Diffusion LLMs Are Fast and Interesting!

I first got curious about diffusion language models after watching Lisa Li’s talk on controlling language models. Before that, I mostly associated diffusion with image generation, not with text, reasoning, or controllability. That talk reopened a simple question I had stopped asking: does a language model really have to generate text strictly from left to right, one token at a time?

One of my research colleagues used to work on generating or imputing single-cell Hi-C contact maps using a diffusion-based approach as well. I would never have thought this would work!

This post is my attempt to make sense of the alternative.

I will use DLLMs as shorthand for diffusion language models or diffusion large language models. The naming is still loose in the wild. You will also see diffusion-based LLMs, large language diffusion models, and masked diffusion language models. The important idea is simpler than the terminology: an autoregressive model writes token by token, while a diffusion model starts from a corrupted sequence and iteratively cleans it up.

The core shift

As of March 2026, autoregressive LLMs are still the default. Their training objective is straightforward, their scaling behavior is well understood, and the entire ecosystem of inference, serving, and product UX has been built around next-token prediction. If most people say “LLM,” this is still what they mean.

Formally, an autoregressive language model factorizes a sequence like this:

P(x1:T)=t=1TP(xtx<t)P(x_{1:T}) = \prod_{t=1}^{T} P(x_t \mid x_{<t})

and it is usually trained with next-token cross-entropy:

L=t=1TlogPθ(xtx<t)\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})

At inference time, the model keeps repeating the same loop: read the current prefix, predict the next token, append it, and continue until it stops. That recipe works astonishingly well. It is one of the most successful simplifications in modern machine learning.

Still, its weaknesses are structural, not accidental. Decoding is sequential, so token 500 cannot arrive before token 499. The model generates using left context, not future context. Early mistakes can ripple forward. Rewriting the middle of a sequence is possible, but awkward. Autoregressive models remain extremely strong, yet these pressure points are exactly where diffusion starts to look interesting.

In machine learning, “diffusion” usually refers to a pair of processes. The forward process gradually corrupts data. The reverse process learns to reconstruct clean data from corrupted versions. With images this story is intuitive, because pixels live in a continuous space and Gaussian noise makes sense. Text is much more annoying. Tokens are discrete symbols, not coordinates. You cannot add 0.3 units of “banana” to a word and call that a noisy version.

That mismatch is why text diffusion took longer to become compelling than image diffusion. Language is fragile. A one-token change can flip meaning, break syntax, or destroy a program. Order matters everywhere at once: locally for grammar, globally for coherence, and structurally for long-range dependencies. A text diffusion model has to denoise while preserving all of that.

How text diffusion works

Most DLLMs solve the problem in one of two ways. One path is continuous diffusion: map text into a continuous representation, often an embedding space, run diffusion there, and then project back into tokens. This was the basic route behind work like Diffusion-LM. It inherits some of the mathematical convenience of continuous diffusion and often looks attractive for controllability, but the projection back into discrete text is not trivial. Early results in this family were often more exciting as research instruments than as plain language models.

The other path is discrete diffusion, which corrupts tokens directly in token space. A common version uses an absorbing state such as [MASK]. The forward process replaces some clean tokens with [MASK]; the reverse process predicts the missing tokens and gradually unmasks the sequence. This connects naturally to masked language modeling, denoising, infilling, and editing, which is one reason it has become such an important direction.

A useful mental model is this:

  • Autoregressive LLM: writes the answer from left to right.
  • Diffusion LLM: starts from a rough, corrupted draft and repeatedly revises the whole thing.

That difference sounds small, but it changes a lot. A diffusion model can update many positions at once, so generation feels closer to iterative editing than to incremental typing.

Suppose the final sentence is:

diffusion models may change how we generate code

A masked diffusion process might begin here:

[MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]

Then it could fill easier positions first:

diffusion [MASK] may [MASK] how we [MASK] code

Then refine further:

diffusion models may change how we generate code

Some methods stop there. Others keep revising uncertain positions for a few more denoising steps. The details vary, but the key point stays the same: many tokens can be predicted in parallel during each refinement step. That is the source of the speed story, and also the source of many of the tradeoffs.

The research arc that got us here is fairly easy to sketch in retrospect. D3PM helped establish diffusion-style modeling in discrete state spaces. Diffusion-LM and related work showed that language diffusion was not a category error. Likelihood-based approaches such as Plaid pushed quality and scaling questions more seriously. SEDD improved discrete diffusion training, while MDLM showed that simple masked diffusion could be much stronger than many people assumed. By the time LLaDA argued that an 8B diffusion model could compete with strong autoregressive baselines in several settings, the question had shifted from “can this work at all?” to “where does this work best?”

Where it looks strongest

The most convincing case for DLLMs is not that they universally beat autoregressive models. They do not. The case is that their generation style lines up well with a different set of workloads.

Parallel decoding is the obvious headline. If a model can revise many positions at once and reach acceptable quality in a small number of denoising steps, latency can drop sharply, especially for longer outputs. But this is also where a lot of hype sneaks in. Speed comparisons depend heavily on output length, denoising step count, hardware, batch size, and whether the task is open-ended generation, editing, or completion. Sometimes the speedups are real. Sometimes they are benchmark theater.

Editing and infilling are a more natural fit. If the job is to fill a missing span, rewrite one paragraph, update a code block, or revise a sentence under constraints, masked diffusion is operating in its home territory. The model is already set up to treat the sequence as something partially known and partially to be repaired.

Bidirectional context is another practical advantage. Because denoising happens across the whole sequence, the model can use both left and right context while deciding what belongs at a position. That matters for infilling, structured generation, and cases where global consistency is more important than streaming text token by token.

Controllability is part of what made Lisa Li’s talk memorable to me in the first place. Diffusion models expose more opportunities to intervene during generation because the output is refined over multiple steps instead of committed one token at a time. In continuous variants especially, researchers have shown ways to steer generation toward target properties such as syntax, sentiment, or structure. That does not mean diffusion is automatically controllable, only that it tends to offer more handles.

This is also why diffusion keeps showing up in discussions of code editing and other structured workloads. A model that can revise the whole output may have an advantage when the task benefits from global planning rather than irreversible left-to-right commitment.

Evaluation reflects this awkwardly. There is no single metric that settles the argument. Language quality still matters, and metrics like perplexity have not disappeared, but they are not always the cleanest fit across different diffusion formulations. Downstream performance on instruction following, coding, infilling, summarization, and reasoning-style tasks often tells a more useful story. So do latency, throughput, and the number of refinement steps required before quality becomes acceptable. In practice, the most revealing DLLM evaluations are often not plain chat benchmarks at all. They are tasks where whole-sequence revision is actually native to the problem.

Why it is still hard

The harder part of the story is that diffusion does not simply replace one bottleneck with freedom. It trades one set of constraints for another.

Next-token prediction is simple, universal, and battle-tested. Diffusion training is less standardized. You have to care about corruption schedules, parameterization choices, loss formulations, denoising samplers, and token update strategies. That makes the space richer, but also messier.

Inference is iterative too. Diffusion avoids strictly sequential left-to-right decoding, but it still needs multiple denoising steps. The architecture wins only when parallelism compensates for those extra passes. Long-form coherence also remains hard. Revising the whole sequence is powerful, but it does not magically solve the problem of keeping a long answer consistent over many positions and multiple updates.

The ecosystem is another real constraint. Serving stacks, benchmarking habits, UX expectations, and infrastructure tricks such as KV-cache-heavy optimization all grew around autoregressive assumptions. Diffusion models are not competing on a blank slate. They are pushing against an installed base that is deeply optimized for a different architecture.

I think framing DLLM is the future is too binary. The plausible future is not winner-takes-all. It is probably a mix:

  • autoregressive models for the broad default case
  • diffusion models for editing, infilling, controllable refinement, and some low-latency long-output workloads
  • hybrid systems that borrow from both

DLLMs suggest a different mental model for language generation. Instead of imagining a model that types one token after another, imagine one that drafts, revises, and globally reshapes text through repeated refinement. That feels closer to editing, planning, and constraint satisfaction than the usual autocomplete metaphor.

Some References

Foundational diffusion

Diffusion for text and categorical data

Recent DLLM progress

Products and research systems

← prev post next post →