Diffusion LLMs Are Fast and Interesting!

I first got curious about diffusion language models after watching Lisa Li’s talk on controlling language models. Before that, I mostly associated diffusion with image generation, not with text, reasoning, or controllability. That talk cracked open the idea that language models do not have to generate text strictly left to right, one token at a time. There is another route, and it is a very interesting one.

This is my attempt to make sense of that route.

I will use DLLMs as shorthand for diffusion language models or diffusion large language models. The naming is still a bit messy in the wild. You will also see terms like diffusion-based LLMs, large language diffusion models, and masked diffusion language models.

The big idea: Autoregressive LLMs write text token by token. Diffusion language models start from a heavily corrupted sequence and iteratively clean it up.

Why this matters now

As of March 2026, autoregressive LLMs are still the dominant paradigm in both research and products. Their training objective is clean, their scaling story is well understood, and the entire tooling stack is built around next-token prediction.

But diffusion language models are now a “serious alternative” worth watching. Recent work has pushed them much closer to practical relevance, especially for parallel generation, editing, infilling, and controllability. On the product side, Google DeepMind has an experimental Gemini Diffusion model, Inception Labs has commercial Mercury models for code, ByteDance introduced Seed Diffusion Preview, and open research efforts such as LLaDA made the case that diffusion can scale into the LLM regime. The field is still young, but it is no longer just a lab curiosity.

The autoregressive baseline

When people say “LLM” today, they usually mean an autoregressive model.

An autoregressive language model generates text from left to right. Given a token sequence $x_1, x_2, ..., x_T$ , it factorizes the probability of the full sequence as:

P(x_{1:T}) = \prod_{t=1}^{T} P(x_t \mid x_{<t})

Training is next-token prediction with cross-entropy loss:

\mathcal{L} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_{<t})

The training rule is: given the prefix, predict the next token. At inference time, the model repeats the same game step by step:

read the current prefix
predict the next token
append it
repeat until done

This recipe works astonishingly well. It is one of the great engineering hacks of modern AI.

It also comes with structural constraints:

Sequential decoding bottleneck. You cannot generate token 500 before generating token 499.
Unidirectional generation. The model sees the past when generating, not the future.
Error accumulation. A bad early token can poison later ones.
Editing is awkward. Rewriting the middle of a sequence is not the native shape of the model.

These are not fatal flaws. Autoregressive models are still incredibly strong. But diffusion models attack exactly these pressure points.

What “diffusion” means here

In ordinary English, diffusion means something spreading out, like ink dispersing in water.

In machine learning, diffusion usually means a generative process with two parts:

a forward process that gradually corrupts data
a reverse process that learns to reconstruct clean data from noisy versions

In image models, the corruption often looks like adding Gaussian noise. In text, things get trickier because text is discrete, not continuous. Tokens are symbols, not pixels. You cannot just add 0.3 units of “banana” to a word.

That mismatch is the core headache of text diffusion.

Why text is harder than images for diffusion

Diffusion was born in continuous spaces, where gradual corruption is natural. Images live there. Audio does too. Text does not.

Text has at least three annoying properties:

1. Tokens are discrete

A token is a categorical choice from a vocabulary. If you corrupt a token, what does “slightly noisier than giraffe” even mean? This forced researchers to invent special machinery for diffusion over discrete symbols.

2. Small changes can be catastrophic

Changing one pixel barely matters. Changing one token can flip the entire meaning of a sentence, break syntax, or wreck a program.

3. Order matters a lot

Language is fragile. Global coherence, local syntax, and long-range dependencies all matter at once. A diffusion process has to denoise in a way that preserves both token identity and structure.

That is why diffusion for text took longer to become compelling than diffusion for images.

Two main paths: continuous and discrete diffusion for language

Most diffusion language models fall into one of two camps.

1) Continuous diffusion

The model maps text into a continuous representation, often an embedding space, and runs diffusion there.

This was the route taken by Diffusion-LM. Instead of diffusing directly over tokens, the model diffuses over continuous latent representations and then rounds or projects them back into words.

Why bother?

Because continuous spaces inherit some of the mathematical convenience of image diffusion. They can also make certain types of control easier, such as steering toward syntactic patterns or sentence-level attributes.

The catch is that you eventually have to turn continuous vectors back into discrete tokens. That projection step is not trivial, and early continuous approaches often looked more exciting for controllability research than for plain language-model quality.

2) Discrete diffusion

The model corrupts tokens directly in token space.

A common version uses an absorbing state, usually a special [MASK] token. In the forward process, some clean tokens are replaced by [MASK]. In the reverse process, the model predicts the missing tokens and gradually unmasks the sequence.

This family turned out to be especially important, because it connects diffusion to things language researchers already know well:

masked language modeling
denoising objectives
infilling and editing
parallel token prediction

Recent progress in DLLMs has heavily favored this direction.

A useful mental model

A practical way to think about the difference is this:

Autoregressive LLM: writes the answer from left to right.
Diffusion LLM: starts with a rough, corrupted draft and repeatedly revises the whole thing.

That means a diffusion model can update many positions at once. It is much more like iterative editing than incremental typing.

That single difference has huge consequences.

The historical path to modern DLLMs

The field did not appear out of nowhere. A rough storyline looks like this:

Early foundations

Work such as D3PM helped establish diffusion-like modeling in discrete state spaces, including token corruption schemes relevant to text. This gave researchers a vocabulary for talking about diffusion over symbols instead of only continuous signals.

Continuous-language diffusion

Papers like Diffusion-LM and Continuous Diffusion for Categorical Data explored how to adapt diffusion to language through continuous representations. These works were especially influential for controllable generation and for showing that text diffusion was not total nonsense.

Better likelihood and scaling

Plaid and related likelihood-based work pushed diffusion LMs toward stronger language modeling performance and more serious scaling analysis.

Discrete diffusion gets sharper

SEDD introduced score-entropy training for discrete diffusion, helping close the quality gap. MDLM showed that simple masked diffusion, with the right parameterization and training recipe, was much stronger than many people expected.

Large-scale diffusion LLMs

LLaDA argued that a diffusion model trained from scratch at 8B scale could rival strong autoregressive baselines in several regimes. By this point the conversation changed from “Can this even work?” to “Where does this work best, and what are the tradeoffs?”

How diffusion generation actually works for text

Let us strip the machinery down to the bones.

Suppose the final clean text is:

diffusion models may change how we generate code

A discrete diffusion process might do something like this during generation:

Step 0: start from maximum corruption

[MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]

Step 1: fill in easy positions first

diffusion [MASK] may [MASK] how we [MASK] code

Step 2: refine more positions

diffusion models may change how we generate code

The model may continue revising uncertain positions based on confidence, schedule, or sampling strategy.

The exact details vary by method, but the common theme is this: many tokens can be predicted in parallel at each denoising step.

That is the engine behind the speed story.

Where DLLMs shine

Diffusion language models are not “better than autoregressive models” in some universal cosmic sense. That would be marketing goblin talk. They are better aligned with certain tasks and constraints.

1. Parallel decoding

This is the headline feature.

Autoregressive models decode one token at a time. Diffusion models can update many positions in parallel, which can reduce latency dramatically, especially when the output is long and the number of denoising steps is kept small.

2. Editing and infilling

Diffusion models naturally support tasks where only part of the text should change:

fill the missing span
rewrite one paragraph
update a code block
revise a sentence under constraints

This is a very native workload for masked diffusion.

3. Bidirectional context

Because the model denoises across the whole sequence, it can use information from both left and right context during prediction. That can be valuable for global consistency, structured generation, and non-left-to-right editing.

4. Controllability

This is one reason Lisa Li’s work is so interesting.

Diffusion models give you multiple chances to intervene during generation. In continuous variants especially, researchers have shown that the denoising trajectory can be steered toward target properties such as syntax, sentiment, or structure. Diffusion is not automatically controllable, but it often gives you more handles to grab.

5. Global planning for structured outputs

For code editing, form filling, or constrained text generation, a model that can repeatedly revise the whole output may have an advantage over a model that commits left to right and then has to live with its earlier mistakes.

Why speed claims need a footnote the size of a truck

You will often see diffusion systems advertised as much faster than autoregressive models. Sometimes that is true. Sometimes it is benchmark aerobics.

The fair comparison depends on at least four things:

output length
number of denoising steps
hardware and batch size
whether the task is generation, editing, or completion

If a diffusion model generates many tokens in parallel with a small number of refinement steps, it can be extremely fast. But the speed-quality tradeoff is real. More refinement steps usually improve quality but reduce the raw speed advantage.

So yes, the speed story is important. No, you should not swallow every token-per-second number like candy.

How DLLMs are evaluated today

There is no single metric that settles the debate.

Researchers and product teams usually evaluate DLLMs along several axes:

Language quality

Traditional language modeling metrics such as perplexity still matter, though they are not always the cleanest fit across different diffusion formulations.

Downstream task performance

Instruction following, coding benchmarks, editing tasks, infilling, summarization, and reasoning-style tasks all show different strengths and weaknesses.

Controllability

Can the model satisfy structured constraints, rewrite toward a target style, obey syntax templates, or preserve required spans?

Latency and throughput

This is where diffusion models often try to win attention.

How many denoising steps are needed before quality becomes acceptable? Is performance robust when the step budget is small?

In practice, the most interesting evaluations for DLLMs are often not just plain open-ended chat. They are tasks where iterative whole-sequence refinement is naturally useful.

Current challenges and limitations

This is where the hype meets the furniture.

1. The objective is less standard

Next-token prediction is simple, universal, and battle-tested. Diffusion training involves more design choices:

corruption schedule
parameterization
loss formulation
denoising sampler
token update strategy

That makes the space richer, but also messier.

2. Inference is iterative too

Diffusion avoids left-to-right decoding, but it still requires multiple denoising steps. That means latency is not magically free. The system wins when parallelism compensates for the extra iterations.

3. Long-form coherence is still a hard problem

Whole-sequence revision is powerful, but long-context language generation remains difficult. A diffusion model still has to preserve consistency across many positions over multiple updates.

4. Infrastructure and serving are less mature

The entire LLM ecosystem was built around autoregressive assumptions: KV cache tricks, token streaming expectations, benchmarking habits, serving libraries, product UX. Diffusion models have to fight that installed base.

5. Evaluation is still settling

A diffusion model can be great at editing and still look mediocre under a benchmark suite built for left-to-right chat generation. The field still needs better task-model alignment.

So are diffusion LLMs the future?

Maybe. But the more useful answer is: they are becoming a serious part of the design space.

I do not think the right framing is “autoregressive versus diffusion, winner takes all.” The more plausible future is architectural pluralism:

autoregressive models for the broad default case
diffusion models for fast editing, parallel generation, and controllable refinement
hybrid models that blend blockwise autoregression with diffusion-style updates

That last category may be especially important. Some recent work already explores hybrids that try to preserve the strengths of both worlds.

The interesting question is no longer whether diffusion can touch language at all. It obviously can. The real question is where it becomes the better tool.

A practical intuition for builders

If you build AI systems, here is the simplest heuristic I would keep in my head:

Use autoregressive models when

you want the strongest default ecosystem
you need mature serving and tooling
your workload is mostly standard chat or completion
token streaming behavior matters a lot

Watch diffusion models closely when

your workload is mostly editing or infilling
you care about low-latency long outputs
you need controllable iterative refinement
you are generating structured text or code that benefits from whole-sequence planning

That does not mean diffusion will automatically win these cases. It means the architecture is aligned with them in a way that is worth taking seriously.

Questions I am still exploring

And here are the questions that I am still chewing on:

Which workloads truly favor diffusion once comparisons are made fairly?
How much of the current gain comes from architecture, and how much comes from training recipe and evaluation choice?
Will hybrid AR-diffusion systems beat pure versions of either camp?
Can controllability be made practical for real applications, not just neat demos?

That last point matters. A lot. Technologies do not win only because they are elegant. They win because they fit the whole stack.

Closing thought

DLLMs suggest a different mental model. Instead of a model that types one token after another, imagine a model that drafts, revises, and globally reshapes text through repeated refinement. That framing feels much closer to editing, planning, and controlled generation than the standard autocomplete story.

this “drafts -> revisions -> final” process feel very much like agent-workflow (ReAct)

Whether diffusion becomes dominant or not, it has already done something useful: it made the design space for language models feel open again.

And that is always where the fun starts.

Playground links

References

Foundational diffusion

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv. https://arxiv.org/abs/2006.11239
Austin, Jacob, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. 2021. Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM). NeurIPS. https://arxiv.org/abs/2107.03006

Diffusion for text and categorical data

Li, Xiang Lisa, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. 2022. Diffusion-LM Improves Controllable Text Generation. NeurIPS. https://arxiv.org/abs/2205.14217
Dieleman, Sander, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler. 2022. Continuous Diffusion for Categorical Data. arXiv. https://arxiv.org/abs/2211.15089
Chen, Ting, Ruixiang Zhang, and Geoffrey Hinton. 2023. Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning. ICLR. https://arxiv.org/abs/2208.04202
Gulrajani, Ishaan, and Tatsunori Hashimoto. 2023. Likelihood-Based Diffusion Language Models. NeurIPS. https://arxiv.org/abs/2305.18619

Recent DLLM progress

Lou, Aaron, Chenlin Meng, and Stefano Ermon. 2023. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution. NeurIPS. https://arxiv.org/abs/2310.16834
Sahoo, Subham Sekhar, Aram-Hun Choi, Jayanth Koushik, Katherine Thai, Kuan-Hao Huang, and Volodymyr Kuleshov. 2024. Simple and Effective Masked Diffusion Language Models. NeurIPS. https://arxiv.org/abs/2406.07524
Nie, Shen, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large Language Diffusion Models. arXiv. https://arxiv.org/abs/2502.09992
Li, Tianyi, Mingda Chen, Bowei Guo, and Zhiqiang Shen. 2025. A Survey on Diffusion Language Models. arXiv. https://arxiv.org/abs/2508.10875

Products and research systems

Google DeepMind. 2025. Gemini Diffusion. https://deepmind.google/models/gemini-diffusion/
Google. 2025. Gemini Diffusion is our new experimental research model. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-diffusion/
Inception Labs. 2025. Introducing Mercury, the World’s First Commercial-Scale Diffusion Large Language Model. https://www.inceptionlabs.ai/blog/introducing-mercury
Inception Labs. 2025. Mercury: Ultra-Fast Language Models Based on Diffusion. arXiv. https://arxiv.org/abs/2506.17298
ByteDance Seed. 2025. A Large-Scale Diffusion Language Model with High-Speed Inference. arXiv. https://arxiv.org/abs/2508.02193