I first got curious about diffusion language models after watching Lisa Li’s talk on controlling language models. Before that, I mostly associated diffusion with image generation, not with text, reasoning, or controllability. That talk cracked open the idea that language models do not have to generate text strictly left to right, one token at a time. There is another route, and it is a very interesting one.
This is my attempt to make sense of that route.
I will use DLLMs as shorthand for diffusion language models or diffusion large language models. The naming is still a bit messy in the wild. You will also see terms like diffusion-based LLMs, large language diffusion models, and masked diffusion language models.
The big idea: Autoregressive LLMs write text token by token. Diffusion language models start from a heavily corrupted sequence and iteratively clean it up.
Why this matters now
As of March 2026, autoregressive LLMs are still the dominant paradigm in both research and products. Their training objective is clean, their scaling story is well understood, and the entire tooling stack is built around next-token prediction.
But diffusion language models are now a “serious alternative” worth watching. Recent work has pushed them much closer to practical relevance, especially for parallel generation, editing, infilling, and controllability. On the product side, Google DeepMind has an experimental Gemini Diffusion model, Inception Labs has commercial Mercury models for code, ByteDance introduced Seed Diffusion Preview, and open research efforts such as LLaDA made the case that diffusion can scale into the LLM regime. The field is still young, but it is no longer just a lab curiosity.
The autoregressive baseline
When people say “LLM” today, they usually mean an autoregressive model.
An autoregressive language model generates text from left to right. Given a token sequence , it factorizes the probability of the full sequence as:
Training is next-token prediction with cross-entropy loss:
The training rule is: given the prefix, predict the next token. At inference time, the model repeats the same game step by step:
- read the current prefix
- predict the next token
- append it
- repeat until done
This recipe works astonishingly well. It is one of the great engineering hacks of modern AI.
It also comes with structural constraints:
- Sequential decoding bottleneck. You cannot generate token 500 before generating token 499.
- Unidirectional generation. The model sees the past when generating, not the future.
- Error accumulation. A bad early token can poison later ones.
- Editing is awkward. Rewriting the middle of a sequence is not the native shape of the model.
These are not fatal flaws. Autoregressive models are still incredibly strong. But diffusion models attack exactly these pressure points.
What “diffusion” means here
In ordinary English, diffusion means something spreading out, like ink dispersing in water.
In machine learning, diffusion usually means a generative process with two parts:
- a forward process that gradually corrupts data
- a reverse process that learns to reconstruct clean data from noisy versions
In image models, the corruption often looks like adding Gaussian noise. In text, things get trickier because text is discrete, not continuous. Tokens are symbols, not pixels. You cannot just add 0.3 units of “banana” to a word.
That mismatch is the core headache of text diffusion.
Why text is harder than images for diffusion
Diffusion was born in continuous spaces, where gradual corruption is natural. Images live there. Audio does too. Text does not.
Text has at least three annoying properties:
1. Tokens are discrete
A token is a categorical choice from a vocabulary. If you corrupt a token, what does “slightly noisier than giraffe” even mean? This forced researchers to invent special machinery for diffusion over discrete symbols.
2. Small changes can be catastrophic
Changing one pixel barely matters. Changing one token can flip the entire meaning of a sentence, break syntax, or wreck a program.
3. Order matters a lot
Language is fragile. Global coherence, local syntax, and long-range dependencies all matter at once. A diffusion process has to denoise in a way that preserves both token identity and structure.
That is why diffusion for text took longer to become compelling than diffusion for images.
Two main paths: continuous and discrete diffusion for language
Most diffusion language models fall into one of two camps.
1) Continuous diffusion
The model maps text into a continuous representation, often an embedding space, and runs diffusion there.
This was the route taken by Diffusion-LM. Instead of diffusing directly over tokens, the model diffuses over continuous latent representations and then rounds or projects them back into words.
Why bother?
Because continuous spaces inherit some of the mathematical convenience of image diffusion. They can also make certain types of control easier, such as steering toward syntactic patterns or sentence-level attributes.
The catch is that you eventually have to turn continuous vectors back into discrete tokens. That projection step is not trivial, and early continuous approaches often looked more exciting for controllability research than for plain language-model quality.
2) Discrete diffusion
The model corrupts tokens directly in token space.
A common version uses an absorbing state, usually a special [MASK] token. In the forward process, some clean tokens are replaced by [MASK]. In the reverse process, the model predicts the missing tokens and gradually unmasks the sequence.
This family turned out to be especially important, because it connects diffusion to things language researchers already know well:
- masked language modeling
- denoising objectives
- infilling and editing
- parallel token prediction
Recent progress in DLLMs has heavily favored this direction.
A useful mental model
A practical way to think about the difference is this:
- Autoregressive LLM: writes the answer from left to right.
- Diffusion LLM: starts with a rough, corrupted draft and repeatedly revises the whole thing.
That means a diffusion model can update many positions at once. It is much more like iterative editing than incremental typing.
That single difference has huge consequences.
The historical path to modern DLLMs
The field did not appear out of nowhere. A rough storyline looks like this:
Early foundations
Work such as D3PM helped establish diffusion-like modeling in discrete state spaces, including token corruption schemes relevant to text. This gave researchers a vocabulary for talking about diffusion over symbols instead of only continuous signals.
Continuous-language diffusion
Papers like Diffusion-LM and Continuous Diffusion for Categorical Data explored how to adapt diffusion to language through continuous representations. These works were especially influential for controllable generation and for showing that text diffusion was not total nonsense.
Better likelihood and scaling
Plaid and related likelihood-based work pushed diffusion LMs toward stronger language modeling performance and more serious scaling analysis.
Discrete diffusion gets sharper
SEDD introduced score-entropy training for discrete diffusion, helping close the quality gap. MDLM showed that simple masked diffusion, with the right parameterization and training recipe, was much stronger than many people expected.
Large-scale diffusion LLMs
LLaDA argued that a diffusion model trained from scratch at 8B scale could rival strong autoregressive baselines in several regimes. By this point the conversation changed from “Can this even work?” to “Where does this work best, and what are the tradeoffs?”
How diffusion generation actually works for text
Let us strip the machinery down to the bones.
Suppose the final clean text is:
diffusion models may change how we generate code
A discrete diffusion process might do something like this during generation:
Step 0: start from maximum corruption
[MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]
Step 1: fill in easy positions first
diffusion [MASK] may [MASK] how we [MASK] code
Step 2: refine more positions
diffusion models may change how we generate code
Step 3: optional extra refinement
The model may continue revising uncertain positions based on confidence, schedule, or sampling strategy.
The exact details vary by method, but the common theme is this: many tokens can be predicted in parallel at each denoising step.
That is the engine behind the speed story.
Where DLLMs shine
Diffusion language models are not “better than autoregressive models” in some universal cosmic sense. That would be marketing goblin talk. They are better aligned with certain tasks and constraints.
1. Parallel decoding
This is the headline feature.
Autoregressive models decode one token at a time. Diffusion models can update many positions in parallel, which can reduce latency dramatically, especially when the output is long and the number of denoising steps is kept small.
2. Editing and infilling
Diffusion models naturally support tasks where only part of the text should change:
- fill the missing span
- rewrite one paragraph
- update a code block
- revise a sentence under constraints
This is a very native workload for masked diffusion.
3. Bidirectional context
Because the model denoises across the whole sequence, it can use information from both left and right context during prediction. That can be valuable for global consistency, structured generation, and non-left-to-right editing.
4. Controllability
This is one reason Lisa Li’s work is so interesting.
Diffusion models give you multiple chances to intervene during generation. In continuous variants especially, researchers have shown that the denoising trajectory can be steered toward target properties such as syntax, sentiment, or structure. Diffusion is not automatically controllable, but it often gives you more handles to grab.
5. Global planning for structured outputs
For code editing, form filling, or constrained text generation, a model that can repeatedly revise the whole output may have an advantage over a model that commits left to right and then has to live with its earlier mistakes.
Why speed claims need a footnote the size of a truck
You will often see diffusion systems advertised as much faster than autoregressive models. Sometimes that is true. Sometimes it is benchmark aerobics.
The fair comparison depends on at least four things:
- output length
- number of denoising steps
- hardware and batch size
- whether the task is generation, editing, or completion
If a diffusion model generates many tokens in parallel with a small number of refinement steps, it can be extremely fast. But the speed-quality tradeoff is real. More refinement steps usually improve quality but reduce the raw speed advantage.
So yes, the speed story is important. No, you should not swallow every token-per-second number like candy.
How DLLMs are evaluated today
There is no single metric that settles the debate.
Researchers and product teams usually evaluate DLLMs along several axes:
Language quality
Traditional language modeling metrics such as perplexity still matter, though they are not always the cleanest fit across different diffusion formulations.
Downstream task performance
Instruction following, coding benchmarks, editing tasks, infilling, summarization, and reasoning-style tasks all show different strengths and weaknesses.
Controllability
Can the model satisfy structured constraints, rewrite toward a target style, obey syntax templates, or preserve required spans?
Latency and throughput
This is where diffusion models often try to win attention.
Refinement behavior
How many denoising steps are needed before quality becomes acceptable? Is performance robust when the step budget is small?
In practice, the most interesting evaluations for DLLMs are often not just plain open-ended chat. They are tasks where iterative whole-sequence refinement is naturally useful.
Current challenges and limitations
This is where the hype meets the furniture.
1. The objective is less standard
Next-token prediction is simple, universal, and battle-tested. Diffusion training involves more design choices:
- corruption schedule
- parameterization
- loss formulation
- denoising sampler
- token update strategy
That makes the space richer, but also messier.
2. Inference is iterative too
Diffusion avoids left-to-right decoding, but it still requires multiple denoising steps. That means latency is not magically free. The system wins when parallelism compensates for the extra iterations.
3. Long-form coherence is still a hard problem
Whole-sequence revision is powerful, but long-context language generation remains difficult. A diffusion model still has to preserve consistency across many positions over multiple updates.
4. Infrastructure and serving are less mature
The entire LLM ecosystem was built around autoregressive assumptions: KV cache tricks, token streaming expectations, benchmarking habits, serving libraries, product UX. Diffusion models have to fight that installed base.
5. Evaluation is still settling
A diffusion model can be great at editing and still look mediocre under a benchmark suite built for left-to-right chat generation. The field still needs better task-model alignment.
So are diffusion LLMs the future?
Maybe. But the more useful answer is: they are becoming a serious part of the design space.
I do not think the right framing is “autoregressive versus diffusion, winner takes all.” The more plausible future is architectural pluralism:
- autoregressive models for the broad default case
- diffusion models for fast editing, parallel generation, and controllable refinement
- hybrid models that blend blockwise autoregression with diffusion-style updates
That last category may be especially important. Some recent work already explores hybrids that try to preserve the strengths of both worlds.
The interesting question is no longer whether diffusion can touch language at all. It obviously can. The real question is where it becomes the better tool.
A practical intuition for builders
If you build AI systems, here is the simplest heuristic I would keep in my head:
Use autoregressive models when
- you want the strongest default ecosystem
- you need mature serving and tooling
- your workload is mostly standard chat or completion
- token streaming behavior matters a lot
Watch diffusion models closely when
- your workload is mostly editing or infilling
- you care about low-latency long outputs
- you need controllable iterative refinement
- you are generating structured text or code that benefits from whole-sequence planning
That does not mean diffusion will automatically win these cases. It means the architecture is aligned with them in a way that is worth taking seriously.
Questions I am still exploring
And here are the questions that I am still chewing on:
- Which workloads truly favor diffusion once comparisons are made fairly?
- How much of the current gain comes from architecture, and how much comes from training recipe and evaluation choice?
- Will hybrid AR-diffusion systems beat pure versions of either camp?
- Can controllability be made practical for real applications, not just neat demos?
That last point matters. A lot. Technologies do not win only because they are elegant. They win because they fit the whole stack.
Closing thought
DLLMs suggest a different mental model. Instead of a model that types one token after another, imagine a model that drafts, revises, and globally reshapes text through repeated refinement. That framing feels much closer to editing, planning, and controlled generation than the standard autocomplete story.
this “drafts -> revisions -> final” process feel very much like agent-workflow (ReAct)
Whether diffusion becomes dominant or not, it has already done something useful: it made the design space for language models feel open again.
And that is always where the fun starts.
Playground links
- Controlling Language Models, Lisa Li (Stanford talk)
- Gemini Diffusion
- Inception Mercury
- LLaDA demo
- Seed Diffusion Preview
References
Foundational diffusion
- Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv. https://arxiv.org/abs/2006.11239
- Austin, Jacob, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. 2021. Structured Denoising Diffusion Models in Discrete State-Spaces (D3PM). NeurIPS. https://arxiv.org/abs/2107.03006
Diffusion for text and categorical data
- Li, Xiang Lisa, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori Hashimoto. 2022. Diffusion-LM Improves Controllable Text Generation. NeurIPS. https://arxiv.org/abs/2205.14217
- Dieleman, Sander, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, and Jonas Adler. 2022. Continuous Diffusion for Categorical Data. arXiv. https://arxiv.org/abs/2211.15089
- Chen, Ting, Ruixiang Zhang, and Geoffrey Hinton. 2023. Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning. ICLR. https://arxiv.org/abs/2208.04202
- Gulrajani, Ishaan, and Tatsunori Hashimoto. 2023. Likelihood-Based Diffusion Language Models. NeurIPS. https://arxiv.org/abs/2305.18619
Recent DLLM progress
- Lou, Aaron, Chenlin Meng, and Stefano Ermon. 2023. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution. NeurIPS. https://arxiv.org/abs/2310.16834
- Sahoo, Subham Sekhar, Aram-Hun Choi, Jayanth Koushik, Katherine Thai, Kuan-Hao Huang, and Volodymyr Kuleshov. 2024. Simple and Effective Masked Diffusion Language Models. NeurIPS. https://arxiv.org/abs/2406.07524
- Nie, Shen, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. 2025. Large Language Diffusion Models. arXiv. https://arxiv.org/abs/2502.09992
- Li, Tianyi, Mingda Chen, Bowei Guo, and Zhiqiang Shen. 2025. A Survey on Diffusion Language Models. arXiv. https://arxiv.org/abs/2508.10875
Products and research systems
- Google DeepMind. 2025. Gemini Diffusion. https://deepmind.google/models/gemini-diffusion/
- Google. 2025. Gemini Diffusion is our new experimental research model. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-diffusion/
- Inception Labs. 2025. Introducing Mercury, the World’s First Commercial-Scale Diffusion Large Language Model. https://www.inceptionlabs.ai/blog/introducing-mercury
- Inception Labs. 2025. Mercury: Ultra-Fast Language Models Based on Diffusion. arXiv. https://arxiv.org/abs/2506.17298
- ByteDance Seed. 2025. A Large-Scale Diffusion Language Model with High-Speed Inference. arXiv. https://arxiv.org/abs/2508.02193