Fast-dLLM

NVIDIA, HKU, and MIT Launch Fast-dLLM v2: 2.5× End-to-End Throughput Boost

Honghao Wang

26 Oct 2025 — 3 min read

Autoregressive (AR) LLMs vs. Diffusion LLMs (dLLM)

Autoregressive (AR) large language models generate output sequentially, token-by-token, which limits inference efficiency.

Diffusion-type LLMs (dLLM) allow parallel generation, but traditionally struggle with:

KV cache reuse
Variable-length generation
Consistently outperforming AR in quality

---

Fast-dLLM v2 — Pragmatic Parallel Decoding

Fast-dLLM v2 adapts a pre-trained AR model into a Block-dLLM with only ~1B tokens of fine-tuning, enabling lossless migration.

Key benefits:

No need for massive datasets (Dream requires ~580B tokens)
Runs efficiently on A100/H100 GPUs
Up to 2.5× throughput boost without accuracy loss

📄 Resources:

---

Key Highlights

Minimal Data Adaptation

Only ~1B tokens of fine-tuning required
Works with existing AR models like Qwen2.5-Instruct 1.5B/7B
Avoids hundreds of billions of tokens needed by other approaches

AR-Friendly Architecture

Block-internal bidirectional attention + block-to-block causal attention
Preserves AR semantics, KV cache reuse, and variable-length generation
Uses complementary masking + token-shift for robust adaptation

Hierarchical Cache + Parallel Decoding

Block-level KV cache for efficiency
DualCache reduces redundant work during denoising/refinement
Confidence-thresholded parallel decoding boosts end-to-end speed

Large-Model Validation

At 7B scale, matches Qwen2.5-7B-Instruct quality
Throughput improvement: +2.54×

---

Methodology — From AR to Block Diffusion

1. Block Diffusion with AR-Friendly Attention

Split sequences into fixed-size blocks
Within-block: bidirectional attention → parallel denoising
Between-block: causal attention → preserves AR semantics
Complementary masking + token-shift → tokens learned in visible & masked states

2. Hierarchical Cache Structure

Block-Level Cache

Reuses KV for fully decoded blocks → native AR-style caching

DualCache (Sub-Block)

Stores prefix & suffix KV for partially decoded blocks
Avoids repeated computation during refinement cycles

3. Confidence-Aware Parallel Decoding

When confidence > threshold (e.g., 0.9), finalize multiple tokens at once
Low-confidence tokens → refined in later passes
Example (GSM8K):
Tokens/s: 39.1 → 101.7 (~2.6× speedup)
Accuracy impact: negligible

---

Practical Applications

Fast-dLLM v2 combines AR robustness with parallel efficiency, making it ideal for latency-sensitive workloads.

Monetization Example — AiToEarn

For creators, AiToEarn connects:

AI-powered content generation
Cross-platform publishing
Analytics + monetization

Platforms include:

Douyin, Kwai, WeChat, Bilibili, Xiaohongshu (Rednote), Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter).

It also provides:

Multi-platform analytics
AI model rankings
Open-source tools for scaling AI creativity into revenue

---

Performance Results

End-to-End Acceleration: Up to 2.5× speedup with maintained quality
7B Model on A100:
Throughput: +2.54×
Accuracy: +5.2% (GSM8K) over Fast-dLLM-LLaDA
Scaling:
Larger batches on A100/H100 → acceleration grows
~1.5× boost (A100) → up to ~1.8× (H100)

Benchmark Scores

1.5B Model: Avg 45.0 — new SOTA among ~1B-parameter models
7B Model: Avg 60.3 — beats Qwen2.5-7B-Nemo-FT (59.6) and Dream (57.6)
Benchmarks include HumanEval, MBPP, GSM8K, MATH, MMLU, GPQA, IFEval

---

Training Cost

Efficiency:

~1B tokens vs Dream’s ~500B
Trained on 64 × A100 in just hours
Full reproducibility with provided configs

---

Conclusion

Fast-dLLM v2 =

✅ AR → Block Diffusion in hours

✅ 2.5× throughput boost

✅ Comparable or better accuracy

✅ Low compute requirement

Tuning options for optimal balance:

Block size
Confidence threshold
Cache strategy

These gains enhance both developer productivity and content monetization workflows.

Platforms like AiToEarn官网 turn high-performance AI models into global reach and revenue via open-source, multi-platform integration.

---

If you’d like, I can create a comparative performance table showing AR vs dLLM vs Fast-dLLM v2 for different model sizes and benchmarks, to make these results even more digestible. Would you like me to add that?