NVIDIA, HKU, and MIT Launch Fast-dLLM v2: 2.5× End-to-End Throughput Boost

NVIDIA, HKU, and MIT Launch Fast-dLLM v2: 2.5× End-to-End Throughput Boost

Autoregressive (AR) LLMs vs. Diffusion LLMs (dLLM)

Autoregressive (AR) large language models generate output sequentially, token-by-token, which limits inference efficiency.

Diffusion-type LLMs (dLLM) allow parallel generation, but traditionally struggle with:

  • KV cache reuse
  • Variable-length generation
  • Consistently outperforming AR in quality

---

Fast-dLLM v2 — Pragmatic Parallel Decoding

Fast-dLLM v2 adapts a pre-trained AR model into a Block-dLLM with only ~1B tokens of fine-tuning, enabling lossless migration.

Key benefits:

  • No need for massive datasets (Dream requires ~580B tokens)
  • Runs efficiently on A100/H100 GPUs
  • Up to 2.5× throughput boost without accuracy loss
image

📄 Resources:

---

Key Highlights

image

Minimal Data Adaptation

  • Only ~1B tokens of fine-tuning required
  • Works with existing AR models like Qwen2.5-Instruct 1.5B/7B
  • Avoids hundreds of billions of tokens needed by other approaches

AR-Friendly Architecture

  • Block-internal bidirectional attention + block-to-block causal attention
  • Preserves AR semantics, KV cache reuse, and variable-length generation
  • Uses complementary masking + token-shift for robust adaptation

Hierarchical Cache + Parallel Decoding

  • Block-level KV cache for efficiency
  • DualCache reduces redundant work during denoising/refinement
  • Confidence-thresholded parallel decoding boosts end-to-end speed

Large-Model Validation

  • At 7B scale, matches Qwen2.5-7B-Instruct quality
  • Throughput improvement: +2.54×

---

Methodology — From AR to Block Diffusion

image

1. Block Diffusion with AR-Friendly Attention

  • Split sequences into fixed-size blocks
  • Within-block: bidirectional attention → parallel denoising
  • Between-block: causal attention → preserves AR semantics
  • Complementary masking + token-shift → tokens learned in visible & masked states

2. Hierarchical Cache Structure

Block-Level Cache

  • Reuses KV for fully decoded blocks → native AR-style caching

DualCache (Sub-Block)

  • Stores prefix & suffix KV for partially decoded blocks
  • Avoids repeated computation during refinement cycles
image

3. Confidence-Aware Parallel Decoding

  • When confidence > threshold (e.g., 0.9), finalize multiple tokens at once
  • Low-confidence tokens → refined in later passes
  • Example (GSM8K):
  • Tokens/s: 39.1 → 101.7 (~2.6× speedup)
  • Accuracy impact: negligible

---

Practical Applications

Fast-dLLM v2 combines AR robustness with parallel efficiency, making it ideal for latency-sensitive workloads.

Monetization Example — AiToEarn

For creators, AiToEarn connects:

  • AI-powered content generation
  • Cross-platform publishing
  • Analytics + monetization

Platforms include:

Douyin, Kwai, WeChat, Bilibili, Xiaohongshu (Rednote), Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter).

It also provides:

  • Multi-platform analytics
  • AI model rankings
  • Open-source tools for scaling AI creativity into revenue
image

---

Performance Results

  • End-to-End Acceleration: Up to 2.5× speedup with maintained quality
  • 7B Model on A100:
  • Throughput: +2.54×
  • Accuracy: +5.2% (GSM8K) over Fast-dLLM-LLaDA
  • Scaling:
  • Larger batches on A100/H100 → acceleration grows
  • ~1.5× boost (A100) → up to ~1.8× (H100)
image

Benchmark Scores

  • 1.5B Model: Avg 45.0 — new SOTA among ~1B-parameter models
  • 7B Model: Avg 60.3 — beats Qwen2.5-7B-Nemo-FT (59.6) and Dream (57.6)
  • Benchmarks include HumanEval, MBPP, GSM8K, MATH, MMLU, GPQA, IFEval
image

---

Training Cost

Efficiency:

  • ~1B tokens vs Dream’s ~500B
  • Trained on 64 × A100 in just hours
  • Full reproducibility with provided configs

---

Conclusion

Fast-dLLM v2 =

AR → Block Diffusion in hours

2.5× throughput boost

Comparable or better accuracy

Low compute requirement

Tuning options for optimal balance:

  • Block size
  • Confidence threshold
  • Cache strategy

These gains enhance both developer productivity and content monetization workflows.

Platforms like AiToEarn官网 turn high-performance AI models into global reach and revenue via open-source, multi-platform integration.

---

If you’d like, I can create a comparative performance table showing AR vs dLLM vs Fast-dLLM v2 for different model sizes and benchmarks, to make these results even more digestible. Would you like me to add that?

Read more