Fast-dLLM
NVIDIA, HKU, and MIT Launch Fast-dLLM v2: 2.5× End-to-End Throughput Boost
Autoregressive (AR) LLMs vs. Diffusion LLMs (dLLM) Autoregressive (AR) large language models generate output sequentially, token-by-token, which limits inference efficiency. Diffusion-type LLMs (dLLM) allow parallel generation, but traditionally struggle with: * KV cache reuse * Variable-length generation * Consistently outperforming AR in quality --- Fast-dLLM v2 — Pragmatic Parallel Decoding Fast-dLLM v2 adapts a