Inference Speed Soars 60×: DiDi-Instruct Lets Diffusion LLMs Beat Thousand-Step GPT in Just 16 Steps

Inference Speed Soars 60×: DiDi-Instruct Lets Diffusion LLMs Beat Thousand-Step GPT in Just 16 Steps

2025-10-27 13:21 Beijing

DiDi-Instruct introduces a breakthrough approach for extreme acceleration, alignment, and reinforcement learning in large-scale language models.

image
image

---

Overview

A collaboration between Purdue University, University of Texas, National University of Singapore, Morgan Stanley ML Research, and Xiaohongshu Hi-Lab has yielded a post-training method for Discrete Diffusion Large Language Models — called Discrete Diffusion Divergence Instruct (DiDi-Instruct).

Key advantages:

  • Up to 60× acceleration over traditional GPT and standard diffusion LLMs.
  • Distills a “student” from a diffusion “teacher” model — achieving improved efficiency and generation quality.

---

How It Works

Core Innovation — Post-training via Probability Distribution Matching

  • Teacher model: Requires >500 inference steps.
  • Student model: Generates a paragraph in only 8–16 steps.
  • Benchmark results (OpenWebText):
  • 64× speedup in inference.
  • Outperforms the teacher (1024-step) dLLM and GPT-2 (1024-step) in quality.

Practical Impact

The algorithm improves both efficiency and reasoning quality, enabling ultra-efficient LLM deployment.

image

---

---

Research Background

The "Speed Limit" of LLM Generation

Autoregressive models (ARMs), e.g., ChatGPT, DeepSeek:

  • Generate sequentially, token-by-token.
  • Even with parallel hardware, long-form generation latency remains high.

Diffusion language models (dLLMs):

  • Start with noise/masked sequences → iteratively denoise in parallel.
  • Faster paragraph generation possible, but still require hundreds of steps.

Key research question:

Can a model, in just 8–16 iterations, outperform a 1024-step GPT?

---

DiDi-Instruct — The Game-Changer

Post-training algorithm for dLLMs:

  • Reduces 1024 steps → 8–16 steps
  • Improves performance vs. teacher and AR baselines.

Implication:

Optimizes high-speed generation without sacrificing quality — ideal for real-world deployment.

---

Platforms like AiToEarn官网 align with DiDi-Instruct’s goals by:

  • Generating AI content.
  • Publishing across multiple platforms (Douyin, Kwai, WeChat, YouTube, Pinterest, X/Twitter).
  • Tracking performance & monetization.

---

Theoretical Foundation

Derived from Diff-Instruct (continuous diffusion).

Goal: Minimize integral KL divergence between student and teacher models across the entire discrete denoising trajectory.

This:

  • Aggregates KL divergence across time steps (weighted).
  • Ensures global matching rather than only final output alignment.
  • Leads to better convergence and stability.

---

image

Pipeline:

  • Reconstruct clean text from masked input (teacher & student).
  • Discriminator compares outputs → reward signal.
  • Student model learns to match teacher’s distribution.

---

Key Technical Innovations

  • Policy-Gradient Distribution Matching
  • Reformulates KL objective → policy gradient form with a reward function, avoiding discrete differentiation issues.
  • Dynamic Reward Shaping (Adversarial Learning)
  • Discriminator distinguishes noisy samples from student and teacher; log-density ratio → reward signal.
  • Stability & Quality Enhancements:
  • Grouped Reward Normalization
  • Reduces variance, inspired by GRPO.
  • Stepwise Intermediate-State Matching
  • Prevents mode collapse; improves output diversity.
  • Reward-Guided Ancestral Sampling
  • Integrates gradient guidance + candidate re-ranking for high-quality inference.
image

Post-training algorithm.

image

Reward-guided ancestral sampling.

---

Experimental Results

1. New Quality Benchmark

  • Outperformed GPT-2, MDLM, DUO, SDTT across NFEs from 8–128.
  • 16 NFE output exceeds 1024-step teacher quality — 30% improvement.
  • Maintains diversity (entropy loss ~1%).
image

---

2. Training Efficiency

  • Distillation training: ~1 hour on 1× NVIDIA H100 GPU.
  • Competing baselines: >20× longer training times.

---

3. Cross-Domain Generalization

  • Applied to protein sequence generation.
  • Student retained variable-length generation capacity.
  • Produced high-confidence structures in few steps.
image

---

4. Ablation Studies

Key findings:

  • Intermediate state matching: Crucial stability driver.
  • Time-step coupling: Major PPL reduction at low steps.
  • Regularization: Helps at low NFEs, may hinder at high NFEs.
  • Guided inference: Improves quality at low NFEs, boosts diversity in high NFEs.
image

---

Technical Outlook

Promise:

Efficient, adaptive generative models across NLP, bioinformatics, and beyond.

Future potential:

Coupling breakthroughs like DiDi-Instruct with deployment platforms (AiToEarn官网) →

Mass-scale, fast, monetizable AI generation.

---

Team

  • Haoyang Zheng (First Author, Ph.D. candidate, Purdue University, supervised by Prof. Guang Lin).
  • Weijian Luo (Hi-Lab, Xiaohongshu).
  • Wei Deng (Morgan Stanley ML Research, NY).
image

---

Conclusion:

DiDi-Instruct delivers record-setting acceleration and top-tier quality in discrete diffusion LLMs — establishing a robust framework for next-gen AI content generation at unprecedented efficiency.

---

Would you like me to prepare an English annotation set for all workflow diagrams?

That would make this article fully accessible to non-Chinese technical readers.

Read more

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Spatial Intelligence & Supersensing: The Next Frontier in AI Leading AI researchers — Fei-Fei Li, Saining Xie, and Yann LeCun — have been highlighting a transformative concept: Spatial Intelligence. This goes beyond simply “understanding images or videos.” It’s about: * Comprehending spatial structures * Remembering events * Predicting future outcomes In essence, a truly

By Honghao Wang
Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA OmniVinci: A Breakthrough in Multimodal AI NVIDIA has unveiled OmniVinci, a large language model designed for multimodal understanding and reasoning — capable of processing text, visual, audio, and even robotic data inputs. Led by the NVIDIA Research team, the project explores human-like perception: integrating and interpreting information across multiple data

By Honghao Wang