Weng Li’s “Elegant” Approach to Strategy Distillation — How It Redefines Cost and Efficiency | New Paper Analysis

Honghao Wang

29 Oct 2025 — 4 min read

A Leap of Imagination

AI Future Compass — a paper-interpretation column breaking down top conference and journal highlights with frontline perspectives and accessible language.

---

Breaking the "Impossible Triangle"

For years, post-training of models has been trapped in an impossible triangle:

Researchers want models to have strong capabilities, low training costs, and controllable alignment — all at once.

Limitations of Mainstream Methods:

SFT & Distillation
Pros: Simple, parallelizable
Cons: "Cramming-style" training makes models brittle; they fail in unknown scenarios or when errors occur.
RL (Reinforcement Learning)
Pros: Enables exploration and adaptability
Cons: Sparse rewards require costly large-scale trial-and-error.

---

On-Policy Distillation (OPD): A New Path

Against this backdrop, Thinking Machines’ deep dive into Qwen’s work introduces On-Policy Distillation (OPD) — aiming to solve the impossible triangle.

About Thinking Machines

Founded in February 2025 by former OpenAI CTO Mira Murati, the team includes alumni from OpenAI, DeepMind, Meta, and core contributors to ChatGPT and DALL·E.

Notably, Lilian Weng, former VP of Safety Research at OpenAI, is a co-founder.

Lilian Weng describes OPD as an "elegant combination" — a surprising fusion that addresses the challenges of both major paradigms.

---

01 How OPD Differs from Traditional Distillation

The Shortcomings of Traditional Distillation

Off-policy: Student learns from static, perfect trajectories generated by the teacher.
Students see only states the teacher would encounter — leading to compounding errors when reality deviates.

OPD’s Shift to On-Policy

Data come from the student’s own real-time outputs, not fixed teacher demos.
Students receive dense, token-level guidance from the teacher on their own (possibly flawed) trajectories.
This lets students learn from mistakes, improving robustness.

Key benefit:

OPD trains students to think in states they will actually encounter, fixing the compounding error problem of SFT and distillation.

---

02 Fusing RL’s Resilience with SFT’s Efficiency

RL’s Strength

Learns from real error states.
Avoids SFT’s brittle failure cascades.

RL’s Weakness

Inefficient due to sparse rewards and hard credit assignment.
Process Reward Models (PRM) help but require costly training.

OPD’s Elegant Simplification

No PRM needed — the teacher provides direct KL divergence feedback.
RL’s adaptability + SFT’s dense supervision = High efficiency and enhanced stability.

OPD brings RL’s soul (on-policy data) into SFT’s body (dense divergence loss) — avoiding SFT’s fragility and RL’s sluggishness.

---

03 Cost Efficiency: The Second Layer of Elegance

At first glance, OPD seems costly — the large teacher model is invoked every step.

But real numbers tell a different story:

SFT: Cheap at first, but cost skyrockets for small performance gains beyond initial plateaus.
RL: Wastes most compute on unproductive exploration.
OPD: Higher per-step cost, yet vastly higher sample efficiency — no step is wasted.

Example:

To go from 60% to 70% accuracy:
SFT: Needs ~5× more data (costly brute force)
OPD: A small, targeted correction phase after SFT 400K data suffices.

Result: OPD extra cost = only 1/13.6 of SFT’s brute force approach.

It replaces wasted trials with targeted, high-quality guidance.

---

04 The Leap of Imagination

Why was OPD not broadly adopted earlier?

Source Inspirations:

Qwen 3 (engineering foundation)
Agarwal’s research
Tinker’s conceptual work

Cultural Barriers:

SFT camp: Obsessed with keeping costs low, dismissed online calls to large models.
RL camp: Fixated on reward models, undervalued imitation learning.

OPD broke both biases:

An online method that’s cheaper overall than SFT due to sample efficiency.
Dispenses with complex reward models entirely.

---

Practical Applications & Content Platforms

For practitioners exploring OPD or similar high-efficiency techniques, platforms such as AiToEarn官网 streamline the journey from idea to impact:

AI-powered content generation
Cross-platform publishing to Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter)
Performance analytics & model leaderboards — AI模型排名.

This is akin to OPD’s philosophy: minimal waste, maximum reach.

---