AI Playbooks

Why Isn't RL as Stable as SFT? Plus Various RL Tricks

Honghao Wang

20 Nov 2025 — 3 min read

Large Model Intelligence — Discussion

This article offers a casual exploration of thoughts on large language model (LLM) reinforcement learning (RL). Inspiration comes from countless exchanges with my good friend @真中合欢 during RL training discussions — he often warns me not to casually add “technique stuffing”. He convinced me, and now I’m seeing if I can persuade others.

---

01 — SFT and RL

Common Perceptions About SFT

Today, many have a prejudice against supervised fine-tuning (SFT) — believing it’s just “data washing” and lacking technical depth.

This view is partially correct only if you work on infra — RL frameworks indeed demand more advanced coding skills than SFT frameworks.

For algorithm and data teams, RL work often involves recombining known RL tricks for experiments. I argue this isn’t inherently more technically challenging than SFT.

Most importantly:

> Algorithm work should not treat SFT and RL as disconnected.

> The explore space during RL comes from SFT. A strong SFT cold start is essential for RL to deliver meaningful gains.

---

Revisiting Loss Functions

A quick review of RL loss versus SFT loss:

Transformer Output
Each step produces a logits vector.
Softmax yields the probability distribution over tokens.
Denote logits as \( z \).
RL Step Behavior
Positive advantage → push selected token’s logit higher
Negative advantage → push selected token’s logit lower
SFT Loss — Cross-Entropy
For target distribution \( p \) (often one-hot) and prediction \( q \):
\[
\text{CE}(p, q) = - \sum_{i} p_i \log q_i
\]

Key Insight

SFT loss and RL loss are formally identical.

SFT is a special case of RL where advantage is constant = 1 for all samples.

> Correction from 俞扬:

> In RL, advantage \( A(s,a) \) changes with state–action pairs and must be estimated via sampling — introducing inevitable error and instability. In SFT, advantage truly is a constant 1.

> Loss form similarity is a lead-in: the bigger difference is training stability due to differing data distributions.

---

02 — Why RL Is Less Stable Than SFT

If their loss forms are equivalent, why does:

SFT → stable for months using the same hyperparameters
RL → prone to collapse after a single run

Speculation

Infra Complexity
RL’s infra side demands more care — bugs lurk everywhere:
Did you consider `temperature`, `top_p`, and `top_k` when computing logprobs?
When reward changes from binary to fractional — are sample-handling rules still correct?
Model rollout duplication (“”) handled correctly?
Frameworks like Megatron, vllm, sglang face subtler bugs.
Data Cleanliness Gap
SFT uses multi-step filtering (rules + models + manual review).
RL often simply adds a reward model — complex prompts + complex ground truths mean even 90% judgment accuracy is good.

---

Radical View: RL Is “Rowing Upstream”

Every RL sample carries toxicity; without extracting valuable info → model drifts toward collapse.
Positive RL samples → risk overfitting, reinforce high-probability tokens.
Negative RL samples → disrupt probability distributions, shifting weights to all other tokens — pushing toward unknown states.

Open Question

> What exactly do negative samples teach?

> When a token’s probability is redistributed, what knowledge is actually gained?

> How can negative samples be used correctly?

---

03 — RL Tricks to Stabilize Training

Because RL collapses easily, these techniques are common:

Entropy Loss — debated for training stability
CLIP — theoretically strong, practical results mixed
Token Masking — special logic for certain token sets
Reward Shaping
Control proportion of binary reward samples
Use pass@K as optimization target
Reward via test-case pass rate
Apply length penalties
Train / Inference Consistency — hot topic, e.g., `tis`, `icepop`

> Personally, I dislike adding too many tricks (entropy_loss, kl_loss) without knowing their precise impact — often they just prevent collapse superficially.

---

Understanding Entropy Behavior

Entropy Increase → high-prob tokens treated as negative or low-prob tokens treated as positive.
Entropy Falls Too Fast → poor diversity in rollouts; better to adjust rollout temperature or prompts than add entropy regularization.

Bottom line:

> Analyzing changes in rollout data distribution matters more than blindly introducing tricks. Every trick alters the training data distribution — sometimes unnoticed.

---

04 — RL Data Is King

The Only Universal Trick:

Clean the data
Train the reward model properly

---

Practical Data Problems in Advanced Tasks

In difficult benchmarks (e.g., AIME, IMO) trainers often no longer fully understand the data.

Flaw in Batch Cleaning Methods:

Cannot distinguish:

Hard problems — often wrong, sometimes right.
Wrong problems — usually right, but flagged wrong due to ground-truth mismatch.

Examples:

“Ticket costs 2, you have 9” → expected nonsense “356 tickets” wrong answer; actual mistake is “4.5 tickets” (subtle).
Math problems → solution is correct but adds extra correct info → gets 0 reward.

---

Reward Model Issues

Purely rule-based reward models rarely suffice.

A good reward model must:

Understand output format requirements
Recognize equivalent transformations
Parse complex formulas
Follow instructions strictly (avoid “re-solving” problem itself)

---

05 — Broader Context: Content Monetization Platforms

Platforms like AiToEarn官网 illustrate data quality + distribution control in another domain — AI content monetization.

AiToEarn:

Open-source, global AI content monetization infrastructure
Tools for AI content generation
Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
Analytics + AI模型排名

Such multi-channel content publishing parallels RL’s need for reliable deployment workflows — ensuring outputs are visible to diverse audiences and tracked for performance.

---

Closing Thoughts

LLM RL today differs greatly from classical RL, but studying classics like TRPO remains valuable.

The methodology and rigor in TRPO can inspire more stable and principled approaches.

And to echo the theme again:

> Focus on data integrity and understanding true distributions before adding techniques.

> Tricks are secondary.

---

Acknowledgment:

Thanks to @真中合欢 for ongoing RL guidance. His walkthrough of TRPO was eye-opening.