Why Isn't RL as Stable as SFT? Plus Various RL Tricks
Large Model Intelligence — Discussion
This article offers a casual exploration of thoughts on large language model (LLM) reinforcement learning (RL). Inspiration comes from countless exchanges with my good friend @真中合欢 during RL training discussions — he often warns me not to casually add “technique stuffing”. He convinced me, and now I’m seeing if I can persuade others.
---
01 — SFT and RL
Common Perceptions About SFT
Today, many have a prejudice against supervised fine-tuning (SFT) — believing it’s just “data washing” and lacking technical depth.
This view is partially correct only if you work on infra — RL frameworks indeed demand more advanced coding skills than SFT frameworks.
For algorithm and data teams, RL work often involves recombining known RL tricks for experiments. I argue this isn’t inherently more technically challenging than SFT.
Most importantly:
> Algorithm work should not treat SFT and RL as disconnected.
> The explore space during RL comes from SFT. A strong SFT cold start is essential for RL to deliver meaningful gains.
---
Revisiting Loss Functions
A quick review of RL loss versus SFT loss:
- Transformer Output
- Each step produces a logits vector.
- Softmax yields the probability distribution over tokens.
- Denote logits as \( z \).
- RL Step Behavior
- Positive advantage → push selected token’s logit higher
- Negative advantage → push selected token’s logit lower
- SFT Loss — Cross-Entropy
- For target distribution \( p \) (often one-hot) and prediction \( q \):
- \[
- \text{CE}(p, q) = - \sum_{i} p_i \log q_i
- \]
Key Insight
SFT loss and RL loss are formally identical.
SFT is a special case of RL where advantage is constant = 1 for all samples.
> Correction from 俞扬:
> In RL, advantage \( A(s,a) \) changes with state–action pairs and must be estimated via sampling — introducing inevitable error and instability. In SFT, advantage truly is a constant 1.
> Loss form similarity is a lead-in: the bigger difference is training stability due to differing data distributions.
---
02 — Why RL Is Less Stable Than SFT
If their loss forms are equivalent, why does:
- SFT → stable for months using the same hyperparameters
- RL → prone to collapse after a single run
Speculation
- Infra Complexity
- RL’s infra side demands more care — bugs lurk everywhere:
- Did you consider `temperature`, `top_p`, and `top_k` when computing logprobs?
- When reward changes from binary to fractional — are sample-handling rules still correct?
- Model rollout duplication (“”) handled correctly?
- Frameworks like Megatron, vllm, sglang face subtler bugs.
- Data Cleanliness Gap
- SFT uses multi-step filtering (rules + models + manual review).
- RL often simply adds a reward model — complex prompts + complex ground truths mean even 90% judgment accuracy is good.
---
Radical View: RL Is “Rowing Upstream”
- Every RL sample carries toxicity; without extracting valuable info → model drifts toward collapse.
- Positive RL samples → risk overfitting, reinforce high-probability tokens.
- Negative RL samples → disrupt probability distributions, shifting weights to all other tokens — pushing toward unknown states.
Open Question
> What exactly do negative samples teach?
> When a token’s probability is redistributed, what knowledge is actually gained?
> How can negative samples be used correctly?
---
03 — RL Tricks to Stabilize Training
Because RL collapses easily, these techniques are common:
- Entropy Loss — debated for training stability
- CLIP — theoretically strong, practical results mixed
- Token Masking — special logic for certain token sets
- Reward Shaping
- Control proportion of binary reward samples
- Use pass@K as optimization target
- Reward via test-case pass rate
- Apply length penalties
- Train / Inference Consistency — hot topic, e.g., `tis`, `icepop`
> Personally, I dislike adding too many tricks (entropy_loss, kl_loss) without knowing their precise impact — often they just prevent collapse superficially.
---
Understanding Entropy Behavior
- Entropy Increase → high-prob tokens treated as negative or low-prob tokens treated as positive.
- Entropy Falls Too Fast → poor diversity in rollouts; better to adjust rollout temperature or prompts than add entropy regularization.
Bottom line:
> Analyzing changes in rollout data distribution matters more than blindly introducing tricks. Every trick alters the training data distribution — sometimes unnoticed.
---
04 — RL Data Is King
The Only Universal Trick:
- Clean the data
- Train the reward model properly
---
Practical Data Problems in Advanced Tasks
In difficult benchmarks (e.g., AIME, IMO) trainers often no longer fully understand the data.
Flaw in Batch Cleaning Methods:
Cannot distinguish:
- Hard problems — often wrong, sometimes right.
- Wrong problems — usually right, but flagged wrong due to ground-truth mismatch.
Examples:
- “Ticket costs 2, you have 9” → expected nonsense “356 tickets” wrong answer; actual mistake is “4.5 tickets” (subtle).
- Math problems → solution is correct but adds extra correct info → gets 0 reward.
---
Reward Model Issues
Purely rule-based reward models rarely suffice.
A good reward model must:
- Understand output format requirements
- Recognize equivalent transformations
- Parse complex formulas
- Follow instructions strictly (avoid “re-solving” problem itself)
---
05 — Broader Context: Content Monetization Platforms
Platforms like AiToEarn官网 illustrate data quality + distribution control in another domain — AI content monetization.
AiToEarn:
- Open-source, global AI content monetization infrastructure
- Tools for AI content generation
- Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
- Analytics + AI模型排名
Such multi-channel content publishing parallels RL’s need for reliable deployment workflows — ensuring outputs are visible to diverse audiences and tracked for performance.
---
Closing Thoughts
LLM RL today differs greatly from classical RL, but studying classics like TRPO remains valuable.
The methodology and rigor in TRPO can inspire more stable and principled approaches.
And to echo the theme again:
> Focus on data integrity and understanding true distributions before adding techniques.
> Tricks are secondary.
---
Acknowledgment:
Thanks to @真中合欢 for ongoing RL guidance. His walkthrough of TRPO was eye-opening.