These Experts’ Meta Papers Are Disappearing One by One
Former Meta Researcher Tian Yuandong Publishes New RLVR Findings
Former top researchers from Meta continue to produce influential work after leaving the company.
This time, Tian Yuandong and his team tackle a puzzling question in large-scale reinforcement learning (RL) for language models:
> Why does RL training yield significant performance gains while changing only a tiny fraction of the model’s parameters?

---
Key Insights
The study focuses on Reinforcement Learning with Verifiable Rewards (RLVR) and challenges the common belief that sparse parameter updates are the main story.
Instead, the team finds that sparsity is a symptom of a deeper, fixed optimization bias.
- Consistent Targeting: For the same base pretrained model, RLVR repeatedly modifies the same small subset of parameters, regardless of dataset or RL algorithm.
- Novel Framework: Introduction of the Three-Gate Theory, explaining why RLVR updates are localized.

---
The High-Gain, Low-Change Paradox
Models like OpenAI-o3 and DeepSeek-R1 improve dramatically in math and coding after RLVR training.
Yet:
- RL updates are sparse.
- Supervised fine-tuning (SFT) updates are dense.
Meta’s team investigated this paradox, examining open-source models (Qwen series, DeepSeek-R1-Distill-Qwen) trained for 3,000+ RL steps on tasks ranging from mathematics to logical puzzles.
Findings using a custom bfloat16-precision-aware probe:
- SFT sparsity: 0.6%–18.8%
- RL sparsity: 36%–92% (≈10× higher)

---
The Three-Gate Theory
The research identifies three mechanisms that explain RLVR's behavior.
Gate 1: KL Anchor
- Prevents large stylistic drifts in outputs.
- Enforced by policy KL limits or implicit KL bounds (ratio clipping).
- Each training step stays close to the current policy, limiting parameter shifts.

---
Gate 2: Model Geometry
- Pretrained models have structured geometry:
- High-curvature regions: affect reasoning strongly but risk instability.
- RL updates prioritize low-curvature directions to preserve reasoning ability.
- SFT often targets high-curvature areas for accuracy, risking structural damage.

---
Gate 3: Precision Filtering
- Using bfloat16 hides small updates due to limited precision (7-bit mantissa).
- Changes below ULP threshold go unrepresented, appearing as sparsity.
- Switching to float32 reveals more updates.
---
Experimental Validation
- SVD Analysis: RL updates avoid top principal components, overlap more with low-magnitude weights.
- Layer Disruption Tests: Rotating/swapping heads in Qwen3-4B-Base lowers update overlap to random levels in intervened layers.
- Spectral Stability: RLVR maintains stable top principal component spectra; SFT shows larger rotations and drifts.



---
Implications for Fine-Tuning Methods
Limitations:
- Many PEFT methods from the SFT era (low-rank/sparse priors on principal directions) transfer poorly under RLVR.
Findings:
- Sparse fine-tuning along SFT-favored principal components → worst trajectories and slow KL curve rises.
- Targeting non-principal, low-magnitude weights aligns with RLVR’s optimal path.
LoRA Variants:
- PiSSA (principal-component-oriented) offers no gains over standard LoRA.
- High learning rates with PiSSA → instability and early collapse due to updates along high-curvature principal directions.

---
Practical Impact
Understanding the Three-Gate Theory could:
- Improve RL training efficiency.
- Allow targeted optimization without harmful interference.
- Enable synergy between RLVR and complementary fine-tuning methods.
---
Related Tools: AiToEarn
Researchers aiming to share RLVR innovations globally may benefit from AiToEarn官网, an open-source global AI content monetization platform.
Features:
- AI content generation → multi-platform publishing.
- Portals supported: Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter).
- Integrated analytics & model rankings.
- Efficient monetization of AI creativity.
Learn More:
---
Paper link: arXiv:2511.08567