RL scaling

Scaling Laws and Best Practices for Large Model RL

Honghao Wang

31 Oct 2025 — 4 min read

2025-10-30 · Jiangsu

The paper The Art of Scaling Reinforcement Learning Compute for LLMs from Meta proposes a scaling law for RL applied to LLMs, performs extensive comparisons and ablations, and distills these insights into what may be considered the current best RL recipe.

---

Why This Paper Matters

This work effectively consolidates years of RLHF-related innovations for LLMs into one clearly written reference. It blends theoretical scaling insights with empirical design guidelines, providing actionable recipes for practitioners.

---

Abstract Summary

Reinforcement Learning is now a core component in LLM training. While pre-training benefits from well-understood scaling laws, RL has lacked such a guide.

Key contributions:

Identified the scaling law for LLM RL through 400,000 GPU-hours of experiments.
Used this scaling law to evaluate various RL designs, culminating in a best-practice RL recipe.

---

RL Scaling Law

Experimental setup:

50,000 GPU-hours on
8B dense model
17B × 16 MoE model
Measured i.i.d validation performance over ~7000 steps.

Findings:

Performance fits a sigmoid-shaped scaling curve.
Extrapolations align with extended training results — proving predictive validity.
Downstream evals (AIME-24) follow the same curve.

Formula:

> RL Improvement = A × f(C)

> where f is a sigmoid over compute investment C.

Curve parameters:

A (Upper Limit) — max converged performance.
B (Learning Efficiency) — efficiency factor.
C_mid — compute needed to reach half of possible improvement.

---

---

Empirical RL Design Studies

Asynchronous RL Setups

Variants:

PPO-off-policy-k
Used in Qwen3, ProRL. Generate B prompts, split into k minibatches (B/^). Gradient updates on minibatches; older policy used for rollouts.
Pipeline-RL-k
Used in Magistral. Trainer updates instantly loaded into generator; generator continues pre-generated tokens using new params. Parameter k limits how far trainer is ahead.

Background:

Generation/training often run in different frameworks — LLM models in each can have independent parameters and implementations.

Conclusions:

Pipeline RL and PPO-off-policy share same A.
Pipeline RL has higher B (efficiency).
Optimal k = 8.

---

Loss Function Variants

Common Methods:

DAPO — Token-level clipping; clipped tokens not contributing gradients.
GSPO — Sequence-level clipping.
CISPO — Vanilla REINFORCE base; clipping + stop-gradient during importance sampling; clipped tokens still influence gradients.

Conclusions:

CISPO > GSPO > DAPO in A (performance ceiling).

---

FP32 Precision for Logits

Conclusion:

Using FP32 for logits in both generator and trainer leads to significant gains. (Minimax-M1 proposal)

---

Loss Aggregation Strategies

Options:

Sample Averaging — per trajectory (GRPO).
Prompt Averaging — per prompt (DAPO).
Token Averaging — across the batch.

Conclusion:

Prompt averaging performs best.

---

Advantage Normalization

Methods differ in Std computation:

Std over rollouts per prompt (GRPO).
Std over batch.
No Std (Dr.GRPO).

Conclusion:

All perform similarly; method (2) preferred on theoretical grounds.

---

Zero-Variance Filtering

Idea: Remove prompts with identical rewards across rollouts.

Result: Improves performance.

---

Adaptive Prompt Filtering

Strategy: Remove prompts with >0.9 average accuracy (Polaris).
Result: Increases A.

---

Recommended RL Recipe — ScaleRL

From findings, the optimal recipe includes:

PipelineRL (k=8)
Interruption-based length control
FP32 logits
CISPO loss
Prompt-level loss aggregation
Zero-variance filtering
No-positive resampling

Ablation confirms each component’s contribution.

---

Factors Affecting RL Scaling

Model Size

Scaling law applies to 7B and 17B×16 MoE models.

Context Length

Longer contexts reduce B but raise A.

Batch Size

Small batches help early performance; large batches raise final A when compute is abundant.

Rollout Number

With fixed batch size, varying n_rollout and n_prompt has no effect on training.

---

Takeaways

Scaling Law:

Performance curve is sigmoid in log-compute space.
A and B depend on model, algorithm, and design choices.

Principles:

Algorithm design sets RL’s upper bound — not a universal constant.
Small-scale wins may not scale; estimate A/B before committing huge compute.
Many designs alter B but not A (loss aggregation, curriculum learning, length penalties, adv norm).

ScaleRL offers a tested optimal RLHF recipe for modern LLMs.

---

Community & Resources

Join AINLP Tech Group:

WeChat: ainlp2 (Include your focus area in the request.)

About AINLP:

Community for AI, ML, DL, NLP — from LLMs, text generation, chatbots, MT, KG, recommendation, ads, to shared job info and experiences.

---

Practical Deployment Tip:

For labs and creators, scaling training insights into deployed AI systems benefits from integrated workflows.

Platforms like AiToEarn官网 package AI generation, publishing, analytics, and monetization — allowing direct syndication to Douyin, Kwai, Bilibili, Xiaohongshu, FB, IG, Threads, YouTube, Pinterest, X/Twitter — aligning the efficiency principles from RL scaling with global content reach.

---

Original Article

Open in WeChat

Scaling Laws and Best Practices for Large Model RL

Honghao Wang

2025-10-30 · Jiangsu

Why This Paper Matters

Abstract Summary

RL Scaling Law

Empirical RL Design Studies

Asynchronous RL Setups

Loss Function Variants

FP32 Precision for Logits

Loss Aggregation Strategies

Advantage Normalization

Zero-Variance Filtering

Adaptive Prompt Filtering

Recommended RL Recipe — ScaleRL

Factors Affecting RL Scaling

Model Size

Context Length

Batch Size

Rollout Number

Takeaways

Community & Resources

Read more

These College Students Are Helping OPPO Build AI Products

Ilya’s Shocking Testimony: Altman’s Wrongdoing, Mira’s Drama, and OpenAI’s Near-Merger with Anthropic

Reasons Against pgvector: Technical Challenges at Scale

Elimination Game’s New Innovative Gameplay Hits $1M Monthly Revenue in 70 Days