Scaling Laws and Best Practices for Large Model RL

Scaling Laws and Best Practices for Large Model RL

2025-10-30 · Jiangsu

The paper The Art of Scaling Reinforcement Learning Compute for LLMs from Meta proposes a scaling law for RL applied to LLMs, performs extensive comparisons and ablations, and distills these insights into what may be considered the current best RL recipe.

---

image
image

Why This Paper Matters

This work effectively consolidates years of RLHF-related innovations for LLMs into one clearly written reference. It blends theoretical scaling insights with empirical design guidelines, providing actionable recipes for practitioners.

image

---

Abstract Summary

Reinforcement Learning is now a core component in LLM training. While pre-training benefits from well-understood scaling laws, RL has lacked such a guide.

Key contributions:

  • Identified the scaling law for LLM RL through 400,000 GPU-hours of experiments.
  • Used this scaling law to evaluate various RL designs, culminating in a best-practice RL recipe.

---

RL Scaling Law

image

Experimental setup:

  • 50,000 GPU-hours on
  • 8B dense model
  • 17B × 16 MoE model
  • Measured i.i.d validation performance over ~7000 steps.

Findings:

  • Performance fits a sigmoid-shaped scaling curve.
  • Extrapolations align with extended training results — proving predictive validity.
  • Downstream evals (AIME-24) follow the same curve.

Formula:

image

> RL Improvement = A × f(C)

> where f is a sigmoid over compute investment C.

Curve parameters:

  • A (Upper Limit) — max converged performance.
  • B (Learning Efficiency) — efficiency factor.
  • C_mid — compute needed to reach half of possible improvement.

---

image

---

Empirical RL Design Studies

Asynchronous RL Setups

Variants:

  • PPO-off-policy-k
  • Used in Qwen3, ProRL. Generate B prompts, split into k minibatches (B/^). Gradient updates on minibatches; older policy used for rollouts.
  • Pipeline-RL-k
  • Used in Magistral. Trainer updates instantly loaded into generator; generator continues pre-generated tokens using new params. Parameter k limits how far trainer is ahead.

Background:

Generation/training often run in different frameworks — LLM models in each can have independent parameters and implementations.

image

Conclusions:

  • Pipeline RL and PPO-off-policy share same A.
  • Pipeline RL has higher B (efficiency).
  • Optimal k = 8.

---

Loss Function Variants

Common Methods:

  • DAPO — Token-level clipping; clipped tokens not contributing gradients.
  • GSPO — Sequence-level clipping.
  • CISPO — Vanilla REINFORCE base; clipping + stop-gradient during importance sampling; clipped tokens still influence gradients.
image

Conclusions:

  • CISPO > GSPO > DAPO in A (performance ceiling).

---

FP32 Precision for Logits

Conclusion:

Using FP32 for logits in both generator and trainer leads to significant gains. (Minimax-M1 proposal)

---

Loss Aggregation Strategies

image

Options:

  • Sample Averaging — per trajectory (GRPO).
  • Prompt Averaging — per prompt (DAPO).
  • Token Averaging — across the batch.

Conclusion:

Prompt averaging performs best.

---

Advantage Normalization

Methods differ in Std computation:

  • Std over rollouts per prompt (GRPO).
  • Std over batch.
  • No Std (Dr.GRPO).

Conclusion:

All perform similarly; method (2) preferred on theoretical grounds.

---

Zero-Variance Filtering

image

Idea: Remove prompts with identical rewards across rollouts.

Result: Improves performance.

---

Adaptive Prompt Filtering

  • Strategy: Remove prompts with >0.9 average accuracy (Polaris).
  • Result: Increases A.

---

From findings, the optimal recipe includes:

  • PipelineRL (k=8)
  • Interruption-based length control
  • FP32 logits
  • CISPO loss
  • Prompt-level loss aggregation
  • Zero-variance filtering
  • No-positive resampling
image

Ablation confirms each component’s contribution.

---

Factors Affecting RL Scaling

Model Size

Scaling law applies to 7B and 17B×16 MoE models.

Context Length

Longer contexts reduce B but raise A.

image

Batch Size

Small batches help early performance; large batches raise final A when compute is abundant.

image

Rollout Number

With fixed batch size, varying n_rollout and n_prompt has no effect on training.

---

Takeaways

image

Scaling Law:

  • Performance curve is sigmoid in log-compute space.
  • A and B depend on model, algorithm, and design choices.

Principles:

  • Algorithm design sets RL’s upper bound — not a universal constant.
  • Small-scale wins may not scale; estimate A/B before committing huge compute.
  • Many designs alter B but not A (loss aggregation, curriculum learning, length penalties, adv norm).

ScaleRL offers a tested optimal RLHF recipe for modern LLMs.

---

Community & Resources

Join AINLP Tech Group:

WeChat: ainlp2 (Include your focus area in the request.)

image

About AINLP:

Community for AI, ML, DL, NLP — from LLMs, text generation, chatbots, MT, KG, recommendation, ads, to shared job info and experiences.

image

---

Practical Deployment Tip:

For labs and creators, scaling training insights into deployed AI systems benefits from integrated workflows.

Platforms like AiToEarn官网 package AI generation, publishing, analytics, and monetization — allowing direct syndication to Douyin, Kwai, Bilibili, Xiaohongshu, FB, IG, Threads, YouTube, Pinterest, X/Twitter — aligning the efficiency principles from RL scaling with global content reach.

---

Original Article

Open in WeChat

Read more