LLM reasoning

HKUST Proposes New Algorithm to Revolutionize LLM Reasoning: Random Strategy Evaluation Emerges as a Breakthrough in Mathematical Reasoning

Honghao Wang

31 Oct 2025 — 4 min read

2025-10-31 · Beijing

“Simplify, Don’t Complicate” — The Real Key to Advancing Performance

Authors & Affiliations

He Haoran — PhD student at The Hong Kong University of Science and Technology (HKUST), specializing in reinforcement learning and foundation models.
Ye Yuxiao — First-year PhD student at HKUST (Co-first author).
Pan Ling — Assistant Professor, Department of Electronic and Computer Engineering & Department of Computer Science and Engineering, HKUST (Corresponding author).

---

Background: Reinforcement Learning in LLM Reasoning

In large language models (LLMs), Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key strategy for improving mathematical reasoning.

However, mainstream approaches—such as PPO and GRPO—are based on policy gradient updates within a policy iteration framework:

Policy evaluation: Assessing current policy performance.
Policy improvement: Iterative refinement via optimization.

Problem: These methods often bring unstable training, loss of diversity, and complex tuning requirements.

---

A Radical Proposal: ROVER

A collaborative team from HKUST, Step (Jieyue), and Kwai presents ROVER

(Random Policy Valuation for Diverse Reasoning), an ultra-minimalist method that skips the policy improvement loop entirely.

Core idea: Evaluating the value of a completely random policy can be enough to find an optimal reasoning path.
Impact:
Outperforms existing methods on multiple benchmarks.
Maintains high quality and diversity.
Eliminates extra networks or reference models—lighter and more efficient.

Resources:

📄 Paper
💻 Code

---

The Pain Points of Traditional RL

Mainstream methods (PPO, GRPO) operate under Generalized Policy Iteration (GPI):

Iterative Steps

Policy evaluation: Compute advantage function or value estimates.
Policy improvement: Update policy based on optimization rules.

Key Problems

Poor stability: Non-stationary targets cause collapses; fixes like KL regularization or entropy monitoring add fragility.
Heavy infrastructure: Extra value networks or reference models increase compute cost.
Reduced diversity: Models overfit to single-step correctness, hurting exploration and pass@k performance.

---

ROVER’s Radical Simplicity

Theoretical Insight

LLM reasoning tasks can often be modeled as finite-horizon, tree-structured MDPs with:

Deterministic transitions.
Single parent per state.
Binary sparse rewards (Correct / Incorrect).

Finding:

In such MDPs, Q-values from a uniform random policy naturally indicate the optimal policy.

Proof intuition:

If an action leads to a subtree with a correct solution, its Q-value > 0.

Greedy selection on Q-values guarantees optimal paths.

---

ROVER Workflow: Three Minimal Steps

1. Q-value Estimation

Use a generalized Bellman equation under a uniform random policy:

2. Policy Construction

Greedy choice is optimal but can reduce diversity.
ROVER applies softmax sampling over Q-values:

Temperature parameter tunes exploration.

3. Training Objectives

No separate value network: Embed value function within LLM parameters.
Group reward centralization: Reduces variance for stable Q-value learning.

Loss Function:

---

Experimental Results

Benchmarks

ROVER evaluated on AIME24/25, HMMT25, AMC, MATH, Countdown, and GPQA-diamond.

Highlights:

AIME24 pass@1: 30.6 — +19.1 over baseline.
HMMT25 pass@1: 14.6 — 106% gain over best baseline.

---

Sustained Exploration

Unlike PPO/GRPO which plateau quickly in pass@k, ROVER retains strong improvement even at pass@256.

---

Diversity Gains

+17.6% average strategy diversity over baseline.
Excels in cosine distance & utility measures.
Improved O.O.D. task performance (GPQA-diamond).

---

Case Study: More Solution Paths

Example: 2×3 grid arrangement problem.

Base & GRPO: 2 strategies.
ROVER: 4 strategies — including novel methods such as bar method and inclusion–exclusion principle.

---

Broader Impact & Outlook

Key message: In structured reasoning tasks — simplify rather than complicate.

> Simplicity is the ultimate sophistication. — Da Vinci

ROVER embodies this principle, offering:

Lower computational load.
Higher diversity.
Robust performance.

For scaling such models in practice, open-source AI platforms like AiToEarn provide:

AI content generation.
Cross-platform publishing.
Analytics & model ranking (AI模型排名).

---

Read More:

Original Article

Read more

Starting from 250K, Tank 400 Maxes Out Smart Home Use — Driver Assistance Works Even in Rainy Chongqing

Starting from 250K, Tank 400 Maxes Out Smart Home Use — Driver Assistance Works Even in Rainy Chongqing

Tank, You’ve Really Changed The new Tank 400 has officially launched, priced between 249,800 – 319,800 RMB. This isn’t “entry-level” — and neither are its features. Highlights include: * Refrigerator * Big-screen TV * Luxurious sofa * Roof-mounted LiDAR * “Parking space to parking space” assisted driving The fuel version serves as the

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Spatial Intelligence & Supersensing: The Next Frontier in AI Leading AI researchers — Fei-Fei Li, Saining Xie, and Yann LeCun — have been highlighting a transformative concept: Spatial Intelligence. This goes beyond simply “understanding images or videos.” It’s about: * Comprehending spatial structures * Remembering events * Predicting future outcomes In essence, a truly

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA OmniVinci: A Breakthrough in Multimodal AI NVIDIA has unveiled OmniVinci, a large language model designed for multimodal understanding and reasoning — capable of processing text, visual, audio, and even robotic data inputs. Led by the NVIDIA Research team, the project explores human-like perception: integrating and interpreting information across multiple data

Song Zhiping: Companies Should Value and Promote “Obsessive” Talent

Song Zhiping: Companies Should Value and Promote “Obsessive” Talent

**Source of Content** | Excerpted from Book *Effective Managers* Published by China Machine Press --- # People Before Tasks — The Key to Enterprise Success > Doing business is about **people before tasks**, not tasks before people. > Finding the **right people** is the decisive factor for success. An enterprise must **first**: 1.