LLM reasoning

Kwai & Nanjing University Launch HiPO Adaptive Reasoning Framework to Tackle LLM “Overthinking”

Honghao Wang

06 Nov 2025 — 3 min read

# HiPO: Intelligent Thinking Switch for Efficient LLM Reasoning

When a user asks a large language model (LLM) a simple question—  
for example: *How many letters P are in the word “HiPPO”?*—  
the model may unexpectedly produce a long chain-of-thought explanation:

> “First, let’s analyze the word HiPPO. The English term for ‘hippopotamus’ is hippo, a semi-aquatic mammal. Here the user used capital letters...”

Such verbose reasoning:

- **Wastes compute resources**
- **Increases wait time**
- **Can even result in wrong answers** due to reasoning loops

**This overthinking habit** is a growing issue as LLMs pursue stronger reasoning capabilities.

---

![image](https://blog.aitoearn.ai/content/images/2025/11/img_001-128.jpg)

## The Problem: Overthinking in LLMs

LLMs excel at complex cognitive tasks thanks to **Chain-of-Thought (CoT)** reasoning, which mimics human step-by-step analysis.  
However, they often apply deep reasoning to trivial questions.

### Key Drawbacks

- **High cost & latency**: Longer outputs mean more tokens, inflated API costs, and slower responses.
- **Poor adaptability**: The same “think deeply” mode is applied everywhere—simple or complex.
- **Error risk**: Over-complicated reasoning can deviate from the correct answer.

### Existing Mitigation Approaches

1. **Training-based Adaptive Reasoning**: RL with length penalties or SFT promoting brevity  
   ↳ Risk: Penalizing length harms performance on genuinely hard tasks.

2. **External Controls**: Prompt engineering to limit reasoning steps  
   ↳ Risk: Manual, non-scalable, poor generalization.

3. **Post-hoc Optimization**: Pruning generated reasoning chains  
   ↳ Limitation: Cannot fundamentally change the model’s thinking process.

---

## Introduction to HiPO

The **KwaiKAT team** at Kuaishou, in collaboration with **Nanjing University’s NLINK** group (Prof. Liu Jia-heng) and the **ARiSE lab** (Prof. Zhang Yu-qun), launched **HiPO (Hybrid Policy Optimization)**—an intelligent “thinking switch” for LLMs.

**Core idea:**  
Enable the model to decide **when** to activate detailed reasoning (**Think-on**) and **when** to give a direct answer (**Think-off**).

Benefits:

- **Improved accuracy** on complex tasks
- **Reduced token consumption** and latency on simple tasks

![image](https://blog.aitoearn.ai/content/images/2025/11/img_002-121.jpg)

**Resources:**
- Paper: [https://arxiv.org/abs/2509.23967](https://arxiv.org/abs/2509.23967)  
- Model: [https://huggingface.co/Kwaipilot/HiPO-8B](https://huggingface.co/Kwaipilot/HiPO-8B)  

---

## I. Root Cause: Cognitive Inertia in LLMs

### Why LLMs Overthink
The success of step-by-step reasoning in hard problems has led to **cognitive inertia**:
models treat every task as complex.

---

## II. HiPO’s Two-Component Solution

### Component 1: Hybrid Data Cold-Start

![image](https://blog.aitoearn.ai/content/images/2025/11/img_003-108.jpg)

**Goal:** Provide training data for both “think” and “no-think” modes.

**Process:**
1. **Collect diverse datasets** – math & code reasoning sources like AM-Thinking-v1-Distilled, AceReason-Math, II-Thought-RL, Skywork-OR1-RL-Data.
2. **Dual-mode generation** – Use a strong model (e.g., DeepSeek-V3) to produce multiple “Think-on” and “Think-off” answers per question.
3. **Select mode via pass rate comparison**:
   - Choose **Think-on** if significantly better; else prefer **Think-off**.
   - If pass-rate difference < δ threshold → lean toward Think-off.
4. **Pick shortest correct answer** in chosen mode.
5. **Add justification signals** – Short explanations for mode choice, injecting “why” reasoning awareness.

---

### Component 2: Hybrid Reinforcement Learning Reward System

**Goal:** Refine mode choice; avoid “thinking” inertia.

**Rewards:**
1. **Base reward** – Based on answer correctness and format.
2. **Bias adjustment mechanism**:
   - Calculate avg reward in Think-on; set bias for Think-off using ω (~0.01).
   - Boost Think-off when performance difference is small → encourages brevity.
3. **Advantage functions**:
   - **Judgment Advantage** *(A_judge)*: Ensures mode choice aligns with justification quality.
   - **Answer Advantage** *(A_answer)*: Improves answer quality within chosen mode.

**Training:** PPO-like algorithm with token-level optimization for both reasoning and answer segments.

---

## III. Results: Faster & More Accurate

![image](https://blog.aitoearn.ai/content/images/2025/11/img_004-104.jpg)

**Benchmarks:** AIME2024/25, HumanEval, LiveCodeBench v6, MATH-500, GPQA-Diamond.

**Highlights:**
- **30% shorter outputs**, **37% lower “thinking” rate** (RatioT)
- **+6.3% accuracy improvement**
- Outperforms **AdaptThink** and **AutoThink**

**Dynamic decision analysis:**
- Mode activation gap narrowed from **89.5% to 53.1%** over RL training
- Mode rates adapt by dataset difficulty  
  ↳ High Think-on for hard tasks (AIME), low for simpler (HumanEval)

![image](https://blog.aitoearn.ai/content/images/2025/11/img_005-90.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_006-84.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_007-79.jpg)

---

## IV. Future Outlook

1. **Practical LLM deployment** – Lower latency & cost for large-scale applications.
2. **Model lightweighting** – “Stop thinking” when unnecessary can aid compression/distillation.
3. **Metacognitive AI** – Models deciding *how* to think represent a leap in AI intelligence.

---

## V. Conclusion

The HiPO framework teaches LLMs **to discern when deep reasoning is warranted** vs when a direct answer suffices—balancing **quality** and **efficiency**.

**Open-source model**: [https://huggingface.co/Kwaipilot/HiPO-8B](https://huggingface.co/Kwaipilot/HiPO-8B)

---

Kwai & Nanjing University Launch HiPO Adaptive Reasoning Framework to Tackle LLM “Overthinking”

Honghao Wang

Read more

Starting from 250K, Tank 400 Maxes Out Smart Home Use — Driver Assistance Works Even in Rainy Chongqing

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Song Zhiping: Companies Should Value and Promote “Obsessive” Talent