RewardMap: Solving Sparse Rewards in Fine-Grained Visual Reasoning via Multi-Stage Reinforcement Learning

Honghao Wang

21 Oct 2025 — 3 min read

# RewardMap: Tackling Sparse Rewards in Fine-Grained Visual Reasoning

## Research Collaboration
This work is led by the **ENCODE Lab at Westlake University** in collaboration with:
- **Tongji University**
- **Zhejiang University**
- **National University of Singapore**

The team has strong expertise in **large model reinforcement learning** and **multimodal reasoning**.

---

## Background

In recent years, **Large Language Models (LLMs)** and **Multimodal Large Models (MLLMs)** have achieved breakthrough progress in:
- Scene understanding
- Complex reasoning tasks

Yet a key question remains:  
> When visual information becomes extremely complex and densely structured, can a model truly *“understand the picture”*?

Real-world examples — e.g., **high-resolution subway maps** — require:
- Fine-grained visual perception
- Spatial reasoning across multiple lines and stations

---

## Earlier Work: ReasonMap

The team’s prior work, **ReasonMap**, was **the first systematic study** to reveal challenges in high-resolution map reasoning:
- Even state-of-the-art MLLMs suffer from **reasoning hallucinations**:
  - Misreading lines
  - Missing stations
  - Repeating routes

### Key Observation:
On high-resolution, information-rich subway maps:
- RL with **only success/failure signals** from final answers → **Sparse Reward Trap**
- Few accidental correct outputs cause **high-variance gradients**
- Training becomes **slow and unstable**
- Leads to hallucinations in long-chain path planning tasks

---

## Proposed Solution: RewardMap

**RewardMap** is a **multi-stage reinforcement learning framework** specifically designed for real-world map reasoning.

**Core Innovations:**
- **Difficulty-aware fine-grained rewards**
- **Curriculum learning** from easy to hard tasks

**Outcome:**  
Improved **fine-grained visual understanding** and **spatial reasoning** in MLLMs.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-362.jpg)

---

**Paper Title:**  
*RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning*

**Links:**
- **[Paper](https://arxiv.org/abs/2510.02240)**
- **[Project Homepage](https://fscdc.github.io/RewardMap/)**
- **[Code](https://github.com/fscdc/RewardMap)**
- **[Dataset](https://huggingface.co/collections/FSCCS/reasonmap-688517b57d771707a5d64656)**

---

## ReasonMap-Plus: Dense Supervision for Cold Start

![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-337.jpg)

Building on ReasonMap, the team developed **ReasonMap-Plus**:
- High-resolution metro/rail maps from **30 cities**
- **4,018 problem samples**
- Five categories of **fine-grained perception-focused tasks**:
  - Two types of Local Counting
  - Global Counting
  - Two types of True/False
- **Difficulty tags**: Easy / Medium / Hard
- **Balanced train/test splits** by city and difficulty

![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-312.jpg)

---

## RewardMap Framework

### Step-by-Step Approach
1. **Decomposable Fine-Grained Rewards**  
   Split route planning into evaluable sub-goals; avoid binary-only signals.
2. **Curriculum Training**  
   Train first on dense, low-noise subtasks → then on full real-world planning.

### Core Components
1. **Difficulty-Aware Fine-Grained Rewards**
2. **Multi-Stage RL** leveraging ReasonMap-Plus tasks for strong cold start signals.

**Difficulty Awareness:**  
Reward weighting considers:
- **Map difficulty** (three levels)
- **Problem difficulty** (number of transfers → implies higher difficulty)

---

## Reward Function Design

Reward components:
- **Format compliance**
- **Final correctness**
- **Detail items** (weighted by α = 0.5)

**Detail items** add/deduct points based on:
- Correct start/end stations
- Correct route names
- Proper transfer stations
- Correct number of route segments

**Benefit:** Delivers *partial correctness signals*, stabilizing gradients compared to all-or-nothing scoring.

---

## Results

📈 **Performance Gains**  
RewardMap was evaluated on:
1. **ReasonMap**
2. **ReasonMap-Plus**
3. Six external benchmarks across:
   - Spatial reasoning
   - Fine-grained vision
   - General VQA

**Highlights:**
- **Largest improvement**: +13.51% on *SpatialEval*
- Outperformed traditional SFT → RL pipelines

![image](https://blog.aitoearn.ai/content/images/2025/10/img_004-292.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-265.jpg)

### Qualitative Improvements:
- Fewer line misidentifications
- Reduced hallucinations (repeated routes)
- More accurate start/end stations
- Better route segmentation matching real map structures

![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-247.jpg)

---

## Future Outlook

RewardMap demonstrates a **reusable RL paradigm** for high-resolution, structured visual tasks:
- Break problems into measurable sub-goals
- Apply difficulty modeling to balance sparse data
- Link perception-focused subtasks with reasoning-heavy tasks

This ensures the model progresses from:
> **"Seeing clearly" → "Thinking clearly"**

Post-training with map data boosts general MLLM capability — indicating bigger roles for real-world data in future multimodal AI development.

---

## Related Tools for AI Creators

Platforms like **[AiToEarn官网](https://aitoearn.ai/)** showcase how **integrated tooling** benefits creators:
- AI content generation
- Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X)
- Analytics and model rankings  
Visit:
- [AiToEarn博客](https://blog.aitoearn.ai)
- [AI模型排名](https://rank.aitoearn.ai)

While distinct from RewardMap’s research goal, **structured methodologies + integrated ecosystems** apply across AI development and content workflows.

---

RewardMap: Solving Sparse Rewards in Fine-Grained Visual Reasoning via Multi-Stage Reinforcement Learning

Honghao Wang

Read more

Starting from 250K, Tank 400 Maxes Out Smart Home Use — Driver Assistance Works Even in Rainy Chongqing

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Song Zhiping: Companies Should Value and Promote “Obsessive” Talent