Making Robots “Think and Act Accurately”: VLA-R1 Brings “Reasoning + Action” into the Real World

Making Robots “Think and Act Accurately”: VLA-R1 Brings “Reasoning + Action” into the Real World

2025-10-25 12:24 Beijing

Letting the model both explain its reasoning process clearly and execute actions accurately

image
image
image

---

Introduction

In robotics and intelligent agents, a core challenge is bridging the gap between understanding instructions and performing precise actions.

For example:

  • “Put the yellow bowl into the white empty basket.”
  • “Take the milk out of the microwave and place it on the dining table.”

This requires:

  • Environment understanding
  • Instruction parsing
  • Path planning & affordance reasoning
  • Grounding reasoning into exact physical actions

Current Vision-Language-Action (VLA) models often output direct action sequences without explicit reasoning about affordance–trajectory relationships, which increases error rates in complex scenarios.

Goal of VLA-R1:

Add a structured reasoning step, then use reinforcement learning to ensure accurate execution — making the robot explain first and act second.

---

VLA-R1: Overview

image

Summary:

VLA-R1 is a reason-first, execute-second model that integrates:

  • Chain-of-Thought (CoT) supervision
  • Verifiable Reward Reinforcement Learning (RLVR, built on GRPO)

It optimizes both reasoning quality and execution accuracy, following the structure:

...
...

---

Key Innovations

1. Two-Stage Training (SFT + RL)

image
  • Stage 1 — SFT with CoT supervision: Teacher-guided fine-tuning
  • Stage 2 — RL with verifiable rewards (GRPO): Stable refinement from “can think” to “can do”
  • Uses group-wise normalized advantage
  • Enforces KL constraints

---

2. Three Verifiable Rewards (RLVR)

Ensures the model:

  • Sees correctly
  • Moves correctly
  • Formats correctly

Reward Types:

  • Spatial Alignment (GIoU):
  • Keeps gradients valid even when boxes don’t overlap.
  • Trajectory Consistency (ALHF Fréchet distance):
  • Considers position, tangent angles, and segment length ratios.
  • Output Format Reward:
  • Enforces `` and `` structuring.
image

---

3. VLA-CoT Data Engine & Dataset

  • Generated CoT data using Qwen2.5-VL-72B
  • 13K samples aligned with visual/action data
  • Structured four-step reasoning prompt
image
image

---

Experimental Overview

Evaluation across:

  • In-Domain
  • Out-of-Domain
  • Simulation platforms
  • Real robots

Ablations: CoT vs. RL vs. both.

---

Benchmarks

In-Domain

Dataset: VLA-CoT-13K — Affordance Detection + Trajectory Generation.

Objects: bowls, cups, spoons, pens, boxes, baskets.

Results:

  • Affordance IoU: 36.51 (+17.78% over baseline)
  • Avg trajectory error: 91.74 (−17.25% improvement)

---

Out-of-Domain

Datasets: UMD (affordance labels) + VAIT (scene-instruction pairs)

Results:

  • Affordance IoU: 33.96 (UMD)
  • Trajectory error: 93.90 (VAIT)

---

Real Robot Experiments — 4 Tabletop Scenarios

Scenarios:

  • S1: Colored bowl pickup (similar colors challenge)
  • S2: Fruit pickup (same-category differentiation)
  • S3: Complex kitchen with occlusion
  • S4: Mixed clutter with multiple containers

Results:

  • Affordance SR: 62.5%
  • Trajectory SR: 75%
image
image

---

Simulation (Piper / UR5)

Tested cross-platform with diverse object/instructions.

Results:

  • Affordance SR: 60% / 50%
  • Trajectory SR: 80% / 60%
image
image

---

Ablation Study

Configurations:

  • Direct trajectory output
  • CoT only
  • CoT + RL

Findings:

  • CoT boosts IoU from 23.74 → 28.37
  • CoT+RL boosts IoU to 36.51 with lower trajectory error
image

---

Demo Display

Thought Process Showcase

Real Machine Platform

Simulation Platform

---

Application Prospects

Household Picking & Storage

  • Handles clutter, color similarity, uneven lighting
  • Resolves ambiguities before acting:
  • e.g., spoon → bowl, pen → white box

Warehouse Picking & Light Assembly

  • Explains why container/path chosen
  • Smooth, safe trajectories with MES/PLC integration

Educational & Evaluation

  • `` + `` format supports grading and teaching
  • Standard metrics for comparison between training methods
image

---

Platforms like AiToEarn let creators monetize AI-driven content.

Features:

  • Generate → Publish → Analyze → Monetize
  • Multi-platform delivery: Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X
  • Open-source model ranking & analytics

Further resources:

---

© THE END

For repost authorization, please contact this account.

Original: Read here

Open in WeChat

Read more