AI Online Reinforcement Learning “Learn While Doing”: Stanford Team Boosts 7B Model to Surpass GPT-4o

AI Online Reinforcement Learning “Learn While Doing”: Stanford Team Boosts 7B Model to Surpass GPT-4o

AgentFlow: A New Framework for Adaptive, Multi‑Agent Reasoning

Overview

Stanford and collaborators have introduced AgentFlow, a paradigm leveraging online reinforcement learning to help agentic systems "achieve more with less" — in some cases surpassing models like GPT‑4o.

Core Concept:

AgentFlow continuously enhances agents’ reasoning capabilities when tackling complex problems through a collaboration of four specialized agents:

  • Planner
  • Executor
  • Verifier
  • Generator

These agents work via shared memory, with the Planner optimized in real time using the novel Flow‑GRPO method.

image

---

Performance Highlights

Built on Qwen‑2.5‑7B‑Instruct, AgentFlow achieves notable improvements over existing systems across 10 benchmarks:

  • Search tasks: +14.9%
  • Agentic tasks: +14.0%
  • Math tasks: +14.5%
  • Science tasks: +4.1%

It even outperforms much larger models (50× in scale), including GPT‑4o and Llama3.1‑405B.

image

---

Community Reception

AgentFlow has attracted strong interest:

> “Multi‑agent flow feels like phase‑coupled reasoning. Looking forward to coordination ability replacing scale as the key metric for intelligence.”

image

> “Flow‑GRPO’s shared‑memory multi‑agent architecture is brilliantly designed. The Verifier’s ability to block hallucinated tool calls reduces error propagation in multi‑step reasoning chains.”

image
image

---

Need for AgentFlow

Agent-based AI systems have expanded rapidly across vertical and general-purpose applications, yet they still struggle with:

  • Complex decision-making
  • Continuous optimization

Breakthrough: Integrating agent reasoning with reinforcement learning for self-improvement.

Prior work — DeepSeek‑R1, Search‑R1, LangGraph, PydanticAI, OWL — advanced task planning, agent collaboration, and tool integration. AgentFlow builds on these foundations.

image

---

Core Architecture

Four Specialized Agents (with persistent memory):

  • Planner – Analyzes tasks, develops strategy, and chooses tools.
  • Executor – Executes selected tools and consolidates results.
  • Verifier – Validates intermediate outputs against shared memory.
  • Generator – Produces final task output.
image

---

Real‑Time Strategy Adjustment

During each task:

  • Planner modifies approach based on environment changes and feedback from other agents.
  • Continuous co‑evolution of modules nurtures adaptive reasoning.
  • Updates are stored in shared memory for future optimization.

---

AiToEarn Integration Example

Open‑source platforms like AiToEarn官网 complement frameworks like AgentFlow, enabling content creators to:

  • Generate AI content
  • Publish across multiple channels (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
  • Monitor analytics & model rankings

This ecosystem connects technical innovation with monetization opportunities.

---

Reinforcement Learning in the Flow

AgentFlow’s novelty: On-policy, real-time optimization of the Planner in the interactive agent workflow.

Workflow Steps:

  • Environment perception + memory retrieval
  • Action planning + tool selection
  • Strategy optimization + shared memory update

---

Flow‑GRPO: Tackling Multi‑Turn RL Challenges

Challenges in agent RL:

  • Multi‑turn credit assignment
  • Sparse rewards
  • Long‑horizon reasoning

Solution:

Flow‑GRPO leverages an action-level multi‑turn optimization objective.

The final reward (success/failure) is broadcast to every step, simplifying training to single-turn policy updates and improving efficiency.

image
image

---

Benchmark Results

AgentFlow was tested across knowledge retrieval, agentic tasks, math reasoning, and science reasoning.

Results:

  • Search: +14.9%
  • Agentic: +14.0%
  • Math: +14.5%
  • Science: +4.1%

It even outperforms GPT-4o (~200B) and other large models.

image
image
image

---

Key Insights

1. Model Size ≠ Ultimate Performance

A well‑trained 7B parameter AgentFlow beat GPT‑4o and Llama3.1‑405B in Search (+8.2%) and agentic reasoning (+15.8%).

2. Importance of “Learning in the Flow”

Offline SFT training reduced performance by ~19%.

Online training:

  • Corrects tool invocation errors quickly
  • Plans precise sub‑tasks
  • Enhances overall completion rates

---

image
image

---

3. Autonomous Tool Path Discovery

Post Flow‑GRPO training:

  • Planner learned optimal tool combinations
  • Discovered “tool chain” usage (e.g., Wikipedia Search → targeted web search)
  • Significantly improved information retrieval depth
image

---

4. Dynamic Reasoning Depth

For complex tasks (e.g., Multihop Search):

  • Performance improved with higher step limits without increasing average step count
  • Indicates selective deep reasoning only when necessary
image
image

---

Conclusion

AgentFlow proposes a shift in agent training strategies:

  • Move away from reliance on massive monolithic LLMs
  • Enable continuous, adaptive, collaborative learning
  • Combine collective intelligence with learning-by-doing for complex task handling

While real-world deployment at scale remains a challenge, AgentFlow’s potential is considerable.

---

Resources

---

Cross‑Platform Monetization Opportunity

Platforms like AiToEarn官网 allow creators to:

This connects AgentFlow’s technical capabilities to creative revenue streams.

Read more

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Spatial Intelligence & Supersensing: The Next Frontier in AI Leading AI researchers — Fei-Fei Li, Saining Xie, and Yann LeCun — have been highlighting a transformative concept: Spatial Intelligence. This goes beyond simply “understanding images or videos.” It’s about: * Comprehending spatial structures * Remembering events * Predicting future outcomes In essence, a truly

By Honghao Wang
Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA OmniVinci: A Breakthrough in Multimodal AI NVIDIA has unveiled OmniVinci, a large language model designed for multimodal understanding and reasoning — capable of processing text, visual, audio, and even robotic data inputs. Led by the NVIDIA Research team, the project explores human-like perception: integrating and interpreting information across multiple data

By Honghao Wang