New LLM Reinforcement Learning Framework: UCSD Multi-Agent Training Boosts LLM Tool-Use Capability by 5.8×

New LLM Reinforcement Learning Framework: UCSD Multi-Agent Training Boosts LLM Tool-Use Capability by 5.8×

Reinforcement Learning Framework for Large Language Model Agents

First Implementation of Universal Multi-Agent Group Reinforcement

---

Background

Numerous studies show that multi-agent workflows with large language models (LLMs) often outperform single-agent systems — even without targeted training.

Yet, most current LLM agent training frameworks are limited to single-agent training, leaving universal multi-agent “group reinforcement” an open problem.

---

Introducing PettingLLMs

Researchers from UCSD and Intel developed PettingLLMs, a generalized multi-agent reinforcement learning framework that supports training multiple LLMs in arbitrary combinations.

image

Multi-agent LLM systems can significantly boost performance in:

  • Healthcare
  • Programming
  • Scientific research
  • Embodied AI

---

Core Algorithm: Group Relative Policy Optimization (GRPO)

GRPO has proven effective for large model agent training.

Principle:

  • Receive the same input prompt
  • Sample multiple candidate responses
  • Evaluate them with a reward model
  • Compute relative advantages within the group

Key Assumption: All responses are generated from exactly the same context (prompt).

---

Fundamental Challenge in Multi-Agent Environments

In multi-agent, multi-turn tasks:

  • Prompts evolve differently per agent and turn
  • Example: In coding tasks, a second-turn prompt may include:
  • Original question
  • Code from first turn
  • Unit tests generated by other agents
image

If we group responses from different prompts together for advantage calculation, this violates GRPO’s shared context requirement, harming fairness.

---

Solution: Greedy-Search Tree Sampling

Approach:

  • Each turn: Every agent forms a node with K branches
  • After branching: Only the highest-reward branch proceeds
  • Balances exploration vs exploitation

Rewards:

  • Role-specific rewards + global task rewards
  • Drives specialized capabilities and cooperation skills
image

---

Specialization vs Shared Model Strategy

Question: When should agents have specialized models vs a shared model?

Implemented System: Asynchronous distributed training:

image

Training Modes

  • Specialized Model Mode:
  • Multiple independent model pools (Pool i, Pool j)
  • Routing sends Agent i’s data to Pool i → updates Model i only
  • Shared Model Mode:
  • Merges all agents’ data into one pool
  • Updates a shared model for all agents

---

Framework Advantages

PettingLLMs unifies multi-agent cooperation & specialization.

It’s open-source for faster development.

Environment Support:

  • Developers only need to implement:
  • Task-specific interaction logic
  • Reward functions
  • Built-in environments: mathematics, coding, games
  • Arbitrary mappings between models and agents
  • Individual LoRA configurations per agent

---

Real-World Training Results

In Sokoban (long-horizon task), AT-GRPO improved task performance from 14% to 96%.

image

---

Large-Scale Experiments

Models: Qwen3-1.7B and Qwen3-8B

Tasks:

  • Planning: Sokoban, Plan-Path
  • Coding: LiveCodeBench, APPS, CodeContests
  • Mathematics: AIME 24/25, OlympiadBench
image

Performance Gains:

  • Planning: Sokoban +82%, Plan-Path +52.5%
  • Coding: LiveCodeBench +6.1%, APPS +4.2%, CodeContests +7.0%
  • Math: AIME 24 +9.0%, AIME 25 +17.9%
image

---

Ablation Studies

  • Single-agent training is limited:
  • Planning/tools single-agent: +6–9% → MAS only +16% total
  • Role strategy swapping causes collapse:
  • Accuracy drops from 96% → 6%
  • Cooperation improves over time:
  • Sync rewards + fewer rounds required
image

---

Resources

  • Paper: https://huggingface.co/papers/2510.11062
  • GitHub: https://github.com/pettingllms-ai/PettingLLMs

---

Complementary Tools

For deployment and publication, AiToEarn官网 offers:

  • Open-source AI content monetization platform
  • Publishing to Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter)
  • Integrated analytics and AI model rankings
  • Docs: AiToEarn文档
  • Repo: AiToEarn开源地址

---

Conclusion:

PettingLLMs bridges the gap between single-agent RL training and universal multi-agent group reinforcement — enabling coordinated, specialized, and scalable LLM agent evolution across diverse tasks.

Would you like me to also create a diagram summarizing PettingLLMs’ architecture so it’s easier to grasp the system at a glance? That could make this Markdown even more readable.

Read more

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Spatial Intelligence & Supersensing: The Next Frontier in AI Leading AI researchers — Fei-Fei Li, Saining Xie, and Yann LeCun — have been highlighting a transformative concept: Spatial Intelligence. This goes beyond simply “understanding images or videos.” It’s about: * Comprehending spatial structures * Remembering events * Predicting future outcomes In essence, a truly

By Honghao Wang
Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA OmniVinci: A Breakthrough in Multimodal AI NVIDIA has unveiled OmniVinci, a large language model designed for multimodal understanding and reasoning — capable of processing text, visual, audio, and even robotic data inputs. Led by the NVIDIA Research team, the project explores human-like perception: integrating and interpreting information across multiple data

By Honghao Wang