Tencent Releases Ultra-Low-Cost AI Training Method: $17 Beats $9,700 Fine-Tuning方案

Tencent Releases Ultra-Low-Cost AI Training Method: $17 Beats $9,700 Fine-Tuning方案

Training-Free GRPO: A Cost-Effective Breakthrough in LLM Optimization

Only 120 RMB — outperforming fine-tuning that costs 70,000 RMB!

Tencent has introduced a new method for upgrading large-model agents: Training-Free Group Relative Policy Optimization (Training-Free GRPO).

Key idea:

No parameter adjustment required — the method leverages brief experiential learning within prompts to achieve highly cost-effective performance gains.

image

Experimental highlights:

On mathematical reasoning and web search tasks, the DeepSeek-V3.1-Terminus model with Training-Free GRPO showed remarkable cross-domain performance improvements.

Compared with fine-tuning a 32B model, this approach:

  • Requires less training data
  • Is far cheaper when applied to a 671B LLM
image

> Comment from netizens:

> So worth it!

image

---

Background: Limitations of Parameter-Space Optimization

LLMs have evolved into general agents that excel at:

  • Complex reasoning
  • Web research
  • Generalized problem solving

However, in specialized scenarios that require:

  • External tools (calculator, APIs, etc.)
  • Specialized prompting strategies

…they often underperform due to unfamiliarity with domain-specific requirements.

Challenges with GRPO-based Parameter Tuning

Traditional GRPO applies reinforcement learning to task-specific optimization by updating model parameters.

While effective, this approach struggles with:

  • High computational cost
  • Poor cross-domain generalization
  • Limited training data availability
  • Diminishing returns

This raises a key question:

> Can LLM agents be improved non-parametrically, reducing data and computational cost?

---

The Proposed Solution: Training-Free GRPO

Tencent Youtu Lab’s Training-Free GRPO:

  • Keeps model parameters frozen
  • Uses a lightweight token experience base in context
  • Optimizes performance without parameter updates
image

Core concept:

Reuses the relative group evaluation logic of classic GRPO, but shifts it entirely to the inference stage.

---

How It Works

  • Frozen Parameters:
  • Model parameters (θ) remain fixed — no gradient updates.
  • Experience Knowledge Base:
  • Starts empty; updated dynamically based on semantic advantages.
  • Natural-Language Advantages:
  • Generates group-relative performance feedback in plain text.
image

---

Step-by-Step Process

  • Generate Analysis Summary
  • For each output, LLM M creates an analysis summary.
  • Explain Success/Failure
  • Using summaries + current experience, M explains reasons for relative success/failure, then extracts concise experiential knowledge.
image
image

---

Updating the Experience Base

Instead of parameter updates (as in standard GRPO’s gradient ascent), Training-Free GRPO:

  • Add: Append experience from `A_text`
  • Delete: Remove low-quality experience
  • Modify: Improve existing entries
  • Keep: Leave base unchanged

This updates the conditional policy by changing the context, not the parameters — guiding the model toward high-reward outputs.

Advantages:

  • Natural language acts as optimization signals
  • The frozen base model ensures output stability (like KL-divergence in GRPO)

---

Applications Beyond Research

Platforms such as AiToEarn官网 integrate non-parametric optimization methods like Training-Free GRPO into creator ecosystems:

  • Multi-platform AI publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
  • Analytics and AI model ranking: AI模型排名
  • Streamlined content generation, analysis, monetization

This demonstrates how innovations like Training-Free GRPO can bridge cutting-edge AI research with practical creative workflows.

---

Experimental Evaluation

Benchmarks:

  • Mathematical Reasoning (AIME24, AIME25)
  • Web Search (WebWalkerQA)

Focus: Expensive, challenging-to-fine-tune LLMs like DeepSeek-V3.1-Terminus.

image

---

Mathematical Reasoning Results

Baseline: DeepSeek-V3.1-Terminus + ReAct

  • AIME24: 80.0%
  • AIME25: 67.9%

With Training-Free GRPO:

  • AIME24: 82.7% (+2.7%)
  • AIME25: 73.3% (+5.4%)

Key point: Achieved with 100 cross-domain samples, no gradient updates, costing only ~$18 — vs. >$10,000 for traditional RL fine-tuning on 32B models.

image

---

Observations:

  • Performance improves with each learning step — even with just 100 problems.
  • Reduced tool usage — agents learn shortcuts, avoid redundancy.

---

Web Search Results

Dataset: WebWalkerQA

Pass@1 improvement: 63.2% → 67.8%

image

---

Ablation Tests (51 sampled instances):

image
  • Adding raw experience context without optimization lowered performance
  • Training-Free GRPO (without ground truth) matched baseline Pass@1 but improved Pass@3
  • Full method: Best performance (Pass@1: 68.6%, Pass@3: 78.4%)

---

Limitations

Baseline model capability is critical:

On QwQ-32B, Training-Free GRPO scored 25.5% Pass@1 — worse than both DeepSeek-V3.1-Terminus and its own baseline.

This shows the method requires strong reasoning + tool-use capabilities to excel.

---

References

  • Paper: https://arxiv.org/abs/2510.08191
  • Reference Post: https://x.com/rohanpaul_ai/status/1978048482003890625
  • GitHub: https://github.com/TencentCloudADP/youtu-agent/tree/training_free_GRPO

---

> Conclusion:

> Training-Free GRPO significantly improves tool-augmented LLM performance at a fraction of the cost of fine-tuning.

> Its context-driven approach opens new opportunities for research, enterprise, and creator monetization via scalable, non-parametric optimization.

---

Would you like me to also create a flowchart diagram in Mermaid illustrating the Training-Free GRPO process? That could help readers visualize the method quickly.

Read more