Tencent AI

Tencent Releases Ultra-Low-Cost AI Training Method: $17 Beats $9,700 Fine-Tuning方案

Honghao Wang

15 Oct 2025 — 4 min read

Training-Free GRPO: A Cost-Effective Breakthrough in LLM Optimization

Only 120 RMB — outperforming fine-tuning that costs 70,000 RMB!

Tencent has introduced a new method for upgrading large-model agents: Training-Free Group Relative Policy Optimization (Training-Free GRPO).

Key idea:

No parameter adjustment required — the method leverages brief experiential learning within prompts to achieve highly cost-effective performance gains.

Experimental highlights:

On mathematical reasoning and web search tasks, the DeepSeek-V3.1-Terminus model with Training-Free GRPO showed remarkable cross-domain performance improvements.

Compared with fine-tuning a 32B model, this approach:

Requires less training data
Is far cheaper when applied to a 671B LLM

> Comment from netizens:

> So worth it!

---

Background: Limitations of Parameter-Space Optimization

LLMs have evolved into general agents that excel at:

Complex reasoning
Web research
Generalized problem solving

However, in specialized scenarios that require:

External tools (calculator, APIs, etc.)
Specialized prompting strategies

…they often underperform due to unfamiliarity with domain-specific requirements.

Challenges with GRPO-based Parameter Tuning

Traditional GRPO applies reinforcement learning to task-specific optimization by updating model parameters.

While effective, this approach struggles with:

High computational cost
Poor cross-domain generalization
Limited training data availability
Diminishing returns

This raises a key question:

> Can LLM agents be improved non-parametrically, reducing data and computational cost?

---

The Proposed Solution: Training-Free GRPO

Tencent Youtu Lab’s Training-Free GRPO:

Keeps model parameters frozen
Uses a lightweight token experience base in context
Optimizes performance without parameter updates

Core concept:

Reuses the relative group evaluation logic of classic GRPO, but shifts it entirely to the inference stage.

---

How It Works

Frozen Parameters:
Model parameters (θ) remain fixed — no gradient updates.
Experience Knowledge Base:
Starts empty; updated dynamically based on semantic advantages.
Natural-Language Advantages:
Generates group-relative performance feedback in plain text.

---

Step-by-Step Process

Generate Analysis Summary
For each output, LLM M creates an analysis summary.
Explain Success/Failure
Using summaries + current experience, M explains reasons for relative success/failure, then extracts concise experiential knowledge.

---

Updating the Experience Base

Instead of parameter updates (as in standard GRPO’s gradient ascent), Training-Free GRPO:

Add: Append experience from `A_text`
Delete: Remove low-quality experience
Modify: Improve existing entries
Keep: Leave base unchanged

This updates the conditional policy by changing the context, not the parameters — guiding the model toward high-reward outputs.

Advantages:

Natural language acts as optimization signals
The frozen base model ensures output stability (like KL-divergence in GRPO)

---

Applications Beyond Research

Platforms such as AiToEarn官网 integrate non-parametric optimization methods like Training-Free GRPO into creator ecosystems:

Multi-platform AI publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
Analytics and AI model ranking: AI模型排名
Streamlined content generation, analysis, monetization

This demonstrates how innovations like Training-Free GRPO can bridge cutting-edge AI research with practical creative workflows.

---

Experimental Evaluation

Benchmarks:

Mathematical Reasoning (AIME24, AIME25)
Web Search (WebWalkerQA)

Focus: Expensive, challenging-to-fine-tune LLMs like DeepSeek-V3.1-Terminus.

---

Mathematical Reasoning Results

Baseline: DeepSeek-V3.1-Terminus + ReAct

AIME24: 80.0%
AIME25: 67.9%

With Training-Free GRPO:

AIME24: 82.7% (+2.7%)
AIME25: 73.3% (+5.4%)

Key point: Achieved with 100 cross-domain samples, no gradient updates, costing only ~$18 — vs. >$10,000 for traditional RL fine-tuning on 32B models.

---

Observations:

Performance improves with each learning step — even with just 100 problems.
Reduced tool usage — agents learn shortcuts, avoid redundancy.

---

Web Search Results

Dataset: WebWalkerQA

Pass@1 improvement: 63.2% → 67.8%

---

Ablation Tests (51 sampled instances):

Adding raw experience context without optimization lowered performance
Training-Free GRPO (without ground truth) matched baseline Pass@1 but improved Pass@3
Full method: Best performance (Pass@1: 68.6%, Pass@3: 78.4%)

---

Limitations

Baseline model capability is critical:

On QwQ-32B, Training-Free GRPO scored 25.5% Pass@1 — worse than both DeepSeek-V3.1-Terminus and its own baseline.

This shows the method requires strong reasoning + tool-use capabilities to excel.

---

References

Paper: https://arxiv.org/abs/2510.08191
Reference Post: https://x.com/rohanpaul_ai/status/1978048482003890625
GitHub: https://github.com/TencentCloudADP/youtu-agent/tree/training_free_GRPO

---

> Conclusion:

> Training-Free GRPO significantly improves tool-augmented LLM performance at a fraction of the cost of fine-tuning.

> Its context-driven approach opens new opportunities for research, enterprise, and creator monetization via scalable, non-parametric optimization.

---

Would you like me to also create a flowchart diagram in Mermaid illustrating the Training-Free GRPO process? That could help readers visualize the method quickly.