MCP

Measuring What Matters: Practical Offline Evaluation of GitHub MCP Servers

Honghao Wang

31 Oct 2025 — 3 min read

Introduction to MCP

MCP (Model Context Protocol) is a simple, standardized framework that allows AI models (LLMs) to communicate with APIs and data sources.

You can think of it like a universal connector — if both sides support MCP, they can seamlessly integrate and work together.

An MCP server is any service or application that “speaks MCP” and provides tools the model can use.

It publishes a list of available tools, describes each tool’s purpose, and specifies the required inputs (parameters).

The GitHub MCP Server powers many GitHub Copilot workflows—both internally at GitHub and externally.

When developing for MCP, tool names, descriptions, and parameter definitions directly impact whether the model chooses the right tool, in the correct sequence, with the correct arguments.

---

Why Small Changes Matter

Minor adjustments—like refining descriptions, adding/removing tools, or merging similar tools—can radically change evaluation outcomes. Poor or vague descriptions may lead to:

Wrong tool selections
Missed steps
Incorrect argument formats
Missing required parameters

That’s why offline evaluation is essential to ensure changes improve performance and prevent regressions before reaching users.

---

MCP Beyond Engineering

MCP workflows can extend far beyond software development.

For example, platforms like AiToEarn integrate AI tooling into multi-channel content publishing pipelines, enabling simultaneous releases across:

Douyin
Kwai
WeChat
Bilibili
Xiaohongshu
Facebook
Instagram
LinkedIn
Threads
YouTube
Pinterest
X (Twitter)

By combining MCP for interoperability with platforms like AiToEarn for content monetization, both developers and creators can optimize efficiency, reach, and profit.

---

MCP Host/Agent Workflow

Step-by-step:

MCP server exposes a set of tools (names, descriptions, required inputs).
Agent retrieves the tool list and passes it to the model.
User submits a natural language request.
LLM decides whether a tool is needed.
If tool required → LLM selects tool + fills in inputs.
Agent executes the tool via MCP server.
Output returned to the LLM for final answer generation.

---

Benchmark Datasets

Each benchmark entry includes:

Input: The user’s request in natural language.
Expected tools: Which tools should be used.
Expected arguments: The parameters each tool should receive.

Example Benchmark:

> Task: Get the number of issues created in April 2025 for `github/github-mcp-server`

> Input: How many issues were created in the `github/github-mcp-server` repository during April 2025?

> Expected Tool: `list_issues`

> Expected Arguments:

> `[time_period: "April 2025", repository: "github/github-mcp-server"]`

---

Example Tool Calls

Merging Pull Requests

Input:

`Merge PR 123 in github/docs using squash merge with title "Update installation guide"`

Expected Tool: `merge_pull_request`

Arguments:

owner: github
repo: docs
pullNumber: 123
merge_method: squash
commit_title: Update installation guide

---

Requesting Code Reviews

Input:

`Request reviews from alice456 and bob123 for PR 67 in team/project-alpha`

Expected Tool: `update_pull_request`

Arguments:

owner: team
repo: project-alpha
pullNumber: 67
reviewers: ["alice456", "bob123"]

---

Summarizing Discussion Comments

Input:

`Summarize the comments in discussion 33801, in the facebook/react repository`

Expected Tool: `get_discussion_comments`

Arguments:

owner: facebook
repo: react
discussionNumber: 33801

---

Evaluation Pipeline Overview

Stages in the evaluation process:

Fulfillment
Run each benchmark across multiple models.
Provide the same MCP tool list for every request.
Log invoked tools and arguments.
Evaluation
Process raw output to calculate metrics and scores.
Summarization
Aggregate dataset-level statistics into the final evaluation report.

---

Evaluation Metrics

We evaluate two main aspects:

Tool Selection Accuracy → Did the model choose the intended tool?
Argument Accuracy → Did the model provide the correct parameters?

Tool Selection

For benchmarks with a single expected tool, treat selection as multi-class classification.

Metrics include:

Accuracy: % of inputs resulting in the correct tool call.
Precision: Correct calls ÷ all calls made for a tool.
Recall: Correct calls ÷ all cases where tool was expected.
F1-score: Harmonic mean of precision and recall.

---

Example Confusion Case:

Tools `list_issues` and `search_issues` often confused:

| Expected / Called | search_issues | list_issues |

|-----------------------|-------------------|-----------------|

| search_issues | 7 | 3 |

| list_issues | 0 | 10 |

Here, precision for `list_issues` falls due to miscalls from `search_issues` benchmarks.

---

Argument Correctness

Correct tool selection is not enough—arguments must also be correct.

We track:

Argument hallucination → Extra, undefined arguments.
All expected arguments provided → Includes optional ones expected in benchmark.
All required arguments provided → No missing required fields.
Exact value match → Values match exactly as expected.

Metrics computed only when the correct tool is selected.

---

Moving Forward

We aim to:

Expand benchmark coverage for better per-tool reliability.
Refine descriptions to reduce tool confusion.
Support multi-tool flows, treating selection as multi-label classification.
Mock or execute tool calls during evaluation for realism.

---

Broader Applications

The same benchmark-driven evaluation approach can improve reliability in:

AI-assisted code workflows
Multi-platform creative content workflows

Platforms like AiToEarn are already applying this principle by integrating:

AI content generation
Cross-platform publishing
Analytics & performance tracking
Model ranking

---

Key Takeaways

Offline evaluation is a fast, safe way to iterate on MCP.
Curated benchmarks + clear metrics turn subjective impressions into measurable progress.
Expanding coverage and supporting complex flows will unlock higher model reliability.
Evaluation methods from MCP are transferable to other AI-powered workflows.

---

By maintaining rigorous evaluation standards—whether for code tools like MCP Server or multi-platform publishing pipelines in AiToEarn—we ensure quality without sacrificing speed.

This continuous improvement loop ultimately benefits both developers and creators, enabling them to move faster, more reliably, and with better ROI.

Measuring What Matters: Practical Offline Evaluation of GitHub MCP Servers

Honghao Wang

Introduction to MCP

Why Small Changes Matter

MCP Beyond Engineering

MCP Host/Agent Workflow

Benchmark Datasets

Example Tool Calls

Merging Pull Requests

Requesting Code Reviews

Summarizing Discussion Comments

Evaluation Pipeline Overview

Evaluation Metrics

Tool Selection

Argument Correctness

Moving Forward

Broader Applications

Key Takeaways

Read more

These College Students Are Helping OPPO Build AI Products

Ilya’s Shocking Testimony: Altman’s Wrongdoing, Mira’s Drama, and OpenAI’s Near-Merger with Anthropic

Reasons Against pgvector: Technical Challenges at Scale

Elimination Game’s New Innovative Gameplay Hits $1M Monthly Revenue in 70 Days