AI evaluation

Essential Skills for AI Product Development: The Definitive Guide to Systematic Evaluation Building

Honghao Wang

19 Oct 2025 — 3 min read

Essential Skills for AI Product Development

The Definitive Guide to Systematically Building Evals

---

📌 Key Takeaways

Evaluation (Evals) is a systematic method to measure and improve AI applications — far beyond traditional unit testing.
Effective evaluation starts with error analysis on real data, led by a domain-informed benevolent dictator.
Use open-ended coding to capture issues, then apply AI-assisted axial coding to prioritize failure modes.
LLM-as-a-judge can automate scoring of subjective outputs — but must be calibrated to align with human judgment.
Evaluation, intuition, and A/B testing are complementary QA methods; skipping data analysis is a major pitfall.

---

🎯 Background

Guests Hamel Hussein and Shreya Shankar have pioneered making evaluation central to AI product development.

They teach the #1-rated Maven course on evals, training over 2,000 PMs and engineers across 500+ companies (including OpenAI & Anthropic).

This discussion covers:

Full end-to-end eval development
Common misconceptions
Best practices for lasting product improvement

---

1️⃣ Unveiling the Mystery of Evals

What Is Evaluation?

Evaluation is a systematic approach to measure and improve AI applications. Unlike unit testing, it deals with open-ended and unpredictable product behavior.

> Example:

> Without evals, improving a real estate chatbot is guesswork. With evals, PMs establish metrics that guide iteration confidently.

Unit Tests vs. Evals:

Unit tests: Verify defined rules and outputs.
Evals: Include qualitative checks, long-term metrics, and open-ended behaviour monitoring.

---

💡 Pro Tip: Think beyond accuracy — incorporate ongoing feedback loops using data analysis, qualitative review, and targeted experiments.

---

2️⃣ Practical Exercise: Error Analysis → Axial Coding

Step 1 – Error Analysis

Review application logs (traces: system prompts, user inputs, AI outputs).
Manually note issues. Sample size: ~100 cases until theoretical saturation.

> Example:

> If an AI assistant ends a conversation abruptly when no unit matches a request — note: “Should hand off to human agent.”

---

Step 2 – Open Coding

Human-led annotation to capture problems.
Led by a benevolent dictator (domain-savvy PM) to avoid indecision.

---

Step 3 – Axial Coding

Export open coding notes to CSV.
Use LLMs (Claude, ChatGPT) to cluster into major failure modes.
Refine categories, apply them to all notes, and summarize with pivot tables.

> Output: A clear error frequency map — e.g., “17 cases of conversation flow problems.”

---

3️⃣ Building Automated Evaluators

Two main approaches:

Code-Based Evaluators
Logical checks (e.g., JSON format compliance).
LLM-as-a-Judge
For subjective failure modes, with binary pass/fail output.

---

Best Practices

Judge prompts must define failure conditions clearly.
Avoid Likert scales — makes results harder to act on.
Always validate the judge model against human-labeled examples before deployment.
Track agreement/disagreement cases, not just a single % score.

---

4️⃣ Myths, Intuition & A/B Testing

Common Misinterpretations:

Evals only = unit tests.
Intuition-based development ignores evaluation.

> Reality: Intuition works when developers are domain experts AND perform internal trials & error analysis.

Evaluation vs. A/B Testing:

A/B tests require metrics — provided by evaluation.
Ground your A/B tests in real data from error analysis, not untested hypotheses.

---

5️⃣ Keys to Successful Evaluation

Common Misconceptions

Plug-and-play tools ≠ full evaluation.
Skipping raw data review is a huge missed opportunity.

Core Advice for Beginners

Don’t fear messy data — aim for actionable improvement.
Use AI for assistance, not replacement, in summarizing insights.
Make data review painless — build lightweight annotation tools.

---

Time Investment:

Initial setup: 3–4 days
Weekly upkeep: ~30 min

---

6️⃣ Lightning Round & Resources

Books:

Machine Learning — Mitchell (Occam’s Razor principle)
Artificial Intelligence: A Modern Approach — Norvig
Pachinko — Min Jin Lee
Apple in China

Shows:

Frozen (family-friendly)
The Wire (crime drama)

Tools:

Cursor
Claude Code

Mottos:

“Keep learning, think like a beginner.” — Hamel
“Understand the other person’s argument.” — Shreya

---

📍 Connect & Courses

Hamel Hussein
Search “AI evals course” on Maven — by Shreya & Hamel

---

💡 Final Insight:

Blending human judgment with AI assistance results in the most reliable product improvement loop.

---

Would you like me to create a visual flowchart showing the Error Analysis → Axial Coding → Automated Evaluation pipeline so it’s easy for readers to grasp at a glance? That would make this guide even more actionable.