AI testing
New AI Development Rules: Testing to Ensure Every Deployment

Honghao Wang

04 Nov 2025 — 2 min read
# Bridging AI Evaluation with Real-World Business Impact  
*Magdalena Picariello reframes AI conversations around measurable business value, iterative development, and feedback-driven optimization.*

---

## Introduction  

Magdalena shifts focus from **algorithms and metrics** to **tangible business outcomes**, advocating for evaluation systems that go beyond accuracy. Her approach:  
- **Continuous feedback loops**  
- **Iterative development**  
- **Clear ROI alignment**  

---

## Key Takeaways  

- **Mindset Shift for Engineers**  
  Move from binary thinking (“it works or it doesn’t”) to embracing the **spectrum of correctness** in machine learning.  

- **Generative AI as a Black Box**  
  Debugging LLMs is different from traditional systems — rely on **LLM testing (“evals”)** for visibility into successes and failures.  

- **Data-Driven, User-Focused Development**  
  Start with **user expectations** → transform them into scalable, automated test cases.  

- **Business-Relevant Evaluation**  
  Assess success against **business KPIs**, not just technical metrics.  

- **Value-Driven Edge Cases**  
  Rare but high-value scenarios deserve dedicated test cases.  

---

### Practical Outcome  
Integrate **evaluation frameworks** that align technical metrics with business priorities.  
Example: [AiToEarn官网](https://aitoearn.ai/) — open-source AI content monetization platform enabling:  
- Multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X)  
- Analytics and business-oriented metrics  

---

**Subscribe:** Apple Podcasts • YouTube • SoundCloud • Spotify • Overcast • [Podcast Feed](http://www.infoq.com/podcasts/ai-evaluation-driven-development/)  

---

## Why AI Demands a New Engineering Mindset  

Generative AI requires abandoning **binary logic** in favor of a **gradient perspective**: outputs may be 80% correct, partially true, and still valuable.  

Key implications:  
1. Systems **continuously evolve** — not rigid true/false.  
2. Evaluation requires probabilistic and qualitative metrics.  
3. Feedback loops become critical for iteration.

---

## Real-World Challenges with GenAI  

**Magdalena’s insight:**  
- No clear ground truth in LLM outputs.  
- Output quality varies by **human preference** (culture, context).  
- Debugging = black box → need eval frameworks.  

**Example battle story:**  
- Client chatbot stuck at 60% accuracy.  
- Prompt tweaking failed → **solution:** build a **prompt testing system** instead of chasing the “perfect prompt”.

---

## Stop Building Features — Start Delivering Outcomes  

**Approach:**  
1. Begin with **user expectations**.  
2. Transform into **automated, scalable test cases**.  
3. Use a **coverage matrix**:
   - **Segment users** (new vs. returning).
   - **Categorize queries** (billing, product, technical).
   - Map **frequency vs. business importance**.

---

### Coverage Matrix Benefits  
- Visualize distribution of queries.  
- Assign **business impact score**.  
- Multiply **occurrence × value** to prioritize cases.  

Example:  
- 1 in 1,000 wine fair visitors wants 1,000 bottles monthly — rare but **huge business value** → requires targeted handling.

---

## Align AI Metrics with Business KPIs  

Core principle: **Accuracy matters less than business impact**.  
- Identify rare, high-value cases.  
- Quantify impact in revenue, cost savings, or productivity gains.  
- Prioritize test development based on bottom-line contribution.

---

## Testing & Iteration Framework  

**Steps:**  
1. Build comprehensive, automated test coverage.  
2. Experiment with system prompt variations.  
3. Evaluate model performance with numeric/qualitative metrics.  
4. Maintain **human-in-the-loop** verification.  

---

## Selecting and Evaluating Models  

Avoid chasing hype. Instead:  
- Encode user needs into test cases.  
- Abstract the model from application logic.  
- Switch models → rerun tests → compare results.

Tools mentioned:  
- [DeepEval](https://github.com/confident-ai/deepeval)  
- [opik](https://github.com/comet-ml/opik)  
- [Evidently AI](http://github.com/evidentlyai/evidently)  
- [MLflow](https://github.com/mlflow/mlflow)  
- [Lang Fuse](https://github.com/langfuse/langfuse)  

---

## Observability in AI Systems  

Observability isn’t just about the black box — it’s about **user interaction patterns**:  
- Where do conversations stall?  
- Do repeated queries indicate poor resolution?  
- Does latency hinder engagement?  

Translate **business KPIs** into **code + metrics** to measure if problems are solved and goals are met.

---

## Conclusion  

Magdalena’s strategy combines:  
- **User-first design**  
- **Data-driven evaluation**  
- **Business KPI alignment**  
- **Iterative testing & observability**  

Platforms like [AiToEarn官网](https://aitoearn.ai/) integrate all of these into a **practical, open-source workflow** for AI creators:  
- AI content generation  
- Automated multi-platform publishing  
- Performance analytics  
- AI model ranking ([AI模型排名](https://rank.aitoearn.ai))  

---

**Podcast Links:**  
[RSS Feed](http://www.infoq.com/podcasts/ai-evaluation-driven-development/) • [SoundCloud](https://soundcloud.com/infoq-channel) • [Apple Podcasts](https://itunes.apple.com/gb/podcast/the-infoq-podcast/id1106971805?mt=2) • [Spotify](https://open.spotify.com/show/4NhWaYYpPWgWRDAOqeRQbj) • [Overcast](https://overcast.fm/itunes1106971805/the-infoq-podcast) • [YouTube](https://youtube.com/playlist?list=PLndbWGuLoHeZLVC9vl0LzLvMWHzpzIpir&si=Kvb9UpSdGzObuWgg)  

---
New AI Development Rules: Testing to Ensure Every Deployment

Honghao Wang

Read more

These College Students Are Helping OPPO Build AI Products

Ilya’s Shocking Testimony: Altman’s Wrongdoing, Mira’s Drama, and OpenAI’s Near-Merger with Anthropic

Reasons Against pgvector: Technical Challenges at Scale

Elimination Game’s New Innovative Gameplay Hits $1M Monthly Revenue in 70 Days