AI agents

Lots of Insights! Transcript of Silicon Valley’s Internal Discussion on AI Agents – Oct 2, 2025

Honghao Wang

15 Oct 2025 — 4 min read

Datawhale Insights: Why 95% of AI Agents Fail in Production

At a recent Silicon Valley industry event, Oana Olteanu—a renowned venture capitalist—joined engineers and ML leads from Uber, WisdomAI, EvenUp, and Datastrato to discuss how AI Agents can succeed in production environments.

Key statistic: 95% of AI Agents fail in production — not because the models lack intelligence, but because supporting systems (context engineering, security, memory architecture) are immature or missing.

Original link: https://www.motivenotes.ai/p/what-makes-5-of-ai-agents-actually

> Most founders think they're building an AI product, but in reality, they’re building a context selection system.

---

Under the Hood of an AI Agent

Panelists explored the foundational layers necessary for production-grade AI:

Advanced Context Selection
Semantic Layers
Memory Orchestration
Governance Mechanisms
Multi-Model Routing Strategies

This synthesis is based on seminar content compiled by Datawhale from Oana’s seminar insights.

---

Context Engineering ≠ Prompt Tricks

Why RAG Is Often Enough—But Rarely Well-Built

Most agreed: model fine-tuning is infrequently required. If retrieval-augmented generation (RAG) is robust, performance can be excellent. Unfortunately, most current RAG systems are too simplistic.

Top failure patterns:

Over-indexing: Too much irrelevant data confuses the model
Under-indexing: Insufficient signals lead to poor answers
Mixing structured + unstructured data: Breaking embeddings or oversimplifying architecture

---

What Advanced Context Engineering Looks Like

a) Feature Selection for LLMs

Think of context engineering as feature engineering for LLMs:

Selective context pruning = feature selection
Context validation = schema/type/timeliness checks
Context observability = track inputs that improve/ degrade output quality
Embedding augmentation with metadata

Implication: Context becomes a versionable, auditable, testable artifact—not just text.

---

b) Dual-Layer: Semantic + Metadata

Semantic layer: Classical vector search
Metadata layer: Filters by type, timestamp, permissions, domain ontology

Benefits:

Keeps retrieval relevant and structured, not just similar.

---

c) Real Challenges in Text-to-SQL

When asked, “Has anyone deployed text-to-SQL in production?” — none raised their hand.

Reason: Understanding natural language queries in business contexts is harder than the technical mapping.

---

Succeeding with Structured Query Systems

Winning approach:

Business glossary & terminology mapping
Query templates with constraints
Semantic validation before execution
Feedback loops to improve understanding

---

Governance & Trust — Not Just Enterprise Concerns

Critical requirements:

Track input-output lineage
Enforce role-based access control
Customize output per user permissions

> “Two employees asking the same question should get different answers if permissions differ.”

Solution trend: Unified metadata catalogs embedding access policies directly into indexing/search.

---

Trust Is a Human Problem

Story: One guest’s wife bans Tesla autopilot—not due to function, but lack of trust.

Same issue in enterprise AI: Reliability, explainability, and auditability > raw capability.

---

Common Trait in the Top 5%

Design for human-in-the-loop:

Humans review + override AI decisions
Continuous improvement via feedback loops

---

Memory: An Architectural Choice

Memory Tiers

User-level: Personal prefs, style, tone
Team-level: Shared queries, dashboards
Org-level: Policies, institutional knowledge

Best practice: Abstract memory into independent layers with version control.

---

Personalization vs Privacy

Memory offers:

Behavior customization
Event-driven proactive assistance

Cold-start example: Uber’s BI chat tool used past queries to recommend new ones and spark conversation.

Privacy risk: Over-personalization can feel intrusive — e.g., AI knowing children's names unprompted.

---

Designing Memory Responsibly

Balance UX benefits with privacy concerns
Avoid scope creep to protect access control
Goal: A secure, user-controlled portable memory layer

---

Multi-Model Reasoning & Orchestration

Model routing criteria:

Task complexity
Latency
Cost
Compliance requirements
Query type

Example pipeline:

Simple queries → local models
Structured queries → DSL/SQL translators
Complex analysis → GPT-4 / Anthropic / Gemini
Verification → dual-model redundancy

Benefit: Optimizes both performance & cost without brittleness.

---

Chat Interface ≠ Always Optimal

When chat shines: Lowers learning curve for complex tasks like BI.

When GUI shines: Graphic adjustments post-results.

Hybrid UX approach:

Chat for entry
GUI for refinement
Mode choice per task

---

What’s Still Missing

1. Context Observability

Systematic tools to see which contexts improve or harm model responses.

2. Composable Memory

Secure, portable, user-organized memory not tied to any provider.

3. Domain-Aware Languages

High-level, constraint-safe DSLs instead of brittle text-to-SQL.

4. Latency-Aware UX

Match response speed to task requirements.

---

The Future Moat in Generative AI

Not model access, but:

Context quality
Memory design
Stable orchestration
Trustworthy UX

---

5 Key Questions for Founders

Context capacity: Optimal window + content strategy
Memory boundary: Scope, storage, user inspection
Output traceability: Clear input-output linkage
Model strategy: Single vs multi, routing logic
Trust factors: Security + feedback design

---

Final Insight

Early consideration of context, memory, orchestration, trust will define products’ long-term viability.

Platforms like AiToEarn官网 are exploring integrated ecosystems—AI-driven content generation, cross-platform publishing, analytics, orchestration, and model ranking (AI模型排名)—helping creators and enterprises operationalize AI with trust, portability, and multi-channel scale.