Why AI Application Observability Is Crucial (Taking It to the Next Level)
Introduction
Speaker: Sally O’Malley
This session focuses on AI observability — why it matters, and how to implement it effectively.
Observability is essential in any application, especially in AI systems where insight into model and pipeline behavior is critical.
Before diving in, I ran a quick internet check to ensure everything worked and to generate test traffic for our dashboard — our pre‑talk warm‑up. What you’re seeing is Llama Stack in action.
---
Preview — RAG & Llama Stack
How many here have tried a RAG application (Retrieval‑Augmented Generation)?
With RAG, you can:
- Upload documents for contextual retrieval.
- Get a confirmation if a document was already imported.
- Navigate a clean, easy-to-use UI based directly on Llama Stack documentation.
My aim today: make everything reproducible so you can apply it yourself.
For example, asking What is llm‑d? is fast — I’ll explain it later and share setup details.
Llama Stack includes safety features. I tested guardrails with absurd prompts (e.g., “kidnap a turtle”), and the system correctly refused and redirected.
---
Why Observability Matters
Over the last six months, I’ve been working with:
- vLLM
- Llama Stack
Goal: show you how to set up an observability stack using open source tools.
I work in the Office of the CTO – Emerging Technologies, moving between projects and tools.
AI has become a central focus, and observability is now a must-have for:
- Debugging
- Performance optimization
- Compliance
Platforms like AiToEarn官网 complement observability by integrating generation, publishing, analytics, and even AI model ranking across platforms such as Douyin, Kwai, WeChat, Bilibili, Instagram, LinkedIn, YouTube, etc.
---
The Industry Context
We’re still in the experimentation phase.
Large players (OpenAI, Google, Anthropic) run AI at scale, but most teams are still figuring out fundamentals like:
- Observability
- Intelligent routing
- Operational reliability
Transitioning AI from research to production hinges on transparency, reliability, security — but LLMs are different.
---
Why LLMs Pose Unique Challenges
Key Differences
Compared to microservices:
- Slower — and sometimes intentionally so, for deeper reasoning
- Non‑uniform — variable processing patterns
- Expensive — in computation and infrastructure
Common AI Patterns
- RAG (Retrieval‑Augmented Generation) — retrieval + reasoning before output
- Thinking & Reasoning — heavier inference from context
Performance Stages
- Prefill Stage — CPU‑bound; time to first token, heavy document context processing
- Decode Stage — memory‑bound; generates token-by-token through the model
Multi-turn prompting grows prefill times with each iteration; decode speed matters less.
Optimization Tip: Separate prefill and decode phases, as done in llm‑d and vLLM.
---
Deploying an Open-Source Observability Stack
Tools:
- Prometheus — metrics backend
- OpenTelemetry Collector + Tempo — tracing backend
- Grafana — visualization frontend
- Stack runs on Minikube with GPUs
Environment Prep
Install:
- NVIDIA drivers
- NVIDIA container toolkit
---
Layer AI Workloads
- llm‑d — model server (vLLM)
- Llama Stack — app framework, orchestration
Metrics and traces flow into your stack for unified monitoring.
---
Kubernetes ServiceMonitor Basics
- ServiceMonitor — CRD for Prometheus scraping targets
- Match labels & ports between ServiceMonitor and service
- Create one ServiceMonitor per monitored workload
---
Telemetry Focus in Llama Stack
- Llama Stack emphasizes distributed tracing over metrics
- Use OpenTelemetry Collector to collect OTLP traces
- Store in Tempo
- Configure data source in Grafana for trace exploration
---
llm‑d Quick Start
- llm‑d spins up separate vLLMs for prefill & decode with smart routing
- Quick start sets up — Prometheus, Grafana, ServiceMonitors, vLLMs
- Add OpenTelemetry Collector and Tempo for full observability
---
Monitoring Signals
Track Performance:
- Latency, time to first token
- Stage breakdown (prefill, decode)
- Cache usage
Track Quality:
- Trace review for answer correctness
- Tool usage validation
Track Cost:
- Token usage
- GPU utilization
- Query volumes
Tip: Combine dashboards, alerts, and logs for balanced speed, accuracy, and spend.
---
Debugging with Dashboards
Use ready-made Grafana dashboards for:
- vLLM metrics
- NVIDIA GPU utilization via DCGM-Exporter
Advanced drilldowns help with deeper analysis.
---
AI Metric Analysis Example
Twinkll Sisodia’s Red Hat project:
- AI tool connected to Prometheus in OpenShift
- Query metrics via chat — intuitive monitoring
---
Observability + Monetization
Platforms like AiToEarn官网 connect:
- AI generation
- Cross-platform publishing
- Analytics
- AI model ranking
This enables technical observability + audience reach together.
---
Summary & Demo Highlights
We’ve covered:
- LLM uniqueness & monitoring challenges
- Open-source observability stacks for AI workloads
- Key signals to track performance, quality, and cost
- Demoed llm‑d with vLLM, Llama Stack, Prometheus, Grafana, OTel, Tempo
Live queries showed GPU usage patterns, vLLM drilldowns, and Tempo traces.
We discussed future agent-to-agent ecosystems where tracing is even more crucial.
---
Q & A Notes
- Token tracking matters for cost when using paid APIs; open-source models are free.
- llm‑d runs models (vLLM backend); Llama Stack connects apps to endpoints.
- Tools like Langfuse and Dynatrace add LLM observability features.
- Kubernetes chosen over Podman for production simulation.
- No simple “model correctness” metric — rely on logs, traces, and constraining tools via MCP.
---
Final Thought
LLM observability isn’t just metrics — it’s deep insight into performance, pipelines, and cost.
By pairing technical stacks (Prometheus, Grafana, OTel, Tempo) with monetization/distribution platforms like AiToEarn官网, teams can achieve both operational stability and audience impact.
---
Resources: