AI observability

Why AI Application Observability Is Crucial (Taking It to the Next Level)

Honghao Wang

21 Oct 2025 — 3 min read

Introduction

Speaker: Sally O’Malley

This session focuses on AI observability — why it matters, and how to implement it effectively.

Observability is essential in any application, especially in AI systems where insight into model and pipeline behavior is critical.

Before diving in, I ran a quick internet check to ensure everything worked and to generate test traffic for our dashboard — our pre‑talk warm‑up. What you’re seeing is Llama Stack in action.

---

Preview — RAG & Llama Stack

How many here have tried a RAG application (Retrieval‑Augmented Generation)?

With RAG, you can:

Upload documents for contextual retrieval.
Get a confirmation if a document was already imported.
Navigate a clean, easy-to-use UI based directly on Llama Stack documentation.

My aim today: make everything reproducible so you can apply it yourself.

For example, asking What is llm‑d? is fast — I’ll explain it later and share setup details.

Llama Stack includes safety features. I tested guardrails with absurd prompts (e.g., “kidnap a turtle”), and the system correctly refused and redirected.

---

Why Observability Matters

Over the last six months, I’ve been working with:

vLLM
Llama Stack

Goal: show you how to set up an observability stack using open source tools.

I work in the Office of the CTO – Emerging Technologies, moving between projects and tools.

AI has become a central focus, and observability is now a must-have for:

Debugging
Performance optimization
Compliance

Platforms like AiToEarn官网 complement observability by integrating generation, publishing, analytics, and even AI model ranking across platforms such as Douyin, Kwai, WeChat, Bilibili, Instagram, LinkedIn, YouTube, etc.

---

The Industry Context

We’re still in the experimentation phase.

Large players (OpenAI, Google, Anthropic) run AI at scale, but most teams are still figuring out fundamentals like:

Observability
Intelligent routing
Operational reliability

Transitioning AI from research to production hinges on transparency, reliability, security — but LLMs are different.

---

Why LLMs Pose Unique Challenges

Key Differences

Compared to microservices:

Slower — and sometimes intentionally so, for deeper reasoning
Non‑uniform — variable processing patterns
Expensive — in computation and infrastructure

Common AI Patterns

RAG (Retrieval‑Augmented Generation) — retrieval + reasoning before output
Thinking & Reasoning — heavier inference from context

Performance Stages

Prefill Stage — CPU‑bound; time to first token, heavy document context processing
Decode Stage — memory‑bound; generates token-by-token through the model

Multi-turn prompting grows prefill times with each iteration; decode speed matters less.

Optimization Tip: Separate prefill and decode phases, as done in llm‑d and vLLM.

---

Deploying an Open-Source Observability Stack

Tools:

Prometheus — metrics backend
OpenTelemetry Collector + Tempo — tracing backend
Grafana — visualization frontend
Stack runs on Minikube with GPUs

Environment Prep

Install:

NVIDIA drivers
NVIDIA container toolkit

---

Layer AI Workloads

llm‑d — model server (vLLM)
Llama Stack — app framework, orchestration

Metrics and traces flow into your stack for unified monitoring.

---

Kubernetes ServiceMonitor Basics

ServiceMonitor — CRD for Prometheus scraping targets
Match labels & ports between ServiceMonitor and service
Create one ServiceMonitor per monitored workload

---

Telemetry Focus in Llama Stack

Llama Stack emphasizes distributed tracing over metrics
Use OpenTelemetry Collector to collect OTLP traces
Store in Tempo
Configure data source in Grafana for trace exploration

---

llm‑d Quick Start

llm‑d spins up separate vLLMs for prefill & decode with smart routing
Quick start sets up — Prometheus, Grafana, ServiceMonitors, vLLMs
Add OpenTelemetry Collector and Tempo for full observability

---

Monitoring Signals

Track Performance:

Latency, time to first token
Stage breakdown (prefill, decode)
Cache usage

Track Quality:

Trace review for answer correctness
Tool usage validation

Track Cost:

Token usage
GPU utilization
Query volumes

Tip: Combine dashboards, alerts, and logs for balanced speed, accuracy, and spend.

---

Debugging with Dashboards

Use ready-made Grafana dashboards for:

vLLM metrics
NVIDIA GPU utilization via DCGM-Exporter

Advanced drilldowns help with deeper analysis.

---

AI Metric Analysis Example

Twinkll Sisodia’s Red Hat project:

AI tool connected to Prometheus in OpenShift
Query metrics via chat — intuitive monitoring

---

Observability + Monetization

Platforms like AiToEarn官网 connect:

AI generation
Cross-platform publishing
Analytics
AI model ranking

This enables technical observability + audience reach together.

---

Summary & Demo Highlights

We’ve covered:

LLM uniqueness & monitoring challenges
Open-source observability stacks for AI workloads
Key signals to track performance, quality, and cost
Demoed llm‑d with vLLM, Llama Stack, Prometheus, Grafana, OTel, Tempo

Live queries showed GPU usage patterns, vLLM drilldowns, and Tempo traces.

We discussed future agent-to-agent ecosystems where tracing is even more crucial.

---

Q & A Notes

Token tracking matters for cost when using paid APIs; open-source models are free.
llm‑d runs models (vLLM backend); Llama Stack connects apps to endpoints.
Tools like Langfuse and Dynatrace add LLM observability features.
Kubernetes chosen over Podman for production simulation.
No simple “model correctness” metric — rely on logs, traces, and constraining tools via MCP.

---

Final Thought

LLM observability isn’t just metrics — it’s deep insight into performance, pipelines, and cost.

By pairing technical stacks (Prometheus, Grafana, OTel, Tempo) with monetization/distribution platforms like AiToEarn官网, teams can achieve both operational stability and audience impact.

---

Resources: