Why AI Application Observability Is Crucial (Taking It to the Next Level)

Introduction

Speaker: Sally O’Malley

This session focuses on AI observability — why it matters, and how to implement it effectively.

Observability is essential in any application, especially in AI systems where insight into model and pipeline behavior is critical.

Before diving in, I ran a quick internet check to ensure everything worked and to generate test traffic for our dashboard — our pre‑talk warm‑up. What you’re seeing is Llama Stack in action.

---

Preview — RAG & Llama Stack

How many here have tried a RAG application (Retrieval‑Augmented Generation)?

With RAG, you can:

  • Upload documents for contextual retrieval.
  • Get a confirmation if a document was already imported.
  • Navigate a clean, easy-to-use UI based directly on Llama Stack documentation.

My aim today: make everything reproducible so you can apply it yourself.

For example, asking What is llm‑d? is fast — I’ll explain it later and share setup details.

Llama Stack includes safety features. I tested guardrails with absurd prompts (e.g., “kidnap a turtle”), and the system correctly refused and redirected.

---

Why Observability Matters

Over the last six months, I’ve been working with:

  • vLLM
  • Llama Stack

Goal: show you how to set up an observability stack using open source tools.

I work in the Office of the CTO – Emerging Technologies, moving between projects and tools.

AI has become a central focus, and observability is now a must-have for:

  • Debugging
  • Performance optimization
  • Compliance

Platforms like AiToEarn官网 complement observability by integrating generation, publishing, analytics, and even AI model ranking across platforms such as Douyin, Kwai, WeChat, Bilibili, Instagram, LinkedIn, YouTube, etc.

---

The Industry Context

We’re still in the experimentation phase.

Large players (OpenAI, Google, Anthropic) run AI at scale, but most teams are still figuring out fundamentals like:

  • Observability
  • Intelligent routing
  • Operational reliability

Transitioning AI from research to production hinges on transparency, reliability, security — but LLMs are different.

---

Why LLMs Pose Unique Challenges

Key Differences

Compared to microservices:

  • Slower — and sometimes intentionally so, for deeper reasoning
  • Non‑uniform — variable processing patterns
  • Expensive — in computation and infrastructure

Common AI Patterns

  • RAG (Retrieval‑Augmented Generation) — retrieval + reasoning before output
  • Thinking & Reasoning — heavier inference from context

Performance Stages

  • Prefill Stage — CPU‑bound; time to first token, heavy document context processing
  • Decode Stage — memory‑bound; generates token-by-token through the model

Multi-turn prompting grows prefill times with each iteration; decode speed matters less.

Optimization Tip: Separate prefill and decode phases, as done in llm‑d and vLLM.

---

Deploying an Open-Source Observability Stack

Tools:

  • Prometheus — metrics backend
  • OpenTelemetry Collector + Tempo — tracing backend
  • Grafana — visualization frontend
  • Stack runs on Minikube with GPUs

Environment Prep

Install:

  • NVIDIA drivers
  • NVIDIA container toolkit

---

Layer AI Workloads

  • llm‑d — model server (vLLM)
  • Llama Stack — app framework, orchestration

Metrics and traces flow into your stack for unified monitoring.

---

Kubernetes ServiceMonitor Basics

  • ServiceMonitor — CRD for Prometheus scraping targets
  • Match labels & ports between ServiceMonitor and service
  • Create one ServiceMonitor per monitored workload

---

Telemetry Focus in Llama Stack

  • Llama Stack emphasizes distributed tracing over metrics
  • Use OpenTelemetry Collector to collect OTLP traces
  • Store in Tempo
  • Configure data source in Grafana for trace exploration

---

llm‑d Quick Start

  • llm‑d spins up separate vLLMs for prefill & decode with smart routing
  • Quick start sets up — Prometheus, Grafana, ServiceMonitors, vLLMs
  • Add OpenTelemetry Collector and Tempo for full observability

---

Monitoring Signals

Track Performance:

  • Latency, time to first token
  • Stage breakdown (prefill, decode)
  • Cache usage

Track Quality:

  • Trace review for answer correctness
  • Tool usage validation

Track Cost:

  • Token usage
  • GPU utilization
  • Query volumes

Tip: Combine dashboards, alerts, and logs for balanced speed, accuracy, and spend.

---

Debugging with Dashboards

Use ready-made Grafana dashboards for:

  • vLLM metrics
  • NVIDIA GPU utilization via DCGM-Exporter

Advanced drilldowns help with deeper analysis.

---

AI Metric Analysis Example

Twinkll Sisodia’s Red Hat project:

  • AI tool connected to Prometheus in OpenShift
  • Query metrics via chat — intuitive monitoring

---

Observability + Monetization

Platforms like AiToEarn官网 connect:

  • AI generation
  • Cross-platform publishing
  • Analytics
  • AI model ranking

This enables technical observability + audience reach together.

---

Summary & Demo Highlights

We’ve covered:

  • LLM uniqueness & monitoring challenges
  • Open-source observability stacks for AI workloads
  • Key signals to track performance, quality, and cost
  • Demoed llm‑d with vLLM, Llama Stack, Prometheus, Grafana, OTel, Tempo

Live queries showed GPU usage patterns, vLLM drilldowns, and Tempo traces.

We discussed future agent-to-agent ecosystems where tracing is even more crucial.

---

Q & A Notes

  • Token tracking matters for cost when using paid APIs; open-source models are free.
  • llm‑d runs models (vLLM backend); Llama Stack connects apps to endpoints.
  • Tools like Langfuse and Dynatrace add LLM observability features.
  • Kubernetes chosen over Podman for production simulation.
  • No simple “model correctness” metric — rely on logs, traces, and constraining tools via MCP.

---

Final Thought

LLM observability isn’t just metrics — it’s deep insight into performance, pipelines, and cost.

By pairing technical stacks (Prometheus, Grafana, OTel, Tempo) with monetization/distribution platforms like AiToEarn官网, teams can achieve both operational stability and audience impact.

---

Resources:

Read more