AI news

From Dashboard Soup to Observability Lasagna: Building a Better Observability System

Honghao Wang

20 Nov 2025 — 2 min read

From Dashboard Soup to Observability Lasagna

Speaker: Martha Lambert — Product Engineer at incident.io, specializing in reliability and observability.

---

Overview

In this talk, you’ll learn how we transitioned from chaotic dashboard soup to a structured, layered observability lasagna.

Key takeaways:

A proven iterative process to "unsoup" your dashboards.
How layered observability can guide engineers through debugging.
Practical, technical tips for smooth integration into your stack.

---

The Backstory

In early 2024, we had built an on-call product — handling alerts, paging, and postmortems — with very high reliability requirements. We had two months to validate it before release.

The challenge was meta: if our tool failed, the pager wouldn’t alert us — meaning customers wouldn’t know their systems were down.

Reliability Goals:

Proactive — Confident the system handles expected scenarios daily.
Reactive — Capable of rapid discovery and resolution when something breaks.

---

> Great observability underpins both goals. It’s not about expensive tools — it’s about having the right strategy.

---

The "Dashboard Soup" Problem

Dashboard soup = dozens of one-off dashboards created during incidents, never revisited, hard to find, with no clear investigation path.

How Layering Helps:

Start from system-wide KPIs.
Drill into service metrics → request traces → individual logs.
Maintain focus and prevent duplication.
Provide clear entry points and context for newer engineers.

---

Unsouping Our Stack — The Iterative Process

We applied a TDD-like loop for observability called:

Predict → Prove → Measure

1. Predict

Form hypotheses about system behavior under load.

Example:

> “Server response times stay under 200ms at peak.”

2. Prove

Run drills, apply real load, observe behavior — no assumptions.

3. Measure

Add metrics/logs/traces to cover gaps discovered during tests.

Loop continuously until confident unexpected events get detected instantly.

---

Team Drills — Stress Testing in Practice

Steps:

Gather engineers.
Brainstorm failure modes.
Simulate in controlled environment.
Observe, record, and share insights.

Record daily outcomes:

Confidence level.
What was tested.
Load volumes.
Code/observability changes made.

---

Avoiding Dashboard Pitfalls

Over-specific dashboards — answer broad, lasting triage questions, not only "known unknowns".
Static dashboards with no navigation — ensure clear arrows between layers: metrics → logs → traces.

---

The Four-Layer Lasagna

Overview Dashboard — traffic-light triage view at product level.
System Dashboard — SLIs and health of specific subsystems.
Logs — filtered, detailed event views.
Traces — request-level flow and timing.

Principle:

Engineers unfamiliar with a dashboard must still be able to use it instantly.
Click-through navigation between all layers.

---

Overview Dashboard Best Practices

Single homepage in Grafana.
Each row = subsystem health indicator.
No deep detail — visual signposts only.
Links direct to system dashboards.

---

System Dashboards

Show all metrics indicating reliability and customer impact:

Capacity vs. usage.
Throughput rates.
User-observed delays (time from alert to page).
Outcome states beyond success/error.
Direct links to filtered logs.

---

Logs and Traces

Logs:

Consistent format across systems.
Coupled closely with metrics (event logs).
Always emitted, even on error paths.

Traces:

Direct links from logs.
Capture all critical events.
Span processing vs. waiting for connections.

---

Practical Implementation Tips

User Impact First — tie metrics to customer outcomes.
Track User-Observed Times — measure delays as felt by users.
Connect Metrics → Logs → Traces — make it impossible to hit a dead end.
Use Exemplars — clickable dots in metrics linking directly to traces.
Visualize Limits — graph capacity lines dynamically, not as static constants.

---

When Is It "Done"?

Never completely — systems evolve. But you can reach confidence when:

Tested under varied loads and failures.
Responses are well understood.
User impact is always clear.

---

Team Ownership

Reliability isn’t just about tools — it’s about every engineer being able to handle incidents:

Run team drills (open-book).
Hold quarterly “Game Days” (deep-end simulation).
Rotate roles to share knowledge.

---

Summary

Moving from soup to lasagna requires:

Iterative Predict → Prove → Measure loops.
UX-oriented dashboard design.
Tight layer integration.
Shared team understanding of tools.

---

Would you like me to create a visual layered diagram of the observability lasagna summarizing these concepts for quick reference? It would make it far easier to onboard new engineers into this structured debugging process.