From Dashboard Soup to Observability Lasagna: Building a Better Observability System
From Dashboard Soup to Observability Lasagna
Speaker: Martha Lambert — Product Engineer at incident.io, specializing in reliability and observability.
---
Overview
In this talk, you’ll learn how we transitioned from chaotic dashboard soup to a structured, layered observability lasagna.
Key takeaways:
- A proven iterative process to "unsoup" your dashboards.
- How layered observability can guide engineers through debugging.
- Practical, technical tips for smooth integration into your stack.
---
The Backstory
In early 2024, we had built an on-call product — handling alerts, paging, and postmortems — with very high reliability requirements. We had two months to validate it before release.
The challenge was meta: if our tool failed, the pager wouldn’t alert us — meaning customers wouldn’t know their systems were down.
Reliability Goals:
- Proactive — Confident the system handles expected scenarios daily.
- Reactive — Capable of rapid discovery and resolution when something breaks.
---
> Great observability underpins both goals. It’s not about expensive tools — it’s about having the right strategy.
---
The "Dashboard Soup" Problem
Dashboard soup = dozens of one-off dashboards created during incidents, never revisited, hard to find, with no clear investigation path.
How Layering Helps:
- Start from system-wide KPIs.
- Drill into service metrics → request traces → individual logs.
- Maintain focus and prevent duplication.
- Provide clear entry points and context for newer engineers.
---
Unsouping Our Stack — The Iterative Process
We applied a TDD-like loop for observability called:
Predict → Prove → Measure
1. Predict
Form hypotheses about system behavior under load.
Example:
> “Server response times stay under 200ms at peak.”
2. Prove
Run drills, apply real load, observe behavior — no assumptions.
3. Measure
Add metrics/logs/traces to cover gaps discovered during tests.
Loop continuously until confident unexpected events get detected instantly.
---
Team Drills — Stress Testing in Practice
Steps:
- Gather engineers.
- Brainstorm failure modes.
- Simulate in controlled environment.
- Observe, record, and share insights.
Record daily outcomes:
- Confidence level.
- What was tested.
- Load volumes.
- Code/observability changes made.
---
Avoiding Dashboard Pitfalls
- Over-specific dashboards — answer broad, lasting triage questions, not only "known unknowns".
- Static dashboards with no navigation — ensure clear arrows between layers: metrics → logs → traces.
---
The Four-Layer Lasagna
- Overview Dashboard — traffic-light triage view at product level.
- System Dashboard — SLIs and health of specific subsystems.
- Logs — filtered, detailed event views.
- Traces — request-level flow and timing.
Principle:
- Engineers unfamiliar with a dashboard must still be able to use it instantly.
- Click-through navigation between all layers.
---
Overview Dashboard Best Practices
- Single homepage in Grafana.
- Each row = subsystem health indicator.
- No deep detail — visual signposts only.
- Links direct to system dashboards.
---
System Dashboards
Show all metrics indicating reliability and customer impact:
- Capacity vs. usage.
- Throughput rates.
- User-observed delays (time from alert to page).
- Outcome states beyond success/error.
- Direct links to filtered logs.
---
Logs and Traces
Logs:
- Consistent format across systems.
- Coupled closely with metrics (event logs).
- Always emitted, even on error paths.
Traces:
- Direct links from logs.
- Capture all critical events.
- Span processing vs. waiting for connections.
---
Practical Implementation Tips
- User Impact First — tie metrics to customer outcomes.
- Track User-Observed Times — measure delays as felt by users.
- Connect Metrics → Logs → Traces — make it impossible to hit a dead end.
- Use Exemplars — clickable dots in metrics linking directly to traces.
- Visualize Limits — graph capacity lines dynamically, not as static constants.
---
When Is It "Done"?
Never completely — systems evolve. But you can reach confidence when:
- Tested under varied loads and failures.
- Responses are well understood.
- User impact is always clear.
---
Team Ownership
Reliability isn’t just about tools — it’s about every engineer being able to handle incidents:
- Run team drills (open-book).
- Hold quarterly “Game Days” (deep-end simulation).
- Rotate roles to share knowledge.
---
Summary
Moving from soup to lasagna requires:
- Iterative Predict → Prove → Measure loops.
- UX-oriented dashboard design.
- Tight layer integration.
- Shared team understanding of tools.
---
Would you like me to create a visual layered diagram of the observability lasagna summarizing these concepts for quick reference? It would make it far easier to onboard new engineers into this structured debugging process.