From Dashboard Soup to Observability Lasagna: Building a Better Observability System

From Dashboard Soup to Observability Lasagna

Speaker: Martha Lambert — Product Engineer at incident.io, specializing in reliability and observability.

---

Overview

In this talk, you’ll learn how we transitioned from chaotic dashboard soup to a structured, layered observability lasagna.

Key takeaways:

  • A proven iterative process to "unsoup" your dashboards.
  • How layered observability can guide engineers through debugging.
  • Practical, technical tips for smooth integration into your stack.

---

The Backstory

In early 2024, we had built an on-call product — handling alerts, paging, and postmortems — with very high reliability requirements. We had two months to validate it before release.

The challenge was meta: if our tool failed, the pager wouldn’t alert us — meaning customers wouldn’t know their systems were down.

Reliability Goals:

  • Proactive — Confident the system handles expected scenarios daily.
  • Reactive — Capable of rapid discovery and resolution when something breaks.

---

> Great observability underpins both goals. It’s not about expensive tools — it’s about having the right strategy.

---

The "Dashboard Soup" Problem

Dashboard soup = dozens of one-off dashboards created during incidents, never revisited, hard to find, with no clear investigation path.

How Layering Helps:

  • Start from system-wide KPIs.
  • Drill into service metrics → request traces → individual logs.
  • Maintain focus and prevent duplication.
  • Provide clear entry points and context for newer engineers.

---

Unsouping Our Stack — The Iterative Process

We applied a TDD-like loop for observability called:

Predict → Prove → Measure

1. Predict

Form hypotheses about system behavior under load.

Example:

> “Server response times stay under 200ms at peak.”

2. Prove

Run drills, apply real load, observe behavior — no assumptions.

3. Measure

Add metrics/logs/traces to cover gaps discovered during tests.

Loop continuously until confident unexpected events get detected instantly.

---

Team Drills — Stress Testing in Practice

Steps:

  • Gather engineers.
  • Brainstorm failure modes.
  • Simulate in controlled environment.
  • Observe, record, and share insights.

Record daily outcomes:

  • Confidence level.
  • What was tested.
  • Load volumes.
  • Code/observability changes made.

---

Avoiding Dashboard Pitfalls

  • Over-specific dashboards — answer broad, lasting triage questions, not only "known unknowns".
  • Static dashboards with no navigation — ensure clear arrows between layers: metrics → logs → traces.

---

The Four-Layer Lasagna

  • Overview Dashboard — traffic-light triage view at product level.
  • System Dashboard — SLIs and health of specific subsystems.
  • Logs — filtered, detailed event views.
  • Traces — request-level flow and timing.

Principle:

  • Engineers unfamiliar with a dashboard must still be able to use it instantly.
  • Click-through navigation between all layers.

---

Overview Dashboard Best Practices

  • Single homepage in Grafana.
  • Each row = subsystem health indicator.
  • No deep detail — visual signposts only.
  • Links direct to system dashboards.

---

System Dashboards

Show all metrics indicating reliability and customer impact:

  • Capacity vs. usage.
  • Throughput rates.
  • User-observed delays (time from alert to page).
  • Outcome states beyond success/error.
  • Direct links to filtered logs.

---

Logs and Traces

Logs:

  • Consistent format across systems.
  • Coupled closely with metrics (event logs).
  • Always emitted, even on error paths.

Traces:

  • Direct links from logs.
  • Capture all critical events.
  • Span processing vs. waiting for connections.

---

Practical Implementation Tips

  • User Impact First — tie metrics to customer outcomes.
  • Track User-Observed Times — measure delays as felt by users.
  • Connect Metrics → Logs → Traces — make it impossible to hit a dead end.
  • Use Exemplars — clickable dots in metrics linking directly to traces.
  • Visualize Limits — graph capacity lines dynamically, not as static constants.

---

When Is It "Done"?

Never completely — systems evolve. But you can reach confidence when:

  • Tested under varied loads and failures.
  • Responses are well understood.
  • User impact is always clear.

---

Team Ownership

Reliability isn’t just about tools — it’s about every engineer being able to handle incidents:

  • Run team drills (open-book).
  • Hold quarterly “Game Days” (deep-end simulation).
  • Rotate roles to share knowledge.

---

Summary

Moving from soup to lasagna requires:

  • Iterative Predict → Prove → Measure loops.
  • UX-oriented dashboard design.
  • Tight layer integration.
  • Shared team understanding of tools.

---

Would you like me to create a visual layered diagram of the observability lasagna summarizing these concepts for quick reference? It would make it far easier to onboard new engineers into this structured debugging process.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.