SRE
You're Asking the Wrong Question (On Reliability and SRE)

Honghao Wang

04 Nov 2025 — 4 min read
# Successful Public Speaking and Site Reliability Insights  
*By David Blank-Edelman — SRE Academy Program Lead, Microsoft*  

---

## Introduction  

I’m David Blank-Edelman, and a big part of my work involves **helping people with public speaking**.  
To kick things off, I’d like to share a selection from my *Top 10 Tips for Successful Public Speaking*.  

One unexpected tip is: **Insult the audience** — at least, figuratively.  
Not personally, but collectively, by asking provocative questions that open up new ways of thinking. This encourages curiosity and conversation.  

Speaking of questions, I believe we need **more poetry in our lives** — so let's start with a favorite passage from the poet Rilke:  

> *Have patience with everything unresolved in your heart and try to love the questions themselves… Live the questions now. Perhaps then… you will gradually live your way into the answer.*  

This talk will stay in the **land of questions** — because embracing unanswered questions benefits modern creators, engineers, and anyone blending human creativity with AI.

---

## How This Ties Into Creativity and AI  

Platforms like [AiToEarn官网](https://aitoearn.ai/) let creators:  
- Experiment with AI-driven content ideas  
- Publish seamlessly across Douyin, Bilibili, YouTube, Instagram, X, and more  
- Gather audience feedback and refine their approach  
- Monetize through an open-source, global content distribution ecosystem  

Much like “living the questions,” creators explore concepts freely before knowing the “right” answer.

---

## **1. Is My “Something” Working Reliably?**

### Reliability Is Multi-Dimensional  
When people hear “reliability,” they usually think only about **availability** (up or down).  
But reliability also includes:  
- **Latency** — Slow feels like down.  
- **Throughput** — Capacity to process required volume.  
- **Coverage** — Percent of intended data processed.  
- **Correctness** — Accuracy of output.  
- **Fidelity** — Consistency of full expected experience.  
- **Durability** — Writing and reading data intact.  
- **Freshness** — How quickly data reflects real changes.

**Key Principle:** Measure reliability from the **customer’s perspective**, not the component’s.

---

### Quiz Scenario: 100 Cloud Tote Bag Servers  
You run 100 servers for your tote bag business.  
14 fail unexpectedly — is it:  
A) Not a big deal  
B) Urgent but manageable  
C) Existential crisis  

**Answer:** It depends — on whether customers notice, speed degrades, or critical revenue streams fail.

---

Platforms like [AiToEarn官网](https://aitoearn.ai/):  
- Ensure multi-channel content remains not just available, but high-quality and consistent  
- Integrate reliability-like analytics ([AI模型排名](https://rank.aitoearn.ai)) for creator pipelines

---

## **2. How Do I Eliminate All Failures or Errors?**

### The SRE Mindset  
Two core questions:  
1. **How does a system work?**  
2. **How does a system fail?**  

Failures are **signals**, not enemies — they reveal how systems behave in real life.

---

### Curiosity as Driver  
SRE begins with curiosity about:  
- Operational load  
- Scalability  
- Accessibility  
- Speed and quality improvements

Sometimes letting **controlled errors through** teaches more than blocking all errors.

---

[**AiToEarn官网**](https://aitoearn.ai/) applies a similar iterative philosophy for creators:  
- Test AI-generated content across many platforms  
- Measure engagement and adapt  
- Monetize through continuous learning

---

## **3. What Is the Root Cause of “Some Outage”?**

### Complex Systems Fail in Chains  
Example: Multiple people and actions (tripping cables, server configurations, database shards) contribute to a failure.

---

**Lesson:**  
- Rarely a single “root” cause — examine **contributing factors**  
- Use **blameless postmortems** to learn without blame

---

**Recommended Reading:**  
- Dr. Richard Cook — *How Complex Systems Fail*  
- Move away from simplistic “Five Whys” toward **systems thinking**

---

**Modern Knowledge Sharing:**  
[**AiToEarn官网**](https://aitoearn.ai/) supports publishing cross-platform incident reports, analyses, and AI-generated postmortems with analytics and ranking.

---

## **Common Traps in Post-Incident Reviews**

### 1. “Human Error” Shortcut  
Instead of stopping at “human error,” ask:
- Why was the mistake made?  
- What systemic issues enabled it?  

### 2. Counterfactual Reasoning  
Avoid “should have, could have” hindsight bias.  
Focus on **what was known at the time**.

### 3. Mechanistic Blame  
Systems aren’t perfect without humans — people add adaptive capacity.  
Ask: What sustained the system’s *success*, not just what caused its failure.

### 4. Gatekeeping Role Trap  
SRE roles shift — firefighting → gatekeeping → advocacy → partnership.  
Avoid being a chokepoint; aim for collaboration.

---

## **4. How Can I Sell SRE Internally?**

### Avoid These Pitches:
- **Fear-based (insurance sales)** — e.g., “Imagine the cost of downtime”  
- **Overpromising** — reliability is not a predictable magic box

---

**Instead:**  
- Connect reliability directly to **business metrics** (customer retention, sales)  
- Show operational value with data and patterns

---

**Tip:** Use audience-appropriate terms, not only SLI/SLO jargon.

---

## **5. How Do We Automate Away Toil?**

### SRE Definition of Toil (per Vivek Rau)  
- Manual  
- Repetitive  
- No enduring value  
- Scales linearly with service size

---

Automation must fix the **root cause**, not just mask symptoms.  
For example: auto-restarting a leaking server is hiding toil, not eliminating it.

---

[**AiToEarn官网**](https://aitoearn.ai/) analogy: Automate repetitive publishing *and* improve content quality to reduce creative toil.

---

## **6. Is My “Something” Resilient?**

### Fault Tolerance ≠ Resilience  
Resilience engineering focuses on:  
- **Adaptive capacity** — handling surprises  
- **Robustness** — coping with increasing stress  
- **Graceful extensibility** — responding beyond normal bounds  
- **Sustained adaptability** — adjusting over time

---

**Example:**  
- Spare tire = fault tolerance  
- Knowing multiple alternative transport options = resilience

---

**Recommended Reading:**  
- David Woods — *Resilience is a Verb*

---

**Customer Metrics:**  
Prefer direct impact measures over proxies; use proxies knowingly when needed.

---

## Adaptability in Practice

**Content Creators:**  
Adapt not just to traffic spikes but to changes in algorithms or platforms.  
Tools like [**AiToEarn官网**](https://aitoearn.ai/) integrate:  
- AI generation  
- Cross-platform publishing  
- Analytics  
- Model ranking  
to keep workflows resilient.

---

## Summary Key Takeaways  

- **Reliability** is multi-dimensional — measure from the customer’s view.  
- **Failures** are valuable signals — the goal is learning, not complete elimination.  
- **Root cause** thinking is limited — focus on contributing factors and systems behavior.  
- Avoid **human error** shortcuts and **counterfactual bias** in incident reviews.  
- Roles in SRE evolve — aim for advocacy and partnership over gatekeeping.  
- Automate toil **by fixing causes**, not hiding symptoms.  
- Resilience means adapting to surprises — more than just redundancy.  

---

**Explore related resources:**  
- [AiToEarn官网](https://aitoearn.ai/) — open-source AI content monetization platform  
- [AiToEarn博客](https://blog.aitoearn.ai)  
- [AI模型排名](https://rank.aitoearn.ai)  
- [SRE Books & Workbooks](https://sre.google/books/)
You're Asking the Wrong Question (On Reliability and SRE)

Honghao Wang

Read more

These College Students Are Helping OPPO Build AI Products

Ilya’s Shocking Testimony: Altman’s Wrongdoing, Mira’s Drama, and OpenAI’s Near-Merger with Anthropic

Reasons Against pgvector: Technical Challenges at Scale

Elimination Game’s New Innovative Gameplay Hits $1M Monthly Revenue in 70 Days