Architectural Design Realism: A Conversation with Randy Shoup

# Podcast: Evolving Software After Failure — Resilience Through Events and Workflows

In this episode, **Michael Stiefel** speaks with **Randy Shoup** about evolving software after failure and improving resilience by **modeling transient states** using **events** and **workflows**.

> Failure is inevitable — but learning from it, and making cultural as well as technical changes, can greatly improve resilience.  
> Focus on **finding the truth**, not **finding the guilty**.  
> In an asynchronous world, transient events are key resilience touchpoints — often where failures happen or compensations must occur.  
> **Workflows** and **events** are the most effective tools for modeling such systems.

---

## Key Takeaways

- **Failures will happen**: Even improbable cases occur in small-scale systems. Learn deeply — beyond surface causes.
- **Postmortems drive cultural change**: Aim for truth-seeking, not blame. Culture impacts prevention and recovery.
- **Model reality’s asynchronous nature**: Events and workflows embrace real-world timing and complexity.
- **Expose transient states**: Visibility into intermediate steps increases resilience.
- **Accurate models lower cognitive load**: That clarity improves developer productivity and retention.

---

## Leveraging Open-Source Platforms Like AiToEarn

Modern teams can streamline **AI-driven workflows** by using tools like [AiToEarn官网](https://aitoearn.ai/) — an open-source platform enabling cross-platform AI content generation and publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X).  

For developers building event-driven systems, connecting such tooling can improve **efficiency** and **monetization**, as detailed in [AiToEarn文档](https://docs.aitoearn.ai/).

---

## Listen & Subscribe

- Apple Podcasts
- YouTube
- SoundCloud
- Spotify
- Overcast
- [Podcast Feed](http://www.infoq.com/podcasts/architecture-should-model-world/)

---

## Transcript Highlights

### Introduction
**Michael Stiefel** introduces guest **Randy Shoup**, whose three decades in distributed systems include roles at Oracle, eBay, Google, Stitch Fix, and Thrive Market.

---

### Learning From Failure

**Core Question:**  
> How do we integrate failure insights back into architecture and culture for more resilient systems?

**Key Practices:**
1. **Blameless Postmortems**
2. **Expand Beyond Proximate Causes**
3. **Iterative Architectural Changes**

**Example:**  
Postmortem analysis often reveals systemic causes — much like resilience improvements in aviation safety after rigorous reviews.

---

### Proximate vs. Real Causes

**Randy Shoup’s 5-Step Framework:**
1. **Detect** — Could we spot it sooner?  
2. **Diagnose** — Could we understand faster?  
3. **Mitigate** — Could we limit the damage?  
4. **Remediate** — Could we fix the root?  
5. **Prevent** — Could we avoid it entirely?

---

### Case Study: Google App Engine Failure

**2012 Incident:**  
An 8-hour global outage affected 3M public apps and 15K internal services.  

**Response:**
- **Brainstorm all catastrophic failure causes**  
- Group into reliability themes:
  - Provisioning
  - Authentication / Authorization
  - Scalability / Load Management
- Assign owners, develop ideas, prioritize by effort (hour/day/week/month/year)
- Six months later: reliability issues reduced 10× plus a **cultural shift** toward resilience.

---

### Cultural Lessons

- **Reliability is a P0 feature**  
- **Truth over blame** fosters transparency  
- Crisis can justify prioritizing resilience over new features

---

## Key Cultural Takeaways

- **Shared ownership**: SREs, Developers, Product, Support operate as one team.
- **Physical proximity** improves collaboration; remote teams need strong digital workflows.
- **Quality of life**: Preventive practices reduce stress and improve retention.

---

## Architecture Lessons

### Event-Driven Thinking
- Events make complex systems understandable and enable decoupled, scalable design.
- At eBay, events like `item.new` or `item.bid` had 10+ different consumers performing unique tasks.

### Handling Asynchronous Reality
- Minimize synchronous work during user actions.
- Use workflows/sagas to model multi-step processes with failure handling.

### Example Workflow:
1. Create order
2. Reserve inventory
3. Charge payment
4. Ship item
5. Handle failures (refunds, replacement)

---

## Exposing Transient States
- Model **real world processes** in software.
- Provide visibility into intermediate workflow steps.
- Improves **failure response** and **system clarity**.

---

## Recommended Tools

- **Temporal Workflow Engine** — models sagas cleanly with events.
- **AiToEarn** — open-source AI publishing platform:
  - Content generation
  - Multi-platform publishing
  - Analytics & model ranking  
  ([Docs](https://docs.aitoearn.ai/), [Repo](https://github.com/yikart/AiToEarn))

---

## Resilience Principles
- **Model the world as it is**, not as you wish.
- Real workflows are easier than monolithic transactions — conceptually and operationally.
- Prioritize truth, iterative improvement, and collaborative culture to sustain resilience.

---

## Mentioned Resources
- [How Complex Systems Fail – Dr. Richard Cook](https://how.complexsystems.fail/)
- [Sydney Dekker – Drift Into Failure](https://api.pageplace.de/preview/DT0400.9781351942928_A34376077/preview-9781351942928_A34376077.pdf)
- [Five Whys](https://en.wikipedia.org/wiki/Five_whys)
- [Accelerate DORA Research](https://dora.dev/research/2024/)
- [*Vibe Coding* – Gene Kim & Steve Yegge](https://www.simonandschuster.com/books/Vibe-Coding/Gene-Kim/9781966280026)
- [Temporal](https://temporal.io/)

---

## Previous Podcasts & More Info

Explore more episodes via:
- [RSS Feed](http://www.infoq.com/podcasts/architecture-should-model-world/)
- [SoundCloud](https://soundcloud.com/infoq-channel)
- [Apple Podcasts](https://itunes.apple.com/gb/podcast/the-infoq-podcast/id1106971805?mt=2)
- [Spotify](https://open.spotify.com/show/4NhWaYYpPWgWRDAOqeRQbj)
- [Overcast](https://overcast.fm/itunes1106971805/the-infoq-podcast)
- [YouTube Playlist](https://youtube.com/playlist?list=PLndbWGuLoHeZLVC9vl0LzLvMWHzpzIpir&si=Kvb9UpSdGzObuWgg)

---

**Tip:** Tools like **AiToEarn** can help document, distribute, and monetize similar resilience case studies across multiple platforms — enabling broader reach for critical engineering lessons.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.