A Brief Discussion on Incident Review

Types of Fault Sources

Faults in systems generally come from three primary sources:

  • Human-initiated changes (~80%) — Code, configuration, or release process changes triggered during business iteration.
  • Expected low-probability events — Hardware or network failures like disk damage or packet loss. Usually mitigated by Business Continuity Planning (BCP) and disaster recovery designs.
  • Unexpected system fragilities — Underlying performance bottlenecks, poor security defenses, configuration defects, or architecture flaws.

> Focus of this article: The first and most common source — human-initiated changes.

---

Service Availability: Nines and Operational Demands

  • From 99.9% (three nines) to 99.99% (four nines) → Achievable with robust disaster recovery.
  • From 99.99% (four nines) to 99.999% (five nines) → Requires extreme process discipline, tooling excellence, and cultural maturity.

When mitigating change-induced faults, center on:

  • Assessment
  • Verification
  • Monitoring

Goal:

> “Use processes to constrain people, and tools to replace manual execution.”

---

1. Insufficient Assessment

Common Problems

  • Missing code segments
  • No compatibility evaluation
  • Developers unsure of business context
  • Duplicate code implementations

Best Practices

  • Document standard business use cases — Keep knowledge consolidated.
  • Mandatory updates to use case docs before requirement reviews or code changes.
  • Release form requirements — Include impact assessments and key monitoring points.
  • Tools to assist code review:
  • Static scanning
  • Call chain analysis
  • Duplication detection to prevent redundant implementations

---

2. Inadequate Testing and Verification

Common Root Causes

  • Testing environment incomplete or unstable
  • Missing test data
  • Environment mismatches between testing and production
  • Environment Governance
  • Containerized deployments
  • Full-link pre-release environments
  • Traffic coloring (tag-based routing)
  • Data Handling
  • Regularly migrate production data to testing
  • Generate test accounts and datasets via tools
  • Dependency Consistency
  • Online binaries overwrite base container images
  • Automated connectivity probing
  • Mocking & Automation
  • Mock services
  • Automated testing frameworks

---

3. Monitoring Omissions or Delays

Problems

  • Main paths usually monitored
  • Edge branches often overlooked

Recommendations

  • Better too many alerts than none
  • Each branch path → Dedicated monitoring point linked to corresponding test case
  • Framework layer → Automatic monitoring and reporting
  • Client applications → Auto-report error codes
  • Any unexpected error code → Immediate action trigger

---

4. Deployment & Gradual Rollout (Gray Release)

Key Principle

Separate deployment (binary release) from traffic rollout to reveal and isolate issues.

Best Practices

  • On small-scale hardware, avoid “one server equals half the rollout” risk.
  • Start rollout by marking users at frontend; increase traffic only after deployment completion.
  • Use gray tags for major frontend revamps to load different templates.
  • Ensure tooling for:
  • Unified and controllable deployment
  • Automation
  • Decoupled and reversible rollout

---

Conclusion

  • 80% of faults originate from human changes.
  • Processes = Front-line defense
  • Tools = Protective moat
  • Real resilience happens when change safety is embedded in organization culture — not just checked off.

---

Share Your Retrospectives

How do you conduct fault retrospectives?

Drop a comment to share your insights — you might receive exclusive community merchandise.

---

Helpful Tooling Inspiration

Teams aiming to standardize processes and strengthen tooling can explore solutions like AiToEarn官网.

AiToEarn offers:

  • Cross-platform content publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
  • Integrated analytics & automation
  • AI-powered workflows

While designed for content creation and monetization, its automation-first approach can inspire robust operational practices for change safety.

---

Would you like me to add a diagram showing the three key areas (assessment → verification → monitoring) so the flow is visually clear? That would make your fault-prevention process easier to communicate.

Read more