incident review

A Brief Discussion on Incident Review

Honghao Wang

31 Oct 2025 — 2 min read

Types of Fault Sources

Faults in systems generally come from three primary sources:

Human-initiated changes (~80%) — Code, configuration, or release process changes triggered during business iteration.
Expected low-probability events — Hardware or network failures like disk damage or packet loss. Usually mitigated by Business Continuity Planning (BCP) and disaster recovery designs.
Unexpected system fragilities — Underlying performance bottlenecks, poor security defenses, configuration defects, or architecture flaws.

> Focus of this article: The first and most common source — human-initiated changes.

---

Service Availability: Nines and Operational Demands

From 99.9% (three nines) to 99.99% (four nines) → Achievable with robust disaster recovery.
From 99.99% (four nines) to 99.999% (five nines) → Requires extreme process discipline, tooling excellence, and cultural maturity.

When mitigating change-induced faults, center on:

Assessment
Verification
Monitoring

Goal:

> “Use processes to constrain people, and tools to replace manual execution.”

---

1. Insufficient Assessment

Common Problems

Missing code segments
No compatibility evaluation
Developers unsure of business context
Duplicate code implementations

Best Practices

Document standard business use cases — Keep knowledge consolidated.
Mandatory updates to use case docs before requirement reviews or code changes.
Release form requirements — Include impact assessments and key monitoring points.
Tools to assist code review:
Static scanning
Call chain analysis
Duplication detection to prevent redundant implementations

---

2. Inadequate Testing and Verification

Common Root Causes

Testing environment incomplete or unstable
Missing test data
Environment mismatches between testing and production

3. Monitoring Omissions or Delays

Problems

Main paths usually monitored
Edge branches often overlooked

Recommendations

Better too many alerts than none
Each branch path → Dedicated monitoring point linked to corresponding test case
Framework layer → Automatic monitoring and reporting
Client applications → Auto-report error codes
Any unexpected error code → Immediate action trigger

---

4. Deployment & Gradual Rollout (Gray Release)

Key Principle

Separate deployment (binary release) from traffic rollout to reveal and isolate issues.

Best Practices

On small-scale hardware, avoid “one server equals half the rollout” risk.
Start rollout by marking users at frontend; increase traffic only after deployment completion.
Use gray tags for major frontend revamps to load different templates.
Ensure tooling for:
Unified and controllable deployment
Automation
Decoupled and reversible rollout

---

Conclusion

80% of faults originate from human changes.
Processes = Front-line defense
Tools = Protective moat
Real resilience happens when change safety is embedded in organization culture — not just checked off.

---

How do you conduct fault retrospectives?

Drop a comment to share your insights — you might receive exclusive community merchandise.

---

Helpful Tooling Inspiration

Teams aiming to standardize processes and strengthen tooling can explore solutions like AiToEarn官网.

AiToEarn offers:

Cross-platform content publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
Integrated analytics & automation
AI-powered workflows

While designed for content creation and monetization, its automation-first approach can inspire robust operational practices for change safety.

---

Would you like me to add a diagram showing the three key areas (assessment → verification → monitoring) so the flow is visually clear? That would make your fault-prevention process easier to communicate.

A Brief Discussion on Incident Review

Honghao Wang

Types of Fault Sources

Service Availability: Nines and Operational Demands