A Brief Discussion on Incident Review
Types of Fault Sources
Faults in systems generally come from three primary sources:
- Human-initiated changes (~80%) — Code, configuration, or release process changes triggered during business iteration.
- Expected low-probability events — Hardware or network failures like disk damage or packet loss. Usually mitigated by Business Continuity Planning (BCP) and disaster recovery designs.
- Unexpected system fragilities — Underlying performance bottlenecks, poor security defenses, configuration defects, or architecture flaws.
> Focus of this article: The first and most common source — human-initiated changes.
---
Service Availability: Nines and Operational Demands
- From 99.9% (three nines) to 99.99% (four nines) → Achievable with robust disaster recovery.
- From 99.99% (four nines) to 99.999% (five nines) → Requires extreme process discipline, tooling excellence, and cultural maturity.
When mitigating change-induced faults, center on:
- Assessment
- Verification
- Monitoring
Goal:
> “Use processes to constrain people, and tools to replace manual execution.”
---
1. Insufficient Assessment
Common Problems
- Missing code segments
- No compatibility evaluation
- Developers unsure of business context
- Duplicate code implementations
Best Practices
- Document standard business use cases — Keep knowledge consolidated.
- Mandatory updates to use case docs before requirement reviews or code changes.
- Release form requirements — Include impact assessments and key monitoring points.
- Tools to assist code review:
- Static scanning
- Call chain analysis
- Duplication detection to prevent redundant implementations
---
2. Inadequate Testing and Verification
Common Root Causes
- Testing environment incomplete or unstable
- Missing test data
- Environment mismatches between testing and production
Recommended Solutions
- Environment Governance
- Containerized deployments
- Full-link pre-release environments
- Traffic coloring (tag-based routing)
- Data Handling
- Regularly migrate production data to testing
- Generate test accounts and datasets via tools
- Dependency Consistency
- Online binaries overwrite base container images
- Automated connectivity probing
- Mocking & Automation
- Mock services
- Automated testing frameworks
---
3. Monitoring Omissions or Delays
Problems
- Main paths usually monitored
- Edge branches often overlooked
Recommendations
- Better too many alerts than none
- Each branch path → Dedicated monitoring point linked to corresponding test case
- Framework layer → Automatic monitoring and reporting
- Client applications → Auto-report error codes
- Any unexpected error code → Immediate action trigger
---
4. Deployment & Gradual Rollout (Gray Release)
Key Principle
Separate deployment (binary release) from traffic rollout to reveal and isolate issues.
Best Practices
- On small-scale hardware, avoid “one server equals half the rollout” risk.
- Start rollout by marking users at frontend; increase traffic only after deployment completion.
- Use gray tags for major frontend revamps to load different templates.
- Ensure tooling for:
- Unified and controllable deployment
- Automation
- Decoupled and reversible rollout
---
Conclusion
- 80% of faults originate from human changes.
- Processes = Front-line defense
- Tools = Protective moat
- Real resilience happens when change safety is embedded in organization culture — not just checked off.
---
Share Your Retrospectives
How do you conduct fault retrospectives?
Drop a comment to share your insights — you might receive exclusive community merchandise.
---
Helpful Tooling Inspiration
Teams aiming to standardize processes and strengthen tooling can explore solutions like AiToEarn官网.
AiToEarn offers:
- Cross-platform content publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
- Integrated analytics & automation
- AI-powered workflows
While designed for content creation and monetization, its automation-first approach can inspire robust operational practices for change safety.
---
Would you like me to add a diagram showing the three key areas (assessment → verification → monitoring) so the flow is visually clear? That would make your fault-prevention process easier to communicate.