Don't Let Incident Reviews Become Formalities: Use AI to Uncover the Value of Every Failure
# Don’t Let Incident Postmortems Become Mere Formalities
## Use AI to Extract Value from Every “Fall”


This article examines why **incident postmortems** often drift into superficial formalities — plagued by issues like concealing root causes, disjointed event chains, or lack of deep technical insight from support roles — and proposes an **AI‑powered Intelligent Postmortem Agent** to address these pain points.
You’ll learn:
- **Architecture design**: data collection, preprocessing, memory, intent recognition
- **Prompt iteration** across multiple versions
- **Real-world cases** showing impact
By applying AI to postmortem document generation, fault tree analysis, tagging, and Q&A, teams can **turn incidents into data assets** — enabling a shift from passive response to proactive defense, and raising professionalism for technical support, R&D, and non-technical stakeholders.

---
## Business Perspective: Why Postmortems Matter
> **Blameless culture:** You can’t “fix” people — but you can fix systems and processes that help people make better decisions in designing and maintaining complex systems.
Technical support’s primary mission:
1. **Risk prevention**
2. **Real-time emergency collaboration**
3. **Post-incident review with follow-up improvements**
### Postmortems Deliver Two Core Benefits
1. **Risk Detection & Closure** — Accurately find and fix risks before they affect sister systems.
2. **Incident → Asset Conversion** — Use past incidents as case-based learning for newer engineers and other teams.
### Common Pitfalls
- Reluctance to document deep causes for change-related failures
- Middle teams avoid chain reconstruction duties (“not my problem”)
- Postmortems shortened after fix; meeting discussions avoid blame
- Support lacks architectural depth; dev teams dismiss suggestions without discourse
**Ideal state**: Minor incidents handled entirely online, change records auto-linked to change/monitoring/rollout data, self-service postmortems by developers that embed into DevOps loops — practicing **You built it, you own it**.
---
## AI-Powered Capabilities for Postmortems
Key functions include:
- **Auto summarization**: Incident overview, timeline, scope — integrated with monitoring/change systems to produce first draft immediately after recovery.
- **Root Cause Analysis (RCA) & Fault Tree Analysis (FTA)**: Deep cause mapping and focus generation.
- **Multi-dimensional tagging**: Turn postmortem docs into structured knowledge assets.
- **Risk integration**: Feed detected risks into **Health Center** dashboards.
- **Natural language fault Q&A**: Search, count, analyze incident history interactively.
- **Common risk extraction**: Learn patterns from bulk incidents.

Mission: Make postmortems actionable foresight, not just hindsight.

---
## Intelligent Postmortem Agent — Overview
**Core mission:** Turn every incident into predictable, preventable insight.
**Target users:** SREs, DevOps, technical support, stability managers
**Value:** AI-assisted deep analysis replacing manual “form-filling”
**Solution:** LLM-driven full-process fault review support
### Functional Panorama

- **Data aggregation** from chats, meetings, emergency platforms
- **One-click drafts**: overview, timeline, impact scope
- **Conversational deep-dive**: cause analysis, hidden focus points
- **Knowledge-enhanced suggestions** tied to tech stack
- **Closed-loop knowledge assets** feeding back to models
---
## Technical Architecture
### Multi-Agent System

Agents specialize by role/task; orchestration routes intent to relevant experts.

Workflow:
1. **Question & Plan** — Interpret user intent
2. **Orchestrate** — Assign tasks to specialist agents
3. **Execution** — Query APIs, RAG KBs, perform deep reasoning
4. **Integrate** — Compose coherent answer for user

---
## Core Technologies
### 4.1 Data Collection & Preprocessing
- **Challenges:** Heterogeneous formats, uneven density, noise → reduced efficiency and accuracy
- **Goal:** Unified fault data layer for RCA — aggregate messaging, meeting, monitoring, release data → 360° incident view
- **Methods:** noise reduction, speech alignment, multi-modal fusion, timeline reconstruction, causal inference

---
### 4.2 Memory Management
#### Pain:
Multi-turn, long-process tasks quickly overflow tokens, lose key context.
#### Approach:
**Noise Reduction → Summarization → Preservation**

Differences from general agents: longer processes, dense valuable memory, high noise ratio.
**Strategies:**
- Intelligent summarization preserving recent contexts & system prompts
- Structured summaries (eight-section recap)
- Preservation of critical early instructions
async def _execute_fifo_summary(self, messages: List[Message]) -> List[Message]:
# 1. Separate System messages
...

---
### 4.3 Intent Recognition
From monolithic “all-in-one” → **multi-agent routing & nesting**:
- **ChatAgent**: Q&A, light analysis
- **WorkAgent**: Task execution, professional output
- **Selector**: Routes to domain-specific agents

---
### 4.4 Page Interaction Interface
Enhancements over general agents:
- Step-level streaming with exposure control
- Cache-based message producer/consumer
- Frontend components registered as tools for rich UI

---
### 4.5 RAG Knowledge Enhancement
Essential to inject private-domain expertise into LLM context.
Closed-loop: detect blind spots → Q&A pairs → knowledge base → RAG retrieval.

---
### 4.6 Evaluation Mechanism
From **lexical metrics** to **business-value scoring**:
- ROUGE/BLEU → BERTScore → LLM-as-Judge → **Depth/Logic/Actionability/Evidence** scoring

---
### 4.7 Prompt Optimization
#### V1: General prompts → too generic
#### V2: Risk tags + CoT → better structure, still generic improvements
#### V3: Split tag libraries → more precision, still forced matches
#### V4: Discard tags → question-first, evidence-based, actionable, anti-hallucination
---
## Multi-Role Empowerment
| Role | Pain Points | AI Empowerment | Benefits |
|-----------------|-------------|----------------|----------|
| Technical Support | Dispersed info, shallow analysis | Draft generation, completeness check, focus point assist | Faster, deeper, more professional reviews |
| R&D | Fragmented causes | Attribution guidance, blind spot alerts | Higher quality cause chains |
| General Users | Cannot understand reports | One-line summary, visual FTA, Q&A | Instant comprehension, richer knowledge assets |
---
### Technical Support Workflow
1. Auto initial draft after resolution
2. Interactive deep cause analysis
3. Risk uncovering, architecture factor inclusion


---
### R&D Workflow
1. Initial draft for factual recall
2. Cause quality analysis → blind spot prompts

---
### General User Workflow
1. Auto concise summary for non-tech readers
2. Visual process diagrams & FTA trees



---
**Disclaimer:** All cases are fictional for illustration.
---
## Further Reading
- [3DXR Technology](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzAxNDEwNjk5OQ==&action=getalbum&album_id=2565944923443904512#wechat_redirect)
- [Terminal Technology](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzAxNDEwNjk5OQ==&action=getalbum&album_id=1533906991218294785#wechat_redirect)
- [Audio & Video](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzAxNDEwNjk5OQ==&action=getalbum&album_id=1592015847500414978#wechat_redirect)
---
**Key takeaway:** AI-assisted postmortems raise incident analysis depth, turn failures into knowledge, and foster proactive risk management. Pair the **Intelligent Postmortem Agent** with your DevOps workflows for maximum operational resilience.