incident postmortems

Don't Let Incident Reviews Become Formalities: Use AI to Uncover the Value of Every Failure

Honghao Wang

04 Nov 2025 — 5 min read

# Don’t Let Incident Postmortems Become Mere Formalities  
## Use AI to Extract Value from Every “Fall”  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_001-77.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_002-73.jpg)  

This article examines why **incident postmortems** often drift into superficial formalities — plagued by issues like concealing root causes, disjointed event chains, or lack of deep technical insight from support roles — and proposes an **AI‑powered Intelligent Postmortem Agent** to address these pain points.

You’ll learn:
- **Architecture design**: data collection, preprocessing, memory, intent recognition  
- **Prompt iteration** across multiple versions  
- **Real-world cases** showing impact  

By applying AI to postmortem document generation, fault tree analysis, tagging, and Q&A, teams can **turn incidents into data assets** — enabling a shift from passive response to proactive defense, and raising professionalism for technical support, R&D, and non-technical stakeholders.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_003-66.jpg)  

---

## Business Perspective: Why Postmortems Matter  

> **Blameless culture:** You can’t “fix” people — but you can fix systems and processes that help people make better decisions in designing and maintaining complex systems.

Technical support’s primary mission:
1. **Risk prevention**  
2. **Real-time emergency collaboration**  
3. **Post-incident review with follow-up improvements**  

### Postmortems Deliver Two Core Benefits  
1. **Risk Detection & Closure** — Accurately find and fix risks before they affect sister systems.  
2. **Incident → Asset Conversion** — Use past incidents as case-based learning for newer engineers and other teams.

### Common Pitfalls  
- Reluctance to document deep causes for change-related failures  
- Middle teams avoid chain reconstruction duties (“not my problem”)  
- Postmortems shortened after fix; meeting discussions avoid blame  
- Support lacks architectural depth; dev teams dismiss suggestions without discourse  

**Ideal state**: Minor incidents handled entirely online, change records auto-linked to change/monitoring/rollout data, self-service postmortems by developers that embed into DevOps loops — practicing **You built it, you own it**.

---

## AI-Powered Capabilities for Postmortems  

Key functions include:

- **Auto summarization**: Incident overview, timeline, scope — integrated with monitoring/change systems to produce first draft immediately after recovery.  
- **Root Cause Analysis (RCA) & Fault Tree Analysis (FTA)**: Deep cause mapping and focus generation.
- **Multi-dimensional tagging**: Turn postmortem docs into structured knowledge assets.
- **Risk integration**: Feed detected risks into **Health Center** dashboards.
- **Natural language fault Q&A**: Search, count, analyze incident history interactively.  
- **Common risk extraction**: Learn patterns from bulk incidents.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_004-62.jpg)  

Mission: Make postmortems actionable foresight, not just hindsight.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_005-53.jpg)  

---

## Intelligent Postmortem Agent — Overview  

**Core mission:** Turn every incident into predictable, preventable insight.  

**Target users:** SREs, DevOps, technical support, stability managers  
**Value:** AI-assisted deep analysis replacing manual “form-filling”  
**Solution:** LLM-driven full-process fault review support  

### Functional Panorama  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_006-48.jpg)  

- **Data aggregation** from chats, meetings, emergency platforms  
- **One-click drafts**: overview, timeline, impact scope  
- **Conversational deep-dive**: cause analysis, hidden focus points  
- **Knowledge-enhanced suggestions** tied to tech stack  
- **Closed-loop knowledge assets** feeding back to models

---

## Technical Architecture  

### Multi-Agent System  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_008-39.jpg)  

Agents specialize by role/task; orchestration routes intent to relevant experts.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_009-38.jpg)  

Workflow:
1. **Question & Plan** — Interpret user intent  
2. **Orchestrate** — Assign tasks to specialist agents  
3. **Execution** — Query APIs, RAG KBs, perform deep reasoning  
4. **Integrate** — Compose coherent answer for user

![image](https://blog.aitoearn.ai/content/images/2025/11/img_010-37.jpg)  

---

## Core Technologies  

### 4.1 Data Collection & Preprocessing  
- **Challenges:** Heterogeneous formats, uneven density, noise → reduced efficiency and accuracy  
- **Goal:** Unified fault data layer for RCA — aggregate messaging, meeting, monitoring, release data → 360° incident view  
- **Methods:** noise reduction, speech alignment, multi-modal fusion, timeline reconstruction, causal inference  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_011-36.jpg)  

---

### 4.2 Memory Management  

#### Pain:  
Multi-turn, long-process tasks quickly overflow tokens, lose key context.  

#### Approach:  
**Noise Reduction → Summarization → Preservation**  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_012-32.jpg)  

Differences from general agents: longer processes, dense valuable memory, high noise ratio.  

**Strategies:**  
- Intelligent summarization preserving recent contexts & system prompts  
- Structured summaries (eight-section recap)  
- Preservation of critical early instructions

async def _execute_fifo_summary(self, messages: List[Message]) -> List[Message]:

# 1. Separate System messages

...


![image](https://blog.aitoearn.ai/content/images/2025/11/img_015-20.jpg)  

---

### 4.3 Intent Recognition  

From monolithic “all-in-one” → **multi-agent routing & nesting**:

- **ChatAgent**: Q&A, light analysis  
- **WorkAgent**: Task execution, professional output  
- **Selector**: Routes to domain-specific agents  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_017-18.jpg)  

---

### 4.4 Page Interaction Interface  

Enhancements over general agents:
- Step-level streaming with exposure control  
- Cache-based message producer/consumer  
- Frontend components registered as tools for rich UI  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_018-14.jpg)  

---

### 4.5 RAG Knowledge Enhancement  
Essential to inject private-domain expertise into LLM context.  
Closed-loop: detect blind spots → Q&A pairs → knowledge base → RAG retrieval.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_019-14.jpg)  

---

### 4.6 Evaluation Mechanism  

From **lexical metrics** to **business-value scoring**:  
- ROUGE/BLEU → BERTScore → LLM-as-Judge → **Depth/Logic/Actionability/Evidence** scoring  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_020-14.jpg)  

---

### 4.7 Prompt Optimization  

#### V1: General prompts → too generic  
#### V2: Risk tags + CoT → better structure, still generic improvements  
#### V3: Split tag libraries → more precision, still forced matches  
#### V4: Discard tags → question-first, evidence-based, actionable, anti-hallucination  

---

## Multi-Role Empowerment  

| Role            | Pain Points | AI Empowerment | Benefits |
|-----------------|-------------|----------------|----------|
| Technical Support | Dispersed info, shallow analysis | Draft generation, completeness check, focus point assist | Faster, deeper, more professional reviews |
| R&D | Fragmented causes | Attribution guidance, blind spot alerts | Higher quality cause chains |
| General Users | Cannot understand reports | One-line summary, visual FTA, Q&A | Instant comprehension, richer knowledge assets |

---

### Technical Support Workflow  

1. Auto initial draft after resolution  
2. Interactive deep cause analysis  
3. Risk uncovering, architecture factor inclusion  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_027-7.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_028-6.jpg)  

---

### R&D Workflow  

1. Initial draft for factual recall  
2. Cause quality analysis → blind spot prompts  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_029-3.jpg)  

---

### General User Workflow  

1. Auto concise summary for non-tech readers  
2. Visual process diagrams & FTA trees  

![image](https://blog.aitoearn.ai/content/images/2025/11/img_030-3.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_031-3.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_032-3.jpg)  

---

**Disclaimer:** All cases are fictional for illustration.

---

## Further Reading  
- [3DXR Technology](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzAxNDEwNjk5OQ==&action=getalbum&album_id=2565944923443904512#wechat_redirect)  
- [Terminal Technology](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzAxNDEwNjk5OQ==&action=getalbum&album_id=1533906991218294785#wechat_redirect)  
- [Audio & Video](https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzAxNDEwNjk5OQ==&action=getalbum&album_id=1592015847500414978#wechat_redirect)

---

**Key takeaway:** AI-assisted postmortems raise incident analysis depth, turn failures into knowledge, and foster proactive risk management. Pair the **Intelligent Postmortem Agent** with your DevOps workflows for maximum operational resilience.

Don't Let Incident Reviews Become Formalities: Use AI to Uncover the Value of Every Failure

Honghao Wang

Read more

These College Students Are Helping OPPO Build AI Products

Ilya’s Shocking Testimony: Altman’s Wrongdoing, Mira’s Drama, and OpenAI’s Near-Merger with Anthropic

Reasons Against pgvector: Technical Challenges at Scale

Elimination Game’s New Innovative Gameplay Hits $1M Monthly Revenue in 70 Days