AI news

K8s Mysterious Crash Causes P0 Incident — Overnight Investigation Reveals Etcd Database Fragmentation

Honghao Wang

25 Nov 2025 — 2 min read

# ⚠️ Incident Report: etcd Database Fragmentation Crisis

**Incident Level:** P0  
**Impact Scope:** All Services  
**Recovery Duration:** 4 hours 23 minutes  

---

## 1. Disaster Strikes — 3 AM Alert Storm

**Date/Time:** December 15, 2024 — 03:17  
I was asleep when the phone started buzzing relentlessly. Monitoring alerts poured in:

CRITICAL: K8s API Server response timeout

CRITICAL: etcd cluster health check failed

CRITICAL: All Pod statuses abnormal

CRITICAL: All business services offline


With 8 years of ops experience, I knew instantly — this was **serious**.

---

## 2. Initial Diagnosis — Symptoms Worsen

SSH into the **Master node** revealed grim results:

$ kubectl get nodes

The connection to the server localhost:8080 was refused

$ kubectl cluster-info

Unable to connect to the server: dial tcp 10.0.1.10:6443: i/o timeout


The **API Server** was unreachable.

Checking **etcd cluster health**:

$ systemctl status etcd

● etcd.service - etcd

Active: active (running) but degraded

$ etcdctl endpoint health --cluster

10.0.1.10:2379 is unhealthy: took too long

10.0.1.11:2379 is unhealthy: took too long

10.0.1.12:2379 is unhealthy: took too long


All nodes reported **timeout** — clearly not a simple network glitch.

---

## 3. Deep Inspection — A Shocking Discovery

### Step 1 — etcd Logs

$ journalctl -u etcd -n 100

Dec 15 03:15:45 etcd[1234]: database space exceeded

Dec 15 03:16:02 etcd[1234]: mvcc: database space exceeded


**Keyword:** `database space exceeded` — the smoking gun.

### Step 2 — Database Investigation

Incidents like "database space exceeded" directly paralyze Kubernetes clusters. Proper quotas, disk usage monitoring, and automated cleanups are critical.

---

## 4. **Root Cause Analysis** — 8GB Database, 98% Fragmentation!

### Step 3 — Fragmentation Check

$ du -sh /var/lib/etcd/

8.4G /var/lib/etcd/

$ etcdctl endpoint hashkv --cluster

Actual data: 156MB ; File: 8.4GB ; Fragmentation: 98.1%


### Step 4 — Root Causes

#### 1. Pod Restart Storm

$ grep "PUT /registry/pods" etcd.log | wc -l

2847293

Millions of writes → severe fragmentation.

#### 2. Historical Version Bloat

$ etcdctl get --prefix --keys-only /registry/ | wc -l

1847293

No compaction → huge version accumulation.

#### 3. Improper Config

auto-compaction-retention: "0" # Disabled

quota-backend-bytes: 8589934592 # Max quota hit


---

## 5. **Emergency Recovery Plan**

### Phase 1 — **Storage Expansion** (15 min)

$ etcdctl put quota-backend-bytes 12884901888

$ systemctl restart etcd


### Phase 2 — **Manual Compaction** (45 min)

$ rev=$(etcdctl endpoint status ... )

$ etcdctl compact $((rev-1000))


### Phase 3 — **Defragmentation** (180 min)

for ep in 10.0.1.10:2379 10.0.1.11:2379 10.0.1.12:2379; do

etcdctl --endpoints=$ep defrag

done

**Result:** DB size ↓ **8.4GB → ~180MB**

### Phase 4 — **Service Verification** (23 min)

kubectl cluster-info

kubectl get nodes

kubectl get pods --all-namespaces | grep -v Running | wc -l

✅ All services restored.

---

## 6. **Permanent Preventive Measures**

### Automated Compaction

auto-compaction-mode: periodic

auto-compaction-retention: "5m"


### Monitoring & Alerts (Prometheus)

alert: EtcdHighFragmentation
expr: (dbSize - dbSizeInUse) / dbSize > 0.5
for: 10m
#!/bin/bash
check_fragmentation() { ... }


---

## 7. **Lessons Learned**

**Key Metrics:**
- DB Size: Critical > 5GB
- Fragmentation: Critical > 80%
- Versions: Keep 1k-5k

**Best Practices:**
- **Prevention > Cure:** Regular checks
- **Automate:** Avoid human error
- **Strong Monitoring:** Infrastructure lifeline
- **Prepared Scripts:** Ready before crisis

---

## 8. **Final Thoughts**

This 4h23m battle against near-total etcd fragmentation tested skill, resilience, and rapid decision-making. In ops, we are the last defense for business continuity — every outage adds hard-earned expertise.

Leveraging automation platforms, such as **[AiToEarn官网](https://aitoearn.ai/)**, reflects DevOps principles: integration, efficiency, scalability. While AiToEarn focuses on AI-driven content generation and cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X), the underlying automation workflows parallel ops best practices — smart tools, fast reactions.

**Repo:** [AiToEarn开源地址](https://github.com/yikart/AiToEarn)

---


### Daily Health Script

K8s Mysterious Crash Causes P0 Incident — Overnight Investigation Reveals Etcd Database Fragmentation

Honghao Wang

Actual data: 156MB ; File: 8.4GB ; Fragmentation: 98.1%

Read more

Xiaoyuan Learning Tablet Wins 2025 IDEA International Design Award, Setting a New Benchmark for Study Devices

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Cloud Computing Giant Unveils 25 New Products in 10 Minutes — Kimi and MiniMax Debut

TopGear Picks 18 Cars of the Year, Only One from China