K8s Mysterious Crash Causes P0 Incident — Overnight Investigation Reveals Etcd Database Fragmentation
# ⚠️ Incident Report: etcd Database Fragmentation Crisis
**Incident Level:** P0
**Impact Scope:** All Services
**Recovery Duration:** 4 hours 23 minutes
---
## 1. Disaster Strikes — 3 AM Alert Storm
**Date/Time:** December 15, 2024 — 03:17
I was asleep when the phone started buzzing relentlessly. Monitoring alerts poured in:
CRITICAL: K8s API Server response timeout
CRITICAL: etcd cluster health check failed
CRITICAL: All Pod statuses abnormal
CRITICAL: All business services offline
With 8 years of ops experience, I knew instantly — this was **serious**.
---
## 2. Initial Diagnosis — Symptoms Worsen
SSH into the **Master node** revealed grim results:
$ kubectl get nodes
The connection to the server localhost:8080 was refused
$ kubectl cluster-info
Unable to connect to the server: dial tcp 10.0.1.10:6443: i/o timeout
The **API Server** was unreachable.
Checking **etcd cluster health**:
$ systemctl status etcd
● etcd.service - etcd
Active: active (running) but degraded
$ etcdctl endpoint health --cluster
10.0.1.10:2379 is unhealthy: took too long
10.0.1.11:2379 is unhealthy: took too long
10.0.1.12:2379 is unhealthy: took too long
All nodes reported **timeout** — clearly not a simple network glitch.
---
## 3. Deep Inspection — A Shocking Discovery
### Step 1 — etcd Logs
$ journalctl -u etcd -n 100
Dec 15 03:15:45 etcd[1234]: database space exceeded
Dec 15 03:16:02 etcd[1234]: mvcc: database space exceeded
**Keyword:** `database space exceeded` — the smoking gun.
### Step 2 — Database Investigation
Incidents like "database space exceeded" directly paralyze Kubernetes clusters. Proper quotas, disk usage monitoring, and automated cleanups are critical.
---
## 4. **Root Cause Analysis** — 8GB Database, 98% Fragmentation!
### Step 3 — Fragmentation Check
$ du -sh /var/lib/etcd/
8.4G /var/lib/etcd/
$ etcdctl endpoint hashkv --cluster
Actual data: 156MB ; File: 8.4GB ; Fragmentation: 98.1%
### Step 4 — Root Causes
#### 1. Pod Restart Storm$ grep "PUT /registry/pods" etcd.log | wc -l
2847293
Millions of writes → severe fragmentation.
#### 2. Historical Version Bloat$ etcdctl get --prefix --keys-only /registry/ | wc -l
1847293
No compaction → huge version accumulation.
#### 3. Improper Configauto-compaction-retention: "0" # Disabled
quota-backend-bytes: 8589934592 # Max quota hit
---
## 5. **Emergency Recovery Plan**
### Phase 1 — **Storage Expansion** (15 min)$ etcdctl put quota-backend-bytes 12884901888
$ systemctl restart etcd
### Phase 2 — **Manual Compaction** (45 min)$ rev=$(etcdctl endpoint status ... )
$ etcdctl compact $((rev-1000))
### Phase 3 — **Defragmentation** (180 min)for ep in 10.0.1.10:2379 10.0.1.11:2379 10.0.1.12:2379; do
etcdctl --endpoints=$ep defrag
done
**Result:** DB size ↓ **8.4GB → ~180MB**
### Phase 4 — **Service Verification** (23 min)kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running | wc -l
✅ All services restored.
---
## 6. **Permanent Preventive Measures**
### Automated Compactionauto-compaction-mode: periodic
auto-compaction-retention: "5m"
### Monitoring & Alerts (Prometheus)- alert: EtcdHighFragmentation
- expr: (dbSize - dbSizeInUse) / dbSize > 0.5
- for: 10m
- #!/bin/bash
- check_fragmentation() { ... }
---
## 7. **Lessons Learned**
**Key Metrics:**
- DB Size: Critical > 5GB
- Fragmentation: Critical > 80%
- Versions: Keep 1k-5k
**Best Practices:**
- **Prevention > Cure:** Regular checks
- **Automate:** Avoid human error
- **Strong Monitoring:** Infrastructure lifeline
- **Prepared Scripts:** Ready before crisis
---
## 8. **Final Thoughts**
This 4h23m battle against near-total etcd fragmentation tested skill, resilience, and rapid decision-making. In ops, we are the last defense for business continuity — every outage adds hard-earned expertise.
Leveraging automation platforms, such as **[AiToEarn官网](https://aitoearn.ai/)**, reflects DevOps principles: integration, efficiency, scalability. While AiToEarn focuses on AI-driven content generation and cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X), the underlying automation workflows parallel ops best practices — smart tools, fast reactions.
**Repo:** [AiToEarn开源地址](https://github.com/yikart/AiToEarn)
---
### Daily Health Script