Root Cause Localization System Practice Based on DeepSeek and Multi-Agent
AIOps & RCA in Practice: Insights from Chen Dihao at the 2025 XCOPS Summit
> Based on Chen Dihao’s online sharing at the 2025 XCOPS Intelligent Operations Management Annual Conference – Guangzhou Station.
> (Playback link available at the end — don’t miss it!)

---
Speaker Profile
Chen Dihao — Head of the AI Technology Platform at SF Technology
- Oversees AI and large-model infrastructure.
- Former Platform Architect at Fourth Paradigm, PMC member of OpenMLDB, Architect for Xiaomi’s Deep Learning Platform, and leader in Storage & Container at UnitedStack.
- Active in distributed systems and machine learning open-source communities, contributing to HBase, OpenStack, TensorFlow, etc.
---
Session Highlights
- Evolution trends in AIOps and RCA
- Multi-agent architecture for operations systems
- Large models in multi-scenario RCA
- DeepSeek optimization & practice
---
1. Evolution Trends in AIOps and RCA Technology
1.1 From DevOps to AIOps

DevOps — Automated Ops
- Breaks barriers between Dev & Ops
- Focus on CI/CD to optimize delivery cycles and improve iteration speed
AIOps — Intelligent Ops
- Leverages big data + machine learning for:
- Anomaly detection
- Root cause localization
- Automated remediation
- Shifts from passive response to proactive prediction
- Cuts MTTR, reduces business risks
---
1.2 RCA Challenges

Challenges:
- Multi-modal data integration — logs, metrics, traces, events in varying formats/timings
- Complex causal inference — intertwined dependencies, spurious correlations
- High data quality requirements — noise/missing values affect accuracy
- Engineering implementation difficulty — high performance + interpretability demands
---
Response Strategies:
- Integrate O&M data into a unified platform (unstructured + graph-structured)
- Multi-agent collaboration to manage complex causal webs
- Large-model reasoning with private deployment + domain knowledge bases
- Focus on large-model security to protect operational data
---
Future AIOps & RCA Trends

- Multimodal data fusion
- Large-model-driven decisions
- Automated repair loops
- End-to-end causal chain tracking
- Human–AI co-evolution
- Dynamic threshold optimization
---
2. Multi-Agent Architecture at SF Express
Platform Overview

- GPU cluster: 1,000+ GPUs
- Private deployment of latest DeepSeek models
- Users: 7,000+ internal LLM users
- Daily calls: 200M+
---
Core Application Scenarios

- Root Cause Localization
- Causal graph + agents to pinpoint faults quickly
- Policy Recommendations
- Agents suggest tailored O&M strategies
- Dynamic Thresholds
- Auto-adjust monitoring ranges based on real-time + historical data
---
3. Architecture for Multi-Agent RCA

Four Specializations:
- Unified Logical Topology — integrate CMDB & APM
- Collaborative Diagnostics — specialized agents per alert type
- O&M Knowledge Base — combine internal experience + external best practices
- Multi-Scenario AIOps Tools — integrate RCA into alert handling platforms
---
4. Multi-Alert RCA Workflow

Process:
- Converge alerts by time
- Filter duplicates/unrelated alerts
- Identify common dependency nodes via topology mapping
- Trace via large models + domain experts to confirm root cause
---
5. Multi-Agent Collaboration Mechanism

- Architect Agent — coordinates analysis across agents
- Domain-Specific Agents — alarm analysis, cloud logs, APM tracing, core components, basic monitoring, DB analysis
- Each agent has:
- Independent LLM capability
- Dedicated knowledge base
- Data acquisition interface
---
6. Large Models in Multi-Scenario RCA
Steps:
- Data Preparation — O&M middle platform
- Knowledge Integration
- Multi-Agent Deployment
- Tool Integration — auto-trigger diagnostics in production
Key AIOps Indicators

- Data processing
- Localization accuracy
- Automated response
- Explainability
---
Alarm Convergence & Node Filtering

- Merge related alerts
- Filter based on severity/impact/root cause proximity
- Prioritize significant anomalies
---
7. Results & Enhancements
Alert Dashboard:

- Calls relevant agents by alert type
- Generates summaries via LLM
Root Cause Localization:

- Maps alerts → graph nodes
- Select Top‑weighted nodes for RCA
- Strategy recommendations: restart/rollback
---
8. Multimodal LLM Integration & Human–Machine Collaboration

- Manual chart anomaly detection augmented by image-based LLMs
- ASR/TTS integration from war rooms to feed models in real-time
---
9. Practical Value vs. Technical Challenges

Value:
- Faster RCA = quicker recovery
- Business continuity
- Better resource allocation
- Knowledge reuse
Challenges:
- Accurate data acquisition
- Algorithm–performance balance
- Real-time RCA demands
- Complexity of dynamic systems
---
10. DeepSeek Optimization for RCA
Modules:

- O&M Middleware
- Automation Tools
- Agent Platform
- RCA Algorithms
---
Private Large Model Deployment

- Hybrid cloud for data security
- Local data storage
- Inference optimization (batching, quantization, accelerated frameworks)
---
11. RCA Scenarios with DeepSeek

- Multi-alert convergence
- Log analysis optimization
- Root cause node tracing
- Time-series anomaly detection
- Multi-agent collaboration
---
12. RCA & Strategy Recommendations

- Prompt Engineering: +36% RCA accuracy
- Knowledge Base Integration: +29% accuracy
- Multi-Agent Collaboration: target 90% RCA accuracy
---
Q&A Highlights

Q1: Fine-tuning vs. RAG & multi-agent scheduling
- Fine-tuning for strict performance tasks
- RCA at SF: no fine-tuning yet, due to high cost & limited gain for smaller models
- Multi-agent logic currently hardcoded or workflow engine-based
Q2: Large models in real-time monitoring
- LLM calls are manual-trigger post-alert
- Thresholds precomputed using traditional time-series methods
- CV/fine-tuned binary classifiers for anomaly validity checks
---

⬇️ Click Read Original to get PPT (Code: 0516)