Root Cause Localization System Practice Based on DeepSeek and Multi-Agent

Root Cause Localization System Practice Based on DeepSeek and Multi-Agent

AIOps & RCA in Practice: Insights from Chen Dihao at the 2025 XCOPS Summit

> Based on Chen Dihao’s online sharing at the 2025 XCOPS Intelligent Operations Management Annual Conference – Guangzhou Station.

> (Playback link available at the end — don’t miss it!)

image

---

Speaker Profile

Chen Dihao — Head of the AI Technology Platform at SF Technology

  • Oversees AI and large-model infrastructure.
  • Former Platform Architect at Fourth Paradigm, PMC member of OpenMLDB, Architect for Xiaomi’s Deep Learning Platform, and leader in Storage & Container at UnitedStack.
  • Active in distributed systems and machine learning open-source communities, contributing to HBase, OpenStack, TensorFlow, etc.

---

Session Highlights

  • Evolution trends in AIOps and RCA
  • Multi-agent architecture for operations systems
  • Large models in multi-scenario RCA
  • DeepSeek optimization & practice

---

1.1 From DevOps to AIOps

image

DevOps — Automated Ops

  • Breaks barriers between Dev & Ops
  • Focus on CI/CD to optimize delivery cycles and improve iteration speed

AIOps — Intelligent Ops

  • Leverages big data + machine learning for:
  • Anomaly detection
  • Root cause localization
  • Automated remediation
  • Shifts from passive response to proactive prediction
  • Cuts MTTR, reduces business risks

---

1.2 RCA Challenges

image

Challenges:

  • Multi-modal data integration — logs, metrics, traces, events in varying formats/timings
  • Complex causal inference — intertwined dependencies, spurious correlations
  • High data quality requirements — noise/missing values affect accuracy
  • Engineering implementation difficulty — high performance + interpretability demands

---

Response Strategies:

  • Integrate O&M data into a unified platform (unstructured + graph-structured)
  • Multi-agent collaboration to manage complex causal webs
  • Large-model reasoning with private deployment + domain knowledge bases
  • Focus on large-model security to protect operational data

---

image
  • Multimodal data fusion
  • Large-model-driven decisions
  • Automated repair loops
  • End-to-end causal chain tracking
  • Human–AI co-evolution
  • Dynamic threshold optimization

---

2. Multi-Agent Architecture at SF Express

Platform Overview

image
  • GPU cluster: 1,000+ GPUs
  • Private deployment of latest DeepSeek models
  • Users: 7,000+ internal LLM users
  • Daily calls: 200M+

---

Core Application Scenarios

image
  • Root Cause Localization
  • Causal graph + agents to pinpoint faults quickly
  • Policy Recommendations
  • Agents suggest tailored O&M strategies
  • Dynamic Thresholds
  • Auto-adjust monitoring ranges based on real-time + historical data

---

3. Architecture for Multi-Agent RCA

image

Four Specializations:

  • Unified Logical Topology — integrate CMDB & APM
  • Collaborative Diagnostics — specialized agents per alert type
  • O&M Knowledge Base — combine internal experience + external best practices
  • Multi-Scenario AIOps Tools — integrate RCA into alert handling platforms

---

4. Multi-Alert RCA Workflow

image

Process:

  • Converge alerts by time
  • Filter duplicates/unrelated alerts
  • Identify common dependency nodes via topology mapping
  • Trace via large models + domain experts to confirm root cause

---

5. Multi-Agent Collaboration Mechanism

image
  • Architect Agent — coordinates analysis across agents
  • Domain-Specific Agents — alarm analysis, cloud logs, APM tracing, core components, basic monitoring, DB analysis
  • Each agent has:
  • Independent LLM capability
  • Dedicated knowledge base
  • Data acquisition interface

---

6. Large Models in Multi-Scenario RCA

Steps:

  • Data Preparation — O&M middle platform
  • Knowledge Integration
  • Multi-Agent Deployment
  • Tool Integration — auto-trigger diagnostics in production

Key AIOps Indicators

image
  • Data processing
  • Localization accuracy
  • Automated response
  • Explainability

---

Alarm Convergence & Node Filtering

image
  • Merge related alerts
  • Filter based on severity/impact/root cause proximity
  • Prioritize significant anomalies

---

7. Results & Enhancements

Alert Dashboard:

image
  • Calls relevant agents by alert type
  • Generates summaries via LLM

Root Cause Localization:

image
  • Maps alerts → graph nodes
  • Select Top‑weighted nodes for RCA
  • Strategy recommendations: restart/rollback

---

8. Multimodal LLM Integration & Human–Machine Collaboration

image
  • Manual chart anomaly detection augmented by image-based LLMs
  • ASR/TTS integration from war rooms to feed models in real-time

---

9. Practical Value vs. Technical Challenges

image

Value:

  • Faster RCA = quicker recovery
  • Business continuity
  • Better resource allocation
  • Knowledge reuse

Challenges:

  • Accurate data acquisition
  • Algorithm–performance balance
  • Real-time RCA demands
  • Complexity of dynamic systems

---

10. DeepSeek Optimization for RCA

Modules:

image
  • O&M Middleware
  • Automation Tools
  • Agent Platform
  • RCA Algorithms

---

Private Large Model Deployment

image
  • Hybrid cloud for data security
  • Local data storage
  • Inference optimization (batching, quantization, accelerated frameworks)

---

11. RCA Scenarios with DeepSeek

image
  • Multi-alert convergence
  • Log analysis optimization
  • Root cause node tracing
  • Time-series anomaly detection
  • Multi-agent collaboration

---

12. RCA & Strategy Recommendations

image
  • Prompt Engineering: +36% RCA accuracy
  • Knowledge Base Integration: +29% accuracy
  • Multi-Agent Collaboration: target 90% RCA accuracy

---

Q&A Highlights

image

Q1: Fine-tuning vs. RAG & multi-agent scheduling

  • Fine-tuning for strict performance tasks
  • RCA at SF: no fine-tuning yet, due to high cost & limited gain for smaller models
  • Multi-agent logic currently hardcoded or workflow engine-based

Q2: Large models in real-time monitoring

  • LLM calls are manual-trigger post-alert
  • Thresholds precomputed using traditional time-series methods
  • CV/fine-tuned binary classifiers for anomaly validity checks

---

image

⬇️ Click Read Original to get PPT (Code: 0516)

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.