Decoupled Scheduling Architecture: Meta’s Journey in Scaling AI

Decoupled Scheduling Architecture: Meta’s Journey in Scaling AI
# Disaggregated Schedule Fabric (DSF) — Meta’s Next-Generation AI Network Fabric

**Disaggregated Schedule Fabric (DSF)** is Meta’s advanced network fabric technology for AI training, designed to overcome the constraints of traditional Clos-based architectures. It enables **scalable, low-latency, lossless AI networks** by disaggregating hardware components and optimizing traffic management for large GPU clusters.

This document outlines:

- **Challenges** in traditional IP fabrics for AI workloads
- **Innovations** in DSF design
- **Scalable deployment models** from single zones to mega clusters
- **Future directions** including new interconnect technologies

---

## Why DSF?

The rapid growth of **GenAI applications** demands **high-performance AI networks** that can handle massive AI model training workloads. DSF supports this by:

- Replacing monolithic chassis switches with modular units
- Implementing **VOQ-based architectures**
- Using open standards like [OCP-SAI](https://github.com/opencomputeproject/SAI) and [FBOSS](https://engineering.fb.com/2018/09/04/data-infrastructure/research-in-brief-building-switch-software-at-scale-and-in-the-open/)
- Delivering **better load balancing** and **congestion management** for intra-/inter-cluster traffic

---

## 1. Challenges With Traditional IP Fabrics

AI training jobs using RDMA over UDP exposed **three key issues**:

### 1.1 Elephant Flows
- Long-duration, high-volume traffic congests specific links.
- Causes head-of-line blocking.

### 1.2 Low Entropy
- Few IP flows in collective GPU operations.
- Inefficient hashing → hotspot congestion.

### 1.3 Suboptimal Fabric Utilization
- Uneven bandwidth usage across links.
- Forces **overprovisioning** to maintain performance.

---

## 2. Attempted Solutions & Limitations

1. **BGP-Based Pinning**
   - Pins traffic to specific uplinks.
   - Improves low entropy scenarios.
   - Breaks down in failure cases → relies on ECMP fallback.

2. **Load-Aware ECMP**
   - Tries to balance fat flows.
   - Requires complex tuning.
   - Introduces **out-of-order packets** — problematic for RDMA.

3. **Centralized Traffic-Engineering**
   - Pre-computes flow patterns per model.
   - Scales poorly with network growth.
   - Slow reaction to failures.

---

## 3. DSF Overview

### 3.1 Core Idea
Separates **Ethernet domain** from **Fabric domain**:

- **Ethernet domain** — normal server networking
- **Fabric domain** — packet cell spraying and hardware reassembly

### 3.2 Components
- **Interface Nodes (INs)** = Rack Disaggregated Switches (RDSWs)
- **Fabric Nodes (FNs)** = Fabric Disaggregated Switches (FDSWs)

Together, they form a **virtual chassis** that appears as a single switch to the outside network.

---

## 4. Traffic Management

- **Packet spraying** across all paths
- **Credit-based congestion control**
- **VOQ scheduling** for lossless delivery
- **In-order delivery** guaranteed within the fabric

---

## 5. DSF Deployment Models

### 5.1 Single AI Zone (L1 Zone)
- Multiple scaling units (GPU racks + RDSWs)
- RDSWs connected to FDSWs
- Two identical network planes for fault tolerance

**Figure 1:**
![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-79.png)

---

### 5.2 Dual-Stage Fabric (L2 Zone)
- 4× L1 zones interconnected via **Spine DSF Switches (SDSWs)**
- Non-blocking topology for 18,000 GPUs @ 800G

**Figure 2:**
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-74.png)

---

### 5.3 DSF Region
- 5× L2 zones interconnected via **L3 super-spine**
- Edge PoD architecture with EDSWs → L3 super-spine links

**Figure 3:**
![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-65.png)

---

## 6. Input Balanced Mode

**Purpose:** Maintain input ≤ output capacity even under remote link failures.

### Failure Handling:
1. **RDSW ↔ FDSW Failure**
2. **FDSW ↔ SDSW Failure**
3. **Combined Failures**

Randomized link selection stops advertising reachability to avoid oversubscription.

---

**Figures 4–17:**
![image](https://blog.aitoearn.ai/content/images/2025/10/img_004-59.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-59.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-52.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_007-42.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_008-40.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_009-36.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_010-28.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_011-21.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_012-21.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_013-17.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_014-15.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_015-8.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_016-6.png)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_017-2.png)

---

## 7. Future Work

- **Multi-region mega clusters**  
  Cross-region GPU interconnect tens of km apart ([details](https://engineering.fb.com/2025/09/29/data-infrastructure/metas-infrastructure-evolution-and-the-advent-of-ai/)).

- **Hyperports**  
  Aggregate multiple 800G ports at ASIC level to behave as one physical port.

- **Heterogeneous GPU/NIC support**  
  Natively handle varied hardware configurations.

---

## 8. Related AI Content Infrastructure

Platforms like [AiToEarn官网](https://aitoearn.ai/) parallel DSF’s scalability in the **AI content monetization** domain:

- **AI generation** across diverse workloads
- **Cross-platform publishing** — Douyin, Kwai, WeChat, Bilibili, Facebook, Instagram, Threads, LinkedIn, YouTube, Pinterest, X (Twitter)
- **Analytics + Model Ranking** ([AI模型排名](https://rank.aitoearn.ai))
- **Open-source** workflows ([AiToEarn文档](https://docs.aitoearn.ai), [GitHub](https://github.com/yikart/AiToEarn))

Just as DSF scales GPU clusters efficiently, AiToEarn scales multi-platform content pipelines globally.

---

Read more