HAMi × NVIDIA: Detailed Guide to GPU Topology-Aware Scheduling

HAMi × NVIDIA: Detailed Guide to GPU Topology-Aware Scheduling
# HAMi NVIDIA GPU Topology-Aware Scheduling — Design & Code Deep Dive

**Date:** 2025-10-25 13:30 (Zhejiang)

This article explains the **design philosophy**, **core principles**, and **code implementation** of HAMi’s new **topology-aware scheduling** capability for NVIDIA GPUs in version `v2.7.0`. We focus on how HAMi intelligently schedules GPU workloads in **HPC** and **large AI model training** scenarios to minimize communication bottlenecks and maximize cluster efficiency.

---

## Introduction

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-506.jpg)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-464.jpg)

HAMi is an active **open-source project** maintained by over **350 contributors** from more than **15 countries**. Adopted by over **200 enterprises**, it is known for **scalability** and **production-grade stability**.

The `v2.7.0` release brings **topology-aware scheduling** for NVIDIA GPUs, allowing workloads to be deployed to GPU groups with the **closest physical connections** (e.g., NVLink, PCIe) and **fastest communication speeds**.

---

## Key Features

- **Dynamic topology scoring**:  
  Device Plugin uses **NVML** to detect GPU physical topology and converts it into **numeric communication scores**.

- **Dual-strategy anti-fragmentation scheduling**:  
  - **Best match** for multi-GPU tasks, maximizing inter-GPU communication efficiency.
  - **Minimal disruption** for single-GPU tasks, preserving topology for future workloads.

---

## Core Principles

![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-437.jpg)

HAMi scheduling consists of two main stages:

### **Stage 1: Topology Registration**
Transform invisible hardware topology into quantifiable scores:
1. **Detect connection types** between GPU pairs (`NVLink`, `PCIe`) via NVML.
2. **Model the topology graph** for all devices on a node.
3. **Assign scores** using rules (e.g., `SingleNVLINKLink` = 100, `P2PLinkCrossCPU` = 10).
4. **Create a device score table** and register it in **node annotations**.

### **Stage 2: Scheduling Decision**
Use the score table in the `Fit()` function to choose optimal GPUs:
1. **Filter** GPUs by resource requirements.
2. **Score & optimize**:
   - For **multi-GPU**: select the highest total communication score.
   - For **single-GPU**: select the least-connected GPU to reduce topology impact.

---

## Principle Implementation — Code Analysis

### 1. **Topology Discovery & Score Calculation**

#### Build the Topology Graph
File: `pkg/device/nvidia/calculate_score.go`
- **Initialize device list** with empty `Links` maps.
- **Populate links**:
  - `GetP2PLink` and `GetNVLink` detect interconnects.
  - Store connections as `P2PLink` objects in each device’s `Links`.

#### Quantify to Scores
Function: `calculateGPUPairScore`
- Sum scores for all links between two GPUs using a `switch` statement.

func (o *deviceListBuilder) build() (DeviceList, error) {

var devices DeviceList

for i, d := range nvmlDevices {

devices = append(devices, device)

}

for i, d1 := range nvmlDevices {

for j, d2 := range nvmlDevices {

if i != j {

p2plink, _ := GetP2PLink(d1, d2)

devices[i].Links[j] = append(

devices[i].Links[j], P2PLink{devices[j], p2plink},

)

nvlink, _ := GetNVLink(d1, d2)

devices[i].Links[j] = append(

devices[i].Links[j], P2PLink{devices[j], nvlink},

)

}

}

}

return devices, nil

}

func calculateGPUPairScore(gpu0 Device, gpu1 Device) int {

score := 0

for _, link := range gpu0.Links[gpu1.Index] {

switch link.Type {

case P2PLinkCrossCPU:

score += 10

case SingleNVLINKLink:

score += 100

// ...

}

}

return score

}


---

### 2. **Device-side Scheduling Decision**

File: `pkg/device/nvidia/device.go`
- Check if **topology-aware policy** is requested.
- Apply appropriate strategy:

func (nv *NvidiaGPUDevices) Fit(...) {

needTopology := util.GetGPUSchedulerPolicyByPod(...) == util.GPUSchedulerPolicyTopology.String()

// Filter idle GPUs: tmpDevs

if needTopology {

if len(tmpDevs[k.Type]) > int(originReq) {

if originReq == 1 {

lowestDevices := computeWorstSignleCard(nodeInfo, request, tmpDevs)

tmpDevs[k.Type] = lowestDevices

} else {

combinations := generateCombinations(request, tmpDevs)

combination := computeBestCombination(nodeInfo, combinations)

tmpDevs[k.Type] = combination

}

return true, tmpDevs, ""

}

}

}


---

## Fit Function Strategies

![image](https://blog.aitoearn.ai/content/images/2025/10/img_004-410.jpg)

### **Strategy 1: Best Match (Multi-GPU)**
- Enumerate all combinations via `generateCombinations`.
- Select the one with **maximum total communication score** using `computeBestCombination`.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-374.jpg)

### **Strategy 2: Minimal Disruption (Single-GPU)**
- Use `computeWorstSignleCard` to find the least-connected GPU.
- Preserves topology for future larger jobs.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-350.jpg)

---

## Usage Example

Enable topology-aware scheduling with an **annotation**:

apiVersion: v1

kind: Pod

metadata:

name: gpu-topology-aware-job

annotations:

hami.io/gpu-scheduler-policy: "topology-aware"

spec:

containers:

  • name: cuda-container
  • image: nvidia/cuda:11.6.2-base-ubuntu20.04
  • command: ["sleep", "infinity"]
  • resources:
  • limits:
  • hami.io/gpu: "4"

---

## Summary

HAMi’s **dynamic discovery + dual optimization** approach:
- Best performance for current workloads.
- Long-term cluster topology health.
- Essential for **large-scale AI training** and **HPC workloads**.

---

## References

- [HAMi NVIDIA GPU Topology Policy Proposal](https://github.com/Project-HAMi/HAMi/blob/master/docs/proposals/gpu-topo-policy.md)  
- [NVIDIA GPU Topology Scheduler Guide](https://github.com/Project-HAMi/HAMi/blob/master/docs/proposals/nvidia-gpu-topology-scheduler.md)  
- Relevant PRs: [#1018](https://github.com/Project-HAMi/HAMi/pull/1018), [#1276](https://github.com/Project-HAMi/HAMi/pull/1276)  

---

## Upcoming Event: **AICon 2025 — Beijing**
**Date:** December 19–20  
Topics: Agents, Context Engineering, AI product innovation, and more.  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_007-316.jpg)

---

## Recommended Reads

- [Mass Layoffs at Meta — Yuandong Tian Laid Off?!](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247647159&idx=1&sn=a9e0e5d10801f5a2caddaaee0b68c12b&scene=21#wechat_redirect)  
- [AI Coding Assistant Price Jump](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247646785&idx=1&sn=d3dd267adf24e14e7354c3070ad5ccf4&scene=21#wechat_redirect)  
- [Claude Skills vs MCP Debate](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247646684&idx=1&sn=19a88396cba9448d1a7e5fe2ff21e96e&scene=21#wechat_redirect)  
- [Anthropic’s New Model Benchmarks](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247646628&idx=1&sn=679b13baf56bc6e09ddda0dfea43b0ad&scene=21#wechat_redirect)  
- [Karpathy’s Custom ChatGPT Project](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247646486&idx=1&sn=159d276bb43b24d898b91fa03d8867c2&scene=21#wechat_redirect)  

![image](https://blog.aitoearn.ai/content/images/2025/10/img_008-284.jpg)

[Read the original](2247647314)  
[Open in WeChat](https://wechat2rss.bestblogs.dev/link-proxy/?k=73ec7c06&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzU1NDA4NjU2MA%3D%3D%26mid%3D2247647314%26idx%3D2%26sn%3D416fdac6b4e9a5d4dc075c6628aff2f4)

---

Read more