Open-Source Collaboration Reshaping Inference Infrastructure: Mooncake’s Path from Architectural Innovation to Ecosystem Synergy

Open-Source Collaboration Reshaping Inference Infrastructure: Mooncake’s Path from Architectural Innovation to Ecosystem Synergy
# The "Triangle Dilemma" of Large Model Inference Deployment  
*From Technical Exploration to Industrial Implementation*

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-420.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-389.jpg)  

---

## Introduction: Understanding the "Triangle Dilemma"

Large model inference deployments in the real world face a **three-way trade-off** between:

- **Cost** — reducing cost per MB of tokens
- **Throughput** — meeting massive concurrency demands
- **Long-context processing** — enabling long-form texts and multi-turn conversations

These constraints form a **triangle dilemma** where optimizing one often compromises the others.

The open-source project **Mooncake** proposes *compute–storage decoupling* — leveraging **Prefill–Decode (PD) separation** and **KVCache pooling** to create scalable, efficient inference infrastructure.  

In *AI Evolution Theory: Paths to Breakthrough in the Age of Intelligent Computing OS* (session five), Professor **Zhang Mingxing** (Tsinghua University, Mooncake co-initiator) and Dr. **Ma Teng** (Alibaba Cloud, core contributor) detailed Mooncake’s **technical logic**, **open-source value**, and **enterprise practices** — providing key insights for intelligent computing OS development.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-363.jpg)

---

## 1. Industry Pain Points & Mooncake’s Background

### Q1: Core industry challenge — cost, throughput, context length

**Professor Zhang Mingxing (Tsinghua University)**  
Main issue is **balancing cost vs. user experience**.

Inference has two phases:

1. **Prefill** — processes long user inputs  
2. **Decode** — generates output token-by-token  

Deploying both on the same GPU leads to interference, hurting output smoothness.  
Mooncake’s solution:

- **Separate Prefill & Decode pipelines**
- Large **KVCache pool** for multi-turn dialogs and prompt sharing

With massive models (600B+ params) and long contexts (tens of thousands of tokens), separation architecture + techniques like **SpecDecoding** are vital to maintain speed and keep costs in check.

**Dr. Ma Teng (Alibaba Cloud)**  
Cost, throughput, and context length form a **triangle relationship**:

- **Long context** → high memory usage  
- **High throughput** → requires batching  
- **Batching** → can reduce supported context length

High-end GPUs are expensive — **PD separation** and **hierarchical storage** aim to balance all three.  
Different workloads (e.g., offline batch vs. real-time chat) need tailored strategies.

---

### Q2: Original motivation & open-source benefits

**Dr. Ma Teng (Alibaba Cloud)**  
*(Details continue later in interview)*

**Open-source advantage:**
- **Ecosystem cycle** — avoid closed-door development
- Collaboration attracts more partners (Ant Group, Moore Threads)
- Broader scenario coverage through community input

**Prof. Zhang Mingxing (Tsinghua University)**  
Goal: evolve from a single-company engine to **general-purpose infrastructure**.

- Open source reduces collaboration cost
- Neutral community (Dragon Lizard Community) builds trust
- Short academic-to-industry transformation cycle accelerated via open source

---

## Origin and Early Development

Timeline:
- **Jun–Jul last year** — Kimi/Tsinghua KVCache Pooling tech report sparks idea
- Initial internal use at Kimi → first open-source version in **Nov**
- Direction shift to **higher-level inference framework integration** by **May–Jun this year**

**Key principle:** Build neutral, collaboratively maintained infrastructure.

---

### Q3: Role of Alibaba Cloud’s foundational software

**@Ma Teng (Alibaba Cloud)**  
Integration challenges:
- **eRDMA tuning** took months to reach optimal performance
- Hardware topology differences between cloud GPUs & traditional servers
- Expanded transport engine to be **topology aware** — reusable by other enterprises

Community collaboration via Dragon Lizard AI SIG helped:
- Integrate domestic hardware vendors
- Push **OS & driver optimization** to extreme for hardware performance
- Enable faster protocol/hardware adaptation

---

## 2. Mooncake’s Core Technology & Design Logic

### Q4: Key differences from traditional inference approaches

**@Zhang Mingxing**  
Mooncake utilizes **decoupled architecture**:
- **KVCache-centric separation**
- Prefill → KVCache generation → KVCache pool → Decode
- Independent KVCache management (some decoding components split across devices)

**OS layer requirements:**
- Extreme hardware performance extraction
- Support **async ops**, **zero-copy transfers**, and advanced **topology awareness**
- High fault tolerance

**@Ma Teng**  
Trend toward **Multikernel architectures** — cluster behaves as one OS.
OS roles:
- Bridge hardware and Mooncake
- Driver adaptation and capability abstraction
- Pre-packaged intelligent computing images for ease of deployment

---

### Q5: KVCache Pooling & Efficient Transmission — Core breakthrough points

**Challenges:**
- **Standardization** — unified API for multiple inference frameworks
- **Scalability** — multi-tenancy, cloud-native fits, new protocol support (CXL, RDMA)

**Transport Engine**:
- Supports **eRDMA** & **GPU Direct**
- Low latency + clean architecture

**Hardware speed alignment**:
- Micro/nano-second operation optimization
- Manage heterogeneous devices & data locality
- Marginal gains (e.g., hit rate ↑ from 90% → 95% = 50% compute reduction)

---

## Q6: Adapting to Enterprise Needs

**Early stage** — focus on **fast launch** over enterprise-grade demands  
**Later refinements**:
- Reliability & stability enhancements
- Compatibility tuning for eRDMA, CXL
- Larger KVCache pool for Ant Group multi-turn conversations → TTFT reduction
- Cloud-native integration with Alibaba ACK → better resource scheduling

**Modularization**:
- Split into sub-projects (Transport Engine, Mooncake Store, etc.)
- Flexible adoption and maintenance

**Community role**:
- Hardware vendors adapt own solutions
- Avoid ecosystem fragmentation
- Future plan: donation to foundation for neutral growth

---

## 3. Industry Practice & Effectiveness Verification

### Q7: Framework adaptation challenges

Framework differences:
- **SGLang** → point-to-point transport
- **vLLM** → suited to Put/Get semantics

**Strategy**:
- Component reuse
- Middleware abstraction (Mooncake Store)

**Results**:
- SGLang PD separation → throughput ↑ 30%+, TTFT ↓ 20%

---

### Q8: Enterprise-driven iterations

Ant Group multi-round dialog → KVCache reuse  
Collaborated to integrate with SGLang BlackCache → TTFT improved

---

**Multi-tenant optimization (Alibaba Cloud)**:
- Resource isolation in Mooncake Store  
- VRAM pooling to unite idle GPU memory  
- Offload inactive KVCache to disk/CFS → cost ↓ significantly, perf impact ~20%

**Enterprise-grade robustness**:
- eRDMA optimization
- Layered KVCache storage for long-text workloads
- Auto config tools → combine models + business SLOs to recommend resources

---

### Q9: Achieving $0.20 per 1M tokens

**Feasibility conditions**:
- High concurrency utilization
- Low-speed output workloads (chat scenarios)

**Cost drivers**:
- Resource utilization maximization (KVCache pooling, PD separation)
- Storage offload & VRAM pooling
- Selectively using smaller models in multi-agent scenarios

---

## 4. Mooncake’s Evolution & Industry Insights

### Q10: Future inference tech trends

- Move toward **AI-Aware OS**
- Deep optimization in networking, storage, GPU scheduling
- **Mooncake Store V2**:  
  1. Multi-tenant KVCache sharing  
  2. Low-cost layered storage for expanded KVCache capacity

---

### Q11: Ecosystem expansion & developer advice

- Prioritize openness/fairness to avoid monopoly
- Deepen hardware vendor collaboration
- Donate to foundation for neutrality
- Promote interface standardization + automated tuning tools
- Developers: map existing expertise (storage, networking) to AI infra needs

---

### Q12: Technical paradigm shifts & OS implications

**Resource efficiency**, **shared infra**, **layered storage**, and **adaptive configuration** matter most.

- OS: more fine-grained hardware abstraction
- Decoupled architecture emerging as consensus

---

## Conclusion: Co-building Inference Infrastructure

Mooncake transcends point optimizations — it represents **open-source-driven collaboration** blending traditional infra tech with AI-specific needs.

**Key outcomes**:
- Balance cost, throughput, and long-context support
- Foster academia–industry synergy
- Promote system-wide optimization over isolated fixes

**Repository**: [https://github.com/kvcache-ai/mooncake](https://github.com/kvcache-ai/mooncake)

---

## About the Series

*AI Evolution: Breaking through in the Intelligent Computing Era OS Roadmap* explores AI reshaping industries and domestic tech substitution. Focus is on cloud, AI, security in server OS — with Alibaba Cloud OS as reference.  
Full video replay: **[Read Original](https://www.infoq.cn/video/vUNUp9tkjBHxGqaO33WK)**  

[Open in WeChat](https://wechat2rss.bestblogs.dev/link-proxy/?k=8750b6ff&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMjM5MDE0Mjc4MA%3D%3D%26mid%3D2651259789%26idx%3D3%26sn%3Dc704777f0796320825f5000a7b1e47bd)

Read more

AI Coding Sprint "DeepSeek Moment": Gen Z Team Uses Domestic Model to Instantly Deliver Complex Apps, Surpassing Claude Code

AI Coding Sprint "DeepSeek Moment": Gen Z Team Uses Domestic Model to Instantly Deliver Complex Apps, Surpassing Claude Code

Cloud-Based AI Agents: Redefining the Programming Paradigm Cloud-based AI Agents are making significant advances, transforming how software is conceived, developed, and deployed. With zero human intervention, an “AI programming team” can directly deploy complex applications, leveraging ultra-large context capacities — reaching tens of millions in scale. Imagine simply stating your requirements,

By Honghao Wang