# The "Triangle Dilemma" of Large Model Inference Deployment
*From Technical Exploration to Industrial Implementation*


---
## Introduction: Understanding the "Triangle Dilemma"
Large model inference deployments in the real world face a **three-way trade-off** between:
- **Cost** — reducing cost per MB of tokens
- **Throughput** — meeting massive concurrency demands
- **Long-context processing** — enabling long-form texts and multi-turn conversations
These constraints form a **triangle dilemma** where optimizing one often compromises the others.
The open-source project **Mooncake** proposes *compute–storage decoupling* — leveraging **Prefill–Decode (PD) separation** and **KVCache pooling** to create scalable, efficient inference infrastructure.
In *AI Evolution Theory: Paths to Breakthrough in the Age of Intelligent Computing OS* (session five), Professor **Zhang Mingxing** (Tsinghua University, Mooncake co-initiator) and Dr. **Ma Teng** (Alibaba Cloud, core contributor) detailed Mooncake’s **technical logic**, **open-source value**, and **enterprise practices** — providing key insights for intelligent computing OS development.

---
## 1. Industry Pain Points & Mooncake’s Background
### Q1: Core industry challenge — cost, throughput, context length
**Professor Zhang Mingxing (Tsinghua University)**
Main issue is **balancing cost vs. user experience**.
Inference has two phases:
1. **Prefill** — processes long user inputs
2. **Decode** — generates output token-by-token
Deploying both on the same GPU leads to interference, hurting output smoothness.
Mooncake’s solution:
- **Separate Prefill & Decode pipelines**
- Large **KVCache pool** for multi-turn dialogs and prompt sharing
With massive models (600B+ params) and long contexts (tens of thousands of tokens), separation architecture + techniques like **SpecDecoding** are vital to maintain speed and keep costs in check.
**Dr. Ma Teng (Alibaba Cloud)**
Cost, throughput, and context length form a **triangle relationship**:
- **Long context** → high memory usage
- **High throughput** → requires batching
- **Batching** → can reduce supported context length
High-end GPUs are expensive — **PD separation** and **hierarchical storage** aim to balance all three.
Different workloads (e.g., offline batch vs. real-time chat) need tailored strategies.
---
### Q2: Original motivation & open-source benefits
**Dr. Ma Teng (Alibaba Cloud)**
*(Details continue later in interview)*
**Open-source advantage:**
- **Ecosystem cycle** — avoid closed-door development
- Collaboration attracts more partners (Ant Group, Moore Threads)
- Broader scenario coverage through community input
**Prof. Zhang Mingxing (Tsinghua University)**
Goal: evolve from a single-company engine to **general-purpose infrastructure**.
- Open source reduces collaboration cost
- Neutral community (Dragon Lizard Community) builds trust
- Short academic-to-industry transformation cycle accelerated via open source
---
## Origin and Early Development
Timeline:
- **Jun–Jul last year** — Kimi/Tsinghua KVCache Pooling tech report sparks idea
- Initial internal use at Kimi → first open-source version in **Nov**
- Direction shift to **higher-level inference framework integration** by **May–Jun this year**
**Key principle:** Build neutral, collaboratively maintained infrastructure.
---
### Q3: Role of Alibaba Cloud’s foundational software
**@Ma Teng (Alibaba Cloud)**
Integration challenges:
- **eRDMA tuning** took months to reach optimal performance
- Hardware topology differences between cloud GPUs & traditional servers
- Expanded transport engine to be **topology aware** — reusable by other enterprises
Community collaboration via Dragon Lizard AI SIG helped:
- Integrate domestic hardware vendors
- Push **OS & driver optimization** to extreme for hardware performance
- Enable faster protocol/hardware adaptation
---
## 2. Mooncake’s Core Technology & Design Logic
### Q4: Key differences from traditional inference approaches
**@Zhang Mingxing**
Mooncake utilizes **decoupled architecture**:
- **KVCache-centric separation**
- Prefill → KVCache generation → KVCache pool → Decode
- Independent KVCache management (some decoding components split across devices)
**OS layer requirements:**
- Extreme hardware performance extraction
- Support **async ops**, **zero-copy transfers**, and advanced **topology awareness**
- High fault tolerance
**@Ma Teng**
Trend toward **Multikernel architectures** — cluster behaves as one OS.
OS roles:
- Bridge hardware and Mooncake
- Driver adaptation and capability abstraction
- Pre-packaged intelligent computing images for ease of deployment
---
### Q5: KVCache Pooling & Efficient Transmission — Core breakthrough points
**Challenges:**
- **Standardization** — unified API for multiple inference frameworks
- **Scalability** — multi-tenancy, cloud-native fits, new protocol support (CXL, RDMA)
**Transport Engine**:
- Supports **eRDMA** & **GPU Direct**
- Low latency + clean architecture
**Hardware speed alignment**:
- Micro/nano-second operation optimization
- Manage heterogeneous devices & data locality
- Marginal gains (e.g., hit rate ↑ from 90% → 95% = 50% compute reduction)
---
## Q6: Adapting to Enterprise Needs
**Early stage** — focus on **fast launch** over enterprise-grade demands
**Later refinements**:
- Reliability & stability enhancements
- Compatibility tuning for eRDMA, CXL
- Larger KVCache pool for Ant Group multi-turn conversations → TTFT reduction
- Cloud-native integration with Alibaba ACK → better resource scheduling
**Modularization**:
- Split into sub-projects (Transport Engine, Mooncake Store, etc.)
- Flexible adoption and maintenance
**Community role**:
- Hardware vendors adapt own solutions
- Avoid ecosystem fragmentation
- Future plan: donation to foundation for neutral growth
---
## 3. Industry Practice & Effectiveness Verification
### Q7: Framework adaptation challenges
Framework differences:
- **SGLang** → point-to-point transport
- **vLLM** → suited to Put/Get semantics
**Strategy**:
- Component reuse
- Middleware abstraction (Mooncake Store)
**Results**:
- SGLang PD separation → throughput ↑ 30%+, TTFT ↓ 20%
---
### Q8: Enterprise-driven iterations
Ant Group multi-round dialog → KVCache reuse
Collaborated to integrate with SGLang BlackCache → TTFT improved
---
**Multi-tenant optimization (Alibaba Cloud)**:
- Resource isolation in Mooncake Store
- VRAM pooling to unite idle GPU memory
- Offload inactive KVCache to disk/CFS → cost ↓ significantly, perf impact ~20%
**Enterprise-grade robustness**:
- eRDMA optimization
- Layered KVCache storage for long-text workloads
- Auto config tools → combine models + business SLOs to recommend resources
---
### Q9: Achieving $0.20 per 1M tokens
**Feasibility conditions**:
- High concurrency utilization
- Low-speed output workloads (chat scenarios)
**Cost drivers**:
- Resource utilization maximization (KVCache pooling, PD separation)
- Storage offload & VRAM pooling
- Selectively using smaller models in multi-agent scenarios
---
## 4. Mooncake’s Evolution & Industry Insights
### Q10: Future inference tech trends
- Move toward **AI-Aware OS**
- Deep optimization in networking, storage, GPU scheduling
- **Mooncake Store V2**:
1. Multi-tenant KVCache sharing
2. Low-cost layered storage for expanded KVCache capacity
---
### Q11: Ecosystem expansion & developer advice
- Prioritize openness/fairness to avoid monopoly
- Deepen hardware vendor collaboration
- Donate to foundation for neutrality
- Promote interface standardization + automated tuning tools
- Developers: map existing expertise (storage, networking) to AI infra needs
---
### Q12: Technical paradigm shifts & OS implications
**Resource efficiency**, **shared infra**, **layered storage**, and **adaptive configuration** matter most.
- OS: more fine-grained hardware abstraction
- Decoupled architecture emerging as consensus
---
## Conclusion: Co-building Inference Infrastructure
Mooncake transcends point optimizations — it represents **open-source-driven collaboration** blending traditional infra tech with AI-specific needs.
**Key outcomes**:
- Balance cost, throughput, and long-context support
- Foster academia–industry synergy
- Promote system-wide optimization over isolated fixes
**Repository**: [https://github.com/kvcache-ai/mooncake](https://github.com/kvcache-ai/mooncake)
---
## About the Series
*AI Evolution: Breaking through in the Intelligent Computing Era OS Roadmap* explores AI reshaping industries and domestic tech substitution. Focus is on cloud, AI, security in server OS — with Alibaba Cloud OS as reference.
Full video replay: **[Read Original](https://www.infoq.cn/video/vUNUp9tkjBHxGqaO33WK)**
[Open in WeChat](https://wechat2rss.bestblogs.dev/link-proxy/?k=8750b6ff&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMjM5MDE0Mjc4MA%3D%3D%26mid%3D2651259789%26idx%3D3%26sn%3Dc704777f0796320825f5000a7b1e47bd)