Huawei launches "near-trillion-scale MoE inference" with two killer optimization technologies released as open source

Huawei launches "near-trillion-scale MoE inference" with two killer optimization technologies released as open source

Machine Heart Report: Ultra-Large MoE Inference Breakthroughs

---

2025 Landscape: Inference Efficiency Takes Center Stage

As 2025 concludes, large AI models have evolved from niche tools into foundational infrastructure powering enterprise systems. In this shift, inference efficiency has become the critical factor for scalable deployment.

For ultra-large-scale MoE (Mixture-of-Experts) models, complex inference pipelines demand extreme optimization in:

  • Computation
  • Inter-node communication
  • Memory access

Industry-wide, the challenge is ensuring highly efficient, controllable inference pathways.

---

Huawei’s Near-Trillion Parameter MoE Stack

Huawei has introduced a comprehensive technical solution:

  • Model: `openPangu-Ultra-MoE-718B-V1.1`
  • Near-trillion parameter MoE architecture
  • Key Technologies:
  • Omni Proxy scheduling
  • AMLA acceleration (boosts Ascend hardware utilization to 86%)

👉 Open-source project: GitCode - Ascend Inference Cluster

---

Why Inference Efficiency Matters Now

Previously, large-model innovation focused on training scale and capability breakthroughs.

Today, the central question is:

> Who can run the model stably, at lower cost, with minimal latency?

Especially for ultra-large MoE models, real deployment means tackling:

  • Optimal parallelism strategies
  • Tight cost control — down to each operator call and communication node
  • Avoiding latency amplification in coupled pipelines

---

OpenPangu-Ultra Deploys Efficiently on Ascend Hardware

With deep optimization:

  • Peak operator performance
  • Multi-stream concurrency for communication–computation
  • Hidden inter-node communication delays

China’s surging token demand drives the need for maximal inference efficiency — critical for monetizing large models.

---

The Industry’s Core Challenge

> Run MoE models at hundreds of billions to near-trillion parameters with:

> - High speed

> - Stability

> - Production-grade deployment readiness

Huawei meets this challenge with Ascend-optimized inference acceleration and intelligent scheduling.

---

Model Overview

  • openPangu-Ultra-MoE-718B-V1.1
  • 718B total parameters
  • 39B activated parameters
  • Enhanced tool invocation & integration
  • Quantized version: `openPangu-Ultra-MoE-718B-V1.1-Int8` (openPangu-Ultra)
  • Proven feasible inference path on Ascend hardware

🔗 Model GitCode Repository

---

image

---

Ascend-Affinity Acceleration Technology

Objective: A faster, more stable route for near-trillion parameter MoE inference.

This approach integrates:

  • Hardware synergy
  • Intelligent scheduling
  • Algorithmic optimization

Result: Production-ready AI deployment.

---

Ecosystem Synergy — AiToEarn

Platforms like AiToEarn官网 extend Huawei’s deployment advantages into content monetization:

  • AI generation across major social networks (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
  • Real-time analytics, publishing, and model ranking (AI模型排名)

This shows technology deployment + monetization infrastructure enables full commercial potential.

---

Omni Proxy Innovations for MoE Scheduling

Omni-Infer Integration

  • Optimized for vLLM and SGLang
  • Plug-and-play with Ascend hardware
  • No need to rewrite application logic

---

Omni Proxy vs. Traditional Scheduling

Large model inference requests face:

  • Cyclic load patterns (long Prefill, short Decode cycles)
  • Scheduler blind spots — missing tokenization, batch size, KV cache metrics
  • Poor KV cache matching
  • Duplicate tokenization in multi-node P/D setups

---

Omni Proxy — Six Key Innovations

  • Full-lifecycle scheduling
  • 10 stage decomposition for fine-grained control
  • Dual request modes
  • Sequential & parallel Prefill–Decode scheduling
  • APC-aware cache reuse
  • Global KV state & precise matching
  • Tokenizer result reuse
  • 30% reduction in tokenizer overhead for multi-machine operations
  • Weighted request sorting
  • Length + wait-time balancing for throughput
  • Master-slave scheduling with shared memory
  • Global consistency and scalability

---

image
image
image
image
image
image

---

Full-Stack Inference Acceleration

Core Components

  • API Server & Scale Out: Smooth cluster scaling
  • Omni Placement: Expert placement for uneven cross-layer MoE configurations
  • MTP (Multi-Token Prefill): Parallel multi-token generation
  • Fusion Operator: Reduce redundancy via operator fusion

---

image

---

AMLA — Ascend Hardware FLOPS Utilization at 86%

Why It Matters

MoE inference performance = full chip utilization + operator optimization + efficient communication.

This directly impacts cost per token and system stability.

---

AMLA Two Breakthroughs

  • Addition Instead of Multiplication
  • Converts expensive FP multiplications into light INT additions
  • Eliminates GM–UB roundtrips, slashing memory latency
  • Preload Pipelining + Hierarchical Blocking
  • Cube & Vector cores operate concurrently
  • Continuous data flow for max FLOPS utilization

---

image
image
image
image
image

---

Performance Results

  • 614 TFLOPS on Ascend hardware
  • 86.8% compute utilization — near theoretical peak
  • Beats open-source FlashMLA (~66.7% utilization on NVIDIA H800 SXM5)

---

Conclusion: Deployment Feasibility Achieved

Huawei’s integrated approach — OpenPangu-Ultra + Ascend Optimization Stack — combines:

  • Framework-level accelerators for vLLM/SGLang
  • Omni Proxy for smart scheduling
  • AMLA for low-level operator performance

This achieves a balanced trifecta of:

  • Cost control
  • Performance stability
  • Scalable deployment

---

Commercial Impact

As Scaling Law benefits plateau, optimized inference efficiency is enabling large models to reach industrial deployment scale.

Open ecosystems like AiToEarn mirror this principle — combining technical capability with operational monetization efficiency.

Creators, researchers, and enterprises now have a clear blueprint:

> Optimize both model capability and operational efficiency for sustainable success.

---

Would you like me to also create a summary infographic showing Huawei’s MoE stack architecture and the Omni Proxy + AMLA interaction for quick executive understanding? That would make this Markdown even more presentation-ready.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.