AI news

Huawei launches "near-trillion-scale MoE inference" with two killer optimization technologies released as open source

Honghao Wang

28 Nov 2025 — 4 min read

Machine Heart Report: Ultra-Large MoE Inference Breakthroughs

---

2025 Landscape: Inference Efficiency Takes Center Stage

As 2025 concludes, large AI models have evolved from niche tools into foundational infrastructure powering enterprise systems. In this shift, inference efficiency has become the critical factor for scalable deployment.

For ultra-large-scale MoE (Mixture-of-Experts) models, complex inference pipelines demand extreme optimization in:

Computation
Inter-node communication
Memory access

Industry-wide, the challenge is ensuring highly efficient, controllable inference pathways.

---

Huawei’s Near-Trillion Parameter MoE Stack

Huawei has introduced a comprehensive technical solution:

Model: `openPangu-Ultra-MoE-718B-V1.1`
Near-trillion parameter MoE architecture
Key Technologies:
Omni Proxy scheduling
AMLA acceleration (boosts Ascend hardware utilization to 86%)

👉 Open-source project: GitCode - Ascend Inference Cluster

---

Why Inference Efficiency Matters Now

Previously, large-model innovation focused on training scale and capability breakthroughs.

Today, the central question is:

> Who can run the model stably, at lower cost, with minimal latency?

Especially for ultra-large MoE models, real deployment means tackling:

Optimal parallelism strategies
Tight cost control — down to each operator call and communication node
Avoiding latency amplification in coupled pipelines

---

OpenPangu-Ultra Deploys Efficiently on Ascend Hardware

With deep optimization:

Peak operator performance
Multi-stream concurrency for communication–computation
Hidden inter-node communication delays

China’s surging token demand drives the need for maximal inference efficiency — critical for monetizing large models.

---

The Industry’s Core Challenge

> Run MoE models at hundreds of billions to near-trillion parameters with:

> - High speed

> - Stability

> - Production-grade deployment readiness

Huawei meets this challenge with Ascend-optimized inference acceleration and intelligent scheduling.

---

Model Overview

openPangu-Ultra-MoE-718B-V1.1
718B total parameters
39B activated parameters
Enhanced tool invocation & integration
Quantized version: `openPangu-Ultra-MoE-718B-V1.1-Int8` (openPangu-Ultra)
Proven feasible inference path on Ascend hardware

🔗 Model GitCode Repository

---

---

Ascend-Affinity Acceleration Technology

Objective: A faster, more stable route for near-trillion parameter MoE inference.

This approach integrates:

Hardware synergy
Intelligent scheduling
Algorithmic optimization

Result: Production-ready AI deployment.

---

Ecosystem Synergy — AiToEarn

Platforms like AiToEarn官网 extend Huawei’s deployment advantages into content monetization:

AI generation across major social networks (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
Real-time analytics, publishing, and model ranking (AI模型排名)

This shows technology deployment + monetization infrastructure enables full commercial potential.

---

Omni Proxy Innovations for MoE Scheduling

Omni-Infer Integration

Optimized for vLLM and SGLang
Plug-and-play with Ascend hardware
No need to rewrite application logic

---

Omni Proxy vs. Traditional Scheduling

Large model inference requests face:

Cyclic load patterns (long Prefill, short Decode cycles)
Scheduler blind spots — missing tokenization, batch size, KV cache metrics
Poor KV cache matching
Duplicate tokenization in multi-node P/D setups

---

Omni Proxy — Six Key Innovations

Full-lifecycle scheduling
10 stage decomposition for fine-grained control
Dual request modes
Sequential & parallel Prefill–Decode scheduling
APC-aware cache reuse
Global KV state & precise matching
Tokenizer result reuse
30% reduction in tokenizer overhead for multi-machine operations
Weighted request sorting
Length + wait-time balancing for throughput
Master-slave scheduling with shared memory
Global consistency and scalability

---

---

Full-Stack Inference Acceleration

Core Components

API Server & Scale Out: Smooth cluster scaling
Omni Placement: Expert placement for uneven cross-layer MoE configurations
MTP (Multi-Token Prefill): Parallel multi-token generation
Fusion Operator: Reduce redundancy via operator fusion

---

---

AMLA — Ascend Hardware FLOPS Utilization at 86%

Why It Matters

MoE inference performance = full chip utilization + operator optimization + efficient communication.

This directly impacts cost per token and system stability.

---

AMLA Two Breakthroughs

Addition Instead of Multiplication
Converts expensive FP multiplications into light INT additions
Eliminates GM–UB roundtrips, slashing memory latency
Preload Pipelining + Hierarchical Blocking
Cube & Vector cores operate concurrently
Continuous data flow for max FLOPS utilization

---

---

Performance Results

614 TFLOPS on Ascend hardware
86.8% compute utilization — near theoretical peak
Beats open-source FlashMLA (~66.7% utilization on NVIDIA H800 SXM5)

---

Conclusion: Deployment Feasibility Achieved

Huawei’s integrated approach — OpenPangu-Ultra + Ascend Optimization Stack — combines:

Framework-level accelerators for vLLM/SGLang
Omni Proxy for smart scheduling
AMLA for low-level operator performance

This achieves a balanced trifecta of:

Cost control
Performance stability
Scalable deployment

---

Commercial Impact

As Scaling Law benefits plateau, optimized inference efficiency is enabling large models to reach industrial deployment scale.

Open ecosystems like AiToEarn mirror this principle — combining technical capability with operational monetization efficiency.

Creators, researchers, and enterprises now have a clear blueprint:

> Optimize both model capability and operational efficiency for sustainable success.

---

Would you like me to also create a summary infographic showing Huawei’s MoE stack architecture and the Omni Proxy + AMLA interaction for quick executive understanding? That would make this Markdown even more presentation-ready.