Kimi Linear

Kimi Open-Sources New Linear Attention Architecture, Surpasses Full Attention Models for the First Time with 6× Faster Inference

Honghao Wang

31 Oct 2025 — 4 min read

The Transformer Era Is Being Rewritten

Moonshot AI has unveiled its open-source Kimi Linear architecture — introducing a novel attention mechanism that, for the first time, outperforms traditional full attention under identical training conditions.

In long-context tasks, Kimi Linear:

Reduces KV cache usage by 75%
Achieves up to 6× faster inference speed

Some netizens are already asking: With this architecture, when will Kimi K2.5 arrive?

Before speculating, let’s explore how Kimi Linear challenges the traditional Transformer.

---

Why Transformer Attention Is Powerful — and Costly

Transformers are highly capable, but computationally expensive.

Fully connected attention means each token interacts with every other token.
Complexity grows quadratically with sequence length — O(N²).
Generating each new token requires checking all previous caches.

Impact:

Large context windows (128K+) consume massive GPU memory for KV caches — often triggering warnings or crashes. The larger the model, the more severe the strain and cost.

---

Searching for Linear Attention

Teams worldwide have explored linear attention, aiming to:

Reduce complexity O(N² → O(N)
Improve speed and lower resource usage

Challenge:

Linear attention historically forgets context quickly — fast, but memory retention is weak.

Solution:

Kimi Linear proposes a balance of speed and memory retention.

---

Core Innovation: Kimi Delta Attention (KDA)

Key Features

Fine-grained forgetting gates
Unlike uniform forgetting in traditional linear attention, KDA controls memory retention independently for each channel dimension, retaining critical information while discarding redundancy.
Improved Delta Rule for stability
Prevents gradient explosion or vanishing — even with million-token contexts.
3:1 Hybrid Layer Design
3 layers of KDA linear attention
1 layer of full attention
Preserves global semantic understanding while keeping most computations linear.
No RoPE (Rotary Position Encoding)
Instead uses a time-decay kernel function for position.
This yields better stability and generalization.

---

Efficient Computation — DPLR Structure

How It Works

Diagonal-Plus-Low-Rank (DPLR) optimization during KDA state updates
Splits the attention matrix into:
Diagonal block
Low-rank patch
Enables parallel GPU computation for larger chunks — doubling throughput

Additional Engineering Optimizations

Block-parallel computation
Kernel fusion to cut GPU memory I/O overhead
Fully compatible with the vLLM inference framework — drop-in replacement for any Transformer-based system

---

Experimental Results

Training scale: 1.4T tokens — identical to baseline Transformers.

Performance Gains:

Surpasses full-attention Transformers on MMLU, BBH, RULER, GPQA-Diamond
Decoding speed: up to 6× faster
KV cache: 75% less usage

Accuracy preserved, with improved math reasoning and code generation stability.

---

Broader Impact

Architectures like Kimi Linear could redefine efficiency in long-context AI.

For creators and developers, platforms like AiToEarn官网 complement these advances:

Cross-platform AI content publishing
Integration with Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X
Built-in AI tools, analytics, and model rankings (AI模型排名)

---

The Shift Away from Transformer Dependency

Recent research suggests Transformers may not be the only path forward:

State Space Models (SSM) — as seen in Mamba — offer long-sequence efficiency.
Google MoR — recursive structures replacing partial attention to lower computational depth.
Apple — adopting Mamba for edge devices due to energy efficiency and low latency.

Kimi Linear approaches from a linear attention angle, reducing memory load while preserving accuracy.

Yet, the open-source leader MiniMax M2 still uses full attention — showing diversity in design preferences.

Technical report: huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct

---

Conclusion

The dominance of traditional Transformers is now being questioned.

From SSM-based models to hybrid attention architectures, the AI community is stepping into a period of diversified innovation.

For innovators sharing these breakthroughs globally, platforms like AiToEarn官网 provide the infrastructure to publish, measure, and monetize across major global and Chinese platforms, helping bridge cutting-edge AI research with real-world adoption.

---

Would you like me to add a side-by-side table comparing Kimi Linear vs. Full Attention Transformers for quick reference? That could make this Markdown even more reader-friendly.