Kimi Open-Sources New Linear Attention Architecture, Surpasses Full Attention Models for the First Time with 6× Faster Inference

Kimi Open-Sources New Linear Attention Architecture, Surpasses Full Attention Models for the First Time with 6× Faster Inference

The Transformer Era Is Being Rewritten

Moonshot AI has unveiled its open-source Kimi Linear architecture — introducing a novel attention mechanism that, for the first time, outperforms traditional full attention under identical training conditions.

image

In long-context tasks, Kimi Linear:

  • Reduces KV cache usage by 75%
  • Achieves up to 6× faster inference speed
image

Some netizens are already asking: With this architecture, when will Kimi K2.5 arrive?

image

Before speculating, let’s explore how Kimi Linear challenges the traditional Transformer.

---

Why Transformer Attention Is Powerful — and Costly

Transformers are highly capable, but computationally expensive.

  • Fully connected attention means each token interacts with every other token.
  • Complexity grows quadratically with sequence length — O(N²).
  • Generating each new token requires checking all previous caches.

Impact:

Large context windows (128K+) consume massive GPU memory for KV caches — often triggering warnings or crashes. The larger the model, the more severe the strain and cost.

image

---

Searching for Linear Attention

Teams worldwide have explored linear attention, aiming to:

  • Reduce complexity O(N² → O(N)
  • Improve speed and lower resource usage

Challenge:

Linear attention historically forgets context quickly — fast, but memory retention is weak.

Solution:

Kimi Linear proposes a balance of speed and memory retention.

image

---

Core Innovation: Kimi Delta Attention (KDA)

Key Features

  • Fine-grained forgetting gates
  • Unlike uniform forgetting in traditional linear attention, KDA controls memory retention independently for each channel dimension, retaining critical information while discarding redundancy.
  • Improved Delta Rule for stability
  • Prevents gradient explosion or vanishing — even with million-token contexts.
  • 3:1 Hybrid Layer Design
  • 3 layers of KDA linear attention
  • 1 layer of full attention
  • Preserves global semantic understanding while keeping most computations linear.
  • No RoPE (Rotary Position Encoding)
  • Instead uses a time-decay kernel function for position.
  • This yields better stability and generalization.
image

---

Efficient Computation — DPLR Structure

How It Works

  • Diagonal-Plus-Low-Rank (DPLR) optimization during KDA state updates
  • Splits the attention matrix into:
  • Diagonal block
  • Low-rank patch
  • Enables parallel GPU computation for larger chunks — doubling throughput

Additional Engineering Optimizations

  • Block-parallel computation
  • Kernel fusion to cut GPU memory I/O overhead
  • Fully compatible with the vLLM inference framework — drop-in replacement for any Transformer-based system

---

Experimental Results

Training scale: 1.4T tokens — identical to baseline Transformers.

Performance Gains:

  • Surpasses full-attention Transformers on MMLU, BBH, RULER, GPQA-Diamond
  • Decoding speed: up to 6× faster
  • KV cache: 75% less usage
image
image

Accuracy preserved, with improved math reasoning and code generation stability.

---

Broader Impact

Architectures like Kimi Linear could redefine efficiency in long-context AI.

For creators and developers, platforms like AiToEarn官网 complement these advances:

  • Cross-platform AI content publishing
  • Integration with Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X
  • Built-in AI tools, analytics, and model rankings (AI模型排名)
image

---

The Shift Away from Transformer Dependency

Recent research suggests Transformers may not be the only path forward:

  • State Space Models (SSM) — as seen in Mamba — offer long-sequence efficiency.
  • Google MoR — recursive structures replacing partial attention to lower computational depth.
  • Apple — adopting Mamba for edge devices due to energy efficiency and low latency.

Kimi Linear approaches from a linear attention angle, reducing memory load while preserving accuracy.

Yet, the open-source leader MiniMax M2 still uses full attention — showing diversity in design preferences.

Technical report: huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct

---

Conclusion

The dominance of traditional Transformers is now being questioned.

From SSM-based models to hybrid attention architectures, the AI community is stepping into a period of diversified innovation.

For innovators sharing these breakthroughs globally, platforms like AiToEarn官网 provide the infrastructure to publish, measure, and monetize across major global and Chinese platforms, helping bridge cutting-edge AI research with real-world adoption.

---

Would you like me to add a side-by-side table comparing Kimi Linear vs. Full Attention Transformers for quick reference? That could make this Markdown even more reader-friendly.

Read more