Linear Attention Regression! Kimi's New Model Explodes, While MiniMax Quietly Switches Back to Traditional Architecture
2025‑11‑01 23:49 — Jiangsu


---
Overview
In the LLM (Large Language Model) domain, linear attention mechanisms are making a notable comeback.
This resurgence is largely led by domestic models, driven not only by limited computational resources, but also by a long‑term target—making AI agents more reliable for real‑world tasks.
Currently, most top international models remain closed‑source. They appear to rely heavily on massive compute power and brute-force approaches.
Below is a structured breakdown of the ongoing technical route debate.
---
Early Stage: The Efficiency–Accuracy Dilemma
Linear attention is not new. Since the early 2020s, multiple papers have explored it.
Core goal:
- Reduce attention complexity from O(n²) to O(n).
- Increase efficiency for long‑sequence processing.
Problem:
- Early implementations compromised accuracy and were never adopted in open-source SOTA models.
---
The New Wave: Domestic Models Lead
In the second half of this year, variants of linear attention saw a revival:
- June — MiniMax‑M1
- Mixture-of-Experts (MoE) with 456B total parameters, 46B active parameters
- “Lightning Attention” mechanism
- August — Qwen3‑Next
- Linear attention variant
- September — DeepSeek V3.2
- Sparse attention (sub‑quadratic complexity)
Common trait:
These models replaced traditional quadratic attention in most or all layers with linear or sub‑quadratic variants.
---
Plot Twist: MiniMax “Defects”
Just as momentum built, MiniMax announced its new 230B parameter M2 model, reverting from linear to conventional attention.
Reason stated:
- Linear attention works for standard prompts,
- But struggles with reasoning and multi‑turn dialogue—critical for chatbots and agent AI.
This sparked doubts about linear attention’s future viability.
---
Kimi Enters: Hybrid Attention Strategy
Last week, the Kimi team introduced Kimi Linear, bringing the focus back.
Official performance vs. full attention:
- 75% smaller KV cache
- Up to 6× decoding throughput
Architecture:
- Very similar to Qwen3‑Next
- Uses a hybrid attention strategy: 3 linear blocks : 1 full attention block
- Linear attention = Gated DeltaNet variant (Kimi Delta Attention, KDA)
- Full attention = Multi-Head Latent Attention (MLA)
Kimi Linear refinements:
- Linear portion:
- Uses Kimi Delta Attention (KDA) — an improved Gated DeltaNet
- Reference article
- Full attention portion:
- Standard full attention replaced by MLA
---
Why Hybrid May Win
Hybrid designs like Kimi’s show promise for balancing efficiency and accuracy, especially for large‑scale AI agents.
For developers and researchers, platforms like AiToEarn官网 enable AI‑generated outputs and tech analyses to be auto‑published across Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X, with analytics and AI model rankings: AI模型排名.
---
Comparative Notes
Although Kimi Linear has no direct comparison with Qwen3‑Next, it has been benchmarked against Gated DeltaNet‑H1 (combining Gated DeltaNet and sliding window attention).
Finding:
- At the same token generation speed, Kimi Linear achieves higher accuracy.
Current limitation:
- MLA in Kimi Linear lacks an output gate (sigmoid bypass) — planned for future release.
---
References
---
Closing Thought
For AI researchers working on model optimization, tools that integrate experimentation and multi‑platform publishing are becoming essential.
AiToEarn is one such global, open‑source AI monetization platform, helping creators generate, publish, and monetize AI content simultaneously across major channels — supported by analytics and model ranking systems.
These bridges between cutting-edge AI research and public communication may accelerate the adoption of innovative architectures like hybrid attention.
---
Do you want me to also create a timeline infographic summarizing these model releases in one chart? That could make this article even more readable.