AI models

MiniMax and Kimi Face Off for “Attention”

Honghao Wang

01 Nov 2025 — 4 min read

Moonshot vs. MiniMax: A 72-Hour Duel in Attention Mechanisms

On October 29, Moonshot AI researcher Zhou Xinyu reposted MiniMax M2 Tech Blog’s announcement on X with a cheeky comment:

> “Minimax don't worry, Kimi got your back 😘”

He later posted the identical remark on Zhihu.

Whether playful or provocative, the message hinted at something brewing.

---

The M2 Release and MiniMax’s Full-Attention Bet

Two days after M2’s release, MiniMax pre-training lead Haohai Sun published a candid technical blog on Zhihu and X explaining why the team abandoned efficient attention.

Key point:

> “In industrial systems, efficient attention methods still fall behind Full Attention.”

They addressed common questions like “Why not linear/sparse attention?”, noting the challenges in industrial deployment.

---

What Did “Got Your Back” Mean?

The answer came 24 hours later:

On October 30, Moonshot AI launched Kimi Linear, a 48B-parameter hybrid-attention model:

75% KV cache reduction
6× throughput boost for long-context tasks

The abstract claimed:

> “…for the first time, outperforms full attention under fair comparisons across various scenarios.”

---

1. MiniMax M2: Returning to Full Attention

Previous setup: M1 Lightning used Softmax + MoE with million-token contexts.

M2 change: A return to Full Attention for robustness in agent & code generation.

Performance & Cost Wins:

Price: ~8% of Claude Sonnet 4.5 (USD $0.3 per million-token input)
Speed: ~2× faster inference (TPS ≈ 100)
Method: “Efficient activation parameter design” for optimal balance.

---

Industry Praise:

The blog Why M2 is Full Attention drew commendations for its openness:

“A rare engineering perspective share”
“Sparse attention’s tail risk discussion is brilliant”

---

Haohai Sun's Three Key Challenges

Pipeline Complexity Explosion
Large models must serve >10 scenarios (code/math, agent, multimodal, RL, etc.).
Every new efficient attention mechanism must pass all scenario validations — complexity grows exponentially.
Limitations in Evaluation Systems
Small-scale gains often fail at large-scale training.
Multi-hop reasoning weaknesses only surface after significant resource commitment.
Benchmarks: KorBench, BBEH, Dyck language from BBH.

Incomplete Infrastructure
Linear attention training is memory-bound.
Inference requires low-precision storage, prefix caching, speculative decoding — all immature.
Verification time > design time for new Transformer variants.

---

Cost vs Latency:

Full Attention’s bottleneck is cost, not speed. MiniMax bets on GPU advances plus engineering tweaks to keep performance high.

---

2. Moonshot’s Counter: Kimi Linear

Zhou Xinyu — co-author of the MoBA (Mixture of Block Attention) paper — hinted at Kimi’s move.

Just 72 hours post-M2 release, Moonshot unveiled Kimi Linear:

48B parameters, 3B activation parameters
Trained on 5.7T tokens
1M-token context length
Fully open-sourced (weights, code, report)

---

Three Innovations in Kimi Linear

1. Kimi Delta Attention (KDA)

Builds on Gated DeltaNet
Upgrades scalar gates → channel-wise gates = more granular forgetting
Allows different “memory strength” per feature type
Performance: ~2× computation efficiency vs. DPLR

---

2. 3:1 Hybrid Ratio: KDA + MLA

MLA compresses attention into low-dimension, expands only when needed
Ratio testing found optimum at 3 KDA layers : 1 MLA layer

Results:

0:1 MLA → PPL = 5.77
3:1 → PPL = 5.65 (best)
1:1 → 5.66
7:1 → 5.70
15:1 → 5.82

---

3. No Position Encoding (NoPE)

MLA layers skip RoPE; KDA layers handle positional info entirely
Benefits:
Higher inference efficiency (MLA → MQA)
Simplified training
Better long-context generalization

---

Architecture Overview

Modules: Token Mixing Layer → MoE Channel Mixing Layer (stacked)

---

3. Kimi Linear: Numbers & Proofs

KV Cache Reduction:

75% reduction → 4× lower memory cost

Throughput:

1M-token context decoding: 6.3× faster vs. MLA
TPOT drops from 11.48 ms → 1.84 ms

RULER test (128k):

Score: 84.3
Speed: 3.98× faster vs. MLA
Pareto-optimal: no trade-off between speed & quality

---

Scaling Law Validation (1.4T tokens):

MLA Loss = 2.3092 × C^(−0.0536)
Kimi Loss = 2.2879 × C^(−0.0527)
~1.16× computational efficiency advantage

---

Synthetic Task Testing:

Palindrome, MQAR, Stack Tracking: KDA = 100% accuracy
GDN & Mamba2 fail on long sequences

---

vLLM Integration:

Kimi Delta Attention now in vLLM’s main branch — instant access via upgrade.

---

4. Two Paths: Full vs. Hybrid Attention

MiniMax:

Full Attention bet
Assumes GPU cost improvements + safer, proven tech
Strategy: time for space

Kimi/Moonshot:

KDA + MLA Hybrid
Redesign to cut costs + engineer maturity
Strategy: space for time

Other philosophies:

DeepSeek: MLA focus
Mistral: Sliding-window sparse
OpenAI/Anthropic: Likely optimized Full Attention

---

Industry Impact

These divergent approaches display:

Open, healthy competition
No single “right answer” in large-model attention design

The rivalry is both technical — Full vs. Efficient Attention — and market-facing, influencing long-term competitiveness.

---

5. AI Content Ecosystem Tie-In

For developers/creators leveraging these advancements, AiToEarn offers:

AI-assisted content generation
One-click publishing to Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, YouTube, X(Twitter), Pinterest, LinkedIn, Threads
Integrated analytics & AI Model Ranking

Paralleling Kimi’s efficiency goals, AiToEarn maximizes reach while minimizing operational overhead.

---

Bottom Line:

The “battle for attention” — both engineering and business — is far from over. MiniMax and Kimi’s rivalry will likely shape not only technical standards but also the ecosystems around how AI-powered products are delivered and monetized.

In this race, innovation, openness, and practical deployment tools will determine who truly gets your back.