AI news

ZTE Published a Paper Offering Insights into Cutting-Edge AI Research Directions

Honghao Wang

26 Nov 2025 — 3 min read

Trillion‑Parameter AI Models: Challenges, Bottlenecks & Next‑Gen Paradigms

As frontier models like GPT‑4o and Llama 4 push performance boundaries, the AI industry faces severe constraints:

Transformer inefficiency
Extreme compute demand
Weak linkage to the physical world

A ZTE paper, Insights into Next-Generation AI Large Model Computing Paradigms, analyzes today’s bottlenecks and explores breakthrough directions—offering guidance for AGI progress.

---

1. Status & Hidden Risks in the Scaling Race

Scaling Law (2020) — Model performance improves with more parameters, compute, and data.

Example: GPT‑3 (175 B parameters) greatly advanced NLP and Q&A tasks.

Recent validations: DeepSeek‑V3, GPT‑4o, Llama 4, Qwen 3, Grok 4.

Costs:

Hundreds of thousands of compute cards
Hundreds of TB of corpus
Months of training in autoregressive Transformers
Grok 4: 200,000 cards across two 150 MW data centers; six-month pre‑training

Gap between industry & academia:

Industry: trillion‑scale training
Academia: theory & small (<7 B parameter) experiments

Despite limits in algorithms, hardware, cost, Scaling Law optimism drives scale growth.

---

2. Transformer Architecture Bottlenecks

Low compute efficiency & high bandwidth demand
Arithmetic intensity ≈ 2 vs CNN’s hundreds
High data transfer → low Model FLOPs Utilization (MFU)
Poor parallelism for Softmax, LayerNorm, Swish
Heavy reliance on HBM & advanced fabrication → high cost

With growing inference contexts (long reasoning chains, AI for Science), these issues worsen—especially as Moore’s Law slows and Von Neumann’s separation of compute/storage hits “power” & “memory” walls.

---

3. AGI Road Debates

Transformer autoregression = “next token prediction.”

Issues:

Hallucinations
Poor interpretability
No embodiment or hierarchy

Experts like Yann LeCun suggest current LLMs lack physical world understanding.

Fundamental neural network shortcomings:

Neurons lack true learning/memory/decision abilities—intelligence is emergent
Progress depends on brute‑force scaling
Weak world‑model mapping

---

4. Engineering Optimizations for LLMs

4.1 Algorithm-Level Improvements

Attention Mechanism Optimizations:

Target: long-context support
Reduce O(N²) self-attention cost
Techniques: GQA, MLA, Flash-Attention
New mechanisms: Linear-Attention, RWKV, Mamba
RoPE interpolation refinements
Examples: NSA, MoBA, Ring-Attention, Tree-Attention

Low‑Precision Quantization:

Use FP8, FP4, MXFP to lower bandwidth use & raise throughput
4‑bit offers best scalability in practice
Trade‑offs: quantization error, extra nonlinear layer overhead

Recursive Parameter Reuse:

Universal Transformer, MoEUT
Cross‑layer parameter sharing boosts arithmetic intensity under bandwidth limits
Current experiments are small scale; stability unclear

---

4.2 Cluster System Improvements

LLMs need multi‑card/multi‑machine clusters via:

Tensor Parallelism (TP)
Data Parallelism (DP)
Pipeline Parallelism (PP)
Expert Parallelism (EP)

MoE paradigm: activate top‑K experts only → large compute reduction

Example: DeepSeek V3 cut FFN load to 1/32.

Prefill vs Decode separation:

Prefill: compute‑intensive, optimizes TTFT
Decode: bandwidth‑intensive, optimizes TPOT

---

4.3 Hardware Advances

Techniques:

Microarchitecture DSA-ization: GPU Tensor Cores + async transfer
Interconnect Optimization:
Scale Up: Nvlink, low latency (~200 ns)
Scale Out: RDMA + NCCL primitives
Optoelectronic hybrid clusters: large optical interconnect, wafer‑level expansion
Compute‑in‑memory paradigms
Simulation platforms for 10k+ card clusters

Future keys:

Optical I/O supernodes: sub‑100 ns latency, memory pooling
Novel memory systems: 3D DRAM, capacitor‑less DRAM, heterogeneous media

---

5. Beyond Next Token Prediction

5.1 Advanced Architectures

Diffusion LLMs: LLaDA, Mercury → 10× throughput, 1/10 energy
Joint Embedding Prediction: JEPA, LCM → latent space learning, energy‑based predictions

5.2 Physics-Principles-Based Models

Liquid Neural Models: LTCN, continuous‑time RNNs with biological inspiration
Energy‑Based Models: Hopfield nets, RBM, DBN — probability density estimation & flexible dependency modeling

---

6. Emerging Computing Paradigms

Energy efficiency > raw compute.

Move from von Neumann binary simulation to architectures exploiting natural computation.

6.1 Physical-Principles-Inspired

Optical computing: ONNs via light properties (speed, bandwidth, parallelism)
Quantum computing: QUBO-based optimization, reservoir computing

6.2 Electromagnetic computing

Use wave properties (microwave, mmWave, THz) for linear transformations & real‑time processing.

6.3 Materials-Property Analog

Probabilistic computing: p‑bit units for stochastic tasks
Attractor networks: RRAM devices for hysteresis neurons
Thermodynamic computing: physical equilibrium for sampling

6.4 Bio‑Inspired

Neuromorphic computing: brain‑like energy‑efficient models
DNA computing: high density, low energy biochemical systems

---

7. ZTE’s Next‑Gen Exploration

Microarchitecture Innovations:

8T SRAM in‑memory computing
XPU–PIM heterogeneous → massive efficiency gain over GPU

Physics-informed design:

Recurrent Transformer → fewer parameters, same expressive power

Support engineering:

Optical interconnect, next‑gen memory
Compute–storage separation, memory semantic interconnect
Large-scale simulation platforms
High‑bandwidth Ucie-memory tailored for LLM access patterns

Sparse Boltzmann Machines (DBM):

Non‑volatile memory + probabilistic computing → 100×+ speedup for edge inference

---

8. Conclusions & Outlook

Scaling has driven AI’s rise but exposed deep limits in efficiency, cost, and physical grounding.

Solutions require:

Hardware–software co‑design
Physics‑first principles
New computing substrates

Future:

Hybrid paradigms (edge–cloud, embodied AI)
Energy‑aware architectures replacing brute‑force scaling
Universal intelligence with autonomous awareness

Open platforms like AiToEarn官网 bridge research and deployment—enabling AI content creation, cross‑platform publishing, analytics, and model ranking across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter—monetizing creativity in the emerging AI+multi‑platform ecosystem.