ZTE Published a Paper Offering Insights into Cutting-Edge AI Research Directions
Trillion‑Parameter AI Models: Challenges, Bottlenecks & Next‑Gen Paradigms
As frontier models like GPT‑4o and Llama 4 push performance boundaries, the AI industry faces severe constraints:
- Transformer inefficiency
- Extreme compute demand
- Weak linkage to the physical world
A ZTE paper, Insights into Next-Generation AI Large Model Computing Paradigms, analyzes today’s bottlenecks and explores breakthrough directions—offering guidance for AGI progress.
---
1. Status & Hidden Risks in the Scaling Race
Scaling Law (2020) — Model performance improves with more parameters, compute, and data.
Example: GPT‑3 (175 B parameters) greatly advanced NLP and Q&A tasks.
Recent validations: DeepSeek‑V3, GPT‑4o, Llama 4, Qwen 3, Grok 4.
Costs:
- Hundreds of thousands of compute cards
- Hundreds of TB of corpus
- Months of training in autoregressive Transformers
- Grok 4: 200,000 cards across two 150 MW data centers; six-month pre‑training
Gap between industry & academia:
- Industry: trillion‑scale training
- Academia: theory & small (<7 B parameter) experiments
Despite limits in algorithms, hardware, cost, Scaling Law optimism drives scale growth.
---
2. Transformer Architecture Bottlenecks
- Low compute efficiency & high bandwidth demand
- Arithmetic intensity ≈ 2 vs CNN’s hundreds
- High data transfer → low Model FLOPs Utilization (MFU)
- Poor parallelism for Softmax, LayerNorm, Swish
- Heavy reliance on HBM & advanced fabrication → high cost
With growing inference contexts (long reasoning chains, AI for Science), these issues worsen—especially as Moore’s Law slows and Von Neumann’s separation of compute/storage hits “power” & “memory” walls.
---
3. AGI Road Debates
Transformer autoregression = “next token prediction.”
Issues:
- Hallucinations
- Poor interpretability
- No embodiment or hierarchy
Experts like Yann LeCun suggest current LLMs lack physical world understanding.
Fundamental neural network shortcomings:
- Neurons lack true learning/memory/decision abilities—intelligence is emergent
- Progress depends on brute‑force scaling
- Weak world‑model mapping
---
4. Engineering Optimizations for LLMs
4.1 Algorithm-Level Improvements
Attention Mechanism Optimizations:
- Target: long-context support
- Reduce O(N²) self-attention cost
- Techniques: GQA, MLA, Flash-Attention
- New mechanisms: Linear-Attention, RWKV, Mamba
- RoPE interpolation refinements
- Examples: NSA, MoBA, Ring-Attention, Tree-Attention
Low‑Precision Quantization:
- Use FP8, FP4, MXFP to lower bandwidth use & raise throughput
- 4‑bit offers best scalability in practice
- Trade‑offs: quantization error, extra nonlinear layer overhead
Recursive Parameter Reuse:
- Universal Transformer, MoEUT
- Cross‑layer parameter sharing boosts arithmetic intensity under bandwidth limits
- Current experiments are small scale; stability unclear
---
4.2 Cluster System Improvements
LLMs need multi‑card/multi‑machine clusters via:
- Tensor Parallelism (TP)
- Data Parallelism (DP)
- Pipeline Parallelism (PP)
- Expert Parallelism (EP)
MoE paradigm: activate top‑K experts only → large compute reduction
Example: DeepSeek V3 cut FFN load to 1/32.
Prefill vs Decode separation:
- Prefill: compute‑intensive, optimizes TTFT
- Decode: bandwidth‑intensive, optimizes TPOT
---
4.3 Hardware Advances
Techniques:
- Microarchitecture DSA-ization: GPU Tensor Cores + async transfer
- Interconnect Optimization:
- Scale Up: Nvlink, low latency (~200 ns)
- Scale Out: RDMA + NCCL primitives
- Optoelectronic hybrid clusters: large optical interconnect, wafer‑level expansion
- Compute‑in‑memory paradigms
- Simulation platforms for 10k+ card clusters
Future keys:
- Optical I/O supernodes: sub‑100 ns latency, memory pooling
- Novel memory systems: 3D DRAM, capacitor‑less DRAM, heterogeneous media
---
5. Beyond Next Token Prediction
5.1 Advanced Architectures
- Diffusion LLMs: LLaDA, Mercury → 10× throughput, 1/10 energy
- Joint Embedding Prediction: JEPA, LCM → latent space learning, energy‑based predictions
5.2 Physics-Principles-Based Models
- Liquid Neural Models: LTCN, continuous‑time RNNs with biological inspiration
- Energy‑Based Models: Hopfield nets, RBM, DBN — probability density estimation & flexible dependency modeling
---
6. Emerging Computing Paradigms
Energy efficiency > raw compute.
Move from von Neumann binary simulation to architectures exploiting natural computation.
6.1 Physical-Principles-Inspired
- Optical computing: ONNs via light properties (speed, bandwidth, parallelism)
- Quantum computing: QUBO-based optimization, reservoir computing
6.2 Electromagnetic computing
- Use wave properties (microwave, mmWave, THz) for linear transformations & real‑time processing.
6.3 Materials-Property Analog
- Probabilistic computing: p‑bit units for stochastic tasks
- Attractor networks: RRAM devices for hysteresis neurons
- Thermodynamic computing: physical equilibrium for sampling
6.4 Bio‑Inspired
- Neuromorphic computing: brain‑like energy‑efficient models
- DNA computing: high density, low energy biochemical systems
---
7. ZTE’s Next‑Gen Exploration
Microarchitecture Innovations:
- 8T SRAM in‑memory computing
- XPU–PIM heterogeneous → massive efficiency gain over GPU
Physics-informed design:
- Recurrent Transformer → fewer parameters, same expressive power
Support engineering:
- Optical interconnect, next‑gen memory
- Compute–storage separation, memory semantic interconnect
- Large-scale simulation platforms
- High‑bandwidth Ucie-memory tailored for LLM access patterns
Sparse Boltzmann Machines (DBM):
- Non‑volatile memory + probabilistic computing → 100×+ speedup for edge inference
---
8. Conclusions & Outlook
Scaling has driven AI’s rise but exposed deep limits in efficiency, cost, and physical grounding.
Solutions require:
- Hardware–software co‑design
- Physics‑first principles
- New computing substrates
Future:
- Hybrid paradigms (edge–cloud, embodied AI)
- Energy‑aware architectures replacing brute‑force scaling
- Universal intelligence with autonomous awareness
Open platforms like AiToEarn官网 bridge research and deployment—enabling AI content creation, cross‑platform publishing, analytics, and model ranking across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter—monetizing creativity in the emerging AI+multi‑platform ecosystem.