ZTE Published a Paper Offering Insights into Cutting-Edge AI Research Directions

ZTE Published a Paper Offering Insights into Cutting-Edge AI Research Directions

Trillion‑Parameter AI Models: Challenges, Bottlenecks & Next‑Gen Paradigms

As frontier models like GPT‑4o and Llama 4 push performance boundaries, the AI industry faces severe constraints:

  • Transformer inefficiency
  • Extreme compute demand
  • Weak linkage to the physical world

A ZTE paper, Insights into Next-Generation AI Large Model Computing Paradigms, analyzes today’s bottlenecks and explores breakthrough directions—offering guidance for AGI progress.

---

1. Status & Hidden Risks in the Scaling Race

Scaling Law (2020) — Model performance improves with more parameters, compute, and data.

Example: GPT‑3 (175 B parameters) greatly advanced NLP and Q&A tasks.

Recent validations: DeepSeek‑V3, GPT‑4o, Llama 4, Qwen 3, Grok 4.

Costs:

  • Hundreds of thousands of compute cards
  • Hundreds of TB of corpus
  • Months of training in autoregressive Transformers
  • Grok 4: 200,000 cards across two 150 MW data centers; six-month pre‑training

Gap between industry & academia:

  • Industry: trillion‑scale training
  • Academia: theory & small (<7 B parameter) experiments

Despite limits in algorithms, hardware, cost, Scaling Law optimism drives scale growth.

---

2. Transformer Architecture Bottlenecks

  • Low compute efficiency & high bandwidth demand
  • Arithmetic intensity ≈ 2 vs CNN’s hundreds
  • High data transfer → low Model FLOPs Utilization (MFU)
  • Poor parallelism for Softmax, LayerNorm, Swish
  • Heavy reliance on HBM & advanced fabrication → high cost

With growing inference contexts (long reasoning chains, AI for Science), these issues worsen—especially as Moore’s Law slows and Von Neumann’s separation of compute/storage hits “power” & “memory” walls.

---

3. AGI Road Debates

Transformer autoregression = “next token prediction.”

Issues:

  • Hallucinations
  • Poor interpretability
  • No embodiment or hierarchy

Experts like Yann LeCun suggest current LLMs lack physical world understanding.

Fundamental neural network shortcomings:

  • Neurons lack true learning/memory/decision abilities—intelligence is emergent
  • Progress depends on brute‑force scaling
  • Weak world‑model mapping

---

4. Engineering Optimizations for LLMs

4.1 Algorithm-Level Improvements

Attention Mechanism Optimizations:

  • Target: long-context support
  • Reduce O(N²) self-attention cost
  • Techniques: GQA, MLA, Flash-Attention
  • New mechanisms: Linear-Attention, RWKV, Mamba
  • RoPE interpolation refinements
  • Examples: NSA, MoBA, Ring-Attention, Tree-Attention

Low‑Precision Quantization:

  • Use FP8, FP4, MXFP to lower bandwidth use & raise throughput
  • 4‑bit offers best scalability in practice
  • Trade‑offs: quantization error, extra nonlinear layer overhead

Recursive Parameter Reuse:

  • Universal Transformer, MoEUT
  • Cross‑layer parameter sharing boosts arithmetic intensity under bandwidth limits
  • Current experiments are small scale; stability unclear

---

4.2 Cluster System Improvements

LLMs need multi‑card/multi‑machine clusters via:

  • Tensor Parallelism (TP)
  • Data Parallelism (DP)
  • Pipeline Parallelism (PP)
  • Expert Parallelism (EP)

MoE paradigm: activate top‑K experts only → large compute reduction

Example: DeepSeek V3 cut FFN load to 1/32.

Prefill vs Decode separation:

  • Prefill: compute‑intensive, optimizes TTFT
  • Decode: bandwidth‑intensive, optimizes TPOT

---

4.3 Hardware Advances

Techniques:

  • Microarchitecture DSA-ization: GPU Tensor Cores + async transfer
  • Interconnect Optimization:
  • Scale Up: Nvlink, low latency (~200 ns)
  • Scale Out: RDMA + NCCL primitives
  • Optoelectronic hybrid clusters: large optical interconnect, wafer‑level expansion
  • Compute‑in‑memory paradigms
  • Simulation platforms for 10k+ card clusters

Future keys:

  • Optical I/O supernodes: sub‑100 ns latency, memory pooling
  • Novel memory systems: 3D DRAM, capacitor‑less DRAM, heterogeneous media

---

5. Beyond Next Token Prediction

5.1 Advanced Architectures

  • Diffusion LLMs: LLaDA, Mercury → 10× throughput, 1/10 energy
  • Joint Embedding Prediction: JEPA, LCM → latent space learning, energy‑based predictions

5.2 Physics-Principles-Based Models

  • Liquid Neural Models: LTCN, continuous‑time RNNs with biological inspiration
  • Energy‑Based Models: Hopfield nets, RBM, DBN — probability density estimation & flexible dependency modeling

---

6. Emerging Computing Paradigms

Energy efficiency > raw compute.

Move from von Neumann binary simulation to architectures exploiting natural computation.

6.1 Physical-Principles-Inspired

  • Optical computing: ONNs via light properties (speed, bandwidth, parallelism)
  • Quantum computing: QUBO-based optimization, reservoir computing

6.2 Electromagnetic computing

  • Use wave properties (microwave, mmWave, THz) for linear transformations & real‑time processing.

6.3 Materials-Property Analog

  • Probabilistic computing: p‑bit units for stochastic tasks
  • Attractor networks: RRAM devices for hysteresis neurons
  • Thermodynamic computing: physical equilibrium for sampling

6.4 Bio‑Inspired

  • Neuromorphic computing: brain‑like energy‑efficient models
  • DNA computing: high density, low energy biochemical systems

---

7. ZTE’s Next‑Gen Exploration

Microarchitecture Innovations:

  • 8T SRAM in‑memory computing
  • XPU–PIM heterogeneous → massive efficiency gain over GPU

Physics-informed design:

  • Recurrent Transformer → fewer parameters, same expressive power

Support engineering:

  • Optical interconnect, next‑gen memory
  • Compute–storage separation, memory semantic interconnect
  • Large-scale simulation platforms
  • High‑bandwidth Ucie-memory tailored for LLM access patterns

Sparse Boltzmann Machines (DBM):

  • Non‑volatile memory + probabilistic computing → 100×+ speedup for edge inference

---

8. Conclusions & Outlook

Scaling has driven AI’s rise but exposed deep limits in efficiency, cost, and physical grounding.

Solutions require:

  • Hardware–software co‑design
  • Physics‑first principles
  • New computing substrates

Future:

  • Hybrid paradigms (edge–cloud, embodied AI)
  • Energy‑aware architectures replacing brute‑force scaling
  • Universal intelligence with autonomous awareness

Open platforms like AiToEarn官网 bridge research and deployment—enabling AI content creation, cross‑platform publishing, analytics, and model ranking across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter—monetizing creativity in the emerging AI+multi‑platform ecosystem.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.