AI models

LeCun’s Prediction Comes True: 790-Year Long Video Trains the Strongest Open-Source “World Model”

Honghao Wang

03 Nov 2025 — 5 min read

The Third Scaling Paradigm in AI

Emu3.5 Ushers in a Native Multimodal World

New Intelligence Report

Editor: Peach Sleepy

---

Executive Summary

The third scaling paradigm in AI is here — Emu3.5.

With 34 billion parameters, trained on the equivalent of 790 years of long-form video, this native multimodal world model can generate coherent 3D worlds and delivers up to 20× faster image inference.

---

World Models — The New Battleground in AI (2025)

Google: Genie 3 — Generates a live 720p simulated world from a single sentence, dubbed by netizens as “Game Engine 2.0.”
World Labs (Fei-Fei Li): RTFM — Real-time 3D rendering using only one H100 GPU.
Meta FAIR: Code World Model (CWM)
Runway: General World Model (GWM)
Tesla: Neural network simulator

Core focus across the industry: Multimodal world models

Why World Models?

Leading AI researchers (e.g., Fei-Fei Li, Yann LeCun) emphasize: language alone cannot replicate human intelligence — AI must understand and simulate the physical world.

World models mimic human “mental models,” predicting environmental behavior and dynamics.

---

Introducing Emu3.5 — A Milestone by BAAI

Official launch by Beijing Academy of Artificial Intelligence (BAAI)

President Dr. Zhongyuan Wang:

> “Not every large model has to follow paths already taken. Emu is our own technical route — one we lead.”

Key difference:

While mainstream models are “modular assemblies” (LLM + CLIP + DiT), Emu3.5 returns to first principles:

Continuous, long-term visual learning
Unified autoregressive architecture for both understanding and generation

Capabilities:

Long-text rendering
Complex image editing
Visual storytelling with physical dynamics, causality, spacetime, and logic

Technical report: https://arxiv.org/pdf/2510.26583

Homepage: https://zh.emu.world

---

Core Research Questions Emu3.5 Addresses

How should multimodality be unified?
→ Native, end-to-end autoregressive “Next-State Prediction”
What should a world model learn?
→ Long video data rich in world knowledge, temporal consistency, and causality
How to achieve scaling?
→ Third Scaling Paradigm: Pretraining + Multimodal RL, leveraging LLM infrastructure
How to implement efficiently?
→ Reasoning acceleration via DiDA to overcome bottlenecks

---

Learning Like Humans — From Next-Token to Next-State

Human learning starts with perception, not text.

Babies observe and interact with the world → understand physics → develop language

Problem in current models:

Video/image generators use separated modules (Diffusion Transformer) → no true unified intelligence.

Emu3.5’s Native Multimodal Path

Unified tokenization: Images, text, action instructions
Single autoregressive Transformer predicts the next token (visual, textual, or instructive)
Benefits:
Unification: Shared context for understanding & generation
Scalability: Reuse LLM’s proven infrastructure

---

Third Scaling Paradigm — 790 Years of Video + Multimodal RL

Data scale: > 13 trillion multimodal tokens

Core: 790 years of long videos (documentaries, education, vlogs, gaming, animation)
Rich spatiotemporal, causal, and coherent context

Training stages:

Large-scale pretraining (~10 trillion tokens) → Basic multimodal alignment
Large-scale multimodal RL → Decision-making & contextual reasoning across modalities

RL innovation:

Unified autoregressive architecture supports multi-task multimodal RL
Reward types: General (aesthetics, image–text consistency) + Task-specific (OCR accuracy, face ID preservation)
Optimization via GRPO in a unified reward space

---

DiDA — 20× Faster Inference for Autoregressive Models

Problem: Autoregressive image generation = slow (token-by-token)

Solution: Discrete Diffusion Adaptation (DiDA)

Converts model from sequential → parallel token generation
Process: generate noisy tokens → denoise in parallel refinement steps
Result: ~20× faster inference with negligible quality loss
Matches inference efficiency of top closed Diffusion models (e.g., Midjourney)

---

Performance Highlights — From Editing to World Simulation

Any-to-Image Generation & Editing:

Complex bilingual content, formulas
Benchmarks surpass Gemini 1.5 Flash, Qwen-VL-Max

Semantic Understanding & World-Building:

Logical consistency in generated worlds

Reasoning:

Object replacement via numerical labels

Viewpoint Transformation:

Bird’s-eye conversion with spatial awareness

Long-sequence Consistency:

Coherent states/storylines across videos

---

Unique Capabilities in the "World Model" Category

Visual Narrative
Coherent protagonist in multi-image stories
Visual Guidance
Step-by-step, image+text instructions (e.g., folding clothes, growing kale)

World Exploration
Scene navigation commands produce consistent exploration visuals
Embodied Manipulation
Plan & visualize robotic arm tasks step-by-step

---

Open Source Approach & Future Potential

Model release with detailed technical report to invite global collaboration
Emu3.5 trained with only 34B parameters and <1% of public internet video data
Expected breakthroughs as scale and data grow

---

References

---

Practical Applications for Creators

Platforms like AiToEarn官网 connect cutting-edge models with content monetization pipelines, enabling:

AI content generation
Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Instagram, YouTube, X, etc.)
Analytics & model rankings (AI模型排名)

This ecosystem lets models like Emu3.5 move from research demos → large-scale deployment & monetization quickly.

---

Read Original

Open in WeChat