Fei-Fei Li’s Latest Essay: AI Is Hot, but Possibly Headed Off Track
From Words to Worlds
Why AI Needs Spatial Intelligence to Move Beyond Language

AI excels at talking, but struggles to truly understand the world.
Recently, Google unveiled Gemini 3 Pro, sparking a wave of online buzz. The questions came quickly: Does it have more parameters? Can it handle longer context windows? Are we getting closer to AGI (Artificial General Intelligence)?
Renowned computer scientist Fei-Fei Li — U.S. National Academy of Engineering member and Stanford professor — offers a reality check.
On November 10, she published a detailed article arguing:
> Bigger models and better algorithms aren’t enough. Without world understanding, AI will never reach true intelligence.
---
The Problem with Today’s Large Language Models

Like Well-Read Scholars Who’ve Never Gone Outside
Think of ChatGPT, Gemini, DeepSeek, or Doubao — all powered by Large Language Models (LLMs).
LLMs predict the next word in a sequence.
Example: You say “床前明月光”; the model guesses the next phrase is “疑是地上霜.”
This word-prediction ability, trained on massive text datasets, enables LLMs to:
- Pass professional exams
- Solve complex math problems
But they falter at simple, physical reasoning:
- “How far is that car from the tree?”
- “Will this box fit in the trunk?”
Sometimes, they make absurd predictions — e.g., assuming a cup will float upward.
LLMs may know formulas, but they lack physical common sense. Fei-Fei Li calls them “wordsmiths in the dark.”
---
Why Hallucinations Happen
LLMs rely on statistical patterns in text, not real-world experience.
They could claim “The sun rises from the west” because grammar and probability suggest it — even if physics says it’s impossible.
They’ve read thousands of books, but never touched the world outside.
---
Language Can Fabricate — The Physical World Doesn’t Lie

Enter Spatial Intelligence
Fei-Fei Li believes AI must develop spatial intelligence — the ability to understand and interact with the physical world.
Example: Drinking coffee
- Vision: Judging cup-to-mouth distance
- Motor control: Adjusting grip based on weight
- Touch: Avoiding burns
- Balance & motion: Keeping cup level
This process uses perception, imagination, and action — not verbal commands.
Spatial intelligence is key because true intelligence requires:
- Prediction
- Action
- Goal achievement — in changing, uncertain environments.
---
Learning Through Interaction
- Babies: Push over blocks → hear crash → learn cause & effect.
- Scientists: Watson and Crick built physical DNA models to reveal the double helix — insight wasn’t in the “words” but in spatial arrangement.
---
From Predicting Words to Predicting Frames

Fei-Fei Li advocates shifting AI from predicting the next word → to predicting the next frame of reality.
Example: Letting go of a glass cup
- Human mind predicts: fall → impact → shatter
- Without reading about it, you know what happens.
Key Difference:
- Word prediction = grammatical logic
- Frame prediction = physical logic
This requires world models — spaces with consistent gravity, light, occlusion, and physics.
---
Challenges in Building World Models
Fei-Fei Li identifies two key obstacles:
- Finding the formula: LLMs succeed with “predict next word” simplicity; can we find an equally elegant equation for spatial intelligence?
- Finding the data: Requires massive spatial datasets; extracting 3D info from 2D video is an active research area.
---
Marble: A Glimpse of Spatial AI

Fei-Fei Li’s World Labs developed Marble — input text or a photo, get an explorable 3D space.
Testing it: Uploading a single image → Marble inferred chairs, desks, and room layout (still rough, but promising).
---
Potential Impact of Spatial Intelligence
- Robots at Home:
- Avoid fragile vases
- Dry wet floors before walking
- Assist elderly care
- Industrial Applications:
- Controllable video generation for ads & film
- Virtual production efficiency (Sony partner reported 40× gains with Marble)
- Consumer Products:
- Interactive interior design
- 3D memory albums
- VR therapy for phobias
- Synthetic Data Markets:
- “Textbooks” for robots: domain-specific task data
---
AI Monetization in the Spatial Era
As spatial AI matures, tools that combine creation + cross-platform publishing will become essential.
Example: AiToEarn官网 offers:
- AI content generation
- Distribution across platforms like Douyin, Bilibili, Xiaohongshu, Instagram, YouTube, LinkedIn
- Analytics and AI model ranking (AI模型排名)
This could help spatial AI creators share interactive environments instantly and monetize their innovations globally.
---
Final Takeaways
Why AI still makes simple mistakes:
It “thinks” in statistical patterns, not through cause-and-effect in reality.
Fei-Fei Li’s proposal:
Shift from predicting text → to predicting reality via spatial intelligence & world models.
Possible outcomes:
- Household robots with genuine awareness
- AI scientists finding new laws of nature
- Fully controllable, physically consistent creative tools
Current status:
- Marble is early-stage
- Formula for world models unknown
- Spatial datasets scarce
But now, the path to true intelligence looks clearer.
---
References
- From Words to Worlds: Spatial Intelligence is AI’s Next Frontier
- Google Developer Guide: Introduction to Large Language Models
---
If you'd like, I can also create a concise 10–15 bullet executive summary of this rewritten piece so it’s easier for leaders or investors to digest in under 2 minutes. Would you like me to prepare that?