Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Spatial Intelligence & Supersensing: The Next Frontier in AI

Leading AI researchers — Fei-Fei Li, Saining Xie, and Yann LeCun — have been highlighting a transformative concept: Spatial Intelligence.

This goes beyond simply “understanding images or videos.” It’s about:

  • Comprehending spatial structures
  • Remembering events
  • Predicting future outcomes

In essence, a truly capable AI should not only “see,” but also sense, understand, and actively organize experience — a core competency for future multimodal intelligence.

---

Introducing Cambrian-S: Spatial Supersensing in Video

The trio recently collaborated to publish Cambrian-S: Towards Spatial Supersensing in Video.

image

Their proposed paradigm — Supersensing — emphasizes that AI models must:

  • Observe, recognize, and answer
  • Remember and understand 3D structure
  • Predict future events
  • Organize experiences into an Internal World Model
image

Co-first author Shusheng Yang noted: supersensing demands active prediction, filtering, and organization of sensory inputs — not passive reception.

image

Saining Xie described Cambrian-S as their first step into spatial supersensing for video. Despite its length, the paper is rich with detail and groundbreaking perspectives — essential reading for those in video-based multimodal modeling.

image

---

Defining “Spatial Supersensing”

Background

Last year, Xie’s team released Cambrian-1 — focusing on multimodal image modeling.

Instead of rushing towards Cambrian-2 or Cambrian-3, they paused to reflect:

  • What truly is multimodal intelligence?
  • Is LLM-style perception modeling adequate?
  • Why is human perception intuitive yet powerful?

Their conclusion: Without Supersensing, there can be no Superintelligence.

---

What Supersensing Is — and Isn’t

Supersensing is not about better sensors or cameras. It’s how a digital lifeform truly experiences the world — continuously absorbing sensory streams and learning from them.

Key point: Human agents can solve abstract problems without perception, but to operate in the real world, AI agents require sensory modeling.

As Andrej Karpathy has remarked: sensory modeling may be all that intelligence requires.

---

The Five Levels of Supersensing

image
  • No Sensory Capabilities
  • Example: LLMs
  • Operate purely on text/symbols, without grounding in the physical world.
  • Semantic Perception
  • Parsing pixels into objects, attributes, and relationships.
  • Strong performance in “image captioning.”
  • Streaming Event Cognition
  • Handling continuous, real-time data streams.
  • Active interpretation and response to ongoing events.
  • Implicit 3D Spatial Cognition
  • Understanding video as a projection of the 3D world.
  • Answering “what exists,” “where it is,” and “how it changes over time.”
  • Current video models are still lacking here.
  • Predictive World Modeling
  • Anticipating possible future states based on prior expectations.
  • Using surprise as a driver for attention, memory, and learning.

---

Why Predictive World Modeling Matters

The human brain continuously predicts potential world states. High prediction errors trigger attention and learning, but current multimodal systems lack such mechanisms.

image

---

Evaluating Spatial Supersensing

Existing video benchmarks focus on language understanding and semantic perception, but neglect higher-level supersensing.

Even efforts like VSI-Bench address spatial perception only for short videos, missing unbounded, continuous visual streams.

image

---

VSI-SUPER: A New Benchmark

VSI-SUPER introduces two challenging tasks:

  • VSI-SUPER Recall (VSR)
  • Recall anomalous object locations in long videos.
  • VSI-SUPER Count (VSC)
  • Continuously accumulate spatial information across long scenarios.

Unlike conventional tests, VSI-SUPER stitches short clips into arbitrarily long videos, challenging models to remember unbounded visual streams.

Testing Gemini-2.5 Flash revealed high performance on standard benchmarks, yet failure on VSI-SUPER.

image

---

Cambrian-S: Addressing the Challenge

While data and scale matter, the key missing factor is training data designed for spatial cognition.

VSI-590K Dataset

  • 590,000 samples from:
  • First-person indoor environments (3D annotations)
  • Simulators
  • Pseudo-labeled YouTube videos
image

Models from 0.5B to 7B parameters were trained, achieving up to +30% spatial reasoning gains over baseline MLLMs.

image

Still, VSI-SUPER remains unfixed — suggesting LLM-style multimodal systems are not the ultimate path to supersensing.

image

---

Predictive Perception Prototype

Cambrian-S features:

  • Strong general video/image understanding
  • Leading spatial perception performance

Additional strengths:

  • Generalizes to unseen spatial tasks
  • Passes debias stress tests

---

Human Analogy

In baseball, players predict ball trajectory before their brain fully processes the visual input — a skill rooted in internal predictive world models.

image

This ability filters sensory overload by ignoring low-error predictions & focusing on surprises.

---

Latent Frame Prediction

The team trained a latent frame prediction (LFP) head that:

  • Predicts the next input frame
  • Compares predicted vs actual (difference = surprise value)
  • Uses surprise for:
  • Memory management (skip unremarkable frames)
  • Event segmentation (detect scene boundaries)
image

Even a smaller model using this method beat Gemini on VSI-SUPER.

---

  • Multimodal Benchmark Design research — removing language bias
  • Simulator tools for collecting spatial perception video data
image
image
image

---

Open-Source Resources

---

Connection to AI Content Platforms

Advanced cognitive capabilities like predictive perception could power AI-driven creators.

Platforms such as AiToEarn官网 allow:

  • AI content generation
  • Multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, YouTube, LinkedIn, Instagram, Pinterest, Threads, X)
  • Integrated analytics & model ranking

Read AiToEarn Blog | Open-source Repo

---

Summary:

Spatial Supersensing represents a paradigm shift in AI development — moving from passive perception to active, predictive, and structurally aware intelligence. Benchmarks like VSI-SUPER and models like Cambrian-S are critical steps toward this future, where AI can remember, predict, and organize experiences across unbounded multimodal streams.

Read more

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA OmniVinci: A Breakthrough in Multimodal AI NVIDIA has unveiled OmniVinci, a large language model designed for multimodal understanding and reasoning — capable of processing text, visual, audio, and even robotic data inputs. Led by the NVIDIA Research team, the project explores human-like perception: integrating and interpreting information across multiple data

By Honghao Wang