Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Spatial Intelligence & Supersensing: The Next Frontier in AI

Leading AI researchers — Fei-Fei Li, Saining Xie, and Yann LeCun — have been highlighting a transformative concept: Spatial Intelligence.

This goes beyond simply “understanding images or videos.” It’s about:

  • Comprehending spatial structures
  • Remembering events
  • Predicting future outcomes

In essence, a truly capable AI should not only “see,” but also sense, understand, and actively organize experience — a core competency for future multimodal intelligence.

---

Introducing Cambrian-S: Spatial Supersensing in Video

The trio recently collaborated to publish Cambrian-S: Towards Spatial Supersensing in Video.

image

Their proposed paradigm — Supersensing — emphasizes that AI models must:

  • Observe, recognize, and answer
  • Remember and understand 3D structure
  • Predict future events
  • Organize experiences into an Internal World Model
image

Co-first author Shusheng Yang noted: supersensing demands active prediction, filtering, and organization of sensory inputs — not passive reception.

image

Saining Xie described Cambrian-S as their first step into spatial supersensing for video. Despite its length, the paper is rich with detail and groundbreaking perspectives — essential reading for those in video-based multimodal modeling.

image

---

Defining “Spatial Supersensing”

Background

Last year, Xie’s team released Cambrian-1 — focusing on multimodal image modeling.

Instead of rushing towards Cambrian-2 or Cambrian-3, they paused to reflect:

  • What truly is multimodal intelligence?
  • Is LLM-style perception modeling adequate?
  • Why is human perception intuitive yet powerful?

Their conclusion: Without Supersensing, there can be no Superintelligence.

---

What Supersensing Is — and Isn’t

Supersensing is not about better sensors or cameras. It’s how a digital lifeform truly experiences the world — continuously absorbing sensory streams and learning from them.

Key point: Human agents can solve abstract problems without perception, but to operate in the real world, AI agents require sensory modeling.

As Andrej Karpathy has remarked: sensory modeling may be all that intelligence requires.

---

The Five Levels of Supersensing

image
  • No Sensory Capabilities
  • Example: LLMs
  • Operate purely on text/symbols, without grounding in the physical world.
  • Semantic Perception
  • Parsing pixels into objects, attributes, and relationships.
  • Strong performance in “image captioning.”
  • Streaming Event Cognition
  • Handling continuous, real-time data streams.
  • Active interpretation and response to ongoing events.
  • Implicit 3D Spatial Cognition
  • Understanding video as a projection of the 3D world.
  • Answering “what exists,” “where it is,” and “how it changes over time.”
  • Current video models are still lacking here.
  • Predictive World Modeling
  • Anticipating possible future states based on prior expectations.
  • Using surprise as a driver for attention, memory, and learning.

---

Why Predictive World Modeling Matters

The human brain continuously predicts potential world states. High prediction errors trigger attention and learning, but current multimodal systems lack such mechanisms.

image

---

Evaluating Spatial Supersensing

Existing video benchmarks focus on language understanding and semantic perception, but neglect higher-level supersensing.

Even efforts like VSI-Bench address spatial perception only for short videos, missing unbounded, continuous visual streams.

image

---

VSI-SUPER: A New Benchmark

VSI-SUPER introduces two challenging tasks:

  • VSI-SUPER Recall (VSR)
  • Recall anomalous object locations in long videos.
  • VSI-SUPER Count (VSC)
  • Continuously accumulate spatial information across long scenarios.

Unlike conventional tests, VSI-SUPER stitches short clips into arbitrarily long videos, challenging models to remember unbounded visual streams.

Testing Gemini-2.5 Flash revealed high performance on standard benchmarks, yet failure on VSI-SUPER.

image

---

Cambrian-S: Addressing the Challenge

While data and scale matter, the key missing factor is training data designed for spatial cognition.

VSI-590K Dataset

  • 590,000 samples from:
  • First-person indoor environments (3D annotations)
  • Simulators
  • Pseudo-labeled YouTube videos
image

Models from 0.5B to 7B parameters were trained, achieving up to +30% spatial reasoning gains over baseline MLLMs.

image

Still, VSI-SUPER remains unfixed — suggesting LLM-style multimodal systems are not the ultimate path to supersensing.

image

---

Predictive Perception Prototype

Cambrian-S features:

  • Strong general video/image understanding
  • Leading spatial perception performance

Additional strengths:

  • Generalizes to unseen spatial tasks
  • Passes debias stress tests

---

Human Analogy

In baseball, players predict ball trajectory before their brain fully processes the visual input — a skill rooted in internal predictive world models.

image

This ability filters sensory overload by ignoring low-error predictions & focusing on surprises.

---

Latent Frame Prediction

The team trained a latent frame prediction (LFP) head that:

  • Predicts the next input frame
  • Compares predicted vs actual (difference = surprise value)
  • Uses surprise for:
  • Memory management (skip unremarkable frames)
  • Event segmentation (detect scene boundaries)
image

Even a smaller model using this method beat Gemini on VSI-SUPER.

---

  • Multimodal Benchmark Design research — removing language bias
  • Simulator tools for collecting spatial perception video data
image
image
image

---

Open-Source Resources

---

Connection to AI Content Platforms

Advanced cognitive capabilities like predictive perception could power AI-driven creators.

Platforms such as AiToEarn官网 allow:

  • AI content generation
  • Multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, YouTube, LinkedIn, Instagram, Pinterest, Threads, X)
  • Integrated analytics & model ranking

Read AiToEarn Blog | Open-source Repo

---

Summary:

Spatial Supersensing represents a paradigm shift in AI development — moving from passive perception to active, predictive, and structurally aware intelligence. Benchmarks like VSI-SUPER and models like Cambrian-S are critical steps toward this future, where AI can remember, predict, and organize experiences across unbounded multimodal streams.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.