AI news
From Detection to General Perception: Building the Foundations of Spatial Intelligence

Honghao Wang

18 Nov 2025 — 5 min read
# From Detection to General Perception: Building the Foundations of Spatial Intelligence

## 1. Importance of Visual Perception & Object Understanding

Visual perception is a **core capability** for AI systems to interact with the physical world, and a necessary step toward **general intelligence**.  
This article is adapted from a June presentation at **AICon 2025 Beijing** by the IDEA Research Institute’s Chair Scientist in the Computer Vision & Robotics Research Center:  
*"From Detection to General Perception: Building the Foundations of Spatial Intelligence"*.

**Key Focus Areas of the Talk:**
- Comparing **language-native** vs. **vision-native** model architectures.
- Evolution of **Transformer-based object detection** from **DETR** to **DINO**.
- Advances in **open-set detection**: **Grounding DINO**, **DINO-X**, and extensions to keypoint localization, attribute understanding, and 3D perception.
- Using detection technologies as the foundation for **spatial intelligence** and real-world AI applications.

---

### AICon 2025 Theme Preview
At the December 19–20 **AICon Beijing** conference:
- **Frontier topics**: Large model training & inference, AI Agents, new R&D paradigms, organizational transformation.
- **Goal**: Building trustworthy, scalable, and commercializable agentic operating systems.
- **Benefit for enterprises**: Cost reduction, efficiency improvement, growth breakthrough.

**Full schedule:** [AICon Beijing 2025](https://aicon.infoq.cn/202512/beijing/schedule)

---

## 1.1 Visual Perception: Humans vs. Machines
- **Human and animal interaction without language** → relied on sight + action.
- **Machine-world interaction** → requires *visual perception* for action.
- Early interactions were via programming; natural language AI (e.g., GPT) opened usage to non-programmers.
- **Even with language**, machines **must** have strong visual capabilities for real-world action.

## 1.2 Two Primary Technical Routes
1. **Language Understanding**
2. **Visual Understanding**

Convergence is happening via **multimodal AI**.  
Language without vision is inadequate; vision without language misses human-machine synergy.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_001-429.jpg)

---

## 1.3 Object Detection Basics
Object detection answers: *Does an image contain a specific object? If so, where?*

**Milestones Timeline:**
- **2001 — Viola’s Algorithm**
  - Efficient face detection.
  - Used in digital camera autofocus.
- **Faster R-CNN Era**
  - Enabled breakthroughs in autonomous driving.
  - Introduced accurate environmental perception.

Modern direction: detection + **open-set capability** + **zero-shot learning**, enabling detection of unknown objects via language prompts.

---

### Content Creation Connection
Platforms like **[AiToEarn官网](https://aitoearn.ai/)** help creators integrate visual AI research into practical, monetizable workflows:
- Publish to Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X.
- Use AI tools, analytics, and [AI模型排名](https://rank.aitoearn.ai) for optimization.

---

## 1.4 Transformer Architecture in Vision
Pre-2020 → Transformers mainly used for **language**.  
**2020 DETR (Detection Transformer)** brought Transformers to **vision**, but faced:
1. Slow convergence (10× slower).
2. Lower performance than CNN-based methods.

Our **DINO** series solved these:
- Faster convergence.
- Achieved **SOTA** performance in object detection.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_002-405.jpg)

---

## 2. Language-Native vs. Vision-Native Architectures

**Language-native:**
- Autoregressive decoder → output one word at a time.
- Needs more data, lacks **translation invariance**.

**Vision-native:**
- Parallel decoding of detection queries.
- Built-in translation invariance → efficient learning, fewer parameters (<1B).

![image](https://blog.aitoearn.ai/content/images/2025/11/img_003-384.jpg)

---

### Our Key Contributions
1. **DAB DETR** — improved query modeling.
2. Faster convergence methods for DETR.
3. Engineering integration → **true SOTA** on COCO benchmark.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_004-363.jpg)

---

**For creative professionals** using multimodal AI:
- **[AiToEarn官网](https://aitoearn.ai/)** is an open-source publishing + monetization platform.
- Bridges research outputs to global audiences through cross-platform publishing.

---

## 3. From Closed-Set to Open-Set Detection

### Closed-set constraints:
- Trained on fixed classes → cannot detect unseen categories.
- Retraining needed for new categories.

### Open-set advantages:
- Uses **language embeddings** for classification.
- Detects objects never seen in training.

### Our Projects:
- **Mask DINO** — integrates segmentation & detection.
- **Grounding DINO** — detects via natural language prompts.
- **T-REX2** — long-tail detection with visual prompts.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_005-327.jpg)

---

### Grounding DINO → Grounded SAM
- Detect objects from text or noun lists.
- SAM segments detected bounding boxes.
- Simplifies image editing via language commands.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_006-298.jpg)

---

### DINO-X
- Trained on **10× more data**: hundreds of millions of images.
- Universal detection across domains.
- Simplifies downstream tasks (pose estimation, segmentation).

![image](https://blog.aitoearn.ai/content/images/2025/11/img_007-284.jpg)

**Training Challenges:**  
Vision requires **object-level annotations** (bounding boxes, labels), more complex than typical image-text pairs.

---

## 3.1 DINO-X Capabilities Demo
1. Text-prompt detection.
2. Prompt-free detection.
3. Visual-prompt detection.
4. Segmentation + detection.
5. Human pose keypoints.
6. Hand keypoints — high precision.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_008-265.jpg)

---

We lead in COCO & LVIS benchmarks.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_009-245.jpg)

---

## 4. Toward General Perception

### Visual Prompt Optimization:
- **User input**: 10–20 images + bounding boxes.
- Optimize only an **embedding vector**.
- Achieve industrial-grade precision (>99% recall/precision).

Industrial examples:
- Component detection (appliance, automotive).
- Welding spot detection.
- Oil barrel & pull ring detection.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_010-226.jpg)

---

Future expansion:
- More scenarios & edge deployment speed.
- 3D attribute + environment understanding.
- Adjective-based detection (“all people wearing white”).

---

## 5. From General Perception to Spatial Intelligence

### Our Extensions:
- 2D keypoints → full 3D meshes.
- Detect, mask, reconstruct objects in 3D for rotation/viewpoint change.
- Apply detection to point clouds → object-level segmentation in 3D space.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_011-215.jpg)

---

### Attribute-based Queries:
- “People wearing purple”
- “People standing under climbing wall but not sitting”
- Requires precise **language-vision alignment**.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_012-191.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_013-180.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_014-151.jpg)

---

## 6. Summary & Outlook

Detection often needs **reasoning** integration:
- **DINO-X + MCP** → feed detection output to LLM (e.g., Claude 4.0) for contextual analysis.

Examples:
- Tallest person’s clothing description.
- **Calorie estimation**: precise food count → better nutritional analysis.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_015-139.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_016-113.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_017-102.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/11/img_018-92.jpg)

---

### Spatial Intelligence Context
- Advocated by **Prof. Fei-Fei Li**.
- **Digital Cousin** project: image detection → simulation environment creation.
- Step 1 uses **Grounding DINO**.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_019-85.jpg)

---

We aim to:
- Extend detection to segmentation, keypoints, 3D structures.
- Support **full spatial intelligence**: understanding objects, attributes, and environments.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_020-73.jpg)

---

**Try our Playground:** [DINO-X Online Demo](https://cloud.deepdataspace.com/playground/dino-x)

---

**For AI creators & researchers** seeking global impact:  
Open-source tools like [AiToEarn官网](https://aitoearn.ai/), [AiToEarn博客](https://blog.aitoearn.ai), and [AI模型排名](https://rank.aitoearn.ai) enable:
- AI content generation.
- Cross-platform distribution.
- Monetization & performance tracking.

Covering Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X.

---
From Detection to General Perception: Building the Foundations of Spatial Intelligence

Honghao Wang

Read more

Xiaoyuan Learning Tablet Wins 2025 IDEA International Design Award, Setting a New Benchmark for Study Devices

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Cloud Computing Giant Unveils 25 New Products in 10 Minutes — Kimi and MiniMax Debut

TopGear Picks 18 Cars of the Year, Only One from China