AI news

Train the Model with 1.55 Million Simulated Videos: GVE Learns 9 Video Retrieval Skills at Once

Honghao Wang

14 Nov 2025 — 4 min read

Quantum Bit｜QbitAI

Breaking the Bottleneck in Video Retrieval

Current video retrieval research has reached a closed-loop bottleneck:

For years, narrow-domain benchmarks like MSRVTT dominated, optimizing models for coarse-grained text queries.

This led to:

Biased training data
Limited capabilities
Poor handling of fine-grained semantics
Weak long-context understanding
Inability to handle complex multi-modal queries

✅ Solution proposed: Restructure from "task-specific" to "universal" retrieval paradigms.

---

The UVR and UVRB Frameworks

Researchers from HKUST (Guangzhou) and Alibaba DAMO Academy’s Tongyi Lab:

Introduced the Universal Video Retrieval (UVR) concept
Built UVRB, a comprehensive benchmark covering:
16 datasets
Multiple tasks and domains
Synthesized 1.55M high-quality, diverse video-language training pairs
Designed a task pyramid curriculum training strategy for large multimodal foundations

Result: General Video Embedding (GVE) model

Available in 3B and 7B parameter versions
Surpassed 14 mainstream models in true zero-shot evaluations
Highest generalization capabilities to date

---

Limitations of Mainstream Video Retrieval Models

Examples:

Microsoft’s CLIP4Clip, Shanghai AI Lab’s InternVideo2, Kuaishou’s Unite

Performance:

Good on MSRVTT but constrained to simple text–video match tasks.

Issues:

Short, generic captions (“a person dancing”)
Cannot handle complex, multi-modal queries
e.g.,
Text + image combination
Example clip search
Spatial relations (“red-shirt person on left”)
Temporal dynamics (“jump and then land”)
Partially matching semantics (“any mention of ‘drone’”)

---

Universal Video Retrieval (UVR) Definition

Task Types

TXT: Text-only
CMP: Text + image combination
VIS: Vision-only

Domains

Coarse-grained (CG)
Fine-grained (FG)
FG subtypes:
S: Spatial
T: Temporal
PR: Partial relevance
Long-context (LC)

Example:

TXT+S: “couple walking a dog in a vlog”
CMP+T: Image + temporal change description (“person walks into a house”)

---

Universal Video Retrieval Benchmark (UVRB)

Covers 3 tasks, 3 domains, 3 FG sub-domains
9 capabilities tested
Reveals biases in existing models
Challenges illusion of “benchmark saturation”

---

V-SynFlow Data Pipeline

Three-stage synthesis:

Multi-granularity quality filtering
Denoising
Consistency checks
MLLM-driven semantic enrichment
Spatial, temporal, topical, and stylistic descriptions
Expanded synthesis
Text+image → video
Frame → video
Segment → video

Modality coverage: TXT→video, IMG→video, TXT+IMG→video, VIDEO→video

---

GVE Model Design (Based on Qwen2.5-VL)

Architecture:

Backbone: Qwen2.5-VL
Frozen visual encoder
LoRA fine-tuning on the LLM part only

Input Fusion:

Supports text/image/video
Injects visual features via special tokens

Representation Extraction:

Use final token’s hidden state
L2-normalized for retrieval

Training Objective:

Symmetric InfoNCE loss
Hard negative mining

Curriculum Learning Strategy:

Build fundamental abilities first (e.g., object recognition)
Progress to complex tasks (e.g., temporal reasoning)

---

Evaluation & Results

Benchmarks: UVRB — 16 datasets

Baselines:

14 mainstream models:

CLIP-based (87M – 8.3B params)
MLLM-based (e.g., GME-7B, Unite-7B, B3-7B)

GVE Advantages:

Avoided training on any evaluation-domain data
True zero-shot
8 uniform frames sampled
No audio/speech/metadata
Cosine similarity for retrieval
Multi-image embeddings for non-video models

Performance:

GVE-7B: Avg R@1 = 0.573 (vs Unite-7B’s 0.538 → +6.5%)
GVE-3B: Avg R@1 = 0.544 (beats Unite-7B despite fewer parameters)

---

Ablation Insights

UVRD synthetic dataset: +27% relative improvement in complex CMP tasks
Modality Pyramid Curriculum: GVE-7B overall ↑ from 0.594 → 0.600
Combined gains: +1.8%–3.1% overall performance

---

Four Key Findings

1. Traditional Benchmarks Are Misleading

Benchmarks like MSRVTT have only 0.58 correlation with true capability.

PR tasks have 0.97 correlation — strongest indicator of embedding quality.

---

2. Spatial vs. Temporal Decoupling

Models handle objects/positions but fail at action sequences
Spatial–Temporal correlation: 0.12
Temporal understanding is decisive (0.98 correlation)

---

3. Architecture Influences Capability Path

CLIP models: Strong coarse-grained spatial (0.99), weak temporal
MLLM models: Balanced, better semantic reasoning and temporal coupling

---

4. Size Isn’t Everything

Small CLIP4Clip (87M) beats large Unite-7B (8B) on pure visual tasks
Visual task correlation to overall retrieval: only 0.26

---

Experimental Goal

Test if general video retrieval emerges from:

Better evaluation systems
Cleaner training data
Smarter learning strategies

Outcome:

High-quality synthetic UVRD + pyramid curriculum → significant generalization gains.

---

Capability Structure Insights

Analysis shows:

Classic benchmarks ≠ overall capability
Spatial and temporal embeddings are decoupled
Architecture shapes evolution path

---

Toward the Next Era

Shift from “matching titles” to “understanding content”:

Richer training signals
Explicit inter-task dependency modeling
New evaluation standards

---

Open Sourcing

UVRB benchmark
GVE models
V-SynFlow pipeline
Modal pyramid curriculum

Goal: Move from leaderboard chasing to capability diagnosis.

---

Paper: arXiv:2510.27571

Project: Homepage

Models & Data: HuggingFace Collection

---

Platforms like AiToEarn support similar open-source, scalable, multi-platform AI content generation and monetization — applicable for both academic research and creative industries.

They enable publishing across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter), with integrated analytics and AI model ranking.

---

If you want, I can also create a summary chart of the UVRB capabilities structure for quicker comprehension. Would you like me to do that?