Train the Model with 1.55 Million Simulated Videos: GVE Learns 9 Video Retrieval Skills at Once

Train the Model with 1.55 Million Simulated Videos: GVE Learns 9 Video Retrieval Skills at Once

Quantum Bit|QbitAI

Breaking the Bottleneck in Video Retrieval

Current video retrieval research has reached a closed-loop bottleneck:

For years, narrow-domain benchmarks like MSRVTT dominated, optimizing models for coarse-grained text queries.

This led to:

  • Biased training data
  • Limited capabilities
  • Poor handling of fine-grained semantics
  • Weak long-context understanding
  • Inability to handle complex multi-modal queries

Solution proposed: Restructure from "task-specific" to "universal" retrieval paradigms.

---

The UVR and UVRB Frameworks

Researchers from HKUST (Guangzhou) and Alibaba DAMO Academy’s Tongyi Lab:

  • Introduced the Universal Video Retrieval (UVR) concept
  • Built UVRB, a comprehensive benchmark covering:
  • 16 datasets
  • Multiple tasks and domains
  • Synthesized 1.55M high-quality, diverse video-language training pairs
  • Designed a task pyramid curriculum training strategy for large multimodal foundations

Result: General Video Embedding (GVE) model

  • Available in 3B and 7B parameter versions
  • Surpassed 14 mainstream models in true zero-shot evaluations
  • Highest generalization capabilities to date

---

image

Limitations of Mainstream Video Retrieval Models

Examples:

Microsoft’s CLIP4Clip, Shanghai AI Lab’s InternVideo2, Kuaishou’s Unite

Performance:

Good on MSRVTT but constrained to simple text–video match tasks.

Issues:

  • Short, generic captions (“a person dancing”)
  • Cannot handle complex, multi-modal queries
  • e.g.,
  • Text + image combination
  • Example clip search
  • Spatial relations (“red-shirt person on left”)
  • Temporal dynamics (“jump and then land”)
  • Partially matching semantics (“any mention of ‘drone’”)

---

Universal Video Retrieval (UVR) Definition

Task Types

  • TXT: Text-only
  • CMP: Text + image combination
  • VIS: Vision-only

Domains

  • Coarse-grained (CG)
  • Fine-grained (FG)
  • FG subtypes:
  • S: Spatial
  • T: Temporal
  • PR: Partial relevance
  • Long-context (LC)
image

Example:

  • TXT+S: “couple walking a dog in a vlog”
  • CMP+T: Image + temporal change description (“person walks into a house”)

---

Universal Video Retrieval Benchmark (UVRB)

  • Covers 3 tasks, 3 domains, 3 FG sub-domains
  • 9 capabilities tested
  • Reveals biases in existing models
  • Challenges illusion of “benchmark saturation”
image

---

V-SynFlow Data Pipeline

Three-stage synthesis:

  • Multi-granularity quality filtering
  • Denoising
  • Consistency checks
  • MLLM-driven semantic enrichment
  • Spatial, temporal, topical, and stylistic descriptions
  • Expanded synthesis
  • Text+image → video
  • Frame → video
  • Segment → video

Modality coverage: TXT→video, IMG→video, TXT+IMG→video, VIDEO→video

image

---

GVE Model Design (Based on Qwen2.5-VL)

Architecture:

  • Backbone: Qwen2.5-VL
  • Frozen visual encoder
  • LoRA fine-tuning on the LLM part only

Input Fusion:

  • Supports text/image/video
  • Injects visual features via special tokens

Representation Extraction:

  • Use final token’s hidden state
  • L2-normalized for retrieval

Training Objective:

  • Symmetric InfoNCE loss
  • Hard negative mining

Curriculum Learning Strategy:

  • Build fundamental abilities first (e.g., object recognition)
  • Progress to complex tasks (e.g., temporal reasoning)
image
image

---

Evaluation & Results

Benchmarks: UVRB — 16 datasets

Baselines:

14 mainstream models:

  • CLIP-based (87M – 8.3B params)
  • MLLM-based (e.g., GME-7B, Unite-7B, B3-7B)

GVE Advantages:

  • Avoided training on any evaluation-domain data
  • True zero-shot
  • 8 uniform frames sampled
  • No audio/speech/metadata
  • Cosine similarity for retrieval
  • Multi-image embeddings for non-video models

Performance:

  • GVE-7B: Avg R@1 = 0.573 (vs Unite-7B’s 0.538 → +6.5%)
  • GVE-3B: Avg R@1 = 0.544 (beats Unite-7B despite fewer parameters)
image

---

Ablation Insights

  • UVRD synthetic dataset: +27% relative improvement in complex CMP tasks
  • Modality Pyramid Curriculum: GVE-7B overall ↑ from 0.594 → 0.600
  • Combined gains: +1.8%–3.1% overall performance
image

---

Four Key Findings

1. Traditional Benchmarks Are Misleading

Benchmarks like MSRVTT have only 0.58 correlation with true capability.

PR tasks have 0.97 correlation — strongest indicator of embedding quality.

---

2. Spatial vs. Temporal Decoupling

  • Models handle objects/positions but fail at action sequences
  • Spatial–Temporal correlation: 0.12
  • Temporal understanding is decisive (0.98 correlation)

---

3. Architecture Influences Capability Path

  • CLIP models: Strong coarse-grained spatial (0.99), weak temporal
  • MLLM models: Balanced, better semantic reasoning and temporal coupling

---

4. Size Isn’t Everything

  • Small CLIP4Clip (87M) beats large Unite-7B (8B) on pure visual tasks
  • Visual task correlation to overall retrieval: only 0.26

---

Experimental Goal

Test if general video retrieval emerges from:

  • Better evaluation systems
  • Cleaner training data
  • Smarter learning strategies

Outcome:

High-quality synthetic UVRD + pyramid curriculum → significant generalization gains.

---

Capability Structure Insights

Analysis shows:

  • Classic benchmarks ≠ overall capability
  • Spatial and temporal embeddings are decoupled
  • Architecture shapes evolution path

---

Toward the Next Era

Shift from “matching titles” to “understanding content”:

  • Richer training signals
  • Explicit inter-task dependency modeling
  • New evaluation standards

---

Open Sourcing

  • UVRB benchmark
  • GVE models
  • V-SynFlow pipeline
  • Modal pyramid curriculum

Goal: Move from leaderboard chasing to capability diagnosis.

---

Paper: arXiv:2510.27571

Project: Homepage

Models & Data: HuggingFace Collection

---

Platforms like AiToEarn support similar open-source, scalable, multi-platform AI content generation and monetization — applicable for both academic research and creative industries.

They enable publishing across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter), with integrated analytics and AI model ranking.

---

If you want, I can also create a summary chart of the UVRB capabilities structure for quicker comprehension. Would you like me to do that?

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.