Can AI Perform “Pilgrimage” Tours? New Multimodal Evaluation Benchmark VIR-Bench Released

Can AI Perform “Pilgrimage” Tours? New Multimodal Evaluation Benchmark VIR-Bench Released

From Anime Pilgrimages to AI-Powered Travel

Many of us have felt that spark:

  • Watching an anime you love, you suddenly want to visit its real-life locations.
  • Seeing a beautifully edited travel vlog, you bookmark it, hoping to follow that exact route someday.

Travel combined with video inspires curiosity and a desire to explore.

Now imagine: what if AI could automatically analyze these travel videos, tell you “which places were visited”, “in what order”, and even generate an instant, personalized itinerary?

This is more than a pop culture fantasy — it’s a realistic scenario for multimodal large language models (MLLMs).

image

---

Introducing VIR‑Bench

Researchers from Waseda University (Japan), CyberAgent, and the Nara Institute of Science and Technology have developed VIR‑Bench — a benchmark to evaluate whether AI can truly grasp the geographical and temporal structure in travel videos.

The core question:

> “Where did I come from? Where am I going?”

image

---

Task Overview — Itinerary Reconstruction

Objective: Automatically produce a visiting order graph from a travel vlog.

A visiting order graph is a directed graph with:

  • Nodes: visited locations in three hierarchical levels — Prefecture → City → Point of Interest (POI).
  • Inclusion edges: “A contains B” (e.g., City contains POI).
  • Transition edges: chronological travel from one node to another within the same level.
image

This requires combining:

  • Location recognition — identify each place visited.
  • Temporal sequencing — determine the order of visits.
  • Spatial reasoning — map containment relationships.

---

Subtasks

To simplify evaluation, the authors split the problem into two subtasks:

  • Node Prediction
  • Input: travel video.
  • Task: list all visited prefectures, cities, and POIs.
  • Edge Prediction
  • Input: video + unordered node list.
  • Task: identify all inclusion edges and transition edges.

This approach allows separate testing of:

  • Geographic recognition ability
  • Temporal reasoning ability
  • How these abilities combine in practice

---

Dataset Details

Scale & Scope:

  • 200 Japan travel vlogs
  • 3,689 POIs across 43 prefectures

Annotation Process:

  • Human annotators marked start/end times for each POI with Google Maps links.
  • Verification by a second annotator.
  • Automated generation of visiting order graphs.

---

Why It Matters

Applications could include:

  • AI travel apps that watch your videos and map your trip automatically.
  • Interactive itinerary generation for vlog creators.
  • Synchronization with maps, timelines, and monetizable content platforms.

Platforms like AiToEarn already integrate AI content generation, multi-platform publishing, analytics, and model ranking — turning ideas like VIR‑Bench into sustainable creative tools.

Explore:

---

Experimental Findings

image

Key Insights:

  • Open-source models lag behind commercial ones (especially in POI node recognition and transition edge prediction).
  • Transition edge prediction is the hardest challenge — many models misinterpret constraints (edges only exist between nodes at the same level).
  • Model size matters — bigger models perform better in edge prediction.
  • Geo-relevant pretraining boosts POI recognition accuracy.
  • Chain-of-Thought (CoT) reasoning helps edge prediction much more than node prediction.
  • Audio input significantly improves performance (e.g., Gemini‑2.5‑Pro).

Ablation Insights:

  • More input frames = more travel clues.
  • Longer reasoning chains = better sequence reconstruction.
  • Audio provides semantic hints for location and order.

Despite these gains, even top models like Gemini‑2.5‑Pro still make many errors — showing how challenging long-range temporal and geographic understanding is.

---

Tables

Table 1: Node Prediction

image

Table 2: Edge Prediction

image

---

Conclusion

VIR‑Bench is more than a benchmark — it’s a bridge toward future AI applications requiring joint understanding of where and when.

Potential impacts:

  • Robotics: route comprehension and planning.
  • Autonomous driving: decision-making in dynamic environments.
  • AI content tools that convert raw travel footage into mapped, shareable experiences.

Current challenge:

> Large models still struggle with long-range reasoning and spatiotemporal integration.

Growth path:

  • Stronger geo-spatial awareness
  • Reliable temporal logic
  • Enhanced multimodal fusion

With these advances, AI will progress from simply watching videos to truly acting within the world.

---

For developers and creators exploring multimodal spatiotemporal AI, platforms like AiToEarn官网 offer a practical way to publish and monetize innovations across:

Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter).

---

Would you like me to also redesign the visiting order graph diagram in a simplified style so it’s easier for general readers to understand? That could make your Markdown even more reader-friendly.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes.

ChatGPT Atlas 发布,AI 浏览器大乱斗...

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. ChatGPT Atlas 发布,AI 浏览器大乱斗...

# AI Browsers: When LLM Companies Step In 原创 lencx · 2025-10-22 07:00 · 上海 --- ## Overview Large Language Model (LLM) companies are making moves into the **AI browser** space. From new entrants like **Dia**[1], **Comet**[2], and **ChatGPT Atlas**[3], to established browsers like **Chrome** and **Edge** (which now feature

By Honghao Wang