AI travel

Can AI Perform “Pilgrimage” Tours? New Multimodal Evaluation Benchmark VIR-Bench Released

Honghao Wang

15 Oct 2025 — 4 min read

From Anime Pilgrimages to AI-Powered Travel

Many of us have felt that spark:

Watching an anime you love, you suddenly want to visit its real-life locations.
Seeing a beautifully edited travel vlog, you bookmark it, hoping to follow that exact route someday.

Travel combined with video inspires curiosity and a desire to explore.

Now imagine: what if AI could automatically analyze these travel videos, tell you “which places were visited”, “in what order”, and even generate an instant, personalized itinerary?

This is more than a pop culture fantasy — it’s a realistic scenario for multimodal large language models (MLLMs).

---

Introducing VIR‑Bench

Researchers from Waseda University (Japan), CyberAgent, and the Nara Institute of Science and Technology have developed VIR‑Bench — a benchmark to evaluate whether AI can truly grasp the geographical and temporal structure in travel videos.

The core question:

> “Where did I come from? Where am I going?”

Paper: https://www.arxiv.org/abs/2509.19002
GitHub: https://github.com/nlp-waseda/VIR-Bench

---

Task Overview — Itinerary Reconstruction

Objective: Automatically produce a visiting order graph from a travel vlog.

A visiting order graph is a directed graph with:

Nodes: visited locations in three hierarchical levels — Prefecture → City → Point of Interest (POI).
Inclusion edges: “A contains B” (e.g., City contains POI).
Transition edges: chronological travel from one node to another within the same level.

This requires combining:

Location recognition — identify each place visited.
Temporal sequencing — determine the order of visits.
Spatial reasoning — map containment relationships.

---

Subtasks

To simplify evaluation, the authors split the problem into two subtasks:

Node Prediction
Input: travel video.
Task: list all visited prefectures, cities, and POIs.
Edge Prediction
Input: video + unordered node list.
Task: identify all inclusion edges and transition edges.

This approach allows separate testing of:

Geographic recognition ability
Temporal reasoning ability
How these abilities combine in practice

---

Dataset Details

Scale & Scope:

200 Japan travel vlogs
3,689 POIs across 43 prefectures

Annotation Process:

Human annotators marked start/end times for each POI with Google Maps links.
Verification by a second annotator.
Automated generation of visiting order graphs.

---

Why It Matters

Applications could include:

AI travel apps that watch your videos and map your trip automatically.
Interactive itinerary generation for vlog creators.
Synchronization with maps, timelines, and monetizable content platforms.

Platforms like AiToEarn already integrate AI content generation, multi-platform publishing, analytics, and model ranking — turning ideas like VIR‑Bench into sustainable creative tools.

Explore:

---

Experimental Findings

Key Insights:

Open-source models lag behind commercial ones (especially in POI node recognition and transition edge prediction).
Transition edge prediction is the hardest challenge — many models misinterpret constraints (edges only exist between nodes at the same level).
Model size matters — bigger models perform better in edge prediction.
Geo-relevant pretraining boosts POI recognition accuracy.
Chain-of-Thought (CoT) reasoning helps edge prediction much more than node prediction.
Audio input significantly improves performance (e.g., Gemini‑2.5‑Pro).

Ablation Insights:

More input frames = more travel clues.
Longer reasoning chains = better sequence reconstruction.
Audio provides semantic hints for location and order.

Despite these gains, even top models like Gemini‑2.5‑Pro still make many errors — showing how challenging long-range temporal and geographic understanding is.

---

Tables

Table 1: Node Prediction

Table 2: Edge Prediction

---

Conclusion

VIR‑Bench is more than a benchmark — it’s a bridge toward future AI applications requiring joint understanding of where and when.

Potential impacts:

Robotics: route comprehension and planning.
Autonomous driving: decision-making in dynamic environments.
AI content tools that convert raw travel footage into mapped, shareable experiences.

Current challenge:

> Large models still struggle with long-range reasoning and spatiotemporal integration.

Growth path:

Stronger geo-spatial awareness
Reliable temporal logic
Enhanced multimodal fusion

With these advances, AI will progress from simply watching videos to truly acting within the world.

---

For developers and creators exploring multimodal spatiotemporal AI, platforms like AiToEarn官网 offer a practical way to publish and monetize innovations across:

Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter).

---

Would you like me to also redesign the visiting order graph diagram in a simplified style so it’s easier for general readers to understand? That could make your Markdown even more reader-friendly.

Can AI Perform “Pilgrimage” Tours? New Multimodal Evaluation Benchmark VIR-Bench Released

Honghao Wang

From Anime Pilgrimages to AI-Powered Travel

Introducing VIR‑Bench

Task Overview — Itinerary Reconstruction

Subtasks

Dataset Details

Why It Matters

Experimental Findings

Tables

Conclusion

Read more

People Stop Buying Porsches, Decade-Long CEO Steps Down

The Cutest New Land Cruiser FJ Launch — Could This Be Equation Leopard’s Long-Lost Brother in Japan?

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. ChatGPT Atlas 发布，AI 浏览器大乱斗...

Express Update | OpenAI’s Japanese Rival Sakana in Talks for Funding at $2.5 Billion Valuation