AI news

VisionWeaver: From Phenomenon Recognition to Cause Diagnosis, Opening a New Chapter in AI Visual Hallucination Research

Honghao Wang

14 Nov 2025 — 3 min read

VisionWeaver & VHBench-10 — Root Cause Diagnosis for LVLM Hallucinations

Date: 2025-11-14 · Location: Shanghai

The Bilibili User Technology Center has unveiled VisionWeaver and its diagnostic benchmark VHBench-10, offering a new paradigm for understanding and tackling hallucinations in large vision-language models (LVLMs).

---

📖 Preface

For years, we’ve known LVLMs can misinterpret images — yet lacked a tool to reveal why these errors happen. We could see the symptoms without understanding the cause.

We wanted a model to stop inventing details and avoid “calling a deer a horse.” That bottleneck has now been broken.

VisionWeaver introduces a context-aware routing network that leverages multiple specialist visual encoders dynamically. At its core is VHBench-10, a benchmark that moves hallucination research from phenomenon observation to cause diagnosis. This work is featured in EMNLP 2025 Findings.

---

Paper: arxiv.org/abs/2509.13836v1
Code: github.com/whwangovo/VisionWeaver
Dataset: huggingface.co/datasets/whwangovo/VHBench_10

---

📄 Paper Overview

Hallucinations are the biggest barrier to LVLM adoption. Existing evaluations (e.g., POPE) remain coarse-grained—like a doctor knowing a patient has a fever but not finding the lesion.

VHBench-10 addresses this by tracing hallucinations to four core visual capabilities:

Detection
Segmentation
Localization
Classification

These are further divided into 10 specific subtasks such as color recognition, counting, and text extraction — like a CT scan for AI perception.

From these insights, VisionWeaver was designed to choose the right visual expert for each image rather than rely on one encoder.

Figure 1 — VisionWeaver System Overview

---

🎯 Key Contributions

VHBench-10 Diagnostic Benchmark
Advances hallucination research from phenomenon recognition to cause analysis.
Links errors directly to failures in foundational visual tasks.
Encoder-specific Hallucination Profiling
Systematic, quantified analysis of encoder strengths/weaknesses.
Targets root causes rather than surface-level fixes.
VisionWeaver Adaptive Architecture
Intelligent context-aware routing chooses the best specialist encoder.
Outperforms traditional feature fusion methods.
SOTA Benchmark Results
Lower hallucination rates and improved performance across multiple datasets.

---

⚙️ Method Overview

1. Context-Aware Routing

Uses `[CLS]` token from a base CLIP encoder to capture global image semantics.
Generates adaptive soft routing weights for each expert encoder (ConvNext, DINOv2, SAM, Vary).
Selects experts based on importance per image.

2. Knowledge Enhancement & Fusion

Weighted fusion of expert outputs creates an aggregated representation.
Preserves fine-grained details via residual connections to CLIP patch tokens.
Projects enhanced features into LLM embedding space for generation.

Figure 2 — VHBench-10: 4 types, 10 subtypes of hallucinations

---

🌟 Key Features

Granular Hallucination Diagnosis via VHBench-10
Dynamic Expert Routing beyond single encoders
Collaborative Multi-Encoder Fusion
Significant Hallucination Suppression
Optimized Inference with lightweight experts & KV caching

---

🧪 Experiments

Fine-Grained Hallucination Evaluation

VHBench-10 assesses 10 visual subtasks, pinpointing perceptual weaknesses.

Over 10,000 images with controlled error types (e.g., color mismatches, counting errors) were processed to accurately measure tendencies.

Comparative Performance

VisionWeaver outperforms single/multi-encoder baselines across VHBench-10, POPE, and AutoHallusion.

Figure 3 — Lowest error rate in all 10 subtasks

---

📝 Summary

VisionWeaver + VHBench-10 shifts LVLM hallucination research from vague “the AI is wrong” statements to clear root cause analysis.

It pairs diagnostics with adaptive treatment — dynamically routing visual expertise to prevent errors before they occur.

This sets a new standard for building reliable, precise multimodal AI systems.

---

💬 Developer Q&A

What hallucination issues have you encountered with large models?

Share below — and join our giveaway!

Prize: Starry Dragon Series Card Sleeve Set

Deadline: Nov 21, 12:00

---

📚 Recommended Reads

---

🚀 Creators’ Tip — Monetizing AI Research Content

Platforms like AiToEarn enable researchers and creators to:

Generate AI-powered content
Publish across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter)
Track analytics and earn from content
Rank AI models globally

📖 Explore: AiToEarn Docs | Core Apps | Open-Source Repo

---

Albums:

General Engineering |

AI |

---

I’ve streamlined this into clearer sections, bolded key concepts, and grouped methods & contributions.

Do you want me to also create a comparison table summarizing VisionWeaver vs baseline methods for quick reading? That would make this article even more accessible.