Meituan LongCat Team Releases All-Modal One-Stop Evaluation Benchmark UNO-Bench
Introduction: The Shift to Full-Modality AI
Multimodal AI is evolving from single-perception systems toward integrated vision, audio, and text processing — the era of full-modality large models (Omni-models).
However, evaluation systems lag behind:
- Tools are scarce, fragmented, and English-centric
- Limited support for Chinese-language scenarios
- Some datasets fail to require actual multimodal fusion, making it hard to gauge true cross-modal reasoning

---
Meet UNO-Bench
UNO-Bench, from Meituan’s LongCat team, is a high-quality, diversified benchmark suite that:
- Accurately measures single- and full-modality comprehension
- Tests a new “Combination Law” in full-modality performance
- For weaker models → bottleneck effect
- For stronger models → synergistic gains
- Designed via manual annotation to ensure quality and avoid contamination
- Introduces multi-step open-ended questions to test deep reasoning beyond traditional MCQs
---
1. Evaluation Landscape: Current State & Challenges
Mature Single-Modality Benchmarks
Examples:
- MMBench — Visual understanding
- MathVision — Mathematical/logical reasoning
- MVBench — Video scene analysis
- MMAU — Audio cognition
Gaps in Full-Modality Evaluation
- Models like Gemini & Qwen-3-Omni integrate visual + audio modalities
- But existing benchmarks:
- Contain errors (e.g., OmniBench)
- Can be solved without true modal integration (e.g., WorldSense)
UNO-Bench’s Contribution
- 1,250 manually annotated full-modality samples
- 2,480 enhanced single-modality samples
- Covers 44 task types (Chinese-language scenarios)
- 98% of tasks strictly require multimodal fusion

> I / A / V / T = Image, Audio, Video, Text
> Acc. = Accuracy, Solvable = % requiring full-modality
> QA Type = MC (Multiple Choice), MO (Multi-step Open-ended)
---
2. UNO-Bench Construction
2.1 Top-Level Design

Two Capability Layers:
- Perception
- Object/attribute recognition
- Scene understanding
- Spatial judgment
- Cross-modal transformation & alignment
- Reasoning
- General reasoning (commonsense + logic)
- STEM, Coding
- Spatial reasoning (static + dynamic)
- Temporal reasoning
- Complex reasoning
---
2.2 Data Pipeline
Three Stages:
- Curated data materials
- Expert-level Q&A annotation
- Rigorous multi-round quality inspection
Key innovations:
- Modality ablation → Removes info from one modality to ensure solvability requires fusion (>98% compliance)
- Audio–video separation + recombination → Breaks redundancy between modalities
- 90%+ privately created original content

---
2.3 Data Optimization
- Supplement with <11% public dataset samples (AV-Odyssey, WorldSense)
- Novel clustering-guided hierarchical sampling →
- Cuts evaluation costs by 90%+
- Maintains ranking consistency

---
2.4 Evaluation Innovations

- Multi-step Open-ended (MO) Questions
- → Complex reasoning broken into sequential sub-questions
- → Expert weighted scoring (max score 10) reveals “reasoning decay”


- General scoring model
- → Supports auto-scoring for six question types
- → 95% accuracy on out-of-distribution models
---
3. Experiments & Findings
3.1 Model Performance Overview
- Closed-source models lead (Gemini series dominates)
- LongCat-Flash-Omni achieves SOTA in open-source segment

---
Capability Breakdown

- Perception strong across models
- Reasoning = key differentiator
- Spatial inference is hardest (top score: 45 by Gemini-2.5-Pro)
---
Human vs AI


- Perception parity → Gemini matches humans
- Reasoning gap → Humans outperform Gemini in complex problems
---
3.2 Relationship: Single vs Full-Modality

Combination Law Formula:
POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422- Exponent > 1 → Convex curve → Acceleration for strong models
- Bottleneck vs Synergy observed in weak vs strong models
---
Ablation Verification:


- Top-tier models (Gemini) extract richer signals from raw AV than text transcriptions
---
3.3 Validity of UNO-Bench
Strengths:
- Differentiates performance across models
- (MO questions amplify cognitive gaps)
- Efficient compression with minimal rank changes
- (SRCC/PLCC > 0.98)
- High data quality — 100% accuracy, 98% cross-modal solvability



---
4. Conclusion & Outlook
UNO-Bench:
- Proves full-modality intelligence > simple sum of single-modal scores
- Reveals bottleneck & synergy effects
- Builds high-quality, Chinese-language multimodal benchmark
- Finds perception approaching human levels, but reasoning still trails
Future roadmap:
- Expand dataset via human–AI co-construction
- Add more challenging tasks (STEM, Coding)
- Deep dive into modal interaction mechanisms
---
Open-Source Resources
- GitHub: UNO-Bench Repo
- Hugging Face: UNO-Bench Dataset
- Paper PDF: UNO-Bench Whitepaper
---
Practical Applications
Platforms like AiToEarn can extend UNO-Bench insights into AI-driven content creation & monetization:
- Open-source global AI publishing & analytics
- Multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, FB, IG, LinkedIn, Threads, YouTube, Pinterest, X)
- Integrated model ranking: AI模型排名
- Docs: AiToEarn文档
---
In essence, UNO-Bench is a scientific benchmark for multimodal evaluation — and with proper ecosystem tools, its insights can directly fuel real-world AI applications and creator economy growth.