Meituan LongCat Team Releases All-Modal One-Stop Evaluation Benchmark UNO-Bench

Meituan LongCat Team Releases All-Modal One-Stop Evaluation Benchmark UNO-Bench

Introduction: The Shift to Full-Modality AI

Multimodal AI is evolving from single-perception systems toward integrated vision, audio, and text processing — the era of full-modality large models (Omni-models).

However, evaluation systems lag behind:

  • Tools are scarce, fragmented, and English-centric
  • Limited support for Chinese-language scenarios
  • Some datasets fail to require actual multimodal fusion, making it hard to gauge true cross-modal reasoning
image

---

Meet UNO-Bench

UNO-Bench, from Meituan’s LongCat team, is a high-quality, diversified benchmark suite that:

  • Accurately measures single- and full-modality comprehension
  • Tests a new “Combination Law” in full-modality performance
  • For weaker models → bottleneck effect
  • For stronger models → synergistic gains
  • Designed via manual annotation to ensure quality and avoid contamination
  • Introduces multi-step open-ended questions to test deep reasoning beyond traditional MCQs

---

1. Evaluation Landscape: Current State & Challenges

Mature Single-Modality Benchmarks

Examples:

  • MMBench — Visual understanding
  • MathVision — Mathematical/logical reasoning
  • MVBench — Video scene analysis
  • MMAU — Audio cognition

Gaps in Full-Modality Evaluation

  • Models like Gemini & Qwen-3-Omni integrate visual + audio modalities
  • But existing benchmarks:
  • Contain errors (e.g., OmniBench)
  • Can be solved without true modal integration (e.g., WorldSense)

UNO-Bench’s Contribution

  • 1,250 manually annotated full-modality samples
  • 2,480 enhanced single-modality samples
  • Covers 44 task types (Chinese-language scenarios)
  • 98% of tasks strictly require multimodal fusion
image

> I / A / V / T = Image, Audio, Video, Text

> Acc. = Accuracy, Solvable = % requiring full-modality

> QA Type = MC (Multiple Choice), MO (Multi-step Open-ended)

---

2. UNO-Bench Construction

2.1 Top-Level Design

image

Two Capability Layers:

  • Perception
  • Object/attribute recognition
  • Scene understanding
  • Spatial judgment
  • Cross-modal transformation & alignment
  • Reasoning
  • General reasoning (commonsense + logic)
  • STEM, Coding
  • Spatial reasoning (static + dynamic)
  • Temporal reasoning
  • Complex reasoning

---

2.2 Data Pipeline

Three Stages:

  • Curated data materials
  • Expert-level Q&A annotation
  • Rigorous multi-round quality inspection

Key innovations:

  • Modality ablation → Removes info from one modality to ensure solvability requires fusion (>98% compliance)
  • Audio–video separation + recombination → Breaks redundancy between modalities
  • 90%+ privately created original content
image

---

2.3 Data Optimization

  • Supplement with <11% public dataset samples (AV-Odyssey, WorldSense)
  • Novel clustering-guided hierarchical sampling
  • Cuts evaluation costs by 90%+
  • Maintains ranking consistency
image

---

2.4 Evaluation Innovations

image
  • Multi-step Open-ended (MO) Questions
  • → Complex reasoning broken into sequential sub-questions
  • → Expert weighted scoring (max score 10) reveals “reasoning decay”
image
image
  • General scoring model
  • → Supports auto-scoring for six question types
  • 95% accuracy on out-of-distribution models

---

3. Experiments & Findings

3.1 Model Performance Overview

  • Closed-source models lead (Gemini series dominates)
  • LongCat-Flash-Omni achieves SOTA in open-source segment
image

---

Capability Breakdown

image
  • Perception strong across models
  • Reasoning = key differentiator
  • Spatial inference is hardest (top score: 45 by Gemini-2.5-Pro)

---

Human vs AI

image
image
  • Perception parity → Gemini matches humans
  • Reasoning gap → Humans outperform Gemini in complex problems

---

3.2 Relationship: Single vs Full-Modality

image

Combination Law Formula:

POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422
  • Exponent > 1 → Convex curve → Acceleration for strong models
  • Bottleneck vs Synergy observed in weak vs strong models

---

Ablation Verification:

image
image
  • Top-tier models (Gemini) extract richer signals from raw AV than text transcriptions

---

3.3 Validity of UNO-Bench

Strengths:

  • Differentiates performance across models
  • (MO questions amplify cognitive gaps)
  • Efficient compression with minimal rank changes
  • (SRCC/PLCC > 0.98)
  • High data quality — 100% accuracy, 98% cross-modal solvability
image
image
image

---

4. Conclusion & Outlook

UNO-Bench:

  • Proves full-modality intelligence > simple sum of single-modal scores
  • Reveals bottleneck & synergy effects
  • Builds high-quality, Chinese-language multimodal benchmark
  • Finds perception approaching human levels, but reasoning still trails

Future roadmap:

  • Expand dataset via human–AI co-construction
  • Add more challenging tasks (STEM, Coding)
  • Deep dive into modal interaction mechanisms

---

Open-Source Resources

---

Practical Applications

Platforms like AiToEarn can extend UNO-Bench insights into AI-driven content creation & monetization:

  • Open-source global AI publishing & analytics
  • Multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, FB, IG, LinkedIn, Threads, YouTube, Pinterest, X)
  • Integrated model ranking: AI模型排名
  • Docs: AiToEarn文档

---

In essence, UNO-Bench is a scientific benchmark for multimodal evaluation — and with proper ecosystem tools, its insights can directly fuel real-world AI applications and creator economy growth.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.