Multimodal AI - aitoearn

Multimodal AI

Multimodal AI Models Learn Reflection and Review — SJTU & Shanghai AI Lab Tackle Complex Multimodal Reasoning

Multimodal AI Breakthrough: MM-HELIX Enables Long-Chain Reflective Reasoning Multimodal large models are becoming increasingly impressive — yet many users feel frustrated by their overly direct approach. Whether generating code, interpreting charts, or answering complex questions, many multimodal large language models (MLLMs) jump straight to a final answer without reconsideration. Like an

Bilibili

Bilibili’s Multimodal Fine-Grained Image Quality Analysis Model Excels at ICCV 2025 Competition

Preface During the summer, the Bilibili Multimedia Lab team participated in the ICCV MIPI (Mobile Intelligent Photography and Imaging) Workshop’s Detailed Image Quality Assessment Track international challenge. We proposed an innovative multimodal training strategy, boosting the composite score by 13.5% and ultimately winning second place. This competition served

OCR

World’s Top OCR Model Only 0.9B! Baidu Wenxin Derivative Just Sweeps 4 SOTAs

PaddleOCR-VL: Baidu’s Lightweight Multimodal OCR Model Takes Global #1 Baidu has delivered a major surprise in the global AI multimodal race with the release of PaddleOCR-VL — a lightweight, self-developed document parsing model that has immediately set new industry benchmarks. With just 0.9B parameters, PaddleOCR-VL scored 92.6 on

Multimodal AI

New Approach to Document Image Parsing: Efficient Recognition and Structuring with Multimodal Models | Open Source Daily No.760

Dolphin: Multimodal Document Image Parsing Repo: bytedance/Dolphin Stars: 6.4k License: MIT Dolphin is a multimodal model for document image parsing, using heterogeneous anchor prompts to enable an “analyze first, then parse” workflow. Key Features * Two-stage processing: * Layout Analysis: Page-level layout detection that produces an element sequence in natural

AI travel

Can AI Perform “Pilgrimage” Tours? New Multimodal Evaluation Benchmark VIR-Bench Released

From Anime Pilgrimages to AI-Powered Travel Many of us have felt that spark: * Watching an anime you love, you suddenly want to visit its real-life locations. * Seeing a beautifully edited travel vlog, you bookmark it, hoping to follow that exact route someday. Travel combined with video inspires curiosity and a

Multimodal AI

Douyin & LV-NUS Release Open-Source Multimodal Model, Achieves SOTA with Small-Scale Design, 8B Inference Rivals GPT-4o

SAIL-VL2: 2B Model Ranked #1 Among Open-Source Models Under 4B Parameters The SAIL-VL2 multimodal large model — jointly developed by Douyin's SAIL team and LV-NUS Lab — has achieved remarkable results. With 2B and 8B parameter versions, it has set performance breakthroughs across 106 datasets, rivaling or surpassing both similar-scale

Open Source AI

Today’s Open Source (2025-10-10): Microsoft Releases UserLM for User Role Simulation in Conversations, Advancing Real Interaction Technology

Daily Discovery of Latest LLMs Date: 2025-10-10 · Location: Hong Kong, China --- 📢 Overview Highlighted Releases: * Language Model: UserLM * Foundation Model: Lumina-DiMOO * Language Model (Code): CoDA-v0-Instruct * Reasoning Model: Jamba-Reasoning * Visualization Tool: Model Explorer ONNX * Video Framework: Code2Video --- --- 🏆 Foundation Models 1. UserLM — Simulating the User Side of Conversations Key Points:

Google DeepMind

Google DeepMind Releases Gemini 2.5 Computer Use Model to Power UI-Controlled AI Agents

Google DeepMind Launches Gemini 2.5 Computer Use Model Google DeepMind has introduced the Gemini 2.5 Computer Use model — a specialized variant of its Gemini 2.5 Pro system. This model enables AI agents to directly interact with graphical user interfaces by performing actions such as clicking, typing, scrolling,