Multimodal AI

Multimodal AI Models Learn Reflection and Review — SJTU & Shanghai AI Lab Tackle Complex Multimodal Reasoning

Multimodal AI

Multimodal AI Models Learn Reflection and Review — SJTU & Shanghai AI Lab Tackle Complex Multimodal Reasoning

Multimodal AI Breakthrough: MM-HELIX Enables Long-Chain Reflective Reasoning Multimodal large models are becoming increasingly impressive — yet many users feel frustrated by their overly direct approach. Whether generating code, interpreting charts, or answering complex questions, many multimodal large language models (MLLMs) jump straight to a final answer without reconsideration. Like an

By Honghao Wang
Bilibili’s Multimodal Fine-Grained Image Quality Analysis Model Excels at ICCV 2025 Competition

Bilibili

Bilibili’s Multimodal Fine-Grained Image Quality Analysis Model Excels at ICCV 2025 Competition

Preface During the summer, the Bilibili Multimedia Lab team participated in the ICCV MIPI (Mobile Intelligent Photography and Imaging) Workshop’s Detailed Image Quality Assessment Track international challenge. We proposed an innovative multimodal training strategy, boosting the composite score by 13.5% and ultimately winning second place. This competition served

By Honghao Wang
New Approach to Document Image Parsing: Efficient Recognition and Structuring with Multimodal Models | Open Source Daily No.760

Multimodal AI

New Approach to Document Image Parsing: Efficient Recognition and Structuring with Multimodal Models | Open Source Daily No.760

Dolphin: Multimodal Document Image Parsing Repo: bytedance/Dolphin Stars: 6.4k License: MIT Dolphin is a multimodal model for document image parsing, using heterogeneous anchor prompts to enable an “analyze first, then parse” workflow. Key Features * Two-stage processing: * Layout Analysis: Page-level layout detection that produces an element sequence in natural

By Honghao Wang
Douyin & LV-NUS Release Open-Source Multimodal Model, Achieves SOTA with Small-Scale Design, 8B Inference Rivals GPT-4o

Multimodal AI

Douyin & LV-NUS Release Open-Source Multimodal Model, Achieves SOTA with Small-Scale Design, 8B Inference Rivals GPT-4o

SAIL-VL2: 2B Model Ranked #1 Among Open-Source Models Under 4B Parameters The SAIL-VL2 multimodal large model — jointly developed by Douyin's SAIL team and LV-NUS Lab — has achieved remarkable results. With 2B and 8B parameter versions, it has set performance breakthroughs across 106 datasets, rivaling or surpassing both similar-scale

By Honghao Wang
Today’s Open Source (2025-10-10): Microsoft Releases UserLM for User Role Simulation in Conversations, Advancing Real Interaction Technology

Open Source AI

Today’s Open Source (2025-10-10): Microsoft Releases UserLM for User Role Simulation in Conversations, Advancing Real Interaction Technology

Daily Discovery of Latest LLMs Date: 2025-10-10 · Location: Hong Kong, China --- 📢 Overview Highlighted Releases: * Language Model: UserLM * Foundation Model: Lumina-DiMOO * Language Model (Code): CoDA-v0-Instruct * Reasoning Model: Jamba-Reasoning * Visualization Tool: Model Explorer ONNX * Video Framework: Code2Video --- --- 🏆 Foundation Models 1. UserLM — Simulating the User Side of Conversations Key Points:

By Honghao Wang