multimodal AI

Only 3B Active Parameters, Stronger Multimodal Understanding and Reasoning — Baidu ERNIE-4.5-VL-28B-A3B-Thinking Officially Open-Sourced

Baidu ERNIE

Only 3B Active Parameters, Stronger Multimodal Understanding and Reasoning — Baidu ERNIE-4.5-VL-28B-A3B-Thinking Officially Open-Sourced

PaddlePaddle — ERNIE-4.5-VL-28B-A3B-Thinking Release Date: November 11, 2025 Location: Zhejiang --- Overview Baidu has officially open-sourced its new ERNIE-4.5-VL-28B-A3B-Thinking multimodal deep-thinking model — a leading performer in document & chart understanding, cross-disciplinary reasoning, general visual reasoning, and cross-modal problem-solving. With only 3B activated parameters, it delivers capabilities comparable to top-tier

By Honghao Wang
Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA OmniVinci: A Breakthrough in Multimodal AI NVIDIA has unveiled OmniVinci, a large language model designed for multimodal understanding and reasoning — capable of processing text, visual, audio, and even robotic data inputs. Led by the NVIDIA Research team, the project explores human-like perception: integrating and interpreting information across multiple data

By Honghao Wang
Today’s Open Source (2025-11-3): Kuaishou and Nanjing University Lab Co-Develop HiPO for Hybrid Strategy Optimization in LLM Dynamic Inference, Dual-Mode Switching Balances Accuracy and Efficiency

LLM optimization

Today’s Open Source (2025-11-3): Kuaishou and Nanjing University Lab Co-Develop HiPO for Hybrid Strategy Optimization in LLM Dynamic Inference, Dual-Mode Switching Balances Accuracy and Efficiency

🏆 Foundational Models ① Project: HiPO HiPO-8B is a novel reinforcement learning framework based on Hybrid Policy Optimization, enabling dynamic reasoning capabilities in large language models (LLMs). Key Highlights: * Developed by KwaiKAT team at Kuaishou in collaboration with NJU-LINK Laboratory (Nanjing University) and ARiSE Laboratory. * Features “think-on” and “think-off” mode switching to

By Honghao Wang
LongCat-Flash-Omni Released and Open-Sourced: Ushering in the Era of Real-Time All-Modal Interaction

LongCat-Flash

LongCat-Flash-Omni Released and Open-Sourced: Ushering in the Era of Real-Time All-Modal Interaction

LongCat-Flash-Omni — Next-Generation Open-Source Full-Modality Model Overview Since September 1, Meituan launched the LongCat-Flash series, beginning with LongCat-Flash-Chat and LongCat-Flash-Thinking, quickly gaining attention in the developer community. Today, the series expands with a new flagship addition: LongCat-Flash-Omni. Highlights: * Efficient Shortcut-Connected MoE architecture (including zero-computation experts) * Highly efficient multimodal perception modules * Speech

By Honghao Wang
Zhipu Wujie·Emu3.5 Released, Launching “Next-State Prediction”! Wang Zhongyuan: Could Open the Third Scaling Paradigm

Emu3.5

Zhipu Wujie·Emu3.5 Released, Launching “Next-State Prediction”! Wang Zhongyuan: Could Open the Third Scaling Paradigm

WuJie·Emu3.5 — The Next Leap in Multimodal World Models Introduction In October 2024, the Beijing Academy of Artificial Intelligence (BAAI) released the world’s first natively multimodal world model — WuJie·Emu3. This groundbreaking model is based entirely on next-token prediction, avoiding diffusion or composite methods, and achieves a unified

By Honghao Wang
Zhiyuan Wujie · Emu3.5 Reshapes the World Model Landscape: Introducing the First Multimodal Scaling Paradigm for Next-Gen AI Understanding

world models

Zhiyuan Wujie · Emu3.5 Reshapes the World Model Landscape: Introducing the First Multimodal Scaling Paradigm for Next-Gen AI Understanding

Once Again, Pushing the Limits of World Models A new benchmark has been set in the race for world models. The Beijing Academy of Artificial Intelligence (BAAI) has announced its large-scale multimodal world model — Wujie·Emu3.5. It not only simulates complex, dynamic physical realities with remarkable realism, but also

By Honghao Wang
Wenxin 4.5’s Most Powerful Derivative Model PaddleOCR-VL: How to Push Document Parsing Limits with 0.9B Parameters

PaddleOCR-VL

Wenxin 4.5’s Most Powerful Derivative Model PaddleOCR-VL: How to Push Document Parsing Limits with 0.9B Parameters

PaddleOCR-VL: Next-Generation Multimodal Document Parsing 2025-10-27 · Shanghai --- Click the blue text to follow us We have officially launched PaddleOCR-VL, a next-generation multimodal document parsing solution supporting 109 languages. With only 0.9B parameters, it sets new records across multiple authoritative benchmarks. Testing — both public and internal — shows industry-leading performance

By Honghao Wang