Douyin & LV-NUS Release Open-Source Multimodal Model, Achieves SOTA with Small-Scale Design, 8B Inference Rivals GPT-4o

Douyin & LV-NUS Release Open-Source Multimodal Model, Achieves SOTA with Small-Scale Design, 8B Inference Rivals GPT-4o

SAIL-VL2: 2B Model Ranked #1 Among Open-Source Models Under 4B Parameters

The SAIL-VL2 multimodal large model — jointly developed by Douyin's SAIL team and LV-NUS Lab — has achieved remarkable results.

With 2B and 8B parameter versions, it has set performance breakthroughs across 106 datasets, rivaling or surpassing both similar-scale and much larger closed-source models in complex reasoning benchmarks such as MMMU and MathVista.

image

---

Key Innovations

SAIL-VL2 advances the field through data, training, and architecture improvements, proving that small models can be powerful. It combines fine-grained visual perception with reasoning abilities comparable to large-scale systems. Both model weights and inference code are open-sourced for community use.

image

---

Architecture Highlights

SAIL-VL2 diverges from traditional dense LLMs by using a sparse Mixture of Experts (MoE) approach, with flexible configurations targeted at diverse applications:

image

---

SAIL-ViT: Progressively Optimized Visual Encoder

To solve visual–language alignment, SAIL-VL2 uses a three-stage strategy:

  • Warm-up Adaptation
  • Freeze SAIL-ViT and LLM
  • Train only the Adapter with 8M data
  • Activate cross-modal mapping
  • Fine-grained Alignment
  • Freeze LLM
  • Unlock SAIL-ViT and Adapter
  • Train with 6.7M caption & COR data for deeper alignment
  • World Knowledge Injection
  • Unlock all parameters
  • Use 36.5M multi-task data to improve generalization

Results:

  • Nearest-neighbor distance: 1.42 → 1.15
  • Wasserstein distance: 4.86 → 3.88
  • Both indicate stronger alignment.

---

MoE Architecture: Efficiency & Balance

  • 31.1B model uses Qwen3-MoE, activating 3B parameters per inference.
  • Load-balancing losses and data calibration improve expert activation entropy by 20%, ensuring specialization.

---

SAIL-ViT-AnyRes: Flexible Visual Resolution

  • 2D RoPE interpolation allows arbitrary input resolutions, up to 1792×1792.
  • In RefCOCO localization tasks, SAIL-ViT-AnyRes scores 57.82 vs. 53.28 for fixed-resolution models.

---

Automated Data Pipeline

image

Two main strategies: quality filtering & scope expansion.

  • SAIL-Caption2
  • Scores Visual Info Richness (VIR) and Image–Text Alignment (ITA) from 1–5
  • Discards samples <3
  • Produces 250M general captions and 1.69M chart captions
  • Synthetic VQA from captions boosts QA dataset diversity
  • Pure Text & Multimodal Instruction Data preserve LLM language and strengthen instruction-following

---

Progressive Pretraining: Perception → Reasoning

image

Two-Stage Multimodal Pretraining

  • Basic Pretraining with 64M data for cross-modal alignment
  • Multi-task Pretraining with 180M data for visual comprehension & instruction-following

Data Resampling

  • Balances datasets
  • Optimizes n-gram distribution
  • Mitigates bias
  • Improves training efficiency

---

SAIL-Video: Video QA Alignment

  • Screened 6 datasets, yielding 6.23M samples
  • Dual scoring: QA alignment, content richness, QA difficulty
  • Result: 5.1M high-quality video-QA pairs

---

SAIL-Instruction2: Instruction Fine-Tuning

  • Uses datasets like Mammoth & MMPR
  • Dual validation + category filtering
  • Generates 20M instruction samples
image

---

Multimodal Chain-of-Thought (CoT) Data

  • Sources: VisualWebInstruct, MathV360K
  • Filtered for challenging yet solvable
  • Result:
  • 400k LongCoT SFT samples
  • 1M Think-Fusion SFT samples
  • 150k RL samples

---

Five-Stage Post-Training Strategy

  • Base SFT – 4-stage data injection with model fusion for strong instructions
  • LongCoT SFT – 400k CoT samples for stepwise reasoning
  • Verifiable Reward RL – Dual rewards for correctness & format compliance, focusing on STEM accuracy
  • Think-Fusion SFT – Mixed data + conditional loss for flexible reasoning
  • Mixed Reward RL – Complex 3D signals for balanced reasoning & concise output

---

Training Efficiency Optimizations

  • Dynamic Batch Packaging
  • Concatenate samples to reduce padding
  • +50% training speed, +0.7% QA performance
  • Visual Token Balance
  • Relieves encoder memory pressure
  • +48% efficiency
  • Kernel Fusion – Fewer memory ops, ×3 MoE training speed
  • Streaming Data Loading + Hybrid Parallelism – Reduced communication overhead

---

Benchmark Results

General Multimodal Benchmark

  • SAIL-VL2-2B: 70.31 OpenCompass
  • Surpasses Qwen2.5-VL-3B (65.36) & InternVL3.5-2B (66.64)
  • #1 open-source under 4B parameters
  • SAIL-VL2-8B: Highest score in similar-scale category
image

---

Fine-Grained Tasks

  • SAIL-VL2-2B:
  • MMStar: 64.07
  • OCRBench: 89.50
  • SAIL-VL2-8B:
  • MMStar: 70.73
  • OCRBench: 91.30
image

---

Multimodal Reasoning

  • SAIL-VL2-8B-Thinking: 54.4, 2nd only to GPT-4o-latest (54.8)
  • SAIL-VL2-A3B-Thinking: 53.6, beating Gemini-2.0-Flash (50.6)

---

Practical Integration & Monetization

Platforms like AiToEarn官网 integrate powerful models, cross-platform publishing, analytics, and ranking — enabling creators to monetize AI outputs (image, video, text) across networks like Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter).

image

---

Paper: https://arxiv.org/pdf/2509.14033

Code & Models: https://github.com/BytedanceDouyinContent/SAIL-VL2

Hugging Face Model Hub: https://huggingface.co/BytedanceDouyinContent

---

AiToEarn Resources

---

If you’d like, I can also create a compact executive summary version of this Markdown, keeping all the essential data points but making it digestible in less than 500 words. Would you like me to prepare that?

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes.

ChatGPT Atlas 发布,AI 浏览器大乱斗...

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. ChatGPT Atlas 发布,AI 浏览器大乱斗...

# AI Browsers: When LLM Companies Step In 原创 lencx · 2025-10-22 07:00 · 上海 --- ## Overview Large Language Model (LLM) companies are making moves into the **AI browser** space. From new entrants like **Dia**[1], **Comet**[2], and **ChatGPT Atlas**[3], to established browsers like **Chrome** and **Edge** (which now feature

By Honghao Wang