AndesVL
Open Source! High Performance, Strong Results, Strict Privacy — OPPO’s Terminal Large Model in Practice

Honghao Wang

18 Oct 2025 — 4 min read
# AndesVL: Next-Generation On-Device Multimodal Large Model

## Introduction

Multimodal large models running directly on devices often suffer from **insufficient performance**, **limited capabilities**, and **poor adaptability** — making it challenging to meet **high-performance**, **strong privacy**, and **low-latency** demands in edge AI applications. These issues create a bottleneck in the evolution of AI smartphones.

The **OPPO AI Center** has introduced **AndesVL**, an **open-source**, fully end-to-end adapted **on-device multimodal large model**. AndesVL merges **state-of-the-art multimodal understanding & reasoning** with **edge-specific optimizations**.

### Key Facts
- **Parameter sizes**: 0.6B, 1B, 2B, 4B
- Supports **flexible deployment in multiple scenarios**
- Equipped with **GUI and multilingual capabilities**
- **Fully open source**

Performance Metrics:
- **Up to 6.7× peak decoding speedup**
- **1.8 BPW compression efficiency**
- SOTA results in **30+ benchmarks** compared to similar-sized models

#### Resources:
- [Tech Report](https://arxiv.org/pdf/2510.11496)
- [Hugging Face Models](https://huggingface.co/OPPOer)
- [GitHub Evaluation Toolkit](https://github.com/OPPO-Mente-Lab/AndesVL_Evaluation)

---

## Technical Background

Cloud-based multimodal models have achieved **great success**, but edge deployment faces challenges:
- **High-performance need** on devices
- Strict **privacy requirements**
- Demands for **low latency**

**AndesVL** is **China's first** fully end-to-end on-device multimodal large model series, optimized for smartphone deployment via:
- **Rapid deployment**
- **Inference acceleration**
- **Broad application support**

---

## Key Highlights

1. **Strong General Capabilities**
   - Outperforms similar-sized open-source models across **30+ public benchmarks**
   - Domains: math reasoning, OCR, image-text understanding, multi-image comprehension, hallucination suppression

2. **Specialized Edge Capabilities**
   - Enhanced **multilingual** and **GUI** understanding
   - Maintains edge-relevant SOTA performance

3. **Broad Applicability**
   - Four sizes: 0.6B, 1B, 2B, 4B
   - *Instruct* vs *Thinking* variants for different complexity scenarios

4. **Excellent Edge Deployment**
   - **Sparsification**
   - **Quantization-aware training**
   - **Codec acceleration**

5. **Rapid Edge Adaptation**
   - **1+N LoRA architecture**
   - **QALFT technology** for scenario-independent updates

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-278.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-261.jpg)

---

## Model Architecture Overview

**Components:**
- **Vision Encoder (ViT)**
- **Multi-Layer Perceptron (MLP)**
- **Large Language Model (LLM)**

**Encoders:**
- 1B–4B: **AimV2-300M** (low-power, edge-friendly)
- 0.6B: **SigLIP-2-base** (lightweight for constrained deployments)

**Enhancements:**
- **2D-RoPE** for resolution flexibility
- **NaViT** for arbitrary input resolution
- Pixel-shuffle compression for speed

**LLM Base:**
- **Qwen3 series**, *Instruct* & *Thinking* modes

---

## Training Strategy

### 1. Pre-Training Stages
#### Vision–Language Alignment
- Low → high resolution fine-tuning (896×896 → 1792×1792)
- Data: captions, OCR, VQA

#### Joint Vision–Language Pre-Training
- **Full-parameter training** with low LR
- Expand context from **2K → 8K tokens**
- Random position replacement for image sequence optimization
- **Thinking versions** receive large reasoning datasets

#### Multi-Task Pre-Training
- ↑ ViT length: 4,096 → 16,384
- ↑ LLM length: 8K → 32K tokens
- Data includes grounding & GUI understanding datasets

![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-210.jpg)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-193.jpg)

---

### 2. Post-Training Stages

#### Supervised Fine-Tuning (SFT)
- Multi-modal data: captioning, VQA, summarization, code generation
- Filtering pipeline: modality filtering → task clustering → LLM scoring
- Result: 16M high-quality entries

![image](https://blog.aitoearn.ai/content/images/2025/10/img_007-170.jpg)

#### Mixed Preference Optimization (MPO)
- Addresses multimodal DPO challenges
- Pipeline + MMPR data to enhance reasoning

![image](https://blog.aitoearn.ai/content/images/2025/10/img_008-150.jpg)

#### GRPO Training
- Data distillation to capture long reasoning chains
- "With/without reasoning" datasets enable mode switching
- Curriculum learning: easy → hard

---

## On-Device Deployment Solutions

### Algorithm Optimization
- **Sparsification**: 75% sparsity, BPW < 1.8
- Collaboration with **MediaTek Dimensity 9500** for hardware-level compression
- Memory ↓ 30.9%, Speed ↑ 45%

### Quantization-Aware Training (QAT)
- Base model QAT: mixed precision for weights/activations
- **QALFT** for scenario-specific LoRA training independence

![image](https://blog.aitoearn.ai/content/images/2025/10/img_009-132.jpg)

### Encoding & Decoding Acceleration
- **OKV compression**: retain only 25% of KV cache for 128K context
- **Speculative decoding**: 6.7× peak speedup

### 1+N LoRA Architecture
- Single base model + multiple pluggable LoRAs
- Enables **dynamic loading & targeted updates**

---

## Evaluation Results

![image](https://blog.aitoearn.ai/content/images/2025/10/img_010-121.jpg)

**Highlights:**
- 4B-Thinking: **70.9** score, +3.2 over next best
- Smaller models (0.6B, 1B, 2B) also top their size categories
- High performance across reasoning, multi-image, multilingual, hallucination suppression

---

## Domain-Specific Capabilities

### Mathematics & Logical Reasoning
![image](https://blog.aitoearn.ai/content/images/2025/10/img_011-109.jpg)

- 4B-Thinking: Top composite score (**58.3**)
- Edge benefit: Strong chain-of-thought reasoning

### Visual–Text Understanding
![image](https://blog.aitoearn.ai/content/images/2025/10/img_012-94.jpg)
- ChartQA score: **90.4** vs previous best 86.0

### Multi-Image Understanding
![image](https://blog.aitoearn.ai/content/images/2025/10/img_013-89.jpg)

### General Q&A
![image](https://blog.aitoearn.ai/content/images/2025/10/img_014-79.jpg)

### Hallucination Suppression
![image](https://blog.aitoearn.ai/content/images/2025/10/img_015-72.jpg)

---

## Multilingual & UI Understanding

### Multilingual
![image](https://blog.aitoearn.ai/content/images/2025/10/img_016-67.jpg)
- Top score: **64.9**, surpasses Ovis2-4B (+1.5)

### UI
![image](https://blog.aitoearn.ai/content/images/2025/10/img_017-67.jpg)
- Leading scores in ScreenSpot & **AndesUI**

---

## Device-Level Evaluation

### QAT Accuracy
![image](https://blog.aitoearn.ai/content/images/2025/10/img_018-63.jpg)

### Cache Eviction
![image](https://blog.aitoearn.ai/content/images/2025/10/img_019-55.jpg)

### Speculative Decoding
![image](https://blog.aitoearn.ai/content/images/2025/10/img_020-45.jpg)

---

## AndesUI Benchmark Dataset

- Screenshots from **90 apps** (65 third-party, 25 native)
- **227,000 labeled UI elements**
- Two tiers: basic (pointing/location) + advanced (descriptions/Q&A)

![image](https://blog.aitoearn.ai/content/images/2025/10/img_021-41.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_022-37.jpg)

---

## Model Output Examples

![image](https://blog.aitoearn.ai/content/images/2025/10/img_023-33.jpg)

---

## Future Outlook

**Roadmap:**
- Optimized visual encoders
- Post-training enhancements
- Knowledge distillation for efficiency
- Tri-modal (text + vision + speech) integration

As multimodal AI matures, **platforms like [AiToEarn](https://aitoearn.ai/)** will be key to:
- **Cross-platform publishing**
- **Automated analytics & ranking**
- **Monetization for creators**

OPPO aims to deliver secure, fast, and intelligent mobile AI experiences — driving industry-wide innovation.

---
Open Source! High Performance, Strong Results, Strict Privacy — OPPO’s Terminal Large Model in Practice

Honghao Wang

Read more

People Stop Buying Porsches, Decade-Long CEO Steps Down

The Cutest New Land Cruiser FJ Launch — Could This Be Equation Leopard’s Long-Lost Brother in Japan?

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. ChatGPT Atlas 发布，AI 浏览器大乱斗...

Express Update | OpenAI’s Japanese Rival Sakana in Talks for Funding at $2.5 Billion Valuation