Open Source! High Performance, Strong Results, Strict Privacy — OPPO’s Terminal Large Model in Practice

Open Source! High Performance, Strong Results, Strict Privacy — OPPO’s Terminal Large Model in Practice
# AndesVL: Next-Generation On-Device Multimodal Large Model

## Introduction

Multimodal large models running directly on devices often suffer from **insufficient performance**, **limited capabilities**, and **poor adaptability** — making it challenging to meet **high-performance**, **strong privacy**, and **low-latency** demands in edge AI applications. These issues create a bottleneck in the evolution of AI smartphones.

The **OPPO AI Center** has introduced **AndesVL**, an **open-source**, fully end-to-end adapted **on-device multimodal large model**. AndesVL merges **state-of-the-art multimodal understanding & reasoning** with **edge-specific optimizations**.

### Key Facts
- **Parameter sizes**: 0.6B, 1B, 2B, 4B
- Supports **flexible deployment in multiple scenarios**
- Equipped with **GUI and multilingual capabilities**
- **Fully open source**

Performance Metrics:
- **Up to 6.7× peak decoding speedup**
- **1.8 BPW compression efficiency**
- SOTA results in **30+ benchmarks** compared to similar-sized models

#### Resources:
- [Tech Report](https://arxiv.org/pdf/2510.11496)
- [Hugging Face Models](https://huggingface.co/OPPOer)
- [GitHub Evaluation Toolkit](https://github.com/OPPO-Mente-Lab/AndesVL_Evaluation)

---

## Technical Background

Cloud-based multimodal models have achieved **great success**, but edge deployment faces challenges:
- **High-performance need** on devices
- Strict **privacy requirements**
- Demands for **low latency**

**AndesVL** is **China's first** fully end-to-end on-device multimodal large model series, optimized for smartphone deployment via:
- **Rapid deployment**
- **Inference acceleration**
- **Broad application support**

---

## Key Highlights

1. **Strong General Capabilities**
   - Outperforms similar-sized open-source models across **30+ public benchmarks**
   - Domains: math reasoning, OCR, image-text understanding, multi-image comprehension, hallucination suppression

2. **Specialized Edge Capabilities**
   - Enhanced **multilingual** and **GUI** understanding
   - Maintains edge-relevant SOTA performance

3. **Broad Applicability**
   - Four sizes: 0.6B, 1B, 2B, 4B
   - *Instruct* vs *Thinking* variants for different complexity scenarios

4. **Excellent Edge Deployment**
   - **Sparsification**
   - **Quantization-aware training**
   - **Codec acceleration**

5. **Rapid Edge Adaptation**
   - **1+N LoRA architecture**
   - **QALFT technology** for scenario-independent updates

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-278.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-261.jpg)

---

## Model Architecture Overview

**Components:**
- **Vision Encoder (ViT)**
- **Multi-Layer Perceptron (MLP)**
- **Large Language Model (LLM)**

**Encoders:**
- 1B–4B: **AimV2-300M** (low-power, edge-friendly)
- 0.6B: **SigLIP-2-base** (lightweight for constrained deployments)

**Enhancements:**
- **2D-RoPE** for resolution flexibility
- **NaViT** for arbitrary input resolution
- Pixel-shuffle compression for speed

**LLM Base:**
- **Qwen3 series**, *Instruct* & *Thinking* modes

---

## Training Strategy

### 1. Pre-Training Stages
#### Vision–Language Alignment
- Low → high resolution fine-tuning (896×896 → 1792×1792)
- Data: captions, OCR, VQA

#### Joint Vision–Language Pre-Training
- **Full-parameter training** with low LR
- Expand context from **2K → 8K tokens**
- Random position replacement for image sequence optimization
- **Thinking versions** receive large reasoning datasets

#### Multi-Task Pre-Training
- ↑ ViT length: 4,096 → 16,384
- ↑ LLM length: 8K → 32K tokens
- Data includes grounding & GUI understanding datasets

![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-210.jpg)
![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-193.jpg)

---

### 2. Post-Training Stages

#### Supervised Fine-Tuning (SFT)
- Multi-modal data: captioning, VQA, summarization, code generation
- Filtering pipeline: modality filtering → task clustering → LLM scoring
- Result: 16M high-quality entries

![image](https://blog.aitoearn.ai/content/images/2025/10/img_007-170.jpg)

#### Mixed Preference Optimization (MPO)
- Addresses multimodal DPO challenges
- Pipeline + MMPR data to enhance reasoning

![image](https://blog.aitoearn.ai/content/images/2025/10/img_008-150.jpg)

#### GRPO Training
- Data distillation to capture long reasoning chains
- "With/without reasoning" datasets enable mode switching
- Curriculum learning: easy → hard

---

## On-Device Deployment Solutions

### Algorithm Optimization
- **Sparsification**: 75% sparsity, BPW < 1.8
- Collaboration with **MediaTek Dimensity 9500** for hardware-level compression
- Memory ↓ 30.9%, Speed ↑ 45%

### Quantization-Aware Training (QAT)
- Base model QAT: mixed precision for weights/activations
- **QALFT** for scenario-specific LoRA training independence

![image](https://blog.aitoearn.ai/content/images/2025/10/img_009-132.jpg)

### Encoding & Decoding Acceleration
- **OKV compression**: retain only 25% of KV cache for 128K context
- **Speculative decoding**: 6.7× peak speedup

### 1+N LoRA Architecture
- Single base model + multiple pluggable LoRAs
- Enables **dynamic loading & targeted updates**

---

## Evaluation Results

![image](https://blog.aitoearn.ai/content/images/2025/10/img_010-121.jpg)

**Highlights:**
- 4B-Thinking: **70.9** score, +3.2 over next best
- Smaller models (0.6B, 1B, 2B) also top their size categories
- High performance across reasoning, multi-image, multilingual, hallucination suppression

---

## Domain-Specific Capabilities

### Mathematics & Logical Reasoning
![image](https://blog.aitoearn.ai/content/images/2025/10/img_011-109.jpg)

- 4B-Thinking: Top composite score (**58.3**)
- Edge benefit: Strong chain-of-thought reasoning

### Visual–Text Understanding
![image](https://blog.aitoearn.ai/content/images/2025/10/img_012-94.jpg)
- ChartQA score: **90.4** vs previous best 86.0

### Multi-Image Understanding
![image](https://blog.aitoearn.ai/content/images/2025/10/img_013-89.jpg)

### General Q&A
![image](https://blog.aitoearn.ai/content/images/2025/10/img_014-79.jpg)

### Hallucination Suppression
![image](https://blog.aitoearn.ai/content/images/2025/10/img_015-72.jpg)

---

## Multilingual & UI Understanding

### Multilingual
![image](https://blog.aitoearn.ai/content/images/2025/10/img_016-67.jpg)
- Top score: **64.9**, surpasses Ovis2-4B (+1.5)

### UI
![image](https://blog.aitoearn.ai/content/images/2025/10/img_017-67.jpg)
- Leading scores in ScreenSpot & **AndesUI**

---

## Device-Level Evaluation

### QAT Accuracy
![image](https://blog.aitoearn.ai/content/images/2025/10/img_018-63.jpg)

### Cache Eviction
![image](https://blog.aitoearn.ai/content/images/2025/10/img_019-55.jpg)

### Speculative Decoding
![image](https://blog.aitoearn.ai/content/images/2025/10/img_020-45.jpg)

---

## AndesUI Benchmark Dataset

- Screenshots from **90 apps** (65 third-party, 25 native)
- **227,000 labeled UI elements**
- Two tiers: basic (pointing/location) + advanced (descriptions/Q&A)

![image](https://blog.aitoearn.ai/content/images/2025/10/img_021-41.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_022-37.jpg)

---

## Model Output Examples

![image](https://blog.aitoearn.ai/content/images/2025/10/img_023-33.jpg)

---

## Future Outlook

**Roadmap:**
- Optimized visual encoders
- Post-training enhancements
- Knowledge distillation for efficiency
- Tri-modal (text + vision + speech) integration

As multimodal AI matures, **platforms like [AiToEarn](https://aitoearn.ai/)** will be key to:
- **Cross-platform publishing**
- **Automated analytics & ranking**
- **Monetization for creators**

OPPO aims to deliver secure, fast, and intelligent mobile AI experiences — driving industry-wide innovation.

---

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes.

ChatGPT Atlas 发布,AI 浏览器大乱斗...

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. ChatGPT Atlas 发布,AI 浏览器大乱斗...

# AI Browsers: When LLM Companies Step In 原创 lencx · 2025-10-22 07:00 · 上海 --- ## Overview Large Language Model (LLM) companies are making moves into the **AI browser** space. From new entrants like **Dia**[1], **Comet**[2], and **ChatGPT Atlas**[3], to established browsers like **Chrome** and **Edge** (which now feature

By Honghao Wang