DeepSeek-OCR

Just Now: DeepSeek Launches New Model, Small yet Powerful

Honghao Wang

21 Oct 2025 — 3 min read

DeepSeek Releases 3B DeepSeek-OCR Model

Moments ago, DeepSeek announced an open-source 3-billion parameter model — DeepSeek-OCR.

While 3B parameters might not sound massive, the innovative architecture behind it is remarkable.

---

Tackling the Long-Text Bottleneck

LLMs struggle with long sequences because computational complexity scales quadratically with sequence length.

More tokens = more compute cost.

DeepSeek’s key insight:

Large amounts of text can be embedded in fewer tokens if it's converted into an image.

They call this Optical Compression — compressing textual data by shifting it to a visual modality.

Why OCR?

OCR inherently performs visual → text conversion, making it an ideal test ground with measurable results.

---

Compression Rate & Accuracy

Highlights from the paper:

10× compression rate while maintaining >97% OCR accuracy.
Even at 20× compression, accuracy is about 60%.

What this means:

100 visual tokens can replace 1,000 text tokens with minimal loss.

OmniDocBench Results

100 tokens/page → beats GOT-OCR 2.0 (256 tokens/page)
<800 tokens/page → beats MinerU 2.0 (6,000 tokens/page avg.)

Production Scale Example:

1 × A100–40G GPU → 200,000+ pages/day generation
160 × A100 GPUs (20 nodes) → 33 million pages/day

---

Core Architecture Overview

Two main components:

DeepEncoder — Image feature extraction + compression
DeepSeek3B-MoE Decoder — Text reconstruction

---

1. DeepEncoder — The Compression Engine

Architecture: SAM-base + CLIP-large in series.

SAM-base (80M params) → Windowed attention for visual detail
CLIP-large (300M params) → Global attention for holistic context

Key Feature: 16× Convolutional Compressor

Reduces token count drastically before global attention
e.g., 1024×1024 image → 4,096 patch tokens → compressed to far fewer tokens

Multi-Resolution Modes:

Native Tiny / Small / Base / Large
Dynamic-res Gundam mode for maximum flexibility

---

2. DeepSeek-3B-MoE Decoder

3B total parameters
Mixture-of-Experts (MoE):
64 experts (6 active per step) + 2 shared experts
~570M active parameters per inference
Offers 3B-level representational power at 500M-like efficiency

Role:

Rebuild original text from compressed visual tokens — learned effectively via OCR training.

---

Data Scale

Collected 30M pages in ~100 languages:

25M Chinese & English

Types of data:

Coarse annotation:
Extracted with fitz for lower-resource language training
Fine annotation:
Generated via PP-DocLayout, MinerU, GOT-OCR 2.0

Model Flywheel strategy for minority languages:

Cross-lingual layout model for detection
Train GOT-OCR 2.0 on fitz-generated data
Use trained model to label more data
Repeat → 600K samples achieved

Other data collected:

3M Word docs → improved formula & table parsing
Scene OCR: 10M Chinese + 10M English samples (from LAION, Wukong via PaddleOCR)

---

Beyond Text Recognition — Deep Parsing

DeepSeek-OCR can extract structured data from complex imagery:

Charts → structured datasets
Chemical diagrams → SMILES format
Geometric figures → duplication & structural analysis
Natural images → dense captions

This opens doors for STEM fields requiring symbolic + graphical parsing.

---

Optical Compression Inspired by Human Memory

Proposed experimental approach:

Convert older conversation history beyond the k-th turn into images
Stage 1 compression: ~10× fewer tokens
Further reduce resolution for distant history
Information fades like human memory decay — less detail in older context

Goal: Support infinite context by balancing fidelity + compute cost.

---

More Than OCR — A Visual Compression Engine

DeepSeek-OCR essentially tests if visual modality can compress text for LLMs.

Early results: 7–20× token compression.

Future directions:

Alternating digital ↔ optical text pretraining
Long-context stress tests ("needle in a haystack")

---

Resources

GitHub: http://github.com/deepseek-ai/DeepSeek-OCR
Paper: https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf
Model Download: https://huggingface.co/deepseek-ai/DeepSeek-OCR

---

AI Ecosystem & Monetization: AiToEarn Example

Creators can explore AiToEarn — a global open-source platform enabling:

AI generation → cross-platform publishing → analytics → model ranking

Supported platforms:

Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X (Twitter)

Links:

---

Would you like me to prepare a follow-up section on how Optical Compression can be integrated into multi-platform publishing workflows?

This could merge DeepSeek’s technical gains with scalable content deployment.