PaddleOCR-VL with Just 0.9B Parameters — Currently the Strongest OCR Model

PaddleOCR-VL with Just 0.9B Parameters — Currently the Strongest OCR Model

🚀 The OCR Track Is Experiencing a True Renaissance

image

Introduction

Over the past few days, OCR (Optical Character Recognition) has become one of the hottest topics in AI — thanks largely to DeepSeek-OCR.

The OCR domain is enjoying a major renaissance, drawing widespread attention.

image

On Hugging Face’s Trending Models board:

  • 3 out of the top 4 models are OCR-related.
  • Even Qwen3-VL-8B can effectively handle OCR tasks — making today’s line-up truly OCR-heavy.

Following my last DeepSeek-OCR post, many readers asked me to compare it with PaddleOCR-VL from Baidu. So… here’s a detailed look at PaddleOCR-VL.

image
image

---

Why Talk About PaddleOCR-VL?

I’m usually cautious when writing about Baidu products — but PaddleOCR-VL is genuinely impressive.

image

The original PaddleOCR:

  • First released in 2020
  • Fully open source from the start
  • Continuously improved for 5+ years
  • Now boasting 60K GitHub stars — possibly the most-starred OCR repo worldwide.
image

The newly released PaddleOCR-VL marks the first time Baidu has integrated a large model at the core of document parsing.

image

Despite having only 0.9B parameters, it’s SOTA (state-of-the-art) in nearly all sub-tasks of the OmniDocBench v1.5 benchmark.

---

Benchmark Performance

image

Categories compared:

  • Traditional multi-stage OCR pipelines
  • General-purpose multimodal LLMs
  • Task-specific vision-language models for document parsing

Highlights:

  • Smallest parameter size
  • Highest scores
  • Latest results:
  • PaddleOCR-VL: 92.56 overall
  • DeepSeek-OCR: 86.46 overall

---

How Does a 0.9B Model Beat Larger Ones?

image

Modular Two-Step Approach

Unlike many end-to-end multimodal models, PaddleOCR-VL uses a divide-and-conquer method:

Step 1 — Layout Analysis

  • Uses PP-DocLayoutV2 model
  • Identifies and boxes distinct regions: titles, body text, tables, formulas, etc.
  • Establishes natural reading order
  • Runs extremely fast and doesn’t require huge models

Step 2 — Region OCR

  • Main PaddleOCR-VL (0.9B) model processes cropped images from Step 1
  • Handles small segments:
  • Tables → Markdown
  • Formulas → LaTeX
  • Maintains high accuracy without massive parameter counts

This design reduces complexity, improves speed, and minimizes hallucination risks.

---

Real-World Testing

1. Scanned PDFs

image

Even very blurry documents are successfully segmented and recognized.

Formulas and text extracted flawlessly.

image
image

---

2. Handwritten Notes

image

Handles both Chinese and English handwriting — as long as it’s legible.

---

3. Dense Layouts & Newspapers

image

Multi-column layouts preserved; reading order correct; recognition nearly perfect.

---

4. Charts & Diagrams

image

Supports end-to-end parsing and can restore visual charts.

---

5. Invoices & Receipts

image

Reliable in semi-structured data extraction — one of the most trustworthy OCR models in its category.

---

6. Complex Tables

image

Accurately recovers table structures, cell contents, and relationships — ideal for automated info extraction.

---

7. Platform Data Extraction

image

Fits seamlessly into multi-dimensional spreadsheet workflows, outperforming more expensive multimodal models in cost-effectiveness.

---

Deployment & Demos

PaddleOCR-VL is open source:

🔗 GitHub Repo

Official demos:

---

Final Thoughts

  • DeepSeek-OCR: innovative, experimental, pushing boundaries with contextual optical compression.
  • PaddleOCR-VL: pragmatic, task-optimized, delivers SOTA in a very specific OCR domain.

If your goal is accurate, efficient document OCR, PaddleOCR-VL is a top contender — and a perfect fit for workflows integrating with publishing and analytics platforms like AiToEarn官网.

---

👍 If you found this useful, don't forget to like, share, and star — see you next time!

Read the original text

Open in WeChat

Read more