visual tokens

Zhipu Was Just Unlucky — Visual Token Research Collided with DeepSeek

Honghao Wang

23 Oct 2025 — 4 min read

Glyph & DeepSeek-OCR: The Visual Token Showdown

It’s quite the coincidence — Zhipu and DeepSeek have crossed paths again.

Just one day after DeepSeek-OCR debuted, Zhipu open-sourced its own visual token framework: Glyph.

And on the same stage? Enter Karpathy, who’s been showering DeepSeek with likes the past few days:

> Maybe you’ll be interested in our work too.

Publishing papers is one thing —

Why does this feel like competing for affection? 🐶

Netizens joked: The AI world now has its own “CEO romance drama.”

---

The Context Length Problem

Like DeepSeek-OCR, Zhipu’s Glyph seeks to tackle overly long LLM contexts — but does so visually.

As LLM capabilities climb, long context needs rise sharply.

Whether for:

Long document analysis
Code reviews
Complex, multi-round dialogues

…models can’t afford goldfish memory. They need large, stable working memory.

Why Extending Context Is Difficult

Extending context length is costly:

Doubling context length from 50K → 100K quadruples computational costs.
More tokens = more activations, caches, and attention weights = higher training & inference bills.

And even then — you may not get better performance.

IBM research highlights that more tokens ≠ linear improvements — longer, noisier inputs may cause overload and reduce accuracy.

---

Mainstream Approaches to Long Context

1. Extending Positional Encoding

Transformer models don’t natively grasp token order.
Positional encoding stretches input range, e.g., from 0–32K → 0–100K.
Limitations: still process all tokens, no training on huge ranges, results are limited.

2. Attention Mechanism Optimisation

Use sparse/linear attention for efficiency.
Limitations: efficiency gains can’t overcome hundreds of thousands of tokens.

3. Retrieval-Augmented Generation (RAG)

Retrieve only relevant content to feed model.
Limitations: retrieval isn’t as reliable as trained responses; slows down responses.

---

Glyph’s Approach: Images as Memory

Glyph’s philosophy:

If raw text lacks information density, render it as an image.

Why This Matters

Text → split into tokens → processed sequentially → low efficiency
Visual tokens pack dense information in fewer units
Allows fixed-context VLMs to handle much longer works without complex tricks

---

Example: Compressing Jane Eyre

Jane Eyre ≈ 240K text tokens
Traditional LLM (context window: 128K) can process ≈ half.
Glyph rendering → ≈ 80K visual tokens
VLM with 128K context can “see” and process the entire book.

This enables broader plot understanding and global question answering.

---

Glyph's Training Process

Stage 1: Continual Pre-Training

Render long text into images with varied fonts, layouts, and styles.
Train VLM to align image text with semantic meaning.

Stage 2: LLM-driven Rendering Search

Balance compression rate vs readability.
Use genetic search to optimise:
Font size
Layout
Resolution

Stage 3: Post-Training

Apply Supervised Fine-Tuning (SFT) & Reinforcement Learning (RL).
Introduce OCR alignment tasks to preserve fine-text detail.

---

Skills Achieved

Precise reasoning over long text
Detail extraction from images without strain

---

Glyph’s Performance

3–4× token compression while matching accuracy of Qwen3-8B
4× faster prefill & decoding
2× faster SFT training

Handles million-token tasks in a 128K window
Strong multimodal generalisation despite image-tuned data

---

Paper: https://arxiv.org/pdf/2510.17800

GitHub: https://github.com/thu-coai/Glyph

Reference: [1] https://x.com/ShawLiu12/status/1980485737507352760

---

Humans, Visual Tokens & AI Futures

DeepSeek-OCR achieves 97.3% accuracy with 10× fewer tokens.

OCR enables a single NVIDIA A100-40G to process >200K pages/day, reducing pre-training costs drastically.

Karpathy notes pixels > text for LLM input:

Greater compression → shorter context, greater efficiency
More expressive bandwidth → includes fonts, colours, arbitrary images

Elon Musk is even bolder:

> In the long run, over 99% of inputs/outputs for AI models will be photons.

---

Neuroscience Connection

Human brains process images first — text is an abstraction layer.

OCR and visual tokens mimic how we naturally absorb data.

This could reshape LLMs’ core information handling:

From text → pixels
From linear reading → visual comprehension

---

Linking Research to Content Creation

Platforms like AiToEarn官网 bring these concepts into real-world publishing:

AI-driven content generation
Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X)
Integrated analytics & AI model rankings (AI模型排名)

Developer resources:

Docs: AiToEarn文档
GitHub: AiToEarn开源地址

This bridges cutting-edge AI research with global monetisation.

---

📌 Summary

Glyph:

Renders text → images → processed as visual tokens
Achieves compression, speed & performance gains
Mimics human visual cognition

Visual tokens are not just a niche trick — they may be the next fundamental shift in AI context modelling.

---

If you want, I can prepare a side-by-side bilingual Chinese-English version of this Markdown for technical teams working in multilingual environments — making it easier to reference and discuss across research and engineering teams.

Would you like me to produce that next?