Can Reading Countless Books Help Large Models “See” the Visual World? Meta Reveals the Origins of LLM Visual Priors

Can Reading Countless Books Help Large Models “See” the Visual World? Meta Reveals the Origins of LLM Visual Priors

How LLMs Learn to “See” Without Visual Input

A Large Language Model (LLM) trained exclusively on text — without any visual data — can still develop prior abilities transferable to vision tasks.

This surprising discovery comes from a recent paper by Meta Superintelligence Labs and the University of Oxford.

---

Study Overview

Researchers conducted an extensive 33-page investigation featuring:

  • 100+ controlled experiments
  • 500,000 GPU hours of computation
  • Systematic analysis of how visual priors emerge in LLMs

The study introduces two categories of visual priors:

  • Reasoning Priors
  • Perception Priors

It also proposes a text-only pre-training recipe that “plants the seeds” of visual ability long before the model ever encounters visual data.

---

image

Quick Links:

---

Core Insight

Two Distinct Sources of LLM Visual Priors

Visual priors are not a single ability — they consist of:

  • Reasoning Prior
  • An abstract, cross-modal general ability
  • Learned from reasoning-rich data such as code, mathematics, and academic papers
  • Transfers well to complex vision tasks
  • Comparable to humans developing logical skills before applying them to visual problem-solving
  • Perception Prior
  • Concrete recognition abilities (e.g., color, shape, object names)
  • Emerges gradually from broad, diverse text corpora
  • Highly sensitive to visual fine-tuning and vision encoder choice

---

Key Finding

> Massive reasoning data drives gains — while visual descriptions saturate quickly.

image

Experimental Approach

Researchers used an adapter-style multimodal pipeline:

  • Text-only pre-training of multiple decoder-based LLMs (following Llama-3 architecture, \(340M to 13B parameters; focus on 3B and 7B\)).
  • Integration stage: visual alignment + supervised fine-tuning.
  • Measurement of visual priors in downstream tasks.

Six conclusions and three hypotheses emerged. Highlights include:

  • Tracing Capability Origins
  • Models trained on code, math, and academic text excelled in visual-reasoning-heavy tasks (e.g., Vision-centric VQA).
image
  • Reasoning Data Matters Most
  • Increasing reasoning-intensive text (up to ~75% of training data) significantly boosts downstream visual reasoning.
  • In contrast, visual-descriptive text (color, shape, locations) yields rapid early gains but plateaus quickly.

---

Implications for Multimodal AI Development

Insights like these guide better pre-training strategies and practical workflows.

Platforms such as AiToEarn官网 provide a global, open-source ecosystem where creators:

  • Generate AI content
  • Publish across multiple platforms (Douyin, Bilibili, YouTube, Instagram)
  • Analyze performance via dashboards
  • Rank models via AI模型排名
image

---

Reasoning vs. Perception: Independence and Dependency

  • Reasoning priors:
  • Universal capability
  • Independent of vision encoder choice
  • Strong reasoning during pre-training improves multimodal reasoning later
  • Perception priors:
  • Dependent on fine-tuning data and vision encoder features
  • Emerge later in the training pipeline
image

---

In short:

To nurture strong visual potential in an LLM — train its “mind” with logic, math, and code rather than flooding it with raw visual descriptions.

---

Pre-training Recipe: From Theory to Practice

The team designed an optimal mixed-data recipe:

  • Rich reasoning content for cognitive sharpness
  • Small but sufficient amount of visual world knowledge

Results:

  • The 7B model using this recipe surpassed language-only optimized models in language tasks.
  • Outperformed all competitors in visual benchmarks.
  • Demonstrated that textual pre-training can intentionally inject visual priors.
image

---

Significance & Outlook

  • Moves multimodal capability cultivation earlier in the pipeline
  • Supports the Platonic Representation Hypothesis
  • (Text and images are projections of the same underlying reality)

Future implication:

Model pre-training will evolve from single-modality focus to cross-modal planning — embedding visual seeds from day one.

---

For Creators & Researchers

AiToEarn官网 helps teams:

  • Publish AI-generated content across Douyin, Bilibili, Instagram, YouTube
  • Track analytics
  • Connect to AI generation tools
  • Access global model rankings (AI模型排名)

This streamlined workflow complements the research’s vision of integrated multimodal intelligence.

---

Would you like me to also create a diagram summarizing the Reasoning vs. Perception priors so the Markdown becomes visually more engaging? That could make the two-pillar concept instantly clear for readers.

Read more