Visual Generation
Another Path for Visual Generation: Principles and Practice of the Infinity Autoregressive Architecture

Honghao Wang

31 Oct 2025 — 4 min read
# Infinity: A New Route for Visual Autoregressive Generation

**Date:** 2025-10-31 13:40  
**Location:** Shandong  

---

## Introduction

Large language models such as **ChatGPT** and **DeepSeek** have achieved **tremendous global success**, ushering in a new wave of AI innovation.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-727.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-690.jpg)  

While **diffusion models** remain the dominant force in visual generation, **visual autoregressive methods**—mirroring the advances seen in large language models—offer **better scaling properties** and the ability to **unify understanding and generation tasks**. These advantages have sparked growing interest.

This article summarizes a **June presentation** by **Han Jian**, AIGC Algorithm Engineer at ByteDance’s Commercialization Technology Team, during **AICon 2025 Beijing**. His talk, *Infinity: A New Route for Visual Autoregressive Generation*, selected as a **CVPR 2025 Oral**, explained the **core principles**, explored **image** and **video generation** scenarios, and discussed the latest research results.

---

## AICon 2025 Overview

**Dates:** December 19–20  
**Theme:** *Exploring the Boundaries of AI Applications*

Focus areas include:  

- **Enterprise-level agent deployment**
- **Context engineering**
- **AI product innovation**

The event will feature real-world applications where enterprises leverage large models for **R&D efficiency** and **business growth**, with **experts from leading companies and startups** sharing insights and case studies.

---

## Autoregressive Models & Scaling Law

### What Is Autoregression?

An **autoregressive model**:
1. **Predicts a token**
2. **Uses that prediction as input**
3. **Repeats the cycle**

Perfect for **discrete sequences** like language, but **visual signals** are continuous—requiring **encoder-based compression into discrete tokens**.

### Early Approaches
- Directly treat **pixels as tokens**
- Use **encoder–decoder** for discretization  
- Examples: early **text-to-image** and **class-to-image** generation

![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-649.jpg)  

### Scaling Law in Language Models
- Scaling **model size**, **dataset size**, or **compute** improves performance following a **power-law**
- **Small experiments** can predict large-scale results

![image](https://blog.aitoearn.ai/content/images/2025/10/img_004-613.jpg)  

### Challenges in Visual Generation
Past research (iGPT, VQVAE, VQGAN, Google Parti):
1. Quality gaps vs. **Diffusion Models (DiT)**
2. Scaling Law verification for visual tokens remains incomplete
3. Raster order = **slow inference**
4. Human vision is **holistic, not raster-based**

![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-570.jpg)  

---

## Visual Autoregressive vs. Diffusion Models

### VAR: Coarse-to-Fine
- **Multi-resolution hierarchy**
- Begin with **blurry low-res** → progressively refine to **high-res sharp details**

Advantages over Parti:
- Handles multiple scales
- Parallel token patch prediction (reducing steps by 10×)
- Aligns with **human visual intuition**

![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-540.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_007-495.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_008-452.jpg)  

### Diffusion Models (LDM, DiT, Sora)
- Add and remove **Gaussian noise** at fixed resolution
- LDM compresses into **latent continuous** space
- DiT replaces U-Net with **Transformer**, increasing scalability
- Basis for **Sora** and state-of-the-art diffusion systems

![image](https://blog.aitoearn.ai/content/images/2025/10/img_009-418.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_010-391.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_011-342.jpg)  

**Comparison:**

| Aspect           | VAR (visual autoregressive) | Diffusion |
|------------------|-----------------------------|-----------|
| Scale processing | Multi-resolution pyramid     | Single-scale fixed resolution |
| Training         | Highly parallel across scales| Sequential timesteps |
| Inference speed  | Faster                       | Slower |
| Error correction | Harder (errors propagate)    | Easier (errors fixed in later steps) |

---

## Infinity: Advancing VAR

### Key Challenges in Text-to-Image
1. **Discrete VAE reconstruction quality**
2. **Accumulated autoregressive errors**
3. **High resolution** & **arbitrary aspect ratios** support

### Bitwise Tokenizer & Classifier
- Replace VQ codebook with **sign quantization** (±1 per channel)
- **No codebook utilization issues**
- Multi-stage residual pyramid: covers **1024×1024 in 16 steps**
- **Bitwise predictions** → smaller parameter count, more robust to perturbations

![image](https://blog.aitoearn.ai/content/images/2025/10/img_012-296.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_013-269.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_014-241.jpg)  

### Bitwise Self-Correction
- Predicted results are **quantized and fed back** during training
- Corrects **bit-level errors**
- Effective during inference

![image](https://blog.aitoearn.ai/content/images/2025/10/img_015-209.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_016-185.jpg)  

### Error Simulation Training
- Randomly flip 20% of bits during training
- Network learns **error correction**
- **FID drops 9 → 3**; better ImageReward
- Larger vocabularies win with larger models & compute

![image](https://blog.aitoearn.ai/content/images/2025/10/img_017-169.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_018-151.jpg)  

---

## Speed & Efficiency

**Performance:**
- **2B Infinity model:** **0.8s** at 1024²
- **20B Infinity model:** **3s** → **3.7× faster than DiT**
- Strong in video generation tasks
- Matches closed-source DiT in **T2I Arena**

![image](https://blog.aitoearn.ai/content/images/2025/10/img_019-129.jpg)  

---

## Key Takeaways

- VAR & Infinity make **discrete autoregressive models** competitive with diffusion
- **High resolution** and **large vocabulary scalability**
- Faster inference & strong alignment (via DPO)
- Platforms like **[AiToEarn](https://aitoearn.ai/)** offer tools to **publish & monetize** outputs globally

---

![image](https://blog.aitoearn.ai/content/images/2025/10/img_020-107.jpg)  

## AICon 2025 Beijing: Conference Preview

**Dates:** Dec 19–20  
**Highlights:**
- **Agents**
- **Context Engineering**
- **AI Product Innovation**

Meet **leading enterprise experts** and **innovative teams** for deep technical exchanges.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_021-97.jpg)  

---

## Recommended Reads

- [From Part-time Engineer to CTO (AI Agent success story)](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247647738&idx=1&sn=a7552c8128fc8bb7262a47b8df02fbd1&scene=21#wechat_redirect)
- [GPT-5.1 Leaks: Internal OpenAI criticism](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247647398&idx=1&sn=d31a323e5ce67c30514f05715e5040f4&scene=21#wechat_redirect)
- [Dark Side of the Moon Funding Rumors & Meta Chaos](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247647359&idx=1&sn=f1b85686b8924a2a18d91acd36e402fd&scene=21#wechat_redirect)
- [LangChain’s rewrite driving $1.25B valuation](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247647314&idx=1&sn=e34deb2852f640d31cd2dd54e2df1692&scene=21#wechat_redirect)
- [Meta Layoffs & Alexandr Wang’s aggressive hiring](https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247647159&idx=1&sn=a9e0e5d10801f5a2caddaaee0b68c12b&scene=21#wechat_redirect)

![image](https://blog.aitoearn.ai/content/images/2025/10/img_022-86.jpg)  

---

[Read the Original](2247647828)  
[Open in WeChat Here](https://wechat2rss.bestblogs.dev/link-proxy/?k=252fc584&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzU1NDA4NjU2MA%3D%3D%26mid%3D2247647828%26idx%3D2%26sn%3D8bd42fbe658ff25fb498ba09e9c500b2)

---
Another Path for Visual Generation: Principles and Practice of the Infinity Autoregressive Architecture

Honghao Wang

Read more

Doubao, Kimi, and 10 AI Giants Take on Wall Street — Who’s the Strongest?

XPeng Humanoid Robot Debuts, Flying Car Nears Mass Production

First Hello-Agents Class Begins! November’s Most Diverse Team Learning Program Arrives (Up to 18 Courses)

Vertex AI Agent Builder: A New Way to Build and Scale AI Agents