Qwen3 Turns into a Diffusion Language Model? Run Without Training from Scratch, Record-Breaking 30B Parameters

Qwen3 Turns into a Diffusion Language Model? Run Without Training from Scratch, Record-Breaking 30B Parameters
# RND1: The Largest Open-Source Diffusion Language Model

**Date:** 2025-10-12 · **Location:** Beijing  

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-88.jpg)

---

## Introduction

Diffusion Language Models (DLMs) have fascinated researchers because—unlike **autoregressive (AR) models**, which must generate text left-to-right—**DLMs enable parallel generation**.  

**Advantages:**
- Potential for **faster outputs**
- Incorporation of both preceding and following context during generation

**Challenges:**
- Lower scaling efficiency compared to AR models
- Direct DLM training often requires **more iterations** to match AR performance
- AR models benefit from a **first‑mover advantage** including robust training infrastructure, refined recipes, and deep practitioner expertise

---

## Radical Numerics’ Approach

![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-82.jpg)

Startup **Radical Numerics** took a different route:  
### **Autoregressive-to-Diffusion (A2D) Conversion**
They adapted an existing high-performing AR model into a DLM.

**Release:**  
- **RND1-Base** (Radical Numerics Diffusion)
- **Largest** open-source diffusion language model so far

![image](https://blog.aitoearn.ai/content/images/2025/10/img_003-77.jpg)

**Model Details:**
- **30B-parameter sparse MoE**
- **3B active parameters**
- Converted from **Qwen3-30BA3B**
- Continuous pretraining on **500B tokens** for full diffusion capabilities  
- Open release: model weights, training recipes, inference code, and sample outputs

![image](https://blog.aitoearn.ai/content/images/2025/10/img_004-76.jpg)

### **Resources**
- **Technical report:** *Training Diffusion Language Models at Scale using Autoregressive Models*  
  [Download PDF](https://www.radicalnumerics.ai/assets/rnd1_report.pdf)  
- **Code repository:** [GitHub Link](https://github.com/RadicalNumerics/RND1)  
- **HuggingFace model:** [HF Link](https://huggingface.co/radicalnumerics/RND1-Base-0910)  

---

## Main Contributions

- **Systematic study** of large-scale A2D conversion: initialization strategies, hierarchical learning rates, critical batch sizes  
- **Scalability & stability** gains by combining AR pretraining with diffusion-specific techniques  
- Released **RND1-30B**, proving very large foundational DLMs are viable and competitive

---

## Benchmark Performance

**Benchmarks Tested:**
- **Reasoning & QA:** MMLU, ARC-C, RACE, BBH  
- **STEM tasks:** GSM8K  
- **Code generation:** MBPP

**Findings:**
- RND1 **outperforms** competing models like **Dream-7B** and **LLaDA-8B**  
- Retains strong capabilities from its AR predecessor  
- **Note:** No direct comparison with **LLaDA-MoE-7B-A1B**; head-to-head performance remains unclear

![image](https://blog.aitoearn.ai/content/images/2025/10/img_006-65.jpg)

_Source: [arXiv 2509.24389](https://arxiv.org/pdf/2509.24389)_

---

## Simple Continuous Pretraining (SCP)

When converting from AR to diffusion:
1. **Enable bidirectional context** in a causal-only architecture
2. **Preserve** AR-trained language knowledge

### Complex Past Approaches:
- **Attention mask annealing**
- **Grafting new attention blocks**  
Often hard to scale to large models

### SCP Method:
A simpler recipe with competitive results:
1. Start from a strong **AR checkpoint**
2. Replace **causal mask → bidirectional mask** at init
3. Continue pretraining under **masked diffusion objective** with LR warmup

---

## Preserving AR Knowledge via Layer-Wise Learning Rates

**Goal:** Prevent **catastrophic forgetting**  

**Insights:**
- Factual knowledge often resides in **FFN/MLP layers**
- **Strategy:**  
  - Higher LR in **attention layers** (adapt faster to bidirectional context)  
  - Lower LR in **MLP & embedding layers** (preserve learned knowledge)

---

## Benefits of Larger Batch Sizes in A2D

**Key Observation:**  
- AR loss: every token contributes  
- Diffusion loss: only masked tokens (~50%)  

**Impact:**  
- Diffusion training needs **larger batches** for effective learning

**Experiment:**
- Used a 4B-parameter model
- Ran 4 branches with varying batch sizes
- Found **loss improvement persists up to ~8M tokens/batch**

![image](https://blog.aitoearn.ai/content/images/2025/10/img_007-64.jpg)

---

## Why Retrofit AR Models?

**Advantages:**
- Avoid starting from scratch
- Harness robust AR pretraining infrastructure
- Explore **new architectures** more efficiently

---

## Ecosystem for AI Creators: AiToEarn

In parallel to research, creators can benefit from publishing and monetizing outputs globally.

**AiToEarn官网** ([visit here](https://aitoearn.ai/)) offers:
- Open-source platform
- AI-assisted content creation
- Cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Instagram, YouTube, X/Twitter, etc.)
- Analytics and monetization tools

Fits well for sharing **diffusion model outputs**, **benchmarks**, and creative applications.

![image](https://blog.aitoearn.ai/content/images/2025/10/img_005-71.jpg)

---

## Radical Numerics’ Vision & Team

![image](https://blog.aitoearn.ai/content/images/2025/10/img_008-57.jpg)

Core philosophy:
- **Automated AI research platform**
- Self-improving AI systems through **recursive optimization**
- Faster experimentation via automated loops

**Team Background:**
- From DeepMind, Meta, Liquid, Stanford
- Interests in **hybrid architectures**, **Hyena**, and **Evo**

![image](https://blog.aitoearn.ai/content/images/2025/10/img_009-51.jpg)  

**More Info:** [Read Blog](https://www.radicalnumerics.ai/blog/rnd1)

![image](https://blog.aitoearn.ai/content/images/2025/10/img_010-51.jpg)  

---

## Conclusion

**RND1** proves:
- Large-scale DLMs (>8B params) are practical
- A2D + SCP enables efficient, stable training
- High potential when paired with publishing/monetization infrastructures like **AiToEarn**

© **THE END**  

[Read the original](2650994927)  
[Open in WeChat](https://wechat2rss.bestblogs.dev/link-proxy/?k=467c04ea&r=1&u=https%3A%2F%2Fmp.weixin.qq.com%2Fs%3F__biz%3DMzA3MzI4MjgzMw%3D%3D%26mid%3D2650994927%26idx%3D2%26sn%3Dabb637d57d22a7c5b334b8cad2cf4b85)  

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes.

ChatGPT Atlas 发布,AI 浏览器大乱斗...

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. ChatGPT Atlas 发布,AI 浏览器大乱斗...

# AI Browsers: When LLM Companies Step In 原创 lencx · 2025-10-22 07:00 · 上海 --- ## Overview Large Language Model (LLM) companies are making moves into the **AI browser** space. From new entrants like **Dia**[1], **Comet**[2], and **ChatGPT Atlas**[3], to established browsers like **Chrome** and **Edge** (which now feature

By Honghao Wang