Drink Some VC | YC Talks with Anthropic’s Head of Pretraining: Pretraining Teams Must Also Consider Inference, Balancing Pretraining and Post‑Training Still in Early Exploration

Drink Some VC | YC Talks with Anthropic’s Head of Pretraining: Pretraining Teams Must Also Consider Inference, Balancing Pretraining and Post‑Training Still in Early Exploration
# Y Combinator Conversation with Nick Joseph — Scaling Laws, Compute, and the Future of AI

**Date:** 2025‑10‑16 11:01 (Beijing)  

> *“Scaling laws show that as you put in more compute, data, and parameters, model loss decreases in a predictable way — this is the core engine driving AI progress.”*

![image](https://blog.aitoearn.ai/content/images/2025/10/img_001-217.jpg)  
![image](https://blog.aitoearn.ai/content/images/2025/10/img_002-207.jpg)  

*Image source: Y Combinator*

---

## 🔍 Z Highlights

- **Driving down the loss function** is the core of pretraining — the single, consistent goal over time.  
- **Alignment** is about teaching models to share human goals — especially critical once models surpass human intelligence.  
- With **unlimited compute**, the main challenge becomes *using it efficiently* and solving scaling engineering.  
- **Current bottleneck:** Limited compute > lack of algorithms.  

*Nick Joseph — Head of Pretraining at Anthropic, with prior work at Vicarious and OpenAI on AI safety and scaling research — shares his insights on pretraining strategies, scaling laws, data, and alignment. This interview (Sept 30, 2025) is hosted by Y Combinator.*

---

## 1. From AI Safety to Leading Pretraining at Anthropic

### Career Path
- **At Vicarious:** Focused on computer vision models for robotics; gained ML engineering fundamentals & infrastructure skills.
- **At GiveWell Internship:** Exposure to early AGI risk discussions; opted for AI path to make real-world impact.
- **At OpenAI:** Joined safety research, worked on code generation using GPT‑3; realized self-improvement potential of AI; moved to Anthropic at founding.

---

## 2. Pretraining Fundamentals

### What is Pretraining?  
- Use *massive compute* with large, unlabeled datasets (e.g., the internet).
- **Next‑word prediction** = primary objective (dense learning signal).
- Scaling compute, parameters, and data **predictably reduces loss** → better performance.

**Key Principle — Scaling Laws:**  
Compute ↑ + Data ↑ + Params ↑ ⇒ Loss ↓ (predictable curve)  
Drives a revenue → better models → more revenue feedback loop.

---

## 3. Why Autoregressive Modeling Won

- Outperformed alternatives (masked LM like BERT) via empirical testing.
- **Advantages:** Easy sampling of coherent text → product-friendly.
- Supports smooth **business–tech loop**: release → revenue → reinvest in training.

---

## 4. Engineering at the Compute Frontier

### Challenges:
- Training = vast hyperparameter space; scaling laws help but tuning still needed.
- Infrastructure: early Anthropic used cloud, with extreme hardware optimization.
- **Distributed Training:** Had to code custom communication (all‑reduce, etc.), beyond public frameworks.

**Optimization Focus:**
1. Use data, pipeline & model parallelism together.
2. Custom low-level ops for attention mechanisms.
3. Profiling at scale (hundreds-thousands GPUs) — often required self-built profilers.

---

## 5. Scaling Teams and Specialization

- Growth → split into specialists (attention, parallelization…) and generalists.
- Need balance: avoid *knowledge silos* or *shallow breadth*.

---

## 6. Hardware & Training Realities

- **GPU/TPU Diversity:** Different strengths (throughput vs bandwidth) → match workloads to hardware.
- **Failures:** A single GPU fault can take down training; need reproducible minimal test cases for vendor debugging.

---

## 7. Pretraining vs Post-training

- Shift toward post-training (RL, alignment tuning) is real — *optimal balance* is still unknown.
- Pretraining still central for base capability gains; post-training builds reasoning & task adherence.

---

## 8. Data Challenges

- Claims of exhausting “useful Internet” are uncertain; quality vs quantity trade-offs persist.
- **Risk:** Training on AI-generated content → model copies flawed distribution.

---

## 9. Evaluation & Alignment

- **Loss function** = robust, but ultimate metric should align with deployment goals.
- Designing low-noise, fast, meaningful evals is hardest.
- **Alignment:** Matching model goals with human values; requires practical constraints (rules, prompts) plus theory.

---

## 10. Paradigm Shifts & Hidden Bugs

- Expect *new AI paradigms* beyond scaling — RL is one such shift.
- **Worst fear:** Subtle, hard-to-find bugs wasting months of compute.
- Rare but vital skill: tracing issues across *all stack layers* (ML logic → CUDA → networking).

---

## 11. Future Directions

- Autoregression may be *enough* for AGI, but optimizations (attention, caching…) will matter as compute grows.
- Inference efficiency is crucial — large, slow-to-serve models hurt UX.

**Startup Opportunities:**
- Specialized applications atop current models.
- Chip validation & failure detection tools.
- Scalable org management for large-scale projects.

---

## 📺 Original Source

*Original video: [Anthropic Head of Pretraining on Scaling Laws, Compute, and the Future of AI](https://www.youtube.com/watch?v=YFeb3yAxtjE)*

---

## 💡 Related Tools: AiToEarn

Platforms like **[AiToEarn官网](https://aitoearn.ai/)** provide:
- AI content generation + publishing to Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X.
- Analytics & AI model rankings: [AI模型排名](https://rank.aitoearn.ai)  
- Open-source: [GitHub repo](https://github.com/yikart/AiToEarn) | [Documentation](https://docs.aitoearn.ai/)

Enables creators and teams to **monetize AI projects** and iterate rapidly — mirroring feedback loops in AI model scaling.

---

*This recap is compiled from original content linked above and does not represent Z Potentials’ official position.*

Read more

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Xie Saining, Fei-Fei Li, and Yann LeCun Team Up for the First Time! Introducing the New "Hyperception" Paradigm — AI Can Now Predict and Remember, Not Just See

Spatial Intelligence & Supersensing: The Next Frontier in AI Leading AI researchers — Fei-Fei Li, Saining Xie, and Yann LeCun — have been highlighting a transformative concept: Spatial Intelligence. This goes beyond simply “understanding images or videos.” It’s about: * Comprehending spatial structures * Remembering events * Predicting future outcomes In essence, a truly

By Honghao Wang
Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

Flexing Muscles While Building Walls: NVIDIA Launches OmniVinci, Outperforms Qwen2.5-Omni but Faces “Fake Open Source” Criticism

NVIDIA OmniVinci: A Breakthrough in Multimodal AI NVIDIA has unveiled OmniVinci, a large language model designed for multimodal understanding and reasoning — capable of processing text, visual, audio, and even robotic data inputs. Led by the NVIDIA Research team, the project explores human-like perception: integrating and interpreting information across multiple data

By Honghao Wang