Exclusive Interview with DeepMind’s Tan Jie: Robots, World Models, and Google
Interview: Zhang Xiaojun × Tan Jie
Google DeepMind Robotics – Foundation Models, Reinforcement Learning, and the Future of General-Purpose Robots


Guest: Tan Jie, Senior Research Scientist & Technical Lead, Google DeepMind Robotics. He works on applying foundation models and deep reinforcement learning to robotics.
---
Overview
China’s robotics sector is often perceived as stronger in hardware, while the U.S. leads in developing the “brains” of robots. Tan Jie shares Google DeepMind’s perspective — including insights from their recent paper "Gemini Robotics 1.5 Brings AI Agents into the Physical World" — and discusses:
- The parallels between computer graphics and robotics
- Sim-to-real transfer and reinforcement learning breakthroughs
- Large language models as the “brain” for robots
- The evolving landscape of embodied intelligence and robotics foundation models
- Data scarcity, cross-embodiment transfer, and motion transfer innovations
- The competitive culture in Silicon Valley post-ChatGPT
---
1. From Computer Graphics to Robotics
Early Career Path
- Undergraduate: Shanghai Jiao Tong University
- PhD: Focus on computer graphics, animation at Pixar internship, physics-based character animation.
- Founded a startup in Shanghai (similar to Kujiale/CoolHome).
- Joined Lytro in Silicon Valley, worked on light field cameras.
- Moved to Google Brain, later merged with DeepMind forming the Google DeepMind Robotics team.
Perspective Shift
> "Robotics is graphics in the real world; graphics is robotics in simulation."
- Initially motivated to apply simulation-based techniques from graphics to real-world robotics.
---
2. Robotics Before Deep Reinforcement Learning
- Dominated by rule-based traditional control methods like MPC (Model Predictive Control).
- High barrier to entry: required PhD-level math.
- Graphics simulations could show agile motions that real robots failed to achieve (DARPA Robotics Challenge robots vs. simulated robots doing flips).
Goal: Bring simulation capabilities to real-world robots for agile locomotion and manipulation.
---
3. Breakthrough: Sim-to-Real + Reinforcement Learning
- First Google paper: "Sim-to-Real: Learning Agile Locomotion for Quadruped Robots"
- Introduced deep RL methods (PPO) inspired by AlphaGo successes.
- Pioneered RL in quadruped locomotion, influencing Unitree, Boston Dynamics, and others.
Paradigm Shifts in Robotics Over 10 Years:
- Reinforcement Learning – solving gait and locomotion
- Large Language Models (LLMs) – bringing language comprehension and common sense to robots
---
4. LLMs + RL = Brain + Cerebellum
- LLMs: “Brain” – reasoning, planning, language understanding
- RL: “Cerebellum” – execution, control, balance, precise movement
- Both components are essential for advanced robotics.
---
5. Robotics Foundation Models — Independent Discipline?
Current Status:
- Most work extends LLMs/multimodal models to output robot actions.
- Lacks standalone robotics-specific pretraining paradigms.
- May become independent in future with unique world models and data formats.
---
6. Data as the Primary Bottleneck
Why Data Is Scarce in Robotics
- Real-world is unstructured; huge diversity of required experiences
- No large-scale open datasets like in language modeling
- Human teleoperation data is expensive
Data Pyramid in Robotics:
- Massive low-quality internet-scale data
- Egocentric human video data (YouTube, wearable cameras)
- Simulation data (physics engines, synthetic environments)
- Robot-specific high-fidelity data (teleoperation, task-specific collections)
---
7. Gemini Robotics 1.5 – Key Innovations
1. Adding “Thinking” into VLA Models
- Allows multi-step reasoning before executing actions
- Improves human-robot transparency and safety
Process Example: Sorting clothes by color
- Identify item color
- Plan which pile it belongs to
- Execute placement and repeat
---
2. Cross-Embodiment Transfer via Motion Transfer
- Enables using data collected on Robot A to train Robot B
- Tested across:
- Aloha: table-top dual-arm robot
- Bi-arm Franka: industrial arms
- Apptronik: humanoid robot
- Result: Tasks requiring unseen workspace configurations transferred successfully between embodiments.
---
8. Technical Challenges and Solutions
- Reasoning speed: Robots have tighter inference budgets (0.5–1s per decision) vs. LLMs (can think for 20s+)
- Overfitting thinking traces: Need diverse annotations to generalize reasoning to new tasks
- Reward function design in RL: Simple for locomotion, extremely hard for varied manipulation tasks
- Embodiment gap: Larger physical differences reduce transfer effectiveness
---
9. Simulation vs. Real Data
Real Data:
+ Avoids sim-to-real gap
− Limited scalability
Simulation Data:
+ Scalable, cheaper in long run
− Initial performance gap, hallucinations in generative video simulation
New Direction:
- Generative video-based simulation (VEO, Sora 2, Genie) may replace traditional physics simulation
- Prompt-based scene generation scales faster than manually modeling environments
---
10. Future Directions
- Scaling data via simulation, human video, and model-generated datasets
- Bridging world models (Vision-Language-Vision) with action outputs
- Incorporating additional modalities (tactile sensing critical for dexterous hands)
- Moving from gripper era to dexterous-hand era robotics
---
11. Development Timelines
Predictions:
- 2–3 years: Robotics “GPT moment” with useful general-purpose models
- 5–10 years: Widespread deployment in industries and eventually homes
- Specialization will be outperformed once true generalists mature
---
12. Notes on Silicon Valley Culture Shift
- Post-ChatGPT: extreme competitiveness (“996” style now common)
- Large-scale coordinated teams replacing small, independent research
- Balancing top-down direction with bottom-up innovation
- “Big effort” is necessary but needs smart innovation for breakthroughs
---
13. Talent & Leadership
- AI talent costs soaring due to supply-demand imbalance
- Mission alignment more important than money for top-tier hires
- Significant Chinese representation (50–60%) in Google Robotics
- Prediction: More Chinese leaders in Silicon Valley AI and robotics in coming years
---
14. Key Takeaways from Tan Jie’s Journey
- Focus: Solve AGI in the physical world
- Preferred Form Factor: Humanoid robots
- Preferred Architecture: End-to-end unified models
- Critical Bet: Scalable synthetic data
- Collaboration between hardware-rich China and AI-rich U.S. could accelerate progress globally
---
Quick Recommendations
- Books:
- Start With Why
- The 7 Habits of Highly Effective People
- Key Papers:
- Sim-to-Real: Learning Agile Locomotion for Quadruped Robots
- RT‑1, RT‑2, RT‑X series
- Gemini Robotics 1.5
---
Closing Perspective
> “When a true generalist robot arrives, specialists will struggle to survive. Whether in robotics or AI content creation, scalable multi-modal data, strong foundation models, and the right collaborations will be key to reaching that point.”
---
Related Resource:
For creators looking to monetize AI innovations (including robotics demos, simulations, or research insights), the AiToEarn官网 open-source platform connects:
- AI content generation
- Multi-platform publishing (Douyin, Kwai, WeChat, Bilibili, Facebook, Instagram, LinkedIn, YouTube, Pinterest, X/Twitter)
- Analytics and AI模型排名
Similar to scalable simulation in robotics, AiToEarn helps ensure innovations reach and grow an audience across ecosystems.