Robert Nishihara: Open-Source AI Computing Solution with Kubernetes, Ray, PyTorch, and vLLM
AI Compute Stacks for Emerging Workloads
AI workloads are rapidly increasing in complexity — both in computing power and data requirements.
Technologies like Kubernetes and PyTorch are critical in building production-ready AI systems capable of handling such demands.
At the KubeCon + CloudNativeCon North America 2025,
Robert Nishihara (Anyscale) shared insights on how a stack integrating:
can effectively power next-generation AI workloads.
---
Overview of Ray
Ray is an open-source framework designed for scaling machine learning and Python applications.
It originated from a reinforcement learning research project at Berkeley and now orchestrates infrastructure for distributed workloads.
Ray recently joined the PyTorch Foundation to deepen its role in the open-source AI ecosystem.
---
Drivers of Next-Generation AI Workloads
Nishihara identified three primary drivers:
- Data Processing
- Shift from traditional tabular data to multimodal datasets
- (images, videos, audio, text, sensor data).
- Multimodal datasets are essential for inference tasks in modern AI.
- Model Training
- Models are growing in size and complexity.
- Training uses distributed CPU/GPU computing to accelerate development.
- Model Serving
- Efficient deployment at scale requires flexible frameworks.
- Must support high-throughput, low-latency inference.
Key Trend:
Hardware requirements now must accommodate GPUs alongside CPUs.
Computing focus has shifted from “SQL ops on CPUs” to “inference ops on GPUs”.
---
Example: AiToEarn Platform
AiToEarn demonstrates how such stacks enable content creation and monetization:
- Generates AI content
- Publishes across multiple platforms
- (Douyin, WeChat, Facebook, YouTube, etc.)
- Offers analytics and AI model rankings
- Fully open source
Purpose:
Connect AI tools, cross-platform publishing, analytics, and deployment — much like Kubernetes + PyTorch + Ray does for enterprise AI workloads.
---
Ray for Model Training and Inference
Model training includes:
- Reinforcement Learning (RL) (more info)
- Post-training dataset generation via inference
Using Ray’s Actor API:
- An Actor is a stateful worker
- Creates a worker class
- Manages scheduling for methods on that instance
Performance Boost:
Ray supports RDMA for direct GPU memory transport → faster object transfers.
---
RL Frameworks Built on Ray
Examples include:
- Cursor composer (AI-powered code editor) — Cursor
- Verl (Bytedance)
- OpenRLHF
- ROLL (Alibaba)
- NeMO-RL (Nvidia)
- SkyRL (UC Berkeley)
Training Engines:
Serving Engines:
- Hugging Face, vLLM, SGLang, OpenAI
---
Architecture View — Top & Bottom Layers
Top Layers:
- AI workloads
- Model training/inference frameworks (PyTorch, vLLM, Megatron, SGLang)
Bottom Layers:
- Computing substrates (GPUs, CPUs)
- Orchestrators (Kubernetes, Slurm)
Bridge Layer:
- Distributed compute frameworks (Ray, Spark)
- Manage data ingestion and movement
---
Kubernetes + Ray: Complementary Roles
- Kubernetes → container-level isolation
- Ray → process-level isolation
- Both provide vertical & horizontal autoscaling
Dynamic GPU Allocation:
- Inference workloads fluctuate compared to training
- Ray + Kubernetes → reallocate GPUs as needed
---
Essential Capabilities for AI Platforms
Nishihara emphasized:
- Native multi-cloud support
- Workload prioritization tied to GPU reservations
- Observability & tooling at container, workload, and process levels
- Model/data lineage tracking
- Governance
Observability Tip:
Track object transfer speeds & performance across all levels.
---
AiToEarn — Open-Source AI Content Engine
For creators and AI engineers:
- Multi-platform publishing
- (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
- AI content generation tools
- Detailed analytics & model rankings
- Governance-friendly workflow
Resources:
---
In Summary:
The integration of Kubernetes, PyTorch, vLLM, and Ray forms a powerful stack that can:
- Train large-scale models efficiently
- Serve them with low latency
- Dynamically allocate compute resources
- Enable both enterprise AI workloads and creative AI monetization platforms