Robert Nishihara: Open-Source AI Computing Solution with Kubernetes, Ray, PyTorch, and vLLM

Honghao Wang

29 Nov 2025 — 2 min read

AI Compute Stacks for Emerging Workloads

AI workloads are rapidly increasing in complexity — both in computing power and data requirements.

Technologies like Kubernetes and PyTorch are critical in building production-ready AI systems capable of handling such demands.

At the KubeCon + CloudNativeCon North America 2025,

Robert Nishihara (Anyscale) shared insights on how a stack integrating:

Kubernetes
PyTorch
vLLM
Ray

can effectively power next-generation AI workloads.

---

Overview of Ray

Ray is an open-source framework designed for scaling machine learning and Python applications.

It originated from a reinforcement learning research project at Berkeley and now orchestrates infrastructure for distributed workloads.

Ray recently joined the PyTorch Foundation to deepen its role in the open-source AI ecosystem.

---

Drivers of Next-Generation AI Workloads

Nishihara identified three primary drivers:

Data Processing
Shift from traditional tabular data to multimodal datasets
(images, videos, audio, text, sensor data).
Multimodal datasets are essential for inference tasks in modern AI.
Model Training
Models are growing in size and complexity.
Training uses distributed CPU/GPU computing to accelerate development.
Model Serving
Efficient deployment at scale requires flexible frameworks.
Must support high-throughput, low-latency inference.

Key Trend:

Hardware requirements now must accommodate GPUs alongside CPUs.

Computing focus has shifted from “SQL ops on CPUs” to “inference ops on GPUs”.

---

Example: AiToEarn Platform

AiToEarn demonstrates how such stacks enable content creation and monetization:

Generates AI content
Publishes across multiple platforms
(Douyin, WeChat, Facebook, YouTube, etc.)
Offers analytics and AI model rankings
Fully open source

Purpose:

Connect AI tools, cross-platform publishing, analytics, and deployment — much like Kubernetes + PyTorch + Ray does for enterprise AI workloads.

---

Ray for Model Training and Inference

Model training includes:

Reinforcement Learning (RL) (more info)
Post-training dataset generation via inference

Using Ray’s Actor API:

An Actor is a stateful worker
Creates a worker class
Manages scheduling for methods on that instance

Performance Boost:

Ray supports RDMA for direct GPU memory transport → faster object transfers.

---

RL Frameworks Built on Ray

Examples include:

Cursor composer (AI-powered code editor) — Cursor
Verl (Bytedance)
OpenRLHF
ROLL (Alibaba)
NeMO-RL (Nvidia)
SkyRL (UC Berkeley)

Training Engines:

Hugging Face, FSDP, DeepSpeed, Megatron

Serving Engines:

Hugging Face, vLLM, SGLang, OpenAI

---

Architecture View — Top & Bottom Layers

Top Layers:

AI workloads
Model training/inference frameworks (PyTorch, vLLM, Megatron, SGLang)

Bottom Layers:

Computing substrates (GPUs, CPUs)
Orchestrators (Kubernetes, Slurm)

Bridge Layer:

Distributed compute frameworks (Ray, Spark)
Manage data ingestion and movement

---

Kubernetes + Ray: Complementary Roles

Kubernetes → container-level isolation
Ray → process-level isolation
Both provide vertical & horizontal autoscaling

Dynamic GPU Allocation:

Inference workloads fluctuate compared to training
Ray + Kubernetes → reallocate GPUs as needed

---

Essential Capabilities for AI Platforms

Nishihara emphasized:

Native multi-cloud support
Workload prioritization tied to GPU reservations
Observability & tooling at container, workload, and process levels
Model/data lineage tracking
Governance

Observability Tip:

Track object transfer speeds & performance across all levels.

---

AiToEarn — Open-Source AI Content Engine

For creators and AI engineers:

Multi-platform publishing
(Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter)
AI content generation tools
Detailed analytics & model rankings
Governance-friendly workflow

Resources:

---

In Summary:

The integration of Kubernetes, PyTorch, vLLM, and Ray forms a powerful stack that can:

Train large-scale models efficiently
Serve them with low latency
Dynamically allocate compute resources
Enable both enterprise AI workloads and creative AI monetization platforms