An Engineering Design Perspective on veRL and OpenRLHF
# Large Model Intelligence|Sharing
## Introduction
In **RL**, **distillation**, and other AI workflows, multiple models often need to **cooperate** — managing computation, data communication, and process control.
For example, in **PPO** (and its derivatives), you may need to coordinate up to *five types of modules*:
- **Actor**
- **Rollout**
- **Ref**
- **Critic**
- **Reward**
Each can take on one or more roles: *train*, *eval*, or *generate*.
In distillation, you might have multiple groups of Teachers and Students distilling collaboratively.
When using traditional **Pretrain** or **SFT** training setups — single-script, multi-process (`deepspeed` or `torchrun`) — flexible task scheduling and resource allocation become difficult.
**Ray’s** *remote asynchronous invocation* + *Actor abstraction* allow each module to have:
- Independent execution units
- Independent task-handling logic
This **decoupled architecture** fits scenarios of multiple models interacting frequently.
In this article, we examine **veRL** [1] and **OpenRLHF** [2] from three angles:
1. **Roles** of each module and their training/inference backends
2. Using Ray to **allocate** and **share resources** (colocate / Hybrid Engine)
3. **Data flow** and **control flow** — the full scheduling chain from distribution to execution
> Code links are provided; reading the actual code is highly recommended — *talk is cheap*, code reveals truth.
---
## 1. Ray Overview
### 1.1 Launching Ray
Ray supports many languages, but Python is most common.
- **Auto-cluster via Python**:import ray
ray.init()
This creates a **driver process** and a Ray cluster automatically.
- **Manual start via CLI**:ray start --head
Then attach via `ray.init(address="auto")`.
### 1.2 Execution Model
- Ray runs a **pool of resident processes** per node.
- `@ray.remote` turns a function/class into a schedulable **Task** or **Actor**.
- `.remote()` schedules them asynchronously — results retrieved via `ray.get()`.
- Arguments/returns are **serialized** into an **Object** stored in Ray’s **Object Store** — shared memory across all nodes.
- Use `ray.get(object_ref)` to retrieve and deserialize transparently.
---
### Example: Nested Actor Creation
import ray
ray.init()
@ray.remote
class ChildActor:
def do_work(self):
return "Work done by child"
@ray.remote
class ParentActor:
def create_child(self):
self.child_actor = ChildActor.remote()
def get_work(self):
return ray.get(self.child_actor.do_work.remote())
parent_actor = ParentActor.remote()
ray.get(parent_actor.create_child.remote())
print(ray.get(parent_actor.get_work.remote())) # Work done by child
✅ Actors can **create** and **invoke** other Actors
⚠️ No method inheritance between Actors — apply `@ray.remote` only to final classes.
---
## 2. Resource Scheduling in Ray
### 2.1 Basic Resource Allocation
When creating an Actor:
MyActor.options(num_cpus=2, num_gpus=1).remote()
If resources are insufficient, scheduling will wait — enforcing **exclusive** allocation.
### 2.2 Placement Groups
For fine-grained control, create **placement groups** and divide them into **bundles**:
- Enable **exclusive** or **shared** GPU/CPU allocation
- Adopted by both **veRL** and **OpenRLHF**
Example (from a distillation framework):
remote_ray_worker = ray.remote(
num_gpus=self.num_gpus_per_worker,
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=self.resource_group.ray_pg,
placement_group_bundle_index=worker_rank,
),
runtime_env={"env_vars": env_vars},
max_concurrency=2,
)(self.worker_cls).remote(...)
---
## 3. OpenRLHF — Engineering Analysis
Version: **v0.5.9.post1**
Ray-based training supports **PPO** + derivatives (REINFORCE++, GRPO, RLOO).
### 3.1 Key Modules
- **Actor**: Train (forward + backward, weight updates)
- **Critic**: Train + Eval (forward + backward)
- **Rollout**: Batch inference — needs Actor weight sync
- **RM/Ref**: Eval — forward only, no weight updates
### 3.2 Backends
- Train: **Deepspeed**
- Batch inference: **vLLM** (supports DP + TP)
⚠️ Precision mismatch between train vs inference engines — affects eval loss.
---
### 3.3 Resource Colocation
**Colocate = multiple Ray Actors share same GPU**.
Modes:
1. `colocate_actor_ref`
2. `colocate_critic_reward`
3. `colocate_all_models`
- Modules in same **placement_group** share GPU bundles
- Set fractional GPU allocation: `num_gpus_per_actor=0.2`
- Actor-Rollout colocation requires **CUDA IPC** for weight sync because NCCL cannot communicate between processes on the same GPU
---
### 3.4 Data/Control Flow
OpenRLHF has **distributed control** — coarse overview:
ActorGroup.async_fit_actor_model():
for ActorWorker:
PPOTrainer.fit():
bind Ref, Critic, RM workers
for episode:
sample_and_generate_rollout()
prepare_experiences()
store_in_replay_buffer()
train_actor_and_critic()
if colocated:
broadcast_weights_CUDA_IPC()
Actor Workers handle:
- Computation
- Data distribution
- Scheduling
➡️ Potential performance bottleneck.
---
## 4. veRL — Engineering Analysis
Version: **v0.2.0.post2**
### 4.1 Key Traits
- **Single Controller** Actor (CPU) — centralized control
- Manages:
- Data/control flow
- WorkerDict creation
- Task scheduling
- Recommended: run controller on **non-head node**
### 4.2 WorkerDict
- Base Worker containing **all module types** (Actor, Critic, Rollout, Ref, Reward)
- Methods bound dynamically via `@register`
- Supports **Hybrid Engine**: same WorkerDict can switch between multiple backends
---
### 4.3 Scheduler
**RayWorkerGroup**:
- Retrieves resources
- Assigns modules to WorkerDict
- Dispatches tasks
- Collects results with pre-defined data distribution strategies
Colocation:
- Current code: **all modules colocated**
- Potential: arbitrary combinations (requires refactor)
---
### 4.4 Weight Sharing
veRL — Actor & Rollout share weights via `ShardingManager` (same process)
No CUDA IPC required.
Memory-efficient; potentially better performance.
---
## 5. Comparison & Summary
### Module Design
- **OpenRLHF**: separate modules per backend (simpler & clear)
- **veRL**: unified WorkerDict (flexible, supports multi-backend hybrid engines)
### Scheduling
- **OpenRLHF**: placement group per model; same-rank workers share bundles
- **veRL**: multiple modules in one WorkerDict share resources
Preferred: OpenRLHF’s clarity; veRL’s approach more complex but potentially more efficient.
---
### Control Flow
- **OpenRLHF**: control on Actor workers (can bottleneck)
- **veRL**: centralized in single controller (potentially better scalability)
---
### Performance Considerations
- Large tensor transfers via Ray Object Store → costly serialization/deserialization
- Optimization: use **NCCL/IPCs** + asynchronous pipelines (double buffering)
- Careful VRAM management needed
---
## References
1. veRL: [https://github.com/volcengine/verl](https://github.com/volcengine/verl)
2. OpenRLHF: [https://github.com/OpenRLHF/OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)
3. Ray Tasks: [https://docs.ray.io/en/latest/ray-core/tasks.html](https://docs.ray.io/en/latest/ray-core/tasks.html)
4. Ray Actors: [https://docs.ray.io/en/latest/ray-core/actors.html](https://docs.ray.io/en/latest/ray-core/actors.html)
... *(see full original reference list for all links)*
---
**Final Thoughts**
Both approaches have trade-offs between **clarity**, **flexibility**, and **performance**.
Understanding Ray’s abstractions — Actors, Tasks, Placement Groups — is fundamental to building efficient, multi-model pipelines.