verl: Volcano Engine Reinforcement Learning for LLMs

verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.

When to Use verl

Choose verl when you need:

Production-ready RL training at scale (tested up to 671B parameters)
Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
Multi-turn rollout with tool calling for agentic workflows
Vision-language model RL training

Consider alternatives when:

You need Megatron-native training → use slime or miles
You want PyTorch-native abstractions with Monarch → use torchforge
You only need simple SFT/DPO → use TRL or Axolotl

Key Features

Training backends: FSDP, FSDP2, Megatron-LM
Rollout engines: vLLM, SGLang, HuggingFace Transformers
Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools

Installation

Option 1: pip install

pip install verl[vllm] # or verl[sglang] for SGLang backend

Option 2: Docker (recommended for production)

docker pull verlai/verl:vllm011.latest

Option 3: From source

git clone https://github.com/volcengine/verl.git cd verl && pip install -e .[vllm,math]

Quick Start: GRPO Training

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=~/data/gsm8k/train.parquet
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B
actor_rollout_ref.rollout.n=8
actor_rollout_ref.actor.use_kl_loss=True
trainer.n_gpus_per_node=8

Core Architecture

verl uses a HybridFlow programming model separating control flow from computation:

┌─────────────────────────────────────────────────────────┐ │ Single-Process Controller (Ray) │ │ - Orchestrates: rollout → reward → train → sync │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Multi-Process Workers │ │ ├── ActorRolloutRefWorker (policy + generation) │ │ ├── CriticWorker (value estimation, PPO only) │ │ └── RewardManager (model-based or rule-based rewards) │ └─────────────────────────────────────────────────────────┘

Workflow 1: Math Reasoning with GRPO

Use this workflow for training reasoning models on math tasks like GSM8K or MATH.

Prerequisites Checklist

GPU cluster with 8+ GPUs (H100 recommended)
Dataset in parquet format with prompt and reward_model columns
Base model from HuggingFace Hub

Step 1: Prepare Dataset

import pandas as pd

data = [ { "prompt": [{"role": "user", "content": "What is 15 + 27?"}], "reward_model": {"ground_truth": "42"} }, # ... more examples ] df = pd.DataFrame(data) df.to_parquet("train.parquet")

Step 2: Define Reward Function

reward_function.py

import re

def compute_reward(responses, ground_truths): rewards = [] for response, gt in zip(responses, ground_truths): # Extract answer from response match = re.search(r'\boxed{([^}]+)}', response) if match and match.group(1).strip() == gt.strip(): rewards.append(1.0) else: rewards.append(0.0) return rewards

Step 3: Create Training Config

config/grpo_math.yaml

algorithm: adv_estimator: grpo gamma: 1.0 lam: 1.0

data: train_files: /path/to/train.parquet val_files: /path/to/val.parquet train_batch_size: 256 max_prompt_length: 512 max_response_length: 2048

actor_rollout_ref: model: path: Qwen/Qwen2.5-7B-Instruct actor: use_kl_loss: true kl_loss_coef: 0.001 ppo_mini_batch_size: 64 rollout: name: vllm n: 8 # samples per prompt temperature: 0.7 top_p: 0.95

trainer: total_epochs: 3 n_gpus_per_node: 8 save_freq: 100

Step 4: Launch Training

python3 -m verl.trainer.main_ppo
--config-path config
--config-name grpo_math
trainer.experiment_name=grpo_math_qwen7b

Step 5: Monitor and Validate

Check WandB/TensorBoard for loss curves
Verify reward is increasing over steps
Run evaluation on held-out test set

Workflow 2: PPO with Critic Model

Use this workflow when you need value-based advantage estimation (GAE).

Key Differences from GRPO

Requires separate critic model
Uses Generalized Advantage Estimation (GAE)
Better for tasks with dense rewards

Configuration

algorithm: adv_estimator: gae # Use GAE instead of GRPO gamma: 0.99 lam: 0.95

critic: model: path: Qwen/Qwen2.5-7B-Instruct # Can be same or different from actor ppo_mini_batch_size: 64

actor_rollout_ref: actor: use_kl_loss: true kl_loss_coef: 0.02 clip_ratio: 0.2 # PPO clipping

Launch with Critic

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=gae
critic.model.path=Qwen/Qwen2.5-7B-Instruct
trainer.n_gpus_per_node=8

Workflow 3: Large-Scale Training with Megatron

Use this workflow for models >70B parameters or when you need expert parallelism.

Prerequisites

Install Megatron-LM bridge: pip install mbridge
Convert model to Megatron format
Multi-node cluster with NVLink/InfiniBand

Configuration for 70B+ Models

actor_rollout_ref: model: path: /path/to/megatron/checkpoint backend: megatron actor: strategy: megatron tensor_model_parallel_size: 8 pipeline_model_parallel_size: 2 rollout: name: vllm tensor_parallel_size: 8

Launch Multi-Node

On head node

ray start --head --port=6379

On worker nodes

ray start --address='head_ip:6379'

Launch training

python3 -m verl.trainer.main_ppo
trainer.nnodes=4
trainer.n_gpus_per_node=8

Configuration Reference

Algorithm Selection

Algorithm adv_estimator

Use Case

GRPO grpo

Critic-free, math/reasoning

PPO/GAE gae

Dense rewards, value estimation

REINFORCE++ reinforce_plus_plus

Variance reduction

RLOO rloo

Leave-one-out baseline

ReMax remax

Maximum reward baseline

OPO opo

Optimal policy optimization

Key Parameters

Rollout parameters

actor_rollout_ref.rollout.n: 8 # Samples per prompt actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling

Training parameters

actor_rollout_ref.actor.lr: 1e-6 # Learning rate actor_rollout_ref.actor.ppo_mini_batch_size: 64 actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range

KL control

actor_rollout_ref.actor.use_kl_loss: true actor_rollout_ref.actor.kl_loss_coef: 0.001 algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control

Common Issues and Solutions

Issue: OOM During Rollout

Symptoms: CUDA out of memory during generation phase

Solutions:

Reduce batch size

actor_rollout_ref.rollout.log_prob_micro_batch_size: 4

Enable gradient checkpointing

actor_rollout_ref.model.enable_gradient_checkpointing: true

Use FSDP2 with CPU offloading

actor_rollout_ref.actor.strategy: fsdp2 actor_rollout_ref.actor.fsdp_config.offload_policy: true

Issue: Training Instability

Symptoms: Loss spikes, reward collapse

Solutions:

Reduce learning rate

actor_rollout_ref.actor.lr: 5e-7

Increase KL penalty

actor_rollout_ref.actor.kl_loss_coef: 0.01

Enable gradient clipping

actor_rollout_ref.actor.max_grad_norm: 1.0

Issue: Slow Weight Sync

Symptoms: Long pauses between rollout and training

Solutions:

Use FSDP2 for faster resharding

actor_rollout_ref.actor.strategy=fsdp2

Enable async weight transfer

trainer.async_weight_update=true

Issue: vLLM Version Mismatch

Symptoms: Import errors or generation failures

Solution: Use compatible versions:

pip install vllm>=0.8.5,<=0.12.0

Avoid vLLM 0.7.x (known bugs)

Advanced Topics

Multi-Turn Tool Calling

See references/multi-turn.md for agentic workflows with tool use.

Vision-Language Models

actor_rollout_ref: model: path: Qwen/Qwen2.5-VL-7B-Instruct rollout: name: vllm enable_vision: true

LoRA Training

actor_rollout_ref: actor: lora: enabled: true r: 16 alpha: 32 target_modules: ["q_proj", "v_proj"]

Resources

Documentation: https://verl.readthedocs.io/
Paper: https://arxiv.org/abs/2409.19256
GitHub: https://github.com/volcengine/verl
Recipes: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
Community: Slack at verl-project

verl-rl-training

Safety Notice

Copy this and send it to your AI assistant to learn

Option 1: pip install

Option 2: Docker (recommended for production)

Option 3: From source

reward_function.py

config/grpo_math.yaml

On head node

On worker nodes

Launch training

Rollout parameters

Training parameters

KL control

Reduce batch size

Enable gradient checkpointing

Use FSDP2 with CPU offloading

Reduce learning rate

Increase KL penalty

Enable gradient clipping

Use FSDP2 for faster resharding

Enable async weight transfer

Avoid vLLM 0.7.x (known bugs)

Source Transparency

Related Skills

senior-data-scientist

senior-backend

senior-frontend

ui-ux-pro-max