verl: Volcano Engine Reinforcement Learning for LLMs
verl is a flexible, efficient, and production-ready RL training library for large language models from ByteDance's Seed team. It implements the HybridFlow framework (EuroSys 2025) and powers models like Doubao-1.5-pro achieving O1-level performance on math benchmarks.
When to Use verl
Choose verl when you need:
-
Production-ready RL training at scale (tested up to 671B parameters)
-
Flexibility to swap backends (FSDP ↔ Megatron-LM ↔ vLLM ↔ SGLang)
-
Support for multiple RL algorithms (PPO, GRPO, RLOO, REINFORCE++, DAPO)
-
Multi-turn rollout with tool calling for agentic workflows
-
Vision-language model RL training
Consider alternatives when:
-
You need Megatron-native training → use slime or miles
-
You want PyTorch-native abstractions with Monarch → use torchforge
-
You only need simple SFT/DPO → use TRL or Axolotl
Key Features
-
Training backends: FSDP, FSDP2, Megatron-LM
-
Rollout engines: vLLM, SGLang, HuggingFace Transformers
-
Algorithms: PPO, GRPO, DAPO, RLOO, ReMax, REINFORCE++, SPIN, SPPO
-
Models: Qwen-3, Llama-3.1, DeepSeek, Gemma-2 (0.5B to 671B)
-
Advanced: LoRA RL, sequence parallelism, expert parallelism, multi-turn tools
Installation
Option 1: pip install
pip install verl[vllm] # or verl[sglang] for SGLang backend
Option 2: Docker (recommended for production)
docker pull verlai/verl:vllm011.latest
Option 3: From source
git clone https://github.com/volcengine/verl.git cd verl && pip install -e .[vllm,math]
Quick Start: GRPO Training
python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=~/data/gsm8k/train.parquet
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B
actor_rollout_ref.rollout.n=8
actor_rollout_ref.actor.use_kl_loss=True
trainer.n_gpus_per_node=8
Core Architecture
verl uses a HybridFlow programming model separating control flow from computation:
┌─────────────────────────────────────────────────────────┐ │ Single-Process Controller (Ray) │ │ - Orchestrates: rollout → reward → train → sync │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Multi-Process Workers │ │ ├── ActorRolloutRefWorker (policy + generation) │ │ ├── CriticWorker (value estimation, PPO only) │ │ └── RewardManager (model-based or rule-based rewards) │ └─────────────────────────────────────────────────────────┘
Workflow 1: Math Reasoning with GRPO
Use this workflow for training reasoning models on math tasks like GSM8K or MATH.
Prerequisites Checklist
-
GPU cluster with 8+ GPUs (H100 recommended)
-
Dataset in parquet format with prompt and reward_model columns
-
Base model from HuggingFace Hub
Step 1: Prepare Dataset
import pandas as pd
data = [ { "prompt": [{"role": "user", "content": "What is 15 + 27?"}], "reward_model": {"ground_truth": "42"} }, # ... more examples ] df = pd.DataFrame(data) df.to_parquet("train.parquet")
Step 2: Define Reward Function
reward_function.py
import re
def compute_reward(responses, ground_truths): rewards = [] for response, gt in zip(responses, ground_truths): # Extract answer from response match = re.search(r'\boxed{([^}]+)}', response) if match and match.group(1).strip() == gt.strip(): rewards.append(1.0) else: rewards.append(0.0) return rewards
Step 3: Create Training Config
config/grpo_math.yaml
algorithm: adv_estimator: grpo gamma: 1.0 lam: 1.0
data: train_files: /path/to/train.parquet val_files: /path/to/val.parquet train_batch_size: 256 max_prompt_length: 512 max_response_length: 2048
actor_rollout_ref: model: path: Qwen/Qwen2.5-7B-Instruct actor: use_kl_loss: true kl_loss_coef: 0.001 ppo_mini_batch_size: 64 rollout: name: vllm n: 8 # samples per prompt temperature: 0.7 top_p: 0.95
trainer: total_epochs: 3 n_gpus_per_node: 8 save_freq: 100
Step 4: Launch Training
python3 -m verl.trainer.main_ppo
--config-path config
--config-name grpo_math
trainer.experiment_name=grpo_math_qwen7b
Step 5: Monitor and Validate
-
Check WandB/TensorBoard for loss curves
-
Verify reward is increasing over steps
-
Run evaluation on held-out test set
Workflow 2: PPO with Critic Model
Use this workflow when you need value-based advantage estimation (GAE).
Key Differences from GRPO
-
Requires separate critic model
-
Uses Generalized Advantage Estimation (GAE)
-
Better for tasks with dense rewards
Configuration
algorithm: adv_estimator: gae # Use GAE instead of GRPO gamma: 0.99 lam: 0.95
critic: model: path: Qwen/Qwen2.5-7B-Instruct # Can be same or different from actor ppo_mini_batch_size: 64
actor_rollout_ref: actor: use_kl_loss: true kl_loss_coef: 0.02 clip_ratio: 0.2 # PPO clipping
Launch with Critic
python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=gae
critic.model.path=Qwen/Qwen2.5-7B-Instruct
trainer.n_gpus_per_node=8
Workflow 3: Large-Scale Training with Megatron
Use this workflow for models >70B parameters or when you need expert parallelism.
Prerequisites
-
Install Megatron-LM bridge: pip install mbridge
-
Convert model to Megatron format
-
Multi-node cluster with NVLink/InfiniBand
Configuration for 70B+ Models
actor_rollout_ref: model: path: /path/to/megatron/checkpoint backend: megatron actor: strategy: megatron tensor_model_parallel_size: 8 pipeline_model_parallel_size: 2 rollout: name: vllm tensor_parallel_size: 8
Launch Multi-Node
On head node
ray start --head --port=6379
On worker nodes
ray start --address='head_ip:6379'
Launch training
python3 -m verl.trainer.main_ppo
trainer.nnodes=4
trainer.n_gpus_per_node=8
Configuration Reference
Algorithm Selection
Algorithm adv_estimator
Use Case
GRPO grpo
Critic-free, math/reasoning
PPO/GAE gae
Dense rewards, value estimation
REINFORCE++ reinforce_plus_plus
Variance reduction
RLOO rloo
Leave-one-out baseline
ReMax remax
Maximum reward baseline
OPO opo
Optimal policy optimization
Key Parameters
Rollout parameters
actor_rollout_ref.rollout.n: 8 # Samples per prompt actor_rollout_ref.rollout.temperature: 0.7 # Sampling temperature actor_rollout_ref.rollout.top_p: 0.95 # Nucleus sampling
Training parameters
actor_rollout_ref.actor.lr: 1e-6 # Learning rate actor_rollout_ref.actor.ppo_mini_batch_size: 64 actor_rollout_ref.actor.clip_ratio: 0.2 # PPO clip range
KL control
actor_rollout_ref.actor.use_kl_loss: true actor_rollout_ref.actor.kl_loss_coef: 0.001 algorithm.kl_ctrl.target_kl: 0.1 # For adaptive KL control
Common Issues and Solutions
Issue: OOM During Rollout
Symptoms: CUDA out of memory during generation phase
Solutions:
Reduce batch size
actor_rollout_ref.rollout.log_prob_micro_batch_size: 4
Enable gradient checkpointing
actor_rollout_ref.model.enable_gradient_checkpointing: true
Use FSDP2 with CPU offloading
actor_rollout_ref.actor.strategy: fsdp2 actor_rollout_ref.actor.fsdp_config.offload_policy: true
Issue: Training Instability
Symptoms: Loss spikes, reward collapse
Solutions:
Reduce learning rate
actor_rollout_ref.actor.lr: 5e-7
Increase KL penalty
actor_rollout_ref.actor.kl_loss_coef: 0.01
Enable gradient clipping
actor_rollout_ref.actor.max_grad_norm: 1.0
Issue: Slow Weight Sync
Symptoms: Long pauses between rollout and training
Solutions:
Use FSDP2 for faster resharding
actor_rollout_ref.actor.strategy=fsdp2
Enable async weight transfer
trainer.async_weight_update=true
Issue: vLLM Version Mismatch
Symptoms: Import errors or generation failures
Solution: Use compatible versions:
pip install vllm>=0.8.5,<=0.12.0
Avoid vLLM 0.7.x (known bugs)
Advanced Topics
Multi-Turn Tool Calling
See references/multi-turn.md for agentic workflows with tool use.
Vision-Language Models
actor_rollout_ref: model: path: Qwen/Qwen2.5-VL-7B-Instruct rollout: name: vllm enable_vision: true
LoRA Training
actor_rollout_ref: actor: lora: enabled: true r: 16 alpha: 32 target_modules: ["q_proj", "v_proj"]
Resources
-
Documentation: https://verl.readthedocs.io/
-
Recipes: https://github.com/verl-project/verl-recipe (DAPO, GSPO, etc.)
-
Community: Slack at verl-project