torchforge-rl-training

torchforge: PyTorch-Native Agentic RL Library

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "torchforge-rl-training" with this command: npx skills add zechenzhangagi/ai-research-skills/zechenzhangagi-ai-research-skills-torchforge-rl-training

torchforge: PyTorch-Native Agentic RL Library

torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.

When to Use torchforge

Choose torchforge when you need:

  • Clean separation between RL algorithms and infrastructure

  • PyTorch-native abstractions (no Ray dependency)

  • Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)

  • Scalable training with Monarch actor system

  • Integration with TorchTitan for model parallelism

Consider alternatives when:

  • You need production-ready stability → use miles or verl

  • You want Megatron-native training → use slime

  • torchforge is experimental and APIs may change

Key Features

  • Algorithm isolation: Implement RL algorithms without touching infrastructure

  • Scalability: From single GPU to thousands via Monarch

  • Modern stack: TorchTitan (training), vLLM (inference), TorchStore (sync)

  • Loss functions: GRPO, DAPO, CISPO, GSPO, SAPO built-in

Architecture Overview

┌─────────────────────────────────────────────────────────┐ │ Application Layer (Your Code) │ │ - Define reward models, loss functions, sampling │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Forge API Layer │ │ - Episode, Group dataclasses │ │ - Service interfaces (async/await) │ └─────────────────────┬───────────────────────────────────┘ │ ┌─────────────────────▼───────────────────────────────────┐ │ Distributed Services (Monarch) │ │ ├── Trainer (TorchTitan FSDP) │ │ ├── Generator (vLLM inference) │ │ ├── Reference Model (frozen KL baseline) │ │ └── Reward Actors (compute rewards) │ └─────────────────────────────────────────────────────────┘

Installation

Create environment

conda create -n forge python=3.12 conda activate forge

Install (handles PyTorch nightly + dependencies)

./scripts/install.sh

Verify

python -c "import torch, forge, vllm; print('OK')"

ROCm Installation

./scripts/install_rocm.sh

Quick Start

SFT Training (2+ GPUs)

python -m apps.sft.main --config apps/sft/llama3_8b.yaml

GRPO Training (3+ GPUs)

python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml

Workflow 1: GRPO Training for Math Reasoning

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

  • 3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator)

  • Model from HuggingFace Hub

  • Training dataset (GSM8K, MATH, etc.)

Step 1: Create Configuration

config/grpo_math.yaml

model: "Qwen/Qwen2.5-7B-Instruct"

dataset: path: "openai/gsm8k" split: "train" streaming: true

training: batch_size: 4 learning_rate: 1e-6 seq_len: 4096 dtype: bfloat16 gradient_accumulation_steps: 4

grpo: n_samples: 8 # Responses per prompt clip_low: 0.2 clip_high: 0.28 beta: 0.1 # KL penalty coefficient temperature: 0.7

services: generator: procs: 1 num_replicas: 1 with_gpus: true trainer: procs: 1 num_replicas: 1 with_gpus: true ref_model: procs: 1 num_replicas: 1 with_gpus: true

Step 2: Define Reward Function

rewards.py

Reward functions are in forge.data.rewards

from forge.data.rewards import MathReward, ThinkingReward import re

Or define your own reward function

class CustomMathReward: def call(self, prompt: str, response: str, target: str) -> float: # Extract answer from response match = re.search(r'\boxed{([^}]+)}', response) if not match: return 0.0

    answer = match.group(1).strip()
    return 1.0 if answer == target else 0.0

Step 3: Launch Training

python -m apps.grpo.main --config config/grpo_math.yaml

Step 4: Monitor Progress

  • Check W&B dashboard for loss curves

  • Verify entropy is decreasing (policy becoming more deterministic)

  • Monitor KL divergence (should stay bounded)

Workflow 2: Custom Loss Function

Use this workflow to implement new RL algorithms.

Step 1: Create Loss Class

src/forge/losses/custom_loss.py

import torch import torch.nn as nn

class CustomLoss(nn.Module): def init(self, clip_range: float = 0.2, beta: float = 0.1): super().init() self.clip_range = clip_range self.beta = beta

def forward(
    self,
    logprobs: torch.Tensor,
    ref_logprobs: torch.Tensor,
    advantages: torch.Tensor,
    padding_mask: torch.Tensor,
) -> torch.Tensor:
    # Compute importance ratio
    ratio = torch.exp(logprobs - ref_logprobs)

    # Clipped policy gradient
    clipped_ratio = torch.clamp(
        ratio,
        1 - self.clip_range,
        1 + self.clip_range
    )
    pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)

    # KL penalty
    kl = ref_logprobs - logprobs

    # Apply mask and aggregate
    masked_loss = (pg_loss + self.beta * kl) * padding_mask
    loss = masked_loss.sum() / padding_mask.sum()

    return loss

Step 2: Integrate into Application

apps/custom/main.py

from forge.losses.custom_loss import CustomLoss

loss_fn = CustomLoss(clip_range=0.2, beta=0.1)

In training loop

loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask, )

Workflow 3: Multi-GPU Distributed Training

Use this workflow for scaling to multiple GPUs or nodes.

Configuration for Distributed

config/distributed.yaml

model: "meta-llama/Meta-Llama-3.1-8B-Instruct"

parallelism: tensor_parallel_degree: 2 # Split model across GPUs pipeline_parallel_degree: 1 data_parallel_shard_degree: 2

services: generator: procs: 2 # 2 processes for TP=2 num_replicas: 1 with_gpus: true trainer: procs: 2 num_replicas: 1 with_gpus: true

Launch with SLURM

Submit job

sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh

Launch Locally (Multi-GPU)

8 GPU setup

python -m apps.grpo.main
--config config/distributed.yaml
--trainer.procs 4
--generator.procs 4

Core API Reference

Training Batch Format

torchforge uses dictionary-based batches for training:

inputs: list of dicts with torch.Tensor values

inputs = [{"tokens": torch.Tensor}]

targets: list of dicts with training signals

targets = [{ "response": torch.Tensor, "ref_logprobs": torch.Tensor, "advantages": torch.Tensor, "padding_mask": torch.Tensor }]

train_step returns loss as float

loss = trainer.train_step(inputs, targets)

Completion

Generated output from vLLM:

@dataclass class Completion: text: str # Generated text token_ids: list[int] # Token IDs logprobs: list[float] # Log probabilities metadata: dict # Custom metadata

Built-in Loss Functions

Loss Functions

Loss functions are in the forge.losses module:

from forge.losses import SimpleGRPOLoss, ReinforceLoss

SimpleGRPOLoss for GRPO training

loss_fn = SimpleGRPOLoss(beta=0.1)

Forward pass

loss = loss_fn( logprobs=logprobs, ref_logprobs=ref_logprobs, advantages=advantages, padding_mask=padding_mask )

ReinforceLoss

from forge.losses.reinforce_loss import ReinforceLoss

With optional importance ratio clipping

loss_fn = ReinforceLoss(clip_ratio=0.2)

Common Issues and Solutions

Issue: Not Enough GPUs

Symptoms: "Insufficient GPU resources" error

Solutions:

Reduce service requirements

services: generator: procs: 1 with_gpus: true trainer: procs: 1 with_gpus: true

Remove ref_model (uses generator weights)

Or use CPU for reference model:

ref_model: with_gpus: false

Issue: OOM During Generation

Symptoms: CUDA OOM in vLLM

Solutions:

Reduce batch size

grpo: n_samples: 4 # Reduce from 8

Or reduce sequence length

training: seq_len: 2048

Issue: Slow Weight Sync

Symptoms: Long pauses between training and generation

Solutions:

Enable RDMA (if available)

export TORCHSTORE_USE_RDMA=1

Or reduce sync frequency

training: sync_interval: 10 # Sync every 10 steps

Issue: Policy Collapse

Symptoms: Entropy drops to zero, reward stops improving

Solutions:

Increase KL penalty

grpo: beta: 0.2 # Increase from 0.1

Or add entropy bonus

training: entropy_coef: 0.01

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

torchforge-rl-training

No summary provided by upstream source.

Repository SourceNeeds Review
Research

qdrant-vector-search

No summary provided by upstream source.

Repository SourceNeeds Review
Research

peft-fine-tuning

No summary provided by upstream source.

Repository SourceNeeds Review