miles: Enterprise-Grade RL for Large-Scale Model Training
miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.
When to Use miles
Choose miles when you need:
-
Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
-
FP8 or INT4 quantization-aware training
-
Bit-wise identical train-inference alignment
-
Speculative RL for maximum throughput
-
Production stability with enterprise support
Consider alternatives when:
-
You want the research-grade original → use slime
-
You need flexible backend swapping → use verl
-
You want PyTorch-native abstractions → use torchforge
Key Features
Low-Precision Training
-
Unified FP8: End-to-end FP8 for both inference and training
-
INT4 QAT: 1TB models on single-machine VRAM (H200)
-
Rollout Routing Replay (R3): Bit-wise expert alignment for MoE
Performance Optimizations
-
Speculative RL: 25%+ rollout speedup with online SFT draft models
-
Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
-
Partial Rollout: Recycle half-finished trajectories
Train-Inference Alignment
-
TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
-
Kernel-level optimization: FlashAttention-3, DeepGEMM integration
Installation
Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g
-it radixark/miles:latest /bin/bash
From source
git clone https://github.com/radixark/miles.git cd miles pip install -r requirements.txt pip install -e .
Quick Start
miles inherits slime's configuration system. Basic training:
python train.py
--advantage-estimator grpo
--model-name qwen3-30b-a3b
--hf-checkpoint /path/to/qwen3-30b-a3b-hf
--rollout-batch-size 512
--n-samples-per-prompt 8
Workflow 1: Large MoE Training
Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.
Prerequisites Checklist
-
H100/H200 GPUs with FP8 support
-
MoE model (DeepSeek V3, Qwen3-MoE)
-
Docker environment with miles
Step 1: Environment Setup
FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 export CUDA_DEVICE_MAX_CONNECTIONS=1
Step 2: Configure Training
python train.py
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--hf-checkpoint /path/to/deepseek-v3
--advantage-estimator grpo
--tensor-model-parallel-size 8
--expert-model-parallel-size 4
--prompt-data /path/to/data.jsonl
--num-rollout 3000
Verification Checklist
-
Model loads without errors
-
Routing decisions are consistent
-
No NaN/Inf in loss values
Workflow 2: Speculative RL Training
Use this workflow for maximum rollout throughput with EAGLE speculative decoding.
How Speculative RL Works
-
Small draft model generates candidate tokens
-
Target model verifies in parallel
-
Draft model updated via online SFT to track policy
Step 1: Enable Speculative Decoding
miles supports EAGLE speculative decoding via SGLang:
python train.py
--actor-num-gpus-per-node 8
--hf-checkpoint /path/to/target-model
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-speculative-draft-model-path /path/to/draft-model
--advantage-estimator grpo
--prompt-data /path/to/data.jsonl
Step 2: Enable Online MTP Training (Optional)
For online SFT of draft model during training:
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.
Expected Speedup
-
Standard rollout: Baseline
-
Speculative RL: 25-40% faster rollout
-
With partial rollout: Additional 10-15% throughput
Configuration Reference
miles inherits all slime arguments. See slime API Reference for the complete list.
Cluster Resources (from slime)
--actor-num-nodes 1 --actor-num-gpus-per-node 8 --rollout-num-gpus 8 --rollout-num-gpus-per-engine 2 --colocate
Megatron Parallelism (from slime)
--tensor-model-parallel-size 8 --pipeline-model-parallel-size 2 --expert-model-parallel-size 4 # MoE expert parallelism
Speculative Decoding (miles-specific)
--sglang-speculative-algorithm EAGLE --sglang-speculative-num-steps 3 --sglang-speculative-eagle-topk 1 --sglang-speculative-num-draft-tokens 4 --sglang-enable-draft-weights-cpu-backup --sglang-speculative-draft-model-path /your/draft/model/path
Online MTP Training (miles-specific)
--mtp-num-layers 1 --enable-mtp-training --mtp-loss-scaling-factor 0.2
Key Features (Conceptual)
The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.
Unified FP8 Pipeline
End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.
Rollout Routing Replay (R3)
Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.
How R3 Works:
-
During SGLang inference, expert routing decisions are recorded
-
Routing decisions stored in sample.rollout_routed_experts
-
During Megatron training, routing is replayed instead of recomputed
-
Ensures identical expert selection between train and inference
INT4 Quantization-Aware Training
Enables single-machine deployment of 1TB+ models (e.g., on H200).
Memory Savings with INT4:
Model Size BF16 VRAM INT4 VRAM Reduction
70B 140GB 45GB 3.1x
235B 470GB 150GB 3.1x
671B 1.3TB 420GB 3.1x
Train-Inference Alignment
miles achieves "exactly 0 KL divergence" between training and inference through:
-
Flash Attention 3
-
DeepGEMM
-
Batch-invariant kernels from Thinking Machines Lab
-
torch.compile integration
Sample Data Structure
miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:
@dataclass class Sample: prompt: str | list[dict] tokens: list[int] response: str reward: float | dict loss_mask: list[int] status: Status metadata: dict rollout_log_probs: list[float] rollout_routed_experts: list[list[int]] # MoE routing for R3
See slime API Reference for the complete Sample definition.
Common Issues and Solutions
Issue: FP8 Training Collapse
Symptoms: Loss explodes, NaN values
Solutions:
-
Use block scaling: export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
-
Reduce learning rate: --lr 5e-7
-
Ensure MoE routing is consistent between train/inference
Issue: Speculative Draft Drift
Symptoms: Low acceptance rate over time
Solutions:
-
Enable online MTP training to keep draft model aligned
-
Reduce speculative steps: --sglang-speculative-num-steps 2
-
Use CPU backup: --sglang-enable-draft-weights-cpu-backup
Issue: Train-Inference Mismatch
Symptoms: Policy divergence, reward collapse
Solutions:
-
Use TIS for off-policy correction: --use-tis --tis-threshold 0.9
-
Verify log probs match between SGLang and Megatron
-
Enable R3 for MoE models
Supported Models
Family Models MoE Support
DeepSeek R1, V3, V3.2 Full
Qwen 2, 2.5, 3 (including MoE) Full
Llama 3, 3.1, 3.3, 4 Dense only
Gemma 2, 3, 3N Dense only
GLM 4.5, 4.6, 4.7 Dense only
MiniMax M2, M2.1 Full
Resources
-
Introduction Blog: https://lmsys.org/blog/2025-11-19-miles/
-
Slime (upstream): https://github.com/THUDM/slime