simpo-training

SimPO - Simple Preference Optimization

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "simpo-training" with this command: npx skills add zechenzhangagi/ai-research-skills/zechenzhangagi-ai-research-skills-simpo-training

SimPO - Simple Preference Optimization

Quick start

SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model.

Installation:

Create environment

conda create -n simpo python=3.10 && conda activate simpo

Install PyTorch 2.2.2

Visit: https://pytorch.org/get-started/locally/

Install alignment-handbook

git clone https://github.com/huggingface/alignment-handbook.git cd alignment-handbook python -m pip install .

Install Flash Attention 2

python -m pip install flash-attn --no-build-isolation

Training (Mistral 7B):

ACCELERATE_LOG_LEVEL=info accelerate launch
--config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py
training_configs/mistral-7b-base-simpo.yaml

Common workflows

Workflow 1: Train from base model (Mistral 7B)

Config (mistral-7b-base-simpo.yaml ):

Model

model_name_or_path: mistralai/Mistral-7B-v0.1 torch_dtype: bfloat16

Dataset

dataset_mixer: HuggingFaceH4/ultrafeedback_binarized: 1.0 dataset_splits:

  • train_prefs
  • test_prefs

SimPO hyperparameters

beta: 2.0 # Reward scaling (2.0-10.0) gamma_beta_ratio: 0.5 # Target margin (0-1) loss_type: sigmoid # sigmoid or hinge sft_weight: 0.0 # Optional SFT regularization

Training

learning_rate: 5e-7 # Critical: 3e-7 to 1e-6 num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 8

Output

output_dir: ./outputs/mistral-7b-simpo

Launch training:

accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml

Workflow 2: Fine-tune instruct model (Llama 3 8B)

Config (llama3-8b-instruct-simpo.yaml ):

model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct

dataset_mixer: argilla/ultrafeedback-binarized-preferences-cleaned: 1.0

beta: 2.5 gamma_beta_ratio: 0.5 learning_rate: 5e-7 sft_weight: 0.1 # Add SFT loss to preserve capabilities

num_train_epochs: 1 per_device_train_batch_size: 2 gradient_accumulation_steps: 4 output_dir: ./outputs/llama3-8b-simpo

Launch:

accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml

Workflow 3: Reasoning-intensive tasks (lower LR)

For math/code tasks:

model_name_or_path: deepseek-ai/deepseek-math-7b-base

dataset_mixer: argilla/distilabel-math-preference-dpo: 1.0

beta: 5.0 # Higher for stronger signal gamma_beta_ratio: 0.7 # Larger margin learning_rate: 3e-7 # Lower LR for reasoning sft_weight: 0.0

num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 16

When to use vs alternatives

Use SimPO when:

  • Want simpler training than DPO (no reference model)

  • Have preference data (chosen/rejected pairs)

  • Need better performance than DPO

  • Limited compute resources

  • Single-node training sufficient

Algorithm selection:

  • SimPO: Simplest, best performance, no reference model

  • DPO: Need reference model baseline, more conservative

  • PPO: Maximum control, need reward model, complex setup

  • GRPO: Memory-efficient RL, no critic

Use alternatives instead:

  • OpenRLHF: Multi-node distributed training, PPO/GRPO

  • TRL: Need multiple methods in one framework

  • DPO: Established baseline comparison

Common issues

Issue: Loss divergence

Reduce learning rate:

learning_rate: 3e-7 # Reduce from 5e-7

Reduce beta:

beta: 1.0 # Reduce from 2.0

Issue: Model forgets capabilities

Add SFT regularization:

sft_weight: 0.1 # Add SFT loss component

Issue: Poor preference separation

Increase beta and margin:

beta: 5.0 # Increase from 2.0 gamma_beta_ratio: 0.8 # Increase from 0.5

Issue: OOM during training

Reduce batch size:

per_device_train_batch_size: 1 gradient_accumulation_steps: 16 # Maintain effective batch

Enable gradient checkpointing:

gradient_checkpointing: true

Advanced topics

Loss functions: See references/loss-functions.md for sigmoid vs hinge loss, mathematical formulations, and when to use each.

Hyperparameter tuning: See references/hyperparameters.md for beta, gamma, learning rate selection guide, and model-size-specific recommendations.

Dataset preparation: See references/datasets.md for preference data formats, quality filtering, and custom dataset creation.

Hardware requirements

  • GPU: NVIDIA A100/H100 recommended

  • VRAM:

  • 7B model: 1× A100 40GB (DeepSpeed ZeRO-3)

  • 8B model: 2× A100 40GB

  • 70B model: 8× A100 80GB

  • Single-node: DeepSpeed ZeRO-3 sufficient

  • Mixed precision: BF16 recommended

Memory optimization:

  • DeepSpeed ZeRO-3 (default config)

  • Gradient checkpointing

  • Flash Attention 2

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

simpo-training

No summary provided by upstream source.

Repository SourceNeeds Review
Research

qdrant-vector-search

No summary provided by upstream source.

Repository SourceNeeds Review
Research

crewai-multi-agent

No summary provided by upstream source.

Repository SourceNeeds Review