HuggingFace Model Trainer

You are an expert in training and fine-tuning large language models using HuggingFace's TRL (Transformer Reinforcement Learning), Transformers, and PEFT libraries. You help with dataset preparation, training configuration, GPU selection, and deployment.

Training Methods Overview

Method Selection Guide

┌─────────────────────────────────────────────────────────────────┐ │ TRAINING METHOD SELECTION │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ HAVE LABELED DATA? │ │ ├── Yes: Input/Output pairs │ │ │ └── Use SFT (Supervised Fine-Tuning) │ │ │ │ │ ├── Yes: Preference pairs (chosen/rejected) │ │ │ └── Use DPO (Direct Preference Optimization) │ │ │ │ │ ├── No: Have a reward function/verifier │ │ │ └── Use GRPO (Group Relative Policy Optimization) │ │ │ │ │ └── No: Just want to continue pretraining │ │ └── Use CLM (Causal Language Modeling) │ │ │ └─────────────────────────────────────────────────────────────────┘

Supervised Fine-Tuning (SFT)

When to Use

You have instruction/response pairs
Adapting a model to your domain
Teaching specific output formats

Basic SFT Script

from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset

Load model and tokenizer

model_id = "meta-llama/Llama-3.1-8B" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token

Load dataset

dataset = load_dataset("your-org/your-dataset", split="train")

Training configuration

config = SFTConfig( output_dir="./sft-output", max_seq_length=2048, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, num_train_epochs=3, logging_steps=10, save_strategy="epoch", bf16=True, # Use bfloat16 on supported GPUs )

Create trainer

trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, )

Train

trainer.train() trainer.save_model("./final-model")

SFT with Chat Template

from trl import SFTTrainer, SFTConfig

Dataset should have 'messages' column in chat format

[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]

config = SFTConfig( output_dir="./chat-sft", max_seq_length=4096, per_device_train_batch_size=2, gradient_accumulation_steps=8, learning_rate=2e-5, num_train_epochs=3, )

trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, # Automatically applies chat template )

Direct Preference Optimization (DPO)

When to Use

You have preference data (chosen vs rejected responses)
Aligning model with human preferences
Improving response quality

DPO Script

from trl import DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset

Load model

model_id = "meta-llama/Llama-3.1-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id)

Dataset needs: prompt, chosen, rejected columns

dataset = load_dataset("your-org/preference-data", split="train")

config = DPOConfig( output_dir="./dpo-output", per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=5e-7, # Lower LR for DPO beta=0.1, # KL penalty coefficient num_train_epochs=1, bf16=True, logging_steps=10, )

trainer = DPOTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, )

trainer.train()

Preference Data Format

Required columns: prompt, chosen, rejected

preference_example = { "prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum bits...", # Better response "rejected": "Computers are fast machines..." # Worse response }

Group Relative Policy Optimization (GRPO)

When to Use

You have a reward function or verifier
Math/code tasks with checkable answers
RL-based training without paired preferences

GRPO Script

from trl import GRPOTrainer, GRPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.1-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id) tokenizer = AutoTokenizer.from_pretrained(model_id)

Define reward function

def reward_fn(completions, prompts): """Return rewards for each completion""" rewards = [] for completion, prompt in zip(completions, prompts): # Example: reward correct math answers if verify_math_answer(completion, prompt): rewards.append(1.0) else: rewards.append(-0.5) return rewards

config = GRPOConfig( output_dir="./grpo-output", per_device_train_batch_size=4, num_generations=4, # Generate 4 samples per prompt learning_rate=1e-6, num_train_epochs=1, )

trainer = GRPOTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, reward_fn=reward_fn, )

trainer.train()

Parameter-Efficient Fine-Tuning (PEFT/LoRA)

Why Use LoRA

Train large models on limited GPU memory
10-100x fewer trainable parameters
Fast training, easy to merge or swap adapters

LoRA Configuration

from peft import LoraConfig, get_peft_model, TaskType

LoRA configuration

lora_config = LoraConfig( r=16, # Rank (start with 8-32) lora_alpha=32, # Alpha scaling target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, )

Apply to model

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

Output: trainable params: 6,553,600 || all params: 8,030,261,248 || trainable%: 0.082

SFT with LoRA

from trl import SFTTrainer, SFTConfig from peft import LoraConfig

LoRA config

peft_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", )

config = SFTConfig( output_dir="./lora-sft", per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, # Higher LR for LoRA num_train_epochs=3, bf16=True, )

trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, # Pass LoRA config )

trainer.train()

QLoRA (Quantized LoRA)

from transformers import BitsAndBytesConfig import torch

4-bit quantization config

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, )

Load quantized model

model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto", )

Then apply LoRA as normal

GPU Selection Guide

Memory Requirements

Model Size Full Fine-tune LoRA QLoRA

7-8B 60GB+ 16GB 8GB

13B 100GB+ 24GB 12GB

34B 200GB+ 48GB 24GB

70B 400GB+ 80GB 48GB

GPU Recommendations

┌─────────────────────────────────────────────────────────────────┐ │ GPU SELECTION GUIDE │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ TASK │ RECOMMENDED GPU │ │ ────────────────────────┼──────────────────────────────────── │ │ QLoRA 8B │ RTX 4090 (24GB), A10G │ │ QLoRA 70B │ A100 40GB x2, H100 │ │ LoRA 8B │ A100 40GB, A10G x2 │ │ LoRA 70B │ A100 80GB x2, H100 x2 │ │ Full FT 8B │ A100 80GB x2, H100 │ │ Full FT 70B │ H100 x8, A100 80GB x8 │ │ │ │ CLOUD PROVIDERS: │ │ - AWS: p4d (A100), p5 (H100) │ │ - GCP: a2-highgpu (A100), a3-highgpu (H100) │ │ - Azure: NC A100, ND H100 │ │ - Lambda Labs: Most cost-effective for training │ │ - RunPod: Good spot pricing │ │ - HuggingFace Jobs: Managed training infrastructure │ │ │ └─────────────────────────────────────────────────────────────────┘

Dataset Preparation

Chat Format Dataset

from datasets import Dataset

Conversation format

conversations = [ { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language..."} ] }, # More examples... ]

dataset = Dataset.from_list(conversations) dataset.push_to_hub("your-org/chat-dataset")

Instruction Format

Alpaca-style format

instruction_data = [ { "instruction": "Summarize the following text", "input": "Long text here...", "output": "Summary here..." } ]

Or simpler format

simple_data = [ { "prompt": "Question or instruction", "completion": "Expected response" } ]

Data Quality Tips

Filter low-quality examples

def filter_quality(example): # Remove very short responses if len(example["completion"]) < 50: return False # Remove repetitive content if example["completion"].count(example["completion"][:20]) > 3: return False return True

dataset = dataset.filter(filter_quality)

Deduplicate

from datasets import concatenate_datasets

def deduplicate(dataset, column="prompt"): seen = set() indices = [] for i, example in enumerate(dataset): key = example[column] if key not in seen: seen.add(key) indices.append(i) return dataset.select(indices)

Training on HuggingFace Jobs

Using HF Jobs MCP Tool

If using Claude Code with HF Jobs MCP

This is submitted via hf_jobs() MCP tool

training_script = ''' from trl import SFTTrainer, SFTConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") dataset = load_dataset("your-org/your-dataset", split="train")

config = SFTConfig( output_dir="./output", max_seq_length=2048, per_device_train_batch_size=4, num_train_epochs=3, bf16=True, push_to_hub=True, hub_model_id="your-org/fine-tuned-model", )

trainer = SFTTrainer(model=model, args=config, train_dataset=dataset, tokenizer=tokenizer) trainer.train() '''

Submit via MCP: hf_jobs("uv", {"script": training_script, "gpu": "a100"})

Cost Estimation

Rough cost estimates for HF Jobs / Cloud GPUs

TRAINING_COSTS = { # GPU type: (hourly_rate, tokens_per_hour_8B) "a10g": (1.50, 50_000_000), "a100_40gb": (3.50, 150_000_000), "a100_80gb": (5.00, 200_000_000), "h100": (8.00, 400_000_000), }

def estimate_cost( model_size: str, dataset_tokens: int, epochs: int, gpu_type: str = "a100_40gb" ) -> dict: rate, throughput = TRAINING_COSTS[gpu_type] total_tokens = dataset_tokens * epochs hours = total_tokens / throughput cost = hours * rate

return {
    "gpu": gpu_type,
    "estimated_hours": round(hours, 1),
    "estimated_cost": f"${cost:.2f}",
    "total_tokens": f"{total_tokens:,}"
}

Example: 10M token dataset, 3 epochs on A100

estimate_cost("8B", 10_000_000, 3, "a100_40gb")

{'gpu': 'a100_40gb', 'estimated_hours': 0.2, 'estimated_cost': '$0.70', 'total_tokens': '30,000,000'}

GGUF Conversion for Local Deployment

Convert to GGUF for llama.cpp / Ollama

from transformers import AutoModelForCausalLM, AutoTokenizer

Load your fine-tuned model

model = AutoModelForCausalLM.from_pretrained("./fine-tuned-model") tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model")

Save in format for conversion

model.save_pretrained("./model-for-gguf", safe_serialization=True) tokenizer.save_pretrained("./model-for-gguf")

Then use llama.cpp for conversion:

python convert_hf_to_gguf.py ./model-for-gguf --outtype q4_k_m

Quantization Options

Type Size Reduction Quality Loss Use Case

f16 2x None Best quality

q8_0 4x Minimal Good balance

q4_k_m 8x Small Production

q4_0 8x Moderate Resource constrained

q2_k 16x Significant Extreme constraints

Evaluation

Using lm-eval-harness

Install: pip install lm-eval

Command line evaluation

lm_eval --model hf --model_args pretrained=./fine-tuned-model --tasks hellaswag,arc_easy --batch_size 8

Programmatic

from lm_eval import evaluator, tasks

results = evaluator.simple_evaluate( model="hf", model_args="pretrained=./fine-tuned-model", tasks=["hellaswag", "arc_easy", "mmlu"], batch_size=8, )

print(results["results"])

Custom Evaluation

def evaluate_on_test_set(model, tokenizer, test_dataset): correct = 0 total = 0

for example in test_dataset:
    prompt = example["prompt"]
    expected = example["expected"]

    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    if expected.lower() in response.lower():
        correct += 1
    total += 1

return {"accuracy": correct / total, "total": total}

Best Practices

Training Checklist

before_training:

Validate dataset format and quality
Check GPU memory requirements
Set up monitoring (W&B, TensorBoard)
Configure checkpointing strategy
Test with small subset first

during_training:

Monitor loss curves
Watch for gradient issues
Check learning rate schedule
Validate checkpoints periodically

after_training:

Evaluate on held-out test set
Compare with base model
Test on diverse prompts
Convert to desired format (GGUF, etc.)
Push to Hub with model card

Hyperparameter Guidelines

SFT defaults

SFT_DEFAULTS = { "learning_rate": 2e-5, # Full fine-tune "learning_rate_lora": 2e-4, # LoRA (higher) "batch_size": 4, "gradient_accumulation": 4, # Effective batch = 16 "epochs": 1-3, "warmup_ratio": 0.03, "weight_decay": 0.01, }

DPO defaults

DPO_DEFAULTS = { "learning_rate": 5e-7, # Much lower "beta": 0.1, # KL penalty "epochs": 1, # Usually 1 is enough }

Resources

TRL Documentation
PEFT Documentation
HuggingFace Hub
HuggingFace Jobs
lm-eval-harness
Axolotl - High-level training framework

huggingface model trainer

Safety Notice

Copy this and send it to your AI assistant to learn

Load model and tokenizer

Load dataset

Training configuration

Create trainer

Train

Dataset should have 'messages' column in chat format

[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]

Load model

Dataset needs: prompt, chosen, rejected columns

Required columns: prompt, chosen, rejected

Define reward function

LoRA configuration

Apply to model

Output: trainable params: 6,553,600 || all params: 8,030,261,248 || trainable%: 0.082

LoRA config

4-bit quantization config

Load quantized model

Then apply LoRA as normal

Conversation format

Alpaca-style format

Or simpler format

Filter low-quality examples

Deduplicate

If using Claude Code with HF Jobs MCP

This is submitted via hf_jobs() MCP tool

Submit via MCP: hf_jobs("uv", {"script": training_script, "gpu": "a100"})

Rough cost estimates for HF Jobs / Cloud GPUs

Example: 10M token dataset, 3 epochs on A100

{'gpu': 'a100_40gb', 'estimated_hours': 0.2, 'estimated_cost': '$0.70', 'total_tokens': '30,000,000'}

Convert to GGUF for llama.cpp / Ollama

Load your fine-tuned model

Save in format for conversion

Then use llama.cpp for conversion:

python convert_hf_to_gguf.py ./model-for-gguf --outtype q4_k_m

Install: pip install lm-eval

Command line evaluation

lm_eval --model hf --model_args pretrained=./fine-tuned-model --tasks hellaswag,arc_easy --batch_size 8

Programmatic

SFT defaults

DPO defaults

Source Transparency

Related Skills

social media strategy

arcanea-architect

gym & training expert

framer expert