Fine-Tune Models on Your Data

Guide the user through deciding whether to fine-tune, preparing data, running fine-tuning with DSPy, distilling to cheaper models, and deploying. Fine-tuning is powerful but expensive — always confirm prerequisites first.

Should you fine-tune?

Before writing any code, walk through these questions with the user:

Have you optimized prompts first? If not, use /ai-improving-accuracy — prompt optimization is 10x cheaper and often sufficient.
Do you have 500+ labeled examples? Fine-tuning with less data usually overfits. Collect more data first.
Is your baseline accuracy above 50%? If your prompt-optimized program is below 50%, your task definition or data has problems. Fix those first.
What's the goal — quality or cost?
- Quality: You've maxed out prompt optimization and need more accuracy
- Cost: You want a small cheap model to match an expensive one

When to fine-tune

You've already optimized prompts with MIPROv2 and hit a ceiling
You have 500+ labeled examples (1000+ is better)
Your baseline is >50% and you need to push higher
You want to distill an expensive model into a cheaper one (10-50x cost savings)
Your domain has specialized vocabulary or patterns the base model doesn't know
You need faster inference (smaller fine-tuned models are faster)

When NOT to fine-tune

You haven't tried prompt optimization yet — start with /ai-improving-accuracy
You have fewer than 500 examples — need more data? Use /ai-generating-data to bootstrap synthetic examples, or use BootstrapFewShot or MIPROv2 instead
Your baseline is below 50% — your data or task definition needs work
You're still iterating on what the task is — fine-tuning locks you in
You don't have a clear metric — you can't evaluate fine-tuning without one
Your use case changes frequently — fine-tuned models don't adapt to new instructions easily

Prerequisites checklist

Before starting, confirm:

Data: 500+ labeled examples (1000+ recommended), split 80/10/10 (train/dev/test)
Baseline: Prompt-optimized program with measured accuracy (use /ai-improving-accuracy)
Metric: Clear, automated metric that scores predictions
Compute: API access (OpenAI fine-tuning API) or local GPUs (for open-source models)
Budget: OpenAI fine-tuning costs ~$0.008/1K tokens for GPT-4o-mini; local needs 1+ GPU

Step 1: Prepare your data and baseline

Build a strong baseline first

Always compare fine-tuning against a prompt-optimized baseline:

import dspy

lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)

# Define your program
class Classify(dspy.Signature):
    """Classify the support ticket."""
    text: str = dspy.InputField()
    category: str = dspy.OutputField()

program = dspy.ChainOfThought(Classify)

# Prepare data
import json
with open("labeled_data.json") as f:
    data = json.load(f)

examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]

# Split: 80% train, 10% dev, 10% test
n = len(examples)
trainset = examples[:int(n * 0.8)]
devset = examples[int(n * 0.8):int(n * 0.9)]
testset = examples[int(n * 0.9):]

# Measure baseline
def metric(example, prediction, trace=None):
    return prediction.category.lower() == example.category.lower()

from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score:.1f}%")

Optimize prompts first (your comparison point)

optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"Prompt-optimized: {prompt_score:.1f}%")

If prompt optimization gets you to your quality goal, stop here. Fine-tuning is only worth it if you need to go further.

Step 2: BootstrapFinetune (core fine-tuning)

The main fine-tuning workflow in DSPy. It bootstraps successful reasoning traces from your training data, filters them by your metric, and fine-tunes the model weights.

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

# Evaluate the fine-tuned model
finetuned_score = evaluator(finetuned)
print(f"Baseline:         {baseline_score:.1f}%")
print(f"Prompt-optimized: {prompt_score:.1f}%")
print(f"Fine-tuned:       {finetuned_score:.1f}%")

How it works

Bootstrap traces: Runs your program on each training example, keeping traces where the metric passes
Filter by metric: Only successful traces become training data
Fine-tune weights: Sends traces to the model provider's fine-tuning API
Return optimized program: The program now uses the fine-tuned model

Requirements

A fine-tunable model (OpenAI gpt-4o-mini, gpt-4o; or local open-source models)
500+ training examples (more traces bootstrapped = better fine-tuning)
A metric that reliably identifies good outputs

Step 3: Model distillation (expensive to cheap)

Train a small, cheap model to mimic an expensive model. This is the biggest cost saver — 10-50x reduction with 85-95% quality retention.

Teacher-student pattern

# Step 1: Teacher — expensive model, high quality
teacher_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=teacher_lm)

# Build and optimize the teacher
teacher = dspy.ChainOfThought(Classify)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = optimizer.compile(teacher, trainset=trainset)

teacher_score = evaluator(teacher_optimized)
print(f"Teacher (GPT-4o): {teacher_score:.1f}%")

# Step 2: Student — fine-tune cheap model on teacher's outputs
student_lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=student_lm)

student = dspy.ChainOfThought(Classify)
ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)

student_score = evaluator(student_finetuned)
print(f"Student (GPT-4o-mini, fine-tuned): {student_score:.1f}%")

Typical results

Model	Quality	Cost per 1M tokens
GPT-4o (teacher)	85%	~$5.00
GPT-4o-mini (no tuning)	70%	~$0.15
GPT-4o-mini (fine-tuned)	81%	~$0.15

The fine-tuned student costs 33x less and retains ~95% of teacher quality.

Step 4: BetterTogether (maximum quality)

BetterTogether alternates between prompt optimization and weight optimization, getting more out of both. Based on the BetterTogether paper (arXiv 2407.10930v2), this approach yields 5-78% gains over either technique alone.

optimizer = dspy.BetterTogether(
    metric=metric,
    prompt_optimizer=dspy.MIPROv2,
    weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)

best_score = evaluator(best)
print(f"Prompt-only:    {prompt_score:.1f}%")
print(f"Fine-tune-only: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")

How it works

Round 1: Optimize prompts (instructions + few-shot examples)
Round 2: Fine-tune weights using the optimized prompts
Round 3: Re-optimize prompts for the fine-tuned model
Each round builds on the previous, creating synergy between prompt and weight optimization

When to use BetterTogether

You want the absolute best quality and have the compute budget
Fine-tuning alone didn't close the gap to your quality target
You have 500+ examples and a reliable metric

Step 5: Evaluate and deploy

Thorough evaluation

Always evaluate on the held-out test set (not dev set):

test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)

print(f"Test set results:")
print(f"  Baseline:         {test_evaluator(program):.1f}%")
print(f"  Prompt-optimized: {test_evaluator(prompt_optimized):.1f}%")
print(f"  Fine-tuned:       {test_evaluator(finetuned):.1f}%")

Save and load for production

# Save
finetuned.save("finetuned_program.json")

# Load later
from my_module import MyProgram
production = MyProgram()
production.load("finetuned_program.json")
result = production(text="New support ticket...")

When fine-tuning goes wrong

Can't bootstrap enough traces

If the base model fails on most training examples, there aren't enough successful traces to fine-tune on.

Fixes:

Use a stronger model for bootstrapping (GPT-4o instead of GPT-4o-mini)
Relax your metric during bootstrapping (accept partial credit)
Simplify your task (break multi-step into single steps)

Model overfits (high train accuracy, low test accuracy)

Fixes:

Add more training data
Reduce fine-tuning epochs (if provider allows)
Use a larger base model (less prone to overfitting)
Simplify your output format

Fine-tuning didn't improve over prompt optimization

Fixes:

Check that bootstrapping produced enough successful traces (need 200+)
Try BetterTogether instead of BootstrapFinetune alone
Verify your metric actually correlates with quality
Try a different base model

Infrastructure choices

OpenAI API (easiest)

Works with gpt-4o-mini and gpt-4o. DSPy handles the fine-tuning API calls automatically:

lm = dspy.LM("openai/gpt-4o-mini")  # fine-tunable via API

Pros: No GPU needed, simple setup, fast
Cons: Data sent to OpenAI, ongoing per-token costs, limited model choices

Local fine-tuning (own your model)

For open-source models (Llama, Mistral, etc.) using LoRA/QLoRA:

lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")

Pros: Data stays private, no per-token costs after training, full control
Cons: Needs GPU(s), more setup, slower iteration

Cloud GPU platforms

AWS SageMaker, Google Cloud, Lambda Labs, or Together AI for training:

Pros: Scalable, no hardware to manage
Cons: Costs vary, setup per platform

Additional resources

For worked examples (classification, distillation, BetterTogether), see examples.md
Use /ai-improving-accuracy to build a strong baseline before fine-tuning
Use /ai-cutting-costs for other cost reduction strategies beyond distillation
Use /ai-fixing-errors if fine-tuning or evaluation errors occur

ai-fine-tuning

Safety Notice

Copy this and send it to your AI assistant to learn

Fine-Tune Models on Your Data

Should you fine-tune?

When to fine-tune

When NOT to fine-tune

Prerequisites checklist

Step 1: Prepare your data and baseline

Build a strong baseline first

Optimize prompts first (your comparison point)

Step 2: BootstrapFinetune (core fine-tuning)

How it works

Requirements

Step 3: Model distillation (expensive to cheap)

Teacher-student pattern

Typical results

Step 4: BetterTogether (maximum quality)

How it works

When to use BetterTogether

Step 5: Evaluate and deploy

Thorough evaluation

Save and load for production

When fine-tuning goes wrong

Can't bootstrap enough traces

Model overfits (high train accuracy, low test accuracy)

Fine-tuning didn't improve over prompt optimization

Infrastructure choices

OpenAI API (easiest)

Local fine-tuning (own your model)

Cloud GPU platforms

Additional resources

Source Transparency

Related Skills

ai-switching-models

ai-stopping-hallucinations

ai-building-chatbots

ai-taking-actions