ai-fine-tuning

Fine-tune models on your data to maximize quality and cut costs. Use when prompt optimization hit a ceiling, you need domain specialization, you want cheaper models to match expensive ones, you heard "fine-tuning will make us AI-native", you have 500+ training examples, or you need to train on proprietary data. Covers DSPy BootstrapFinetune, BetterTogether, model distillation, and when to fine-tune vs optimize prompts.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ai-fine-tuning" with this command: npx skills add lebsral/dspy-programming-not-prompting-lms-skills/lebsral-dspy-programming-not-prompting-lms-skills-ai-fine-tuning

Fine-Tune Models on Your Data

Guide the user through deciding whether to fine-tune, preparing data, running fine-tuning with DSPy, distilling to cheaper models, and deploying. Fine-tuning is powerful but expensive — always confirm prerequisites first.

Should you fine-tune?

Before writing any code, walk through these questions with the user:

  1. Have you optimized prompts first? If not, use /ai-improving-accuracy — prompt optimization is 10x cheaper and often sufficient.
  2. Do you have 500+ labeled examples? Fine-tuning with less data usually overfits. Collect more data first.
  3. Is your baseline accuracy above 50%? If your prompt-optimized program is below 50%, your task definition or data has problems. Fix those first.
  4. What's the goal — quality or cost?
    • Quality: You've maxed out prompt optimization and need more accuracy
    • Cost: You want a small cheap model to match an expensive one

When to fine-tune

  • You've already optimized prompts with MIPROv2 and hit a ceiling
  • You have 500+ labeled examples (1000+ is better)
  • Your baseline is >50% and you need to push higher
  • You want to distill an expensive model into a cheaper one (10-50x cost savings)
  • Your domain has specialized vocabulary or patterns the base model doesn't know
  • You need faster inference (smaller fine-tuned models are faster)

When NOT to fine-tune

  • You haven't tried prompt optimization yet — start with /ai-improving-accuracy
  • You have fewer than 500 examples — need more data? Use /ai-generating-data to bootstrap synthetic examples, or use BootstrapFewShot or MIPROv2 instead
  • Your baseline is below 50% — your data or task definition needs work
  • You're still iterating on what the task is — fine-tuning locks you in
  • You don't have a clear metric — you can't evaluate fine-tuning without one
  • Your use case changes frequently — fine-tuned models don't adapt to new instructions easily

Prerequisites checklist

Before starting, confirm:

  • Data: 500+ labeled examples (1000+ recommended), split 80/10/10 (train/dev/test)
  • Baseline: Prompt-optimized program with measured accuracy (use /ai-improving-accuracy)
  • Metric: Clear, automated metric that scores predictions
  • Compute: API access (OpenAI fine-tuning API) or local GPUs (for open-source models)
  • Budget: OpenAI fine-tuning costs ~$0.008/1K tokens for GPT-4o-mini; local needs 1+ GPU

Step 1: Prepare your data and baseline

Build a strong baseline first

Always compare fine-tuning against a prompt-optimized baseline:

import dspy

lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)

# Define your program
class Classify(dspy.Signature):
    """Classify the support ticket."""
    text: str = dspy.InputField()
    category: str = dspy.OutputField()

program = dspy.ChainOfThought(Classify)

# Prepare data
import json
with open("labeled_data.json") as f:
    data = json.load(f)

examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]

# Split: 80% train, 10% dev, 10% test
n = len(examples)
trainset = examples[:int(n * 0.8)]
devset = examples[int(n * 0.8):int(n * 0.9)]
testset = examples[int(n * 0.9):]

# Measure baseline
def metric(example, prediction, trace=None):
    return prediction.category.lower() == example.category.lower()

from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score:.1f}%")

Optimize prompts first (your comparison point)

optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"Prompt-optimized: {prompt_score:.1f}%")

If prompt optimization gets you to your quality goal, stop here. Fine-tuning is only worth it if you need to go further.

Step 2: BootstrapFinetune (core fine-tuning)

The main fine-tuning workflow in DSPy. It bootstraps successful reasoning traces from your training data, filters them by your metric, and fine-tunes the model weights.

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

# Evaluate the fine-tuned model
finetuned_score = evaluator(finetuned)
print(f"Baseline:         {baseline_score:.1f}%")
print(f"Prompt-optimized: {prompt_score:.1f}%")
print(f"Fine-tuned:       {finetuned_score:.1f}%")

How it works

  1. Bootstrap traces: Runs your program on each training example, keeping traces where the metric passes
  2. Filter by metric: Only successful traces become training data
  3. Fine-tune weights: Sends traces to the model provider's fine-tuning API
  4. Return optimized program: The program now uses the fine-tuned model

Requirements

  • A fine-tunable model (OpenAI gpt-4o-mini, gpt-4o; or local open-source models)
  • 500+ training examples (more traces bootstrapped = better fine-tuning)
  • A metric that reliably identifies good outputs

Step 3: Model distillation (expensive to cheap)

Train a small, cheap model to mimic an expensive model. This is the biggest cost saver — 10-50x reduction with 85-95% quality retention.

Teacher-student pattern

# Step 1: Teacher — expensive model, high quality
teacher_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=teacher_lm)

# Build and optimize the teacher
teacher = dspy.ChainOfThought(Classify)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = optimizer.compile(teacher, trainset=trainset)

teacher_score = evaluator(teacher_optimized)
print(f"Teacher (GPT-4o): {teacher_score:.1f}%")

# Step 2: Student — fine-tune cheap model on teacher's outputs
student_lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=student_lm)

student = dspy.ChainOfThought(Classify)
ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)

student_score = evaluator(student_finetuned)
print(f"Student (GPT-4o-mini, fine-tuned): {student_score:.1f}%")

Typical results

ModelQualityCost per 1M tokens
GPT-4o (teacher)85%~$5.00
GPT-4o-mini (no tuning)70%~$0.15
GPT-4o-mini (fine-tuned)81%~$0.15

The fine-tuned student costs 33x less and retains ~95% of teacher quality.

Step 4: BetterTogether (maximum quality)

BetterTogether alternates between prompt optimization and weight optimization, getting more out of both. Based on the BetterTogether paper (arXiv 2407.10930v2), this approach yields 5-78% gains over either technique alone.

optimizer = dspy.BetterTogether(
    metric=metric,
    prompt_optimizer=dspy.MIPROv2,
    weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)

best_score = evaluator(best)
print(f"Prompt-only:    {prompt_score:.1f}%")
print(f"Fine-tune-only: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")

How it works

  1. Round 1: Optimize prompts (instructions + few-shot examples)
  2. Round 2: Fine-tune weights using the optimized prompts
  3. Round 3: Re-optimize prompts for the fine-tuned model
  4. Each round builds on the previous, creating synergy between prompt and weight optimization

When to use BetterTogether

  • You want the absolute best quality and have the compute budget
  • Fine-tuning alone didn't close the gap to your quality target
  • You have 500+ examples and a reliable metric

Step 5: Evaluate and deploy

Thorough evaluation

Always evaluate on the held-out test set (not dev set):

test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)

print(f"Test set results:")
print(f"  Baseline:         {test_evaluator(program):.1f}%")
print(f"  Prompt-optimized: {test_evaluator(prompt_optimized):.1f}%")
print(f"  Fine-tuned:       {test_evaluator(finetuned):.1f}%")

Save and load for production

# Save
finetuned.save("finetuned_program.json")

# Load later
from my_module import MyProgram
production = MyProgram()
production.load("finetuned_program.json")
result = production(text="New support ticket...")

When fine-tuning goes wrong

Can't bootstrap enough traces

If the base model fails on most training examples, there aren't enough successful traces to fine-tune on.

Fixes:

  • Use a stronger model for bootstrapping (GPT-4o instead of GPT-4o-mini)
  • Relax your metric during bootstrapping (accept partial credit)
  • Simplify your task (break multi-step into single steps)

Model overfits (high train accuracy, low test accuracy)

Fixes:

  • Add more training data
  • Reduce fine-tuning epochs (if provider allows)
  • Use a larger base model (less prone to overfitting)
  • Simplify your output format

Fine-tuning didn't improve over prompt optimization

Fixes:

  • Check that bootstrapping produced enough successful traces (need 200+)
  • Try BetterTogether instead of BootstrapFinetune alone
  • Verify your metric actually correlates with quality
  • Try a different base model

Infrastructure choices

OpenAI API (easiest)

Works with gpt-4o-mini and gpt-4o. DSPy handles the fine-tuning API calls automatically:

lm = dspy.LM("openai/gpt-4o-mini")  # fine-tunable via API
  • Pros: No GPU needed, simple setup, fast
  • Cons: Data sent to OpenAI, ongoing per-token costs, limited model choices

Local fine-tuning (own your model)

For open-source models (Llama, Mistral, etc.) using LoRA/QLoRA:

lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")
  • Pros: Data stays private, no per-token costs after training, full control
  • Cons: Needs GPU(s), more setup, slower iteration

Cloud GPU platforms

AWS SageMaker, Google Cloud, Lambda Labs, or Together AI for training:

  • Pros: Scalable, no hardware to manage
  • Cons: Costs vary, setup per platform

Additional resources

  • For worked examples (classification, distillation, BetterTogether), see examples.md
  • Use /ai-improving-accuracy to build a strong baseline before fine-tuning
  • Use /ai-cutting-costs for other cost reduction strategies beyond distillation
  • Use /ai-fixing-errors if fine-tuning or evaluation errors occur

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

ai-switching-models

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ai-stopping-hallucinations

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ai-building-chatbots

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

ai-taking-actions

No summary provided by upstream source.

Repository SourceNeeds Review