training-data-curation

Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "training-data-curation" with this command: npx skills add sundial-org/skills/sundial-org-skills-training-data-curation

Training Data Curation Guidelines

Best practices for gathering and preparing training data for LLM fine-tuning.

Data Quality Principles

Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.

Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.

Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.

Format Requirements

Supervised Fine-Tuning (SFT)

Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
  • Each sample is a complete conversation
  • Multi-turn: alternate user/assistant messages
  • System prompts optional: {"role": "system", "content": "..."}
  • JSONL format, one sample per line

Preference Learning (DPO/ORPO/KTO)

Requires paired comparisons [2]:

{"prompt": "...", "chosen": "...", "rejected": "..."}
  • chosen and rejected must respond to the same prompt
  • Quality difference should be clear and consistent
  • Annotator agreement >70% indicates usable samples [1]

For KTO, pairs aren't required—just binary labels on completions [7]:

{"prompt": "...", "completion": "...", "label": true/false}

Reward Modeling (RLHF)

Needs ranked responses [1]:

{"prompt": "...", "responses": ["best", "second", "worst"]}

Quality Checklist

Before training, verify:

  • No duplicates — exact and near-duplicate removal [3]
  • No empty fields — all required fields populated
  • Consistent format — schema matches throughout
  • Appropriate length — not too short (noise) or too long (truncation)
  • Clean text — proper encoding, no HTML/boilerplate artifacts [8]
  • Manual inspection — reviewed random sample of 50-100 examples
  • No PII/sensitive data — unless intentionally included
  • License verified — legal to use for training

Common Quality Issues

IssueDetectionFixSource
DuplicatesHash-based dedupRemove exact matches, MinHash for near-dupes[3]
BoilerplateKeyword filterRemove "subscribe", "cookie policy", etc.[8]
Repetitive textN-gram analysisFlag if <30% unique trigrams[4]
Low-quality textAlpha ratioRemove if <50% alphabetic characters[8]
Wrong languageLanguage detectionfastText classifier, filter to target[3]
Too shortLength checkMinimum 3-5 sentences, 100+ words for documents[8]

Data Sources

High quality:

  • Curated human annotations [1]
  • Expert-written examples
  • Filtered high-quality web data [3]

Medium quality:

  • Synthetic data from stronger models (distillation)
  • Community Q&A with voting signals
  • Filtered user-generated content

Use with caution:

  • Raw web scrapes
  • Unfiltered synthetic data
  • Data without clear provenance [6]

Sizing Guidelines

Dataset SizeUse CaseSource
100-1KQuick experiments, specific behaviors
1K-10KProduction SFT, domain adaptation
10K-100KComprehensive instruction tuning[1]
1M+ preference pairsLarge-scale RLHF[1]

Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].

File Format

  • JSONL — one JSON object per line, human-readable
  • Parquet — efficient for large datasets, built-in compression [3]
  • Sharding — split files >500MB into chunks

References

  1. Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
  2. TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
  3. FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
  4. Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
  5. Tinker API — Training API using messages format for SFT, DPO/RLHF support
  6. Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
  7. KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
  8. C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

icml-reviewer

No summary provided by upstream source.

Repository SourceNeeds Review
General

ai-co-scientist

No summary provided by upstream source.

Repository SourceNeeds Review
General

tinker

No summary provided by upstream source.

Repository SourceNeeds Review
General

project-referee

No summary provided by upstream source.

Repository SourceNeeds Review