funsloth-check

Validate datasets for Unsloth fine-tuning. Use when the user wants to check a dataset, analyze tokens, calculate Chinchilla optimality, or prepare data for training.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "funsloth-check" with this command: npx skills add chrisvoncsefalvay/funsloth/chrisvoncsefalvay-funsloth-funsloth-check

Dataset Validation for Unsloth Fine-tuning

Validate datasets before fine-tuning with Unsloth.

Quick Start

For automated validation, use the script:

python scripts/validate_dataset.py --dataset "dataset-id" --model llama-3.1-8b --lora-rank 16

Workflow

1. Get Dataset Source

Ask for: HF dataset ID (e.g., mlabonne/FineTome-100k) or local path (e.g., ./data.jsonl)

2. Load and Detect Format

Auto-detect format from structure. See DATA_FORMATS.md for details.

FormatDetectionKey Fields
Rawtext onlytext
Alpacainstruction + outputinstruction, output
ShareGPTconversations arrayfrom, value
ChatMLmessages arrayrole, content

3. Validate Schema

Check required fields exist. Report issues with fix suggestions.

4. Show Samples

Display 2-3 examples for visual verification.

5. Token Analysis

Report statistics: total tokens, min/max/mean/median sequence length.

Flag concerns:

  • Sequences > 4096 tokens
  • Sequences < 10 tokens

6. Chinchilla Analysis

Ask for target model and LoRA rank, then calculate:

Chinchilla FractionInterpretation
< 0.5xDataset may be too small
0.5x - 2.0xGood range
> 2.0xLarge dataset, may take longer

7. Recommendations

Based on analysis, suggest:

  • standardize_sharegpt() for ShareGPT data
  • Sequence length adjustments
  • Learning rate for small datasets

8. Optional: HF Upload

Offer to upload local datasets to Hub.

9. Handoff

Pass context to funsloth-train:

dataset_id: "mlabonne/FineTome-100k"
format_type: "sharegpt"
total_tokens: 15000000
target_model: "llama-3.1-8b"
use_lora: true
lora_rank: 16
chinchilla_fraction: 1.2

Bundled Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

funsloth-hfjobs

No summary provided by upstream source.

Repository SourceNeeds Review
General

funsloth-local

No summary provided by upstream source.

Repository SourceNeeds Review
General

d3-viz

No summary provided by upstream source.

Repository SourceNeeds Review
General

electron-scaffold

No summary provided by upstream source.

Repository SourceNeeds Review