Overview
This skill provides tools to manage datasets on the Hugging Face Hub with a focus on creation, configuration, content management, and SQL-based data manipulation. It is designed to complement the existing Hugging Face MCP server by providing dataset editing and querying capabilities.
Integration with HF MCP Server
-
Use HF MCP Server for: Dataset discovery, search, and metadata retrieval
-
Use This Skill for: Dataset creation, content editing, SQL queries, data transformation, and structured data formatting
Version
2.1.0
Dependencies
-
huggingface_hub
-
duckdb (for SQL queries)
-
datasets (for pushing query results to Hub)
-
json (built-in)
-
time (built-in)
Core Capabilities
- Dataset Lifecycle Management
-
Initialize: Create new dataset repositories with proper structure
-
Configure: Store detailed configuration including system prompts and metadata
-
Stream Updates: Add rows efficiently without downloading entire datasets
- SQL-Based Dataset Querying (NEW)
Query any Hugging Face dataset using DuckDB SQL via scripts/sql_manager.py :
-
Direct Queries: Run SQL on datasets using the hf:// protocol
-
Schema Discovery: Describe dataset structure and column types
-
Data Sampling: Get random samples for exploration
-
Aggregations: Count, histogram, unique values analysis
-
Transformations: Filter, join, reshape data with SQL
-
Export & Push: Save results locally or push to new Hub repos
- Multi-Format Dataset Support
Supports diverse dataset types through template system:
-
Chat/Conversational: Chat templating, multi-turn dialogues, tool usage examples
-
Text Classification: Sentiment analysis, intent detection, topic classification
-
Question-Answering: Reading comprehension, factual QA, knowledge bases
-
Text Completion: Language modeling, code completion, creative writing
-
Tabular Data: Structured data for regression/classification tasks
-
Custom Formats: Flexible schema definition for specialized needs
- Quality Assurance Features
-
JSON Validation: Ensures data integrity during uploads
-
Batch Processing: Efficient handling of large datasets
-
Error Recovery: Graceful handling of upload failures and conflicts
Usage Instructions
The skill includes two Python scripts:
-
scripts/dataset_manager.py
-
Dataset creation and management
-
scripts/sql_manager.py
-
SQL-based dataset querying and transformation
Prerequisites
-
huggingface_hub library: uv add huggingface_hub
-
duckdb library (for SQL): uv add duckdb
-
datasets library (for pushing): uv add datasets
-
HF_TOKEN environment variable must be set with a Write-access token
-
Activate virtual environment: source .venv/bin/activate
SQL Dataset Querying (sql_manager.py)
Query, transform, and push Hugging Face datasets using DuckDB SQL. The hf:// protocol provides direct access to any public dataset (or private with token).
Quick Start
Query a dataset
python scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition' LIMIT 10"
Get dataset schema
python scripts/sql_manager.py describe --dataset "cais/mmlu"
Sample random rows
python scripts/sql_manager.py sample --dataset "cais/mmlu" --n 5
Count rows with filter
python scripts/sql_manager.py count --dataset "cais/mmlu" --where "subject='nutrition'"
SQL Query Syntax
Use data as the table name in your SQL - it gets replaced with the actual hf:// path:
-- Basic select SELECT * FROM data LIMIT 10
-- Filtering SELECT * FROM data WHERE subject='nutrition'
-- Aggregations SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject ORDER BY cnt DESC
-- Column selection and transformation SELECT question, choices[answer] AS correct_answer FROM data
-- Regex matching SELECT * FROM data WHERE regexp_matches(question, 'nutrition|diet')
-- String functions SELECT regexp_replace(question, '\n', '') AS cleaned FROM data
Common Operations
- Explore Dataset Structure
Get schema
python scripts/sql_manager.py describe --dataset "cais/mmlu"
Get unique values in column
python scripts/sql_manager.py unique --dataset "cais/mmlu" --column "subject"
Get value distribution
python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject" --bins 20
- Filter and Transform
Complex filtering with SQL
python scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT subject, COUNT(*) as cnt FROM data GROUP BY subject HAVING cnt > 100"
Using transform command
python scripts/sql_manager.py transform
--dataset "cais/mmlu"
--select "subject, COUNT(*) as cnt"
--group-by "subject"
--order-by "cnt DESC"
--limit 10
- Create Subsets and Push to Hub
Query and push to new dataset
python scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--push-to "username/mmlu-nutrition-subset"
--private
Transform and push
python scripts/sql_manager.py transform
--dataset "ibm/duorc"
--config "ParaphraseRC"
--select "question, answers"
--where "LENGTH(question) > 50"
--push-to "username/duorc-long-questions"
- Export to Local Files
Export to Parquet
python scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject='nutrition'"
--output "nutrition.parquet"
--format parquet
Export to JSONL
python scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT * FROM data LIMIT 100"
--output "sample.jsonl"
--format jsonl
- Working with Dataset Configs/Splits
Specify config (subset)
python scripts/sql_manager.py query
--dataset "ibm/duorc"
--config "ParaphraseRC"
--sql "SELECT * FROM data LIMIT 5"
Specify split
python scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "test"
--sql "SELECT COUNT(*) FROM data"
Query all splits
python scripts/sql_manager.py query
--dataset "cais/mmlu"
--split "*"
--sql "SELECT * FROM data LIMIT 10"
- Raw SQL with Full Paths
For complex queries or joining datasets:
python scripts/sql_manager.py raw --sql " SELECT a., b. FROM 'hf://datasets/dataset1@~parquet/default/train/.parquet' a JOIN 'hf://datasets/dataset2@~parquet/default/train/.parquet' b ON a.id = b.id LIMIT 100 "
Python API Usage
from sql_manager import HFDatasetSQL
sql = HFDatasetSQL()
Query
results = sql.query("cais/mmlu", "SELECT * FROM data WHERE subject='nutrition' LIMIT 10")
Get schema
schema = sql.describe("cais/mmlu")
Sample
samples = sql.sample("cais/mmlu", n=5, seed=42)
Count
count = sql.count("cais/mmlu", where="subject='nutrition'")
Histogram
dist = sql.histogram("cais/mmlu", "subject")
Filter and transform
results = sql.filter_and_transform( "cais/mmlu", select="subject, COUNT(*) as cnt", group_by="subject", order_by="cnt DESC", limit=10 )
Push to Hub
url = sql.push_to_hub( "cais/mmlu", "username/nutrition-subset", sql="SELECT * FROM data WHERE subject='nutrition'", private=True )
Export locally
sql.export_to_parquet("cais/mmlu", "output.parquet", sql="SELECT * FROM data LIMIT 100")
sql.close()
HF Path Format
DuckDB uses the hf:// protocol to access datasets:
hf://datasets/{dataset_id}@{revision}/{config}/{split}/*.parquet
Examples:
-
hf://datasets/cais/mmlu@~parquet/default/train/*.parquet
-
hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/*.parquet
The @~parquet revision provides auto-converted Parquet files for any dataset format.
Useful DuckDB SQL Functions
-- String functions LENGTH(column) -- String length regexp_replace(col, '\n', '') -- Regex replace regexp_matches(col, 'pattern') -- Regex match LOWER(col), UPPER(col) -- Case conversion
-- Array functions
choices[0] -- Array indexing (0-based)
array_length(choices) -- Array length
unnest(choices) -- Expand array to rows
-- Aggregations COUNT(*), SUM(col), AVG(col) GROUP BY col HAVING condition
-- Sampling USING SAMPLE 10 -- Random sample USING SAMPLE 10 (RESERVOIR, 42) -- Reproducible sample
-- Window functions ROW_NUMBER() OVER (PARTITION BY col ORDER BY col2)
Dataset Creation (dataset_manager.py)
Recommended Workflow
- Discovery (Use HF MCP Server):
Use HF MCP tools to find existing datasets
search_datasets("conversational AI training") get_dataset_details("username/dataset-name")
- Creation (Use This Skill):
Initialize new dataset
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
Configure with detailed system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "$(cat system_prompt.txt)"
- Content Management (Use This Skill):
Quick setup with any template
python scripts/dataset_manager.py quick_setup
--repo_id "your-username/dataset-name"
--template classification
Add data with template validation
python scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json "$(cat your_qa_data.json)"
Template-Based Data Structures
- Chat Template (--template chat )
{ "messages": [ {"role": "user", "content": "Natural user request"}, {"role": "assistant", "content": "Response with tool usage"}, {"role": "tool", "content": "Tool response", "tool_call_id": "call_123"} ], "scenario": "Description of use case", "complexity": "simple|intermediate|advanced" }
- Classification Template (--template classification )
{ "text": "Input text to be classified", "label": "classification_label", "confidence": 0.95, "metadata": {"domain": "technology", "language": "en"} }
- QA Template (--template qa )
{ "question": "What is the question being asked?", "answer": "The complete answer", "context": "Additional context if needed", "answer_type": "factual|explanatory|opinion", "difficulty": "easy|medium|hard" }
- Completion Template (--template completion )
{ "prompt": "The beginning text or context", "completion": "The expected continuation", "domain": "code|creative|technical|conversational", "style": "description of writing style" }
- Tabular Template (--template tabular )
{ "columns": [ {"name": "feature1", "type": "numeric", "description": "First feature"}, {"name": "target", "type": "categorical", "description": "Target variable"} ], "data": [ {"feature1": 123, "target": "class_a"}, {"feature1": 456, "target": "class_b"} ] }
Advanced System Prompt Template
For high-quality training data generation:
You are an AI assistant expert at using MCP tools effectively.
MCP SERVER DEFINITIONS
[Define available servers and tools]
TRAINING EXAMPLE STRUCTURE
[Specify exact JSON schema for chat templating]
QUALITY GUIDELINES
[Detail requirements for realistic scenarios, progressive complexity, proper tool usage]
EXAMPLE CATEGORIES
[List development workflows, debugging scenarios, data management tasks]
Example Categories & Templates
The skill includes diverse training examples beyond just MCP usage:
Available Example Sets:
-
training_examples.json
-
MCP tool usage examples (debugging, project setup, database analysis)
-
diverse_training_examples.json
-
Broader scenarios including:
-
Educational Chat - Explaining programming concepts, tutorials
-
Git Workflows - Feature branches, version control guidance
-
Code Analysis - Performance optimization, architecture review
-
Content Generation - Professional writing, creative brainstorming
-
Codebase Navigation - Legacy code exploration, systematic analysis
-
Conversational Support - Problem-solving, technical discussions
Using Different Example Sets:
Add MCP-focused examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/training_examples.json)"
Add diverse conversational examples
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(cat examples/diverse_training_examples.json)"
Mix both for comprehensive training data
python scripts/dataset_manager.py add_rows --repo_id "your-username/dataset-name"
--rows_json "$(jq -s '.[0] + .[1]' examples/training_examples.json examples/diverse_training_examples.json)"
Commands Reference
List Available Templates:
python scripts/dataset_manager.py list_templates
Quick Setup (Recommended):
python scripts/dataset_manager.py quick_setup --repo_id "your-username/dataset-name" --template classification
Manual Setup:
Initialize repository
python scripts/dataset_manager.py init --repo_id "your-username/dataset-name" [--private]
Configure with system prompt
python scripts/dataset_manager.py config --repo_id "your-username/dataset-name" --system_prompt "Your prompt here"
Add data with validation
python scripts/dataset_manager.py add_rows
--repo_id "your-username/dataset-name"
--template qa
--rows_json '[{"question": "What is AI?", "answer": "Artificial Intelligence..."}]'
View Dataset Statistics:
python scripts/dataset_manager.py stats --repo_id "your-username/dataset-name"
Error Handling
-
Repository exists: Script will notify and continue with configuration
-
Invalid JSON: Clear error message with parsing details
-
Network issues: Automatic retry for transient failures
-
Token permissions: Validation before operations begin
Combined Workflow Examples
Example 1: Create Training Subset from Existing Dataset
1. Explore the source dataset
python scripts/sql_manager.py describe --dataset "cais/mmlu" python scripts/sql_manager.py histogram --dataset "cais/mmlu" --column "subject"
2. Query and create subset
python scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT * FROM data WHERE subject IN ('nutrition', 'anatomy', 'clinical_knowledge')"
--push-to "username/mmlu-medical-subset"
--private
Example 2: Transform and Reshape Data
Transform MMLU to QA format with correct answers extracted
python scripts/sql_manager.py query
--dataset "cais/mmlu"
--sql "SELECT question, choices[answer] as correct_answer, subject FROM data"
--push-to "username/mmlu-qa-format"
Example 3: Merge Multiple Dataset Splits
Export multiple splits and combine
python scripts/sql_manager.py export
--dataset "cais/mmlu"
--split "*"
--output "mmlu_all.parquet"
Example 4: Quality Filtering
Filter for high-quality examples
python scripts/sql_manager.py query
--dataset "squad"
--sql "SELECT * FROM data WHERE LENGTH(context) > 500 AND LENGTH(question) > 20"
--push-to "username/squad-filtered"
Example 5: Create Custom Training Dataset
1. Query source data
python scripts/sql_manager.py export
--dataset "cais/mmlu"
--sql "SELECT question, subject FROM data WHERE subject='nutrition'"
--output "nutrition_source.jsonl"
--format jsonl
2. Process with your pipeline (add answers, format, etc.)
3. Push processed data
python scripts/dataset_manager.py init --repo_id "username/nutrition-training"
python scripts/dataset_manager.py add_rows
--repo_id "username/nutrition-training"
--template qa
--rows_json "$(cat processed_data.json)"