golden-dataset

Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in rules/ loaded on-demand.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "golden-dataset" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-golden-dataset

Golden Dataset

Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in rules/ loaded on-demand.

Quick Reference

Category Rules Impact When to Use

Curation 3 HIGH Content collection, annotation pipelines, diversity analysis

Management 3 HIGH Versioning, backup/restore, CI/CD automation

Validation 3 CRITICAL Quality scoring, drift detection, regression testing

Add Workflow 1 HIGH 9-phase curation, quality scoring, bias detection, silver-to-gold

Total: 10 rules across 4 categories

Curation

Content collection, multi-agent annotation, and diversity analysis for golden datasets.

Rule File Key Pattern

Collection rules/curation-collection.md

Content type classification, quality thresholds, duplicate prevention

Annotation rules/curation-annotation.md

Multi-agent pipeline, consensus aggregation, Langfuse tracing

Diversity rules/curation-diversity.md

Difficulty stratification, domain coverage, balance guidelines

Management

Versioning, storage, and CI/CD automation for golden datasets.

Rule File Key Pattern

Versioning rules/management-versioning.md

JSON backup format, embedding regeneration, disaster recovery

Storage rules/management-storage.md

Backup strategies, URL contract, data integrity checks

CI Integration rules/management-ci.md

GitHub Actions automation, pre-deployment validation, weekly backups

Validation

Quality scoring, drift detection, and regression testing for golden datasets.

Rule File Key Pattern

Quality rules/validation-quality.md

Schema validation, content quality, referential integrity

Drift rules/validation-drift.md

Duplicate detection, semantic similarity, coverage gap analysis

Regression rules/validation-regression.md

Difficulty distribution, pre-commit hooks, full dataset validation

Add Workflow

Structured workflow for adding new documents to the golden dataset.

Rule File Key Pattern

Add Document rules/curation-add-workflow.md

9-phase curation, parallel quality analysis, bias detection

Quick Start Example

from app.shared.services.embeddings import embed_text

async def validate_before_add(document: dict, source_url_map: dict) -> dict: """Pre-addition validation for golden dataset entries.""" errors = []

# 1. URL contract check
if "placeholder" in document.get("source_url", ""):
    errors.append("URL must be canonical, not a placeholder")

# 2. Content quality
if len(document.get("title", "")) < 10:
    errors.append("Title too short (min 10 chars)")

# 3. Tag requirements
if len(document.get("tags", [])) < 2:
    errors.append("At least 2 domain tags required")

return {"valid": len(errors) == 0, "errors": errors}

Key Decisions

Decision Recommendation

Backup format JSON (version controlled, portable)

Embedding storage Exclude from backup (regenerate on restore)

Quality threshold

= 0.70 quality score for inclusion

Confidence threshold

= 0.65 for auto-include

Duplicate threshold

= 0.90 similarity blocks, >= 0.85 warns

Min tags per entry 2 domain tags

Min test queries 3 per document

Difficulty balance Trivial 3, Easy 3, Medium 5, Hard 3 minimum

CI frequency Weekly automated backup (Sunday 2am UTC)

Common Mistakes

  • Using placeholder URLs instead of canonical source URLs

  • Skipping embedding regeneration after restore

  • Not validating referential integrity between documents and queries

  • Over-indexing on articles (neglecting tutorials, research papers)

  • Missing difficulty distribution balance in test queries

  • Not running verification after backup/restore operations

  • Testing restore procedures in production instead of staging

  • Committing SQL dumps instead of JSON (not version-control friendly)

Evaluations

See test-cases.json for 9 test cases across all categories.

Related Skills

  • ork:rag-retrieval

  • Retrieval evaluation using golden dataset

  • langfuse-observability

  • Tracing patterns for curation workflows

  • ork:testing-unit

  • Unit testing patterns and strategies

  • ai-native-development

  • Embedding generation for restore

Capability Details

curation

Keywords: golden dataset, curation, content collection, annotation, quality criteria

Solves:

  • Classify document content types for golden dataset

  • Run multi-agent quality analysis pipelines

  • Generate test queries for new documents

management

Keywords: golden dataset, backup, restore, versioning, disaster recovery

Solves:

  • Backup and restore golden datasets with JSON

  • Regenerate embeddings after restore

  • Automate backups with CI/CD

validation

Keywords: golden dataset, validation, schema, duplicate detection, quality metrics

Solves:

  • Validate entries against document schema

  • Detect duplicate or near-duplicate entries

  • Analyze dataset coverage and distribution gaps

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ui-components

No summary provided by upstream source.

Repository SourceNeeds Review
General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
golden-dataset | V50.AI