Golden Dataset Management
Protect and maintain high-quality test datasets for AI/ML systems
Overview
A golden dataset is a curated collection of high-quality examples used for:
-
Regression testing: Ensure new code doesn't break existing functionality
-
Retrieval evaluation: Measure search quality (precision, recall, MRR)
-
Model benchmarking: Compare different models/approaches
-
Reproducibility: Consistent results across environments
When to use this skill:
-
Building test datasets for RAG systems
-
Implementing backup/restore for critical data
-
Validating data integrity (URL contracts, embeddings)
-
Migrating data between environments
OrchestKit's Golden Dataset
Stats (Production):
-
98 analyses (completed content analyses)
-
415 chunks (embedded text segments)
-
203 test queries (with expected results)
-
91.6% pass rate (retrieval quality metric)
Purpose:
-
Test hybrid search (vector + BM25 + RRF)
-
Validate metadata boosting strategies
-
Detect regressions in retrieval quality
-
Benchmark new embedding models
Core Concepts
Data Integrity Contracts
The URL Contract: Golden dataset analyses MUST store real canonical URLs, not placeholders.
WRONG - Placeholder URL (breaks restore)
analysis.url = "https://orchestkit.dev/placeholder/123"
CORRECT - Real canonical URL (enables re-fetch if needed)
analysis.url = "https://docs.python.org/3/library/asyncio.html"
Why this matters:
-
Enables re-fetching content if embeddings need regeneration
-
Allows validation that source content hasn't changed
-
Provides audit trail for data provenance
Backup Strategy Comparison
Strategy Version Control Restore Speed Portability Inspection
JSON (recommended) Yes Slower (regen embeddings) High Easy
SQL Dump No (binary) Fast DB-version dependent Hard
OrchestKit uses JSON backup for version control and portability.
Quick Reference
Backup Format
{ "version": "1.0", "created_at": "2025-12-19T10:30:00Z", "metadata": { "total_analyses": 98, "total_chunks": 415, "total_artifacts": 98 }, "analyses": [ { "id": "550e8400-e29b-41d4-a716-446655440000", "url": "https://docs.python.org/3/library/asyncio.html", "content_type": "documentation", "status": "completed", "created_at": "2025-11-15T08:20:00Z", "chunks": [ { "id": "7c9e6679-7425-40de-944b-e07fc1f90ae7", "content": "asyncio is a library...", "section_title": "Introduction to asyncio" // embedding NOT included (regenerated on restore) } ] } ] }
Key Design Decisions:
-
Embeddings excluded (regenerate on restore with current model)
-
Nested structure (analyses -> chunks -> artifacts)
-
Metadata for validation
-
ISO timestamps for reproducibility
CLI Commands
cd backend
Backup golden dataset
poetry run python scripts/backup_golden_dataset.py backup
Verify backup integrity
poetry run python scripts/backup_golden_dataset.py verify
Restore from backup (WARNING: Deletes existing data)
poetry run python scripts/backup_golden_dataset.py restore --replace
Restore without deleting (adds to existing)
poetry run python scripts/backup_golden_dataset.py restore
Validation Checks
Check Error/Warning Description
Count mismatch Error Analysis/chunk count differs from metadata
Placeholder URLs Error URLs containing orchestkit.dev or placeholder
Missing embeddings Error Chunks without embeddings after restore
Orphaned chunks Warning Chunks with no parent analysis
Best Practices Summary
-
Version control backups - Commit to git for history and diffs
-
Validate before deployment - Run verify before production changes
-
Test restore in staging - Never test restore in production first
-
Document changes - Track additions/removals in metadata
Disaster Recovery Quick Guide
Scenario Steps
Accidental deletion restore --replace -> verify -> run tests
Migration failure alembic downgrade -1 -> restore --replace -> fix migration
New environment Clone repo -> setup DB -> restore -> run tests
References
For detailed implementation patterns, see:
-
references/storage-patterns.md
-
Backup strategies, JSON format, backup script implementation, CI/CD automation
-
references/versioning.md
-
Restore implementation, embedding regeneration, validation checklist, disaster recovery scenarios
Related Skills
-
golden-dataset-validation
-
Schema and integrity validation
-
golden-dataset-curation
-
Quality criteria and curation workflows
-
pgvector-search
-
Retrieval evaluation using golden dataset
-
ai-native-development
-
Embedding generation for restore
Version: 1.0.0 (December 2025) Status: Production-ready patterns from OrchestKit's 98-analysis golden dataset
Capability Details
backup
Keywords: golden dataset, backup, export, json backup, version control data Solves:
-
How do I backup the golden dataset?
-
Export analyses to JSON for version control
-
Protect critical test datasets
-
Create portable database snapshots
restore
Keywords: restore dataset, import analyses, regenerate embeddings, disaster recovery, new environment Solves:
-
How do I restore from backup?
-
Import golden dataset to new environment
-
Regenerate embeddings after restore
-
Disaster recovery procedures
validation
Keywords: verify dataset, url contract, data integrity, validate backup, placeholder urls Solves:
-
How do I validate dataset integrity?
-
Check URL contracts (no placeholders)
-
Verify embeddings exist
-
Detect orphaned chunks
ci-cd-automation
Keywords: automated backup, github actions, ci cd backup, scheduled backup Solves:
-
How do I automate dataset backups?
-
Set up GitHub Actions for weekly backups
-
Commit backups to git automatically
-
CI/CD integration patterns
disaster-recovery
Keywords: disaster recovery, accidental deletion, migration failure, rollback Solves:
-
What if I accidentally delete the dataset?
-
Database migration gone wrong
-
Restore after data corruption
-
Rollback procedures
orchestkit-golden-dataset
Keywords: orchestkit, 98 analyses, 415 chunks, retrieval evaluation, real world Solves:
-
What is OrchestKit's golden dataset?
-
How does OrchestKit protect test data?
-
Real-world backup/restore examples
-
Production golden dataset stats