Golden Dataset Curation
Curate high-quality documents for the golden dataset with multi-agent validation
Overview
This skill provides patterns and workflows for adding new documents to the golden dataset with thorough quality analysis. It complements golden-dataset-management which handles backup/restore.
When to use this skill:
-
Adding new documents to the golden dataset
-
Classifying content types and difficulty levels
-
Generating test queries for new documents
-
Running multi-agent quality analysis
Content Types
Type Description Quality Focus
article
Technical articles, blog posts Depth, accuracy, actionability
tutorial
Step-by-step guides Completeness, clarity, code quality
research_paper
Academic papers, whitepapers Rigor, citations, methodology
documentation
API docs, reference materials Accuracy, completeness, examples
video_transcript
Transcribed video content Structure, coherence, key points
code_repository
README, code analysis Code quality, documentation
Difficulty Levels
Level Semantic Complexity Expected Score Characteristics
trivial Direct keyword match
0.85 Technical terms, exact phrases
easy Common synonyms
0.70 Well-known concepts, slight variations
medium Paraphrased intent
0.55 Conceptual queries, multi-topic
hard Multi-hop reasoning
0.40 Cross-domain, comparative analysis
adversarial Edge cases Graceful degradation Robustness tests, off-domain
Quality Dimensions
Dimension Weight Perfect Acceptable Failing
Accuracy 0.25 0.95-1.0 0.70-0.94 <0.70
Coherence 0.20 0.90-1.0 0.60-0.89 <0.60
Depth 0.25 0.90-1.0 0.55-0.89 <0.55
Relevance 0.30 0.95-1.0 0.70-0.94 <0.70
Evaluation focuses:
-
Accuracy: Technical correctness, code validity, up-to-date info
-
Coherence: Logical structure, clear flow, consistent terminology
-
Depth: Comprehensive coverage, edge cases, appropriate detail
-
Relevance: Alignment with AI/ML, backend, frontend, DevOps domains
Multi-Agent Pipeline
INPUT: URL/Content | v +------------------+ | FETCH AGENT | Extract structure, detect type +--------+---------+ | v +-----------------------------------------------+ | PARALLEL ANALYSIS AGENTS | | Quality | Difficulty | Domain | Query Gen | +-----------------------------------------------+ | v +------------------+ | CONSENSUS | Weighted score + confidence | AGGREGATOR | -> include/review/exclude +--------+---------+ | v +------------------+ | USER APPROVAL | Show scores, confirm +--------+---------+ | v OUTPUT: Curated document entry
Decision Thresholds
Quality Score Confidence Decision
= 0.75 = 0.70 include
= 0.55 any review
< 0.55 any exclude
Quality Thresholds
Recommended thresholds for golden dataset inclusion
minimum_quality_score: 0.70 minimum_confidence: 0.65 required_tags: 2 # At least 2 domain tags required_queries: 3 # At least 3 test queries
Coverage Balance Guidelines
Maintain balanced coverage across:
-
Content types: Don't over-index on articles
-
Difficulty levels: Need trivial AND hard queries
-
Domains: Spread across AI/ML, backend, frontend, etc.
Duplicate Prevention Checklist
Before adding:
-
Check URL against existing source_url_map.json
-
Run semantic similarity against existing document embeddings
-
Warn if >80% similar to existing document
Provenance Tracking
Always record:
-
Source URL (canonical)
-
Curation date
-
Agent scores (for audit trail)
-
Langfuse trace ID
Langfuse Integration
Trace Structure
trace = langfuse.trace( name="golden-dataset-curation", metadata={"source_url": url, "document_id": doc_id} )
Log individual dimension scores
trace.score(name="accuracy", value=0.85) trace.score(name="coherence", value=0.90) trace.score(name="depth", value=0.78) trace.score(name="relevance", value=0.92)
Final aggregated score
trace.score(name="quality_total", value=0.87) trace.event(name="curation_decision", metadata={"decision": "include"})
Managed Prompts
Prompt Name Purpose
golden-content-classifier
Classify content_type
golden-difficulty-classifier
Assign difficulty
golden-domain-tagger
Extract tags
golden-query-generator
Generate test queries
References
For detailed implementation patterns, see:
-
references/selection-criteria.md
-
Content type classification, difficulty stratification, quality evaluation dimensions, and best practices
-
references/annotation-patterns.md
-
Multi-agent pipeline architecture, agent specifications, consensus aggregation logic, and Langfuse integration
Related Skills
-
golden-dataset-management
-
Backup/restore operations
-
golden-dataset-validation
-
Validation rules and checks
-
langfuse-observability
-
Tracing patterns
-
pgvector-search
-
Duplicate detection
Version: 1.0.0 (December 2025) Issue: #599
Capability Details
content-classification
Keywords: content type, classification, document type, golden dataset Solves:
-
Classify document content types for golden dataset
-
Categorize entries by domain and purpose
-
Identify content requiring special handling
difficulty-stratification
Keywords: difficulty, stratification, complexity level, challenge rating Solves:
-
Assign difficulty levels to golden dataset entries
-
Ensure balanced difficulty distribution
-
Identify edge cases and challenging examples
quality-evaluation
Keywords: quality, evaluation, quality dimensions, quality criteria Solves:
-
Evaluate entry quality against defined criteria
-
Score entries on multiple quality dimensions
-
Identify entries needing improvement
multi-agent-analysis
Keywords: multi-agent, parallel analysis, consensus, agent evaluation Solves:
-
Run parallel agent evaluations on entries
-
Aggregate consensus from multiple analysts
-
Resolve disagreements in classifications