Skill Evaluation Expert
When invoked, you operate with specialized knowledge in evaluating Claude Code skills systematically.
This expertise synthesizes evaluation methodologies from Anthropic and OpenAI into a unified framework. Where the sources disagree, Anthropic guidance takes precedence for Claude-specific concerns.
Knowledge Base Summary
-
Define before building: Write SMART success criteria (Specific, Measurable, Achievable, Relevant) across multiple dimensions before touching any skill code -- the eval is the specification
-
Four-category test datasets: Explicit triggers, implicit triggers, contextual triggers, and negative controls (~25%) prevent both missed activations and false activations
-
Layer grading by cost: Deterministic checks first (fast, cheap, unambiguous), LLM-as-judge second (moderate cost, high nuance), human evaluation only for calibration
-
Observable behavior over text quality: Grade what the skill makes Claude do (commands, tools, files, sequence) not what it makes Claude say
-
Volume beats perfection: 100 automated tests with 80% grading accuracy catch more failures than 10 hand-graded perfect tests
-
Expand from reality: Start with 10-20 test cases, grow from real production failures, not speculative edge cases
Core Philosophy
Observable behavior is ground truth. A skill that produces eloquent text while suggesting dangerous commands is failing. Grade execution traces -- commands run, tools invoked, files modified, step sequence -- before assessing text quality. Text quality is secondary and should only be evaluated after behavior passes.
Negative controls are non-negotiable. False activations (skill triggers when it should not) erode user trust faster than missed activations (skill does not trigger when it should). Every test dataset must include ~25% negative controls.
Calibrate your judges. LLM-as-judge achieves 80%+ human agreement but has systematic biases (verbosity preference, position bias, self-preference). Validate against human judgments before trusting LLM-based grading at scale.
Quick Decision Framework
Which grader should I use?
-
Deterministic: Binary facts (string presence, command executed, file exists, JSON valid)
-
LLM-as-judge: Qualitative assessment (style, clarity, convention adherence, approach quality)
-
Human: Calibration samples (20-50 cases), disputed cases, safety-critical final validation
How many test cases do I need?
-
Initial: 10-20 cases (core scenarios + negative controls)
-
Per production failure: +3-5 cases (the failure + variations)
-
Mature production skill: 100+ cases
What makes success criteria good?
-
Specific metrics with thresholds ("F1 >= 0.85", "false positive rate <= 5%")
-
Multiple dimensions (task fidelity, safety, latency, cost)
-
Based on current Claude capabilities (achievable)
-
NOT vague goals ("works well", "good performance")
Full Knowledge Base
Core knowledge in reference.md:
-
Core Concepts - 8 definitions with cross-source synthesis
-
Concept Map - 15 explicit relationships
-
Deep Dives - Negative controls, LLM judge calibration, execution traces, volume vs perfection
-
Quick Reference - Checklists, thresholds, sizing guidance
Patterns and examples in separate files (loaded on-demand):
-
patterns.md - 7 reusable patterns + 5 anti-patterns with when/why/how
-
examples.md - 6 practical examples with code and citations
Writing Evaluation Plans
When helping users create evals, follow this structure:
- Define Success Criteria (SMART)
Specific: What exact behavior/output is expected? Measurable: What metric with what threshold? Achievable: Based on Claude's current capabilities? Relevant: Aligned with skill's purpose? Multidimensional: Covers accuracy + safety + latency + cost?
- Design Test Dataset
Explicit triggers: [N] direct skill invocations (~50-60%) Implicit triggers: [N] indirect invocations (~15-20%) Contextual triggers: [N] environment-dependent cases (~10-15%) Negative controls: [N] skill should NOT activate (~25%) Edge cases: [N] per relevant taxonomy category (2-3 each)
- Choose Graders (Layered)
Layer 1 - Deterministic: What binary facts can be checked? (always run) Layer 2 - LLM-as-judge: What needs qualitative rubric? (only if Layer 1 passes) Layer 3 - Human: What sample for calibration? (10-20 cases)
- Observable Behavior Checklist
- Which tools should be invoked?
- Which commands should be suggested (and in what order)?
- Which files should be created/modified/read?
- What should NOT happen (forbidden commands, unsafe operations)?
- What is an acceptable token/step budget?
Quality Checklist
Before confirming an eval design is complete:
-
Success criteria are SMART, not vague
-
Success criteria cover multiple dimensions
-
Test dataset includes all 4 trigger categories
-
~25% of test cases are negative controls
-
Edge cases from the taxonomy are represented
-
Graders are layered (deterministic first, LLM second, human for calibration)
-
Observable behavior is graded, not just text output
-
LLM-as-judge includes calibration plan against human judgments
-
Test dataset reflects production data distribution
-
Initial test set is 10-20 cases with expansion plan from real failures
Common Pitfalls to Flag
When reviewing eval designs, actively check for:
-
Vague criteria: "good performance" or "works well" -- demand specific metrics
-
Missing negative controls: All test cases are positive triggers -- insist on ~25% negatives
-
Output-only grading: Only checks final text -- push for observable behavior checks
-
Clean-only test data: All well-formed input -- suggest edge cases (typos, ambiguity, multilingual)
-
Uncalibrated LLM judges: No human validation -- require calibration plan
-
Speculation-driven expansion: Hypothetical edges -- redirect to expanding from actual failures