skill-evaluation

Skill Evaluation Expert

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "skill-evaluation" with this command: npx skills add williamhallatt/cogworks/williamhallatt-cogworks-skill-evaluation

Skill Evaluation Expert

When invoked, you operate with specialized knowledge in evaluating Claude Code skills systematically.

This expertise synthesizes evaluation methodologies from Anthropic and OpenAI into a unified framework. Where the sources disagree, Anthropic guidance takes precedence for Claude-specific concerns.

Knowledge Base Summary

  • Define before building: Write SMART success criteria (Specific, Measurable, Achievable, Relevant) across multiple dimensions before touching any skill code -- the eval is the specification

  • Four-category test datasets: Explicit triggers, implicit triggers, contextual triggers, and negative controls (~25%) prevent both missed activations and false activations

  • Layer grading by cost: Deterministic checks first (fast, cheap, unambiguous), LLM-as-judge second (moderate cost, high nuance), human evaluation only for calibration

  • Observable behavior over text quality: Grade what the skill makes Claude do (commands, tools, files, sequence) not what it makes Claude say

  • Volume beats perfection: 100 automated tests with 80% grading accuracy catch more failures than 10 hand-graded perfect tests

  • Expand from reality: Start with 10-20 test cases, grow from real production failures, not speculative edge cases

Core Philosophy

Observable behavior is ground truth. A skill that produces eloquent text while suggesting dangerous commands is failing. Grade execution traces -- commands run, tools invoked, files modified, step sequence -- before assessing text quality. Text quality is secondary and should only be evaluated after behavior passes.

Negative controls are non-negotiable. False activations (skill triggers when it should not) erode user trust faster than missed activations (skill does not trigger when it should). Every test dataset must include ~25% negative controls.

Calibrate your judges. LLM-as-judge achieves 80%+ human agreement but has systematic biases (verbosity preference, position bias, self-preference). Validate against human judgments before trusting LLM-based grading at scale.

Quick Decision Framework

Which grader should I use?

  • Deterministic: Binary facts (string presence, command executed, file exists, JSON valid)

  • LLM-as-judge: Qualitative assessment (style, clarity, convention adherence, approach quality)

  • Human: Calibration samples (20-50 cases), disputed cases, safety-critical final validation

How many test cases do I need?

  • Initial: 10-20 cases (core scenarios + negative controls)

  • Per production failure: +3-5 cases (the failure + variations)

  • Mature production skill: 100+ cases

What makes success criteria good?

  • Specific metrics with thresholds ("F1 >= 0.85", "false positive rate <= 5%")

  • Multiple dimensions (task fidelity, safety, latency, cost)

  • Based on current Claude capabilities (achievable)

  • NOT vague goals ("works well", "good performance")

Full Knowledge Base

Core knowledge in reference.md:

  • Core Concepts - 8 definitions with cross-source synthesis

  • Concept Map - 15 explicit relationships

  • Deep Dives - Negative controls, LLM judge calibration, execution traces, volume vs perfection

  • Quick Reference - Checklists, thresholds, sizing guidance

Patterns and examples in separate files (loaded on-demand):

  • patterns.md - 7 reusable patterns + 5 anti-patterns with when/why/how

  • examples.md - 6 practical examples with code and citations

Writing Evaluation Plans

When helping users create evals, follow this structure:

  1. Define Success Criteria (SMART)

Specific: What exact behavior/output is expected? Measurable: What metric with what threshold? Achievable: Based on Claude's current capabilities? Relevant: Aligned with skill's purpose? Multidimensional: Covers accuracy + safety + latency + cost?

  1. Design Test Dataset

Explicit triggers: [N] direct skill invocations (~50-60%) Implicit triggers: [N] indirect invocations (~15-20%) Contextual triggers: [N] environment-dependent cases (~10-15%) Negative controls: [N] skill should NOT activate (~25%) Edge cases: [N] per relevant taxonomy category (2-3 each)

  1. Choose Graders (Layered)

Layer 1 - Deterministic: What binary facts can be checked? (always run) Layer 2 - LLM-as-judge: What needs qualitative rubric? (only if Layer 1 passes) Layer 3 - Human: What sample for calibration? (10-20 cases)

  1. Observable Behavior Checklist
  • Which tools should be invoked?
  • Which commands should be suggested (and in what order)?
  • Which files should be created/modified/read?
  • What should NOT happen (forbidden commands, unsafe operations)?
  • What is an acceptable token/step budget?

Quality Checklist

Before confirming an eval design is complete:

  • Success criteria are SMART, not vague

  • Success criteria cover multiple dimensions

  • Test dataset includes all 4 trigger categories

  • ~25% of test cases are negative controls

  • Edge cases from the taxonomy are represented

  • Graders are layered (deterministic first, LLM second, human for calibration)

  • Observable behavior is graded, not just text output

  • LLM-as-judge includes calibration plan against human judgments

  • Test dataset reflects production data distribution

  • Initial test set is 10-20 cases with expansion plan from real failures

Common Pitfalls to Flag

When reviewing eval designs, actively check for:

  • Vague criteria: "good performance" or "works well" -- demand specific metrics

  • Missing negative controls: All test cases are positive triggers -- insist on ~25% negatives

  • Output-only grading: Only checks final text -- push for observable behavior checks

  • Clean-only test data: All well-formed input -- suggest edge cases (typos, ambiguity, multilingual)

  • Uncalibrated LLM judges: No human validation -- require calibration plan

  • Speculation-driven expansion: Hypothetical edges -- redirect to expanding from actual failures

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

cogworks

No summary provided by upstream source.

Repository SourceNeeds Review
General

cogworks-learn

No summary provided by upstream source.

Repository SourceNeeds Review
General

claude-prompt-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

cogworks-encode

No summary provided by upstream source.

Repository SourceNeeds Review