Customization

Before executing, check for user customizations at: ~/.claude/skills/CORE/USER/SKILLCUSTOMIZATIONS/Evals/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

Send voice notification:

curl -s -X POST http://localhost:8888/notify
-H "Content-Type: application/json"
-d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \

/dev/null 2>&1 &

Output text notification:

Running the WorkflowName workflow in the Evals skill to ACTION...

This is not optional. Execute this curl command immediately upon skill invocation.

Evals - AI Agent Evaluation Framework

Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).

Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.

When to Activate

"run evals", "test this agent", "evaluate", "check quality", "benchmark"
"regression test", "capability test"
Compare agent behaviors across changes
Validate agent workflows before deployment
Verify ALGORITHM ISC rows
Create new evaluation tasks from failures

Core Concepts

Three Grader Types

Type Strengths Weaknesses Use For

Code-based Fast, cheap, deterministic, reproducible Brittle, lacks nuance Tests, state checks, tool verification

Model-based Flexible, captures nuance, scalable Non-deterministic, expensive Quality rubrics, assertions, comparisons

Human Gold standard, handles subjectivity Expensive, slow Calibration, spot checks, A/B testing

Evaluation Types

Type Pass Target Purpose

Capability ~70% Stretch goals, measuring improvement potential

Regression ~99% Quality gates, detecting backsliding

Key Metrics

pass@k: Probability of at least 1 success in k trials (measures capability)
pass^k: Probability all k trials succeed (measures consistency/reliability)

Workflow Routing

Trigger Workflow

"run evals", "evaluate suite" Run suite via Tools/AlgorithmBridge.ts

"log failure" Log failure via Tools/FailureToTask.ts log

"convert failures" Convert to tasks via Tools/FailureToTask.ts convert-all

"create suite" Create suite via Tools/SuiteManager.ts create

"check saturation" Check via Tools/SuiteManager.ts check-saturation

Quick Reference

CLI Commands

Run an eval suite

bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s <suite>

Log a failure for later conversion

bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts log "description" -c category -s severity

Convert failures to test tasks

bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts convert-all

Manage suites

bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description" bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts list bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts check-saturation <name> bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

Run eval and update ISC row

bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

#	What Ideal Looks Like	Verify
1	Auth bypass fixed	eval:auth-security
2	Tests all pass	eval:regression

Available Graders

Code-Based (Fast, Deterministic)

Grader Use Case

string_match

Exact substring matching

regex_match

Pattern matching

binary_tests

Run test files

static_analysis

Lint, type-check, security scan

state_check

Verify system state after execution

tool_calls

Verify specific tools were called

Model-Based (Nuanced)

Grader Use Case

llm_rubric

Score against detailed rubric

natural_language_assert

Check assertions are true

pairwise_comparison

Compare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

Domain Primary Graders

coding

binary_tests + static_analysis + tool_calls + llm_rubric

conversational

llm_rubric + natural_language_assert + state_check

research

llm_rubric + natural_language_assert + tool_calls

computer_use

state_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.

Task Schema (YAML)

task: id: "fix-auth-bypass_1" description: "Fix authentication bypass when password is empty" type: regression # or capability domain: coding

graders: - type: binary_tests required: [test_empty_pw.py] weight: 0.30

- type: tool_calls
  weight: 0.20
  params:
    sequence: [read_file, edit_file, run_tests]

- type: llm_rubric
  weight: 0.50
  params:
    rubric: prompts/security_review.md

trials: 3 pass_threshold: 0.75

Resource Index

Resource Purpose

Types/index.ts

Core type definitions

Graders/CodeBased/

Deterministic graders

Graders/ModelBased/

LLM-powered graders

Tools/TranscriptCapture.ts

Capture agent trajectories

Tools/TrialRunner.ts

Multi-trial execution with pass@k

Tools/SuiteManager.ts

Suite management and saturation

Tools/FailureToTask.ts

Convert failures to test tasks

Tools/AlgorithmBridge.ts

ALGORITHM integration

Data/DomainPatterns.yaml

Domain-specific grader configs

Key Principles (from Anthropic)

Start with 20-50 real failures - Don't overthink, capture what actually broke
Unambiguous tasks - Two experts should reach identical verdicts
Balanced problem sets - Test both "should do" AND "should NOT do"
Grade outputs, not paths - Don't penalize valid creative solutions
Calibrate LLM judges - Against human expert judgment
Check transcripts regularly - Verify graders work correctly
Monitor saturation - Graduate to regression when hitting 95%+
Build infrastructure early - Evals shape how quickly you can adopt new models

ALGORITHM: Evals is a verification method
Science: Evals implements scientific method
Browser: For visual verification graders

evals

Safety Notice

Copy this and send it to your AI assistant to learn

Run an eval suite

Log a failure for later conversion

Convert failures to test tasks

Manage suites

Run eval and update ISC row

Source Transparency

Related Skills

osint

firstprinciples

documents