evals

Before executing, check for user customizations at: ~/.claude/skills/CORE/USER/SKILLCUSTOMIZATIONS/Evals/

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "evals" with this command: npx skills add danielmiessler/personal_ai_infrastructure/danielmiessler-personal-ai-infrastructure-evals

Customization

Before executing, check for user customizations at: ~/.claude/skills/CORE/USER/SKILLCUSTOMIZATIONS/Evals/

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.

🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

You MUST send this notification BEFORE doing anything else when this skill is invoked.

Send voice notification:

curl -s -X POST http://localhost:8888/notify
-H "Content-Type: application/json"
-d '{"message": "Running the WORKFLOWNAME workflow in the Evals skill to ACTION"}' \

/dev/null 2>&1 &

Output text notification:

Running the WorkflowName workflow in the Evals skill to ACTION...

This is not optional. Execute this curl command immediately upon skill invocation.

Evals - AI Agent Evaluation Framework

Comprehensive agent evaluation system based on Anthropic's "Demystifying Evals for AI Agents" (Jan 2026).

Key differentiator: Evaluates agent workflows (transcripts, tool calls, multi-turn conversations), not just single outputs.

When to Activate

  • "run evals", "test this agent", "evaluate", "check quality", "benchmark"

  • "regression test", "capability test"

  • Compare agent behaviors across changes

  • Validate agent workflows before deployment

  • Verify ALGORITHM ISC rows

  • Create new evaluation tasks from failures

Core Concepts

Three Grader Types

Type Strengths Weaknesses Use For

Code-based Fast, cheap, deterministic, reproducible Brittle, lacks nuance Tests, state checks, tool verification

Model-based Flexible, captures nuance, scalable Non-deterministic, expensive Quality rubrics, assertions, comparisons

Human Gold standard, handles subjectivity Expensive, slow Calibration, spot checks, A/B testing

Evaluation Types

Type Pass Target Purpose

Capability ~70% Stretch goals, measuring improvement potential

Regression ~99% Quality gates, detecting backsliding

Key Metrics

  • pass@k: Probability of at least 1 success in k trials (measures capability)

  • pass^k: Probability all k trials succeed (measures consistency/reliability)

Workflow Routing

Trigger Workflow

"run evals", "evaluate suite" Run suite via Tools/AlgorithmBridge.ts

"log failure" Log failure via Tools/FailureToTask.ts log

"convert failures" Convert to tasks via Tools/FailureToTask.ts convert-all

"create suite" Create suite via Tools/SuiteManager.ts create

"check saturation" Check via Tools/SuiteManager.ts check-saturation

Quick Reference

CLI Commands

Run an eval suite

bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s <suite>

Log a failure for later conversion

bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts log "description" -c category -s severity

Convert failures to test tasks

bun run ~/.claude/skills/Evals/Tools/FailureToTask.ts convert-all

Manage suites

bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts create <name> -t capability -d "description" bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts list bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts check-saturation <name> bun run ~/.claude/skills/Evals/Tools/SuiteManager.ts graduate <name>

ALGORITHM Integration

Evals is a verification method for THE ALGORITHM ISC rows:

Run eval and update ISC row

bun run ~/.claude/skills/Evals/Tools/AlgorithmBridge.ts -s regression-core -r 3 -u

ISC rows can specify eval verification:

#What Ideal Looks LikeVerify
1Auth bypass fixedeval:auth-security
2Tests all passeval:regression

Available Graders

Code-Based (Fast, Deterministic)

Grader Use Case

string_match

Exact substring matching

regex_match

Pattern matching

binary_tests

Run test files

static_analysis

Lint, type-check, security scan

state_check

Verify system state after execution

tool_calls

Verify specific tools were called

Model-Based (Nuanced)

Grader Use Case

llm_rubric

Score against detailed rubric

natural_language_assert

Check assertions are true

pairwise_comparison

Compare to reference with position swap

Domain Patterns

Pre-configured grader stacks for common agent types:

Domain Primary Graders

coding

binary_tests + static_analysis + tool_calls + llm_rubric

conversational

llm_rubric + natural_language_assert + state_check

research

llm_rubric + natural_language_assert + tool_calls

computer_use

state_check + tool_calls + llm_rubric

See Data/DomainPatterns.yaml for full configurations.

Task Schema (YAML)

task: id: "fix-auth-bypass_1" description: "Fix authentication bypass when password is empty" type: regression # or capability domain: coding

graders: - type: binary_tests required: [test_empty_pw.py] weight: 0.30

- type: tool_calls
  weight: 0.20
  params:
    sequence: [read_file, edit_file, run_tests]

- type: llm_rubric
  weight: 0.50
  params:
    rubric: prompts/security_review.md

trials: 3 pass_threshold: 0.75

Resource Index

Resource Purpose

Types/index.ts

Core type definitions

Graders/CodeBased/

Deterministic graders

Graders/ModelBased/

LLM-powered graders

Tools/TranscriptCapture.ts

Capture agent trajectories

Tools/TrialRunner.ts

Multi-trial execution with pass@k

Tools/SuiteManager.ts

Suite management and saturation

Tools/FailureToTask.ts

Convert failures to test tasks

Tools/AlgorithmBridge.ts

ALGORITHM integration

Data/DomainPatterns.yaml

Domain-specific grader configs

Key Principles (from Anthropic)

  • Start with 20-50 real failures - Don't overthink, capture what actually broke

  • Unambiguous tasks - Two experts should reach identical verdicts

  • Balanced problem sets - Test both "should do" AND "should NOT do"

  • Grade outputs, not paths - Don't penalize valid creative solutions

  • Calibrate LLM judges - Against human expert judgment

  • Check transcripts regularly - Verify graders work correctly

  • Monitor saturation - Graduate to regression when hitting 95%+

  • Build infrastructure early - Evals shape how quickly you can adopt new models

Related

  • ALGORITHM: Evals is a verification method

  • Science: Evals implements scientific method

  • Browser: For visual verification graders

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

osint

No summary provided by upstream source.

Repository SourceNeeds Review
General

firstprinciples

No summary provided by upstream source.

Repository SourceNeeds Review
General

documents

No summary provided by upstream source.

Repository SourceNeeds Review