evals

Run and create evals for testing agent behavior. Use when the user wants to create or run an eval.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "evals" with this command: npx skills add adriancooney/evals/adriancooney-evals-evals

Eval Skill

Run and create evals for testing agent behavior.

Discovering Evals

Evals are markdown files matching *.eval.md. Use glob to find them:

**/*.eval.md

A common pattern is to collect evals in an evals/ directory.

Eval Structure

An eval file contains a prompt and an expectation:

# Eval Title

<prompt>
Instructions for the agent to execute.
</prompt>

<expectation>
Success criteria - describe what must be true for the eval to pass.
</expectation>

Running an Eval

  1. Read the eval file
  2. Extract the <prompt> content
  3. Spawn a subagent with the prompt (runs in current working directory with shared state)
  4. The subagent evaluates its own result against the <expectation> using LLM judgment
  5. Subagent outputs SUCCESS or FAIL with reasoning

When running multiple evals, spawn all subagents in parallel. Report aggregate results at the end.

Always end output with exactly one of these lines for CI parsing:

  • eval result: pass — all evals passed
  • eval result: fail — one or more evals failed

Subagent Instructions

IMPORTANT: The subagent must only test and observe. It must NOT attempt to fix, modify, or change anything to make the expectation pass. The subagent executes the prompt, observes the outcome, and reports whether the expectation was met. If the expectation fails, report FAIL — do not try to make it pass.

Commands

Run a single eval:

/eval run <path-to-eval.eval.md>

Run all evals:

/eval run-all

Creating an Eval

Gather from the user:

  1. Context - The process or flow to evaluate
  2. Expectation - Success criteria in natural language
/eval create <name>

Write the eval to <name>.eval.md in the current directory.

Isolation

When creating an eval, try to make it self-contained and reproducible. This isn't critical, but helps:

  • Try to avoid hardcoded paths — prefer relative paths or have the prompt create its own working directory rather than encoding specific temp directories or absolute paths.
  • Try to avoid external state — if the process relied on existing files or services, consider whether the eval should create that state itself.
  • Parameterize where possible — replace specific values (ports, filenames, IDs) with generic ones the eval can generate.

If you see an opportunity to improve isolation but need clarification, ask the user.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

voice

No summary provided by upstream source.

Repository SourceNeeds Review
General

evals

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

clinic-visit-prep

帮助患者整理就诊前问题、既往记录、检查清单与时间线,不提供诊断。;use for healthcare, intake, prep workflows;do not use for 给诊断结论, 替代医生意见.

Archived SourceRecently Updated
Automation

changelog-curator

从变更记录、提交摘要或发布说明中整理对外 changelog,并区分用户价值与内部改动。;use for changelog, release-notes, docs workflows;do not use for 捏造未发布功能, 替代正式合规审批.

Archived SourceRecently Updated