eval-recipes Runner Skill

Purpose

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

When to Use

User asks to "test with eval-recipes"
User says "run the evals" or "benchmark this change"
User wants to validate improvements against codex/claude_code
Testing a PR branch to prove it improves scores

Capabilities

I can run eval-recipes benchmarks to:

Test specific amplihack branches
Compare against baseline agents (codex, claude_code)
Run specific tasks (linkedin_drafting, email_drafting, etc.)
Compare before/after scores for PRs
Generate reports with score improvements

How It Works

Setup (One-Time)

Clone eval-recipes from Microsoft

git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes cd ~/eval-recipes

Copy our agent configs

cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

Install dependencies

uv sync

Running Benchmarks

Test a specific branch:

Update install.dockerfile to use specific branch

Then run benchmark

cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

Compare before/after:

Test baseline (main)

uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

Test PR branch (edit install.dockerfile to checkout PR branch)

uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

Compare scores

Available Tasks

Common tasks from eval-recipes:

linkedin_drafting
Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
email_drafting
Create CLI tool for emails (scored 26/100 before)
arxiv_paper_summarizer
Research tool
github_docs_extractor
Documentation tool
Many more in ~/eval-recipes/data/tasks/

Typical Workflow

When user says "test this change with eval-recipes":

Identify the branch/PR to test
Update agent config to use that branch:

In .claude/agents/eval-recipes/amplihack/install.dockerfile

RUN git clone https://github.com/rysweet/...git /tmp/amplihack &&
cd /tmp/amplihack &&
git checkout BRANCH_NAME &&
pip install -e .

Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
Run benchmark: cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3
Report scores and compare with baseline

Expected Scores

Baseline (main branch):

Overall: 40.6/100
LinkedIn: 6.5/100
Email: 26/100

With PR #1443 (task classification):

Expected: 55-60/100 (+15-20 points)
LinkedIn: 30-40/100 (creates actual tool)
Email: 45/100 (consistent execution)

Example Usage

User says: "Test PR #1443 with eval-recipes on the LinkedIn task"

I do:

Update install.dockerfile to checkout feat/issue-1435-task-classification
Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
Run: cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
Report results: "Score: 35.2/100 (up from 6.5 baseline)"

Prerequisites

eval-recipes cloned to ~/eval-recipes
API key in environment: export ANTHROPIC_API_KEY=sk-ant-...
Docker installed (for containerized runs)
uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh

Notes

Benchmarks take 2-15 minutes per task depending on complexity
Multiple trials (3-5) give more reliable averages
Docker builds can be cached for speed
Results saved to .benchmark_results/ in eval-recipes repo

Automation

For fully autonomous testing:

Test suite for a PR

tasks="linkedin_drafting email_drafting arxiv_paper_summarizer" for task in $tasks; do uv run eval_recipes/main.py --agent amplihack --task $task --trials 3 done

Compare results

cat .benchmark_results//amplihack//score.txt

eval-recipes-runner

Safety Notice

Copy this and send it to your AI assistant to learn

Clone eval-recipes from Microsoft

Copy our agent configs

Install dependencies

Update install.dockerfile to use specific branch

Then run benchmark

Test baseline (main)

Test PR branch (edit install.dockerfile to checkout PR branch)

Compare scores

In .claude/agents/eval-recipes/amplihack/install.dockerfile

Test suite for a PR

Compare results

Source Transparency

Related Skills

pptx

lawyer-analyst

economist-analyst

mermaid-diagram-generator