eval-recipes Runner Skill
Purpose
Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.
When to Use
-
User asks to "test with eval-recipes"
-
User says "run the evals" or "benchmark this change"
-
User wants to validate improvements against codex/claude_code
-
Testing a PR branch to prove it improves scores
Capabilities
I can run eval-recipes benchmarks to:
-
Test specific amplihack branches
-
Compare against baseline agents (codex, claude_code)
-
Run specific tasks (linkedin_drafting, email_drafting, etc.)
-
Compare before/after scores for PRs
-
Generate reports with score improvements
How It Works
Setup (One-Time)
Clone eval-recipes from Microsoft
git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes cd ~/eval-recipes
Copy our agent configs
cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/
Install dependencies
uv sync
Running Benchmarks
Test a specific branch:
Update install.dockerfile to use specific branch
Then run benchmark
cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
Compare before/after:
Test baseline (main)
uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting
Test PR branch (edit install.dockerfile to checkout PR branch)
uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting
Compare scores
Available Tasks
Common tasks from eval-recipes:
-
linkedin_drafting
-
Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)
-
email_drafting
-
Create CLI tool for emails (scored 26/100 before)
-
arxiv_paper_summarizer
-
Research tool
-
github_docs_extractor
-
Documentation tool
-
Many more in ~/eval-recipes/data/tasks/
Typical Workflow
When user says "test this change with eval-recipes":
-
Identify the branch/PR to test
-
Update agent config to use that branch:
In .claude/agents/eval-recipes/amplihack/install.dockerfile
RUN git clone https://github.com/rysweet/...git /tmp/amplihack &&
cd /tmp/amplihack &&
git checkout BRANCH_NAME &&
pip install -e .
-
Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
-
Run benchmark: cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3
-
Report scores and compare with baseline
Expected Scores
Baseline (main branch):
-
Overall: 40.6/100
-
LinkedIn: 6.5/100
-
Email: 26/100
With PR #1443 (task classification):
-
Expected: 55-60/100 (+15-20 points)
-
LinkedIn: 30-40/100 (creates actual tool)
-
Email: 45/100 (consistent execution)
Example Usage
User says: "Test PR #1443 with eval-recipes on the LinkedIn task"
I do:
-
Update install.dockerfile to checkout feat/issue-1435-task-classification
-
Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/
-
Run: cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3
-
Report results: "Score: 35.2/100 (up from 6.5 baseline)"
Prerequisites
-
eval-recipes cloned to ~/eval-recipes
-
API key in environment: export ANTHROPIC_API_KEY=sk-ant-...
-
Docker installed (for containerized runs)
-
uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh
Notes
-
Benchmarks take 2-15 minutes per task depending on complexity
-
Multiple trials (3-5) give more reliable averages
-
Docker builds can be cached for speed
-
Results saved to .benchmark_results/ in eval-recipes repo
Automation
For fully autonomous testing:
Test suite for a PR
tasks="linkedin_drafting email_drafting arxiv_paper_summarizer" for task in $tasks; do uv run eval_recipes/main.py --agent amplihack --task $task --trials 3 done
Compare results
cat .benchmark_results//amplihack//score.txt