Agent Evaluation (AI Agent Evals)
Based on Anthropic's "Demystifying evals for AI agents"
When to use this skill
-
Designing evaluation systems for AI agents
-
Building benchmarks for coding, conversational, or research agents
-
Creating graders (code-based, model-based, human)
-
Implementing production monitoring for AI systems
-
Setting up CI/CD pipelines with automated evals
-
Debugging agent performance issues
-
Measuring agent improvement over time
Core Concepts
Eval Evolution: Single-turn → Multi-turn → Agentic
Type Turns State Grading Complexity
Single-turn 1 None Simple Low
Multi-turn N Conversation Per-turn Medium
Agentic N World + History Outcome High
7 Key Terms
Term Definition
Task Single test case (prompt + expected outcome)
Trial One agent run on a task
Grader Scoring function (code/model/human)
Transcript Full record of agent actions
Outcome Final state for grading
Harness Infrastructure running evals
Suite Collection of related tasks
Instructions
Step 1: Understand Grader Types
Code-based Graders (Recommended for Coding Agents)
-
Pros: Fast, objective, reproducible
-
Cons: Requires clear success criteria
-
Best for: Coding agents, structured outputs
Example: Code-based grader
def grade_task(outcome: dict) -> float: """Grade coding task by test passage.""" tests_passed = outcome.get("tests_passed", 0) total_tests = outcome.get("total_tests", 1) return tests_passed / total_tests
SWE-bench style grader
def grade_swe_bench(repo_path: str, test_spec: dict) -> bool: """Run tests and check if patch resolves issue.""" result = subprocess.run( ["pytest", test_spec["test_file"]], cwd=repo_path, capture_output=True ) return result.returncode == 0
Model-based Graders (LLM-as-Judge)
-
Pros: Flexible, handles nuance
-
Cons: Requires calibration, can be inconsistent
-
Best for: Conversational agents, open-ended tasks
Example: LLM Rubric for Customer Support Agent
rubric: dimensions: - name: empathy weight: 0.3 scale: 1-5 criteria: | 5: Acknowledges emotions, uses warm language 3: Polite but impersonal 1: Cold or dismissive
- name: resolution
weight: 0.5
scale: 1-5
criteria: |
5: Fully resolves issue
3: Partial resolution
1: No resolution
- name: efficiency
weight: 0.2
scale: 1-5
criteria: |
5: Resolved in minimal turns
3: Reasonable turns
1: Excessive back-and-forth
Human Graders
-
Pros: Highest accuracy, catches edge cases
-
Cons: Expensive, slow, not scalable
-
Best for: Final validation, ambiguous cases
Step 2: Choose Strategy by Agent Type
2.1 Coding Agents
Benchmarks:
-
SWE-bench Verified: Real GitHub issues (40% → 80%+ achievable)
-
Terminal-Bench: Complex terminal tasks
-
Custom test suites with your codebase
Grading Strategy:
def grade_coding_agent(task: dict, outcome: dict) -> dict: return { "tests_passed": run_test_suite(outcome["code"]), "lint_score": run_linter(outcome["code"]), "builds": check_build(outcome["code"]), "matches_spec": compare_to_reference(task["spec"], outcome["code"]) }
Key Metrics:
-
Test passage rate
-
Build success
-
Lint/style compliance
-
Diff size (smaller is better)
2.2 Conversational Agents
Benchmarks:
-
τ2-Bench: Multi-domain conversation
-
Custom domain-specific suites
Grading Strategy (Multi-dimensional):
success_criteria:
- empathy_score: >= 4.0
- resolution_rate: >= 0.9
- avg_turns: <= 5
- escalation_rate: <= 0.1
Key Metrics:
-
Task resolution rate
-
Customer satisfaction proxy
-
Turn efficiency
-
Escalation rate
2.3 Research Agents
Grading Dimensions:
-
Grounding: Claims backed by sources
-
Coverage: All aspects addressed
-
Source Quality: Authoritative sources used
def grade_research_agent(task: dict, outcome: dict) -> dict: return { "grounding": check_citations(outcome["report"]), "coverage": check_topic_coverage(task["topics"], outcome["report"]), "source_quality": score_sources(outcome["sources"]), "factual_accuracy": verify_claims(outcome["claims"]) }
2.4 Computer Use Agents
Benchmarks:
-
WebArena: Web navigation tasks
-
OSWorld: Desktop environment tasks
Grading Strategy:
def grade_computer_use(task: dict, outcome: dict) -> dict: return { "ui_state": verify_ui_state(outcome["screenshot"]), "db_state": verify_database(task["expected_db_state"]), "file_state": verify_files(task["expected_files"]), "success": all_conditions_met(task, outcome) }
Step 3: Follow the 8-Step Roadmap
Step 0: Start Early (20-50 Tasks)
Create initial eval suite structure
mkdir -p evals/{tasks,results,graders}
Start with representative tasks
- Common use cases (60%)
- Edge cases (20%)
- Failure modes (20%)
Step 1: Convert Manual Tests
Transform existing QA tests into eval tasks
def convert_qa_to_eval(qa_case: dict) -> dict: return { "id": qa_case["id"], "prompt": qa_case["input"], "expected_outcome": qa_case["expected"], "grader": "code" if qa_case["has_tests"] else "model", "tags": qa_case.get("tags", []) }
Step 2: Ensure Clarity + Reference Solutions
Good task definition
task: id: "api-design-001" prompt: | Design a REST API for user management with: - CRUD operations - Authentication via JWT - Rate limiting reference_solution: "./solutions/api-design-001/" success_criteria: - "All endpoints documented" - "Auth middleware present" - "Rate limit config exists"
Step 3: Balance Positive/Negative Cases
Ensure eval suite balance
suite_composition = { "positive_cases": 0.5, # Should succeed "negative_cases": 0.3, # Should fail gracefully "edge_cases": 0.2 # Boundary conditions }
Step 4: Isolate Environments
Docker-based isolation for coding evals
eval_environment: type: docker image: "eval-sandbox:latest" timeout: 300s resources: memory: "4g" cpu: "2" network: isolated cleanup: always
Step 5: Focus on Outcomes, Not Paths
GOOD: Outcome-focused grader
def grade_outcome(expected: dict, actual: dict) -> float: return compare_final_states(expected, actual)
BAD: Path-focused grader (too brittle)
def grade_path(expected_steps: list, actual_steps: list) -> float: return step_by_step_match(expected_steps, actual_steps)
Step 6: Always Read Transcripts
Transcript analysis for debugging
def analyze_transcript(transcript: list) -> dict: return { "total_steps": len(transcript), "tool_usage": count_tool_calls(transcript), "errors": extract_errors(transcript), "decision_points": find_decision_points(transcript), "recovery_attempts": find_recovery_patterns(transcript) }
Step 7: Monitor Eval Saturation
Detect when evals are no longer useful
def check_saturation(results: list, window: int = 10) -> dict: recent = results[-window:] return { "pass_rate": sum(r["passed"] for r in recent) / len(recent), "variance": calculate_variance(recent), "is_saturated": all(r["passed"] for r in recent), "recommendation": "Add harder tasks" if saturated else "Continue" }
Step 8: Long-term Maintenance
Eval suite maintenance checklist
maintenance: weekly: - Review failed evals for false negatives - Check for flaky tests monthly: - Add new edge cases from production issues - Retire saturated evals - Update reference solutions quarterly: - Full benchmark recalibration - Team contribution review
Step 4: Integrate with Production
CI/CD Integration
GitHub Actions example
name: Agent Evals on: [push, pull_request]
jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Evals run: | python run_evals.py --suite=core --mode=compact - name: Upload Results uses: actions/upload-artifact@v4 with: name: eval-results path: results/
Production Monitoring
Real-time eval sampling
class ProductionMonitor: def init(self, sample_rate: float = 0.1): self.sample_rate = sample_rate
async def monitor(self, request, response):
if random.random() < self.sample_rate:
eval_result = await self.run_eval(request, response)
self.log_result(eval_result)
if eval_result["score"] < self.threshold:
self.alert("Low quality response detected")
A/B Testing
Compare agent versions
def run_ab_test(suite: str, versions: list) -> dict: results = {} for version in versions: results[version] = run_eval_suite(suite, agent_version=version) return { "comparison": compare_results(results), "winner": determine_winner(results), "confidence": calculate_confidence(results) }
Best Practices
Do's ✅
-
Start with 20-50 representative tasks
-
Use code-based graders when possible
-
Focus on outcomes, not paths
-
Read transcripts for debugging
-
Monitor for eval saturation
-
Balance positive/negative cases
-
Isolate eval environments
-
Version your eval suites
Don'ts ❌
-
Don't over-rely on model-based graders without calibration
-
Don't ignore failed evals (false negatives exist)
-
Don't grade on intermediate steps
-
Don't skip transcript analysis
-
Don't use production data without sanitization
-
Don't let eval suites become stale
Success Patterns
Pattern 1: Graduated Eval Complexity
Level 1: Unit evals (single capability) Level 2: Integration evals (combined capabilities) Level 3: End-to-end evals (full workflows) Level 4: Adversarial evals (edge cases)
Pattern 2: Eval-Driven Development
- Write eval task for new feature
- Run eval (expect failure)
- Implement feature
- Run eval (expect pass)
- Add to regression suite
Pattern 3: Continuous Calibration
Weekly: Review grader accuracy Monthly: Update rubrics based on feedback Quarterly: Full grader audit with human baseline
Troubleshooting
Problem: Eval scores at 100%
Solution: Add harder tasks, check for eval saturation (Step 7)
Problem: Inconsistent model-based grader scores
Solution: Add more examples to rubric, use structured output, ensemble graders
Problem: Evals too slow for CI
Solution: Use toon mode, parallelize, sample subset for PR checks
Problem: Agent passes evals but fails in production
Solution: Add production failure cases to eval suite, increase diversity
References
-
Anthropic: Demystifying evals for AI agents
-
SWE-bench
-
WebArena
-
τ2-Bench
Examples
Example 1: Simple Coding Agent Eval
Task definition
task = { "id": "fizzbuzz-001", "prompt": "Write a fizzbuzz function in Python", "test_cases": [ {"input": 3, "expected": "Fizz"}, {"input": 5, "expected": "Buzz"}, {"input": 15, "expected": "FizzBuzz"}, {"input": 7, "expected": "7"} ] }
Grader
def grade(task, outcome): code = outcome["code"] exec(code) # In sandbox for tc in task["test_cases"]: if fizzbuzz(tc["input"]) != tc["expected"]: return 0.0 return 1.0
Example 2: Conversational Agent Eval with LLM Rubric
task: id: "support-refund-001" scenario: | Customer wants refund for damaged product. Product: Laptop, Order: #12345, Damage: Screen crack expected_actions: - Acknowledge issue - Verify order - Offer resolution options max_turns: 5
grader: type: model model: claude-3-5-sonnet-20241022 rubric: | Score 1-5 on each dimension: - Empathy: Did agent acknowledge customer frustration? - Resolution: Was a clear solution offered? - Efficiency: Was issue resolved in reasonable turns?