AI Evaluation (Evals)
Build systematic evaluation frameworks for AI/LLM products to measure quality, catch regressions, and improve model performance.
When to Use
-
Building product with LLM/AI components
-
Need to measure AI output quality systematically
-
Comparing models or prompts (A/B testing)
-
Detecting regressions before deployment
-
Benchmarking against competitors
-
Improving AI accuracy over time
-
Explaining AI decisions to stakeholders
Core Concept
AI Evaluation (Evals) ≠ Traditional Testing
Traditional software: Deterministic (same input → same output) AI/LLM systems: Probabilistic (same input → variable outputs)
Why Evals Are Hard:
-
Outputs are subjective (is this "good" writing?)
-
No single right answer (multiple valid responses)
-
Edge cases are infinite (can't test everything)
-
Models change behavior with updates
Solution: Build eval suites that:
-
Define quality metrics (what is "good"?)
-
Create representative test cases
-
Measure systematically (automated + human)
-
Track over time (catch regressions)
Workflow
Step 1: Define What You're Evaluating
AI Component Taxonomy
CLASSIFICATION TASKS:
- Sentiment analysis (positive/negative/neutral)
- Content moderation (safe/unsafe)
- Intent detection (user wants X)
- Entity recognition (extract names, dates)
Eval approach: Accuracy, precision, recall, F1 score
GENERATION TASKS:
- Text generation (summaries, responses, creative writing)
- Code generation (functions, scripts)
- Recommendations (suggest items, next actions)
- Translations (language → language)
Eval approach: Quality scores, human preference, task success rate
RETRIEVAL TASKS:
- Search (find relevant documents)
- Recommendation (rank items by relevance)
- Question answering (retrieve + synthesize answer)
Eval approach: Relevance, ranking quality (NDCG, MRR)
REASONING TASKS:
- Multi-step problem solving
- Complex decision making
- Causal inference
Eval approach: Correctness, reasoning quality, step-by-step validation
Step 2: Build Your Eval Dataset
Golden Dataset: Curated examples with known correct outputs.
Dataset Creation Framework
SIZE REQUIREMENTS:
- Minimum: 50-100 examples (manual review feasible)
- Good: 500-1000 examples (covers edge cases)
- Production: 5,000-10,000+ examples (statistical significance)
COMPOSITION:
- Happy Path (40%) - Typical, well-formed inputs
- Edge Cases (30%) - Unusual but valid inputs
- Adversarial (20%) - Deliberately tricky inputs
- Failure Cases (10%) - Invalid inputs (test error handling)
EXAMPLE (Book Recommendation AI):
Happy Path:
- "Recommend books like Harry Potter for my 10-year-old"
- "My kid loved Percy Jackson, what's next?"
Edge Cases:
- "Books for advanced reader (7yo but reads at 5th grade level)"
- "Fantasy but NO violence or romance"
Adversarial:
- "Best books" (too vague)
- "Books about [topic that doesn't exist for kids]"
Failure Cases:
- Gibberish input
- Adult content request
SOURCES FOR TEST CASES:
- User Logs - Real queries (anonymized)
- Team Brainstorm - Manual generation
- Synthetic - GPT-4 to generate test cases
- Competitor Comparison - Test against their outputs
- Bug Reports - Historical failures
Step 3: Define Evaluation Metrics
Quantitative Metrics:
Metric Types
ACCURACY METRICS (Classification)
- Accuracy: (Correct predictions) / (Total)
- Precision: (True Positives) / (True Positives + False Positives)
- Recall: (True Positives) / (True Positives + False Negatives)
- F1 Score: Harmonic mean of precision and recall
Use when: Clear right/wrong answer (classification, extraction)
QUALITY METRICS (Generation)
- Coherence: Does output make sense? (1-5 scale)
- Relevance: Does output answer the question? (1-5 scale)
- Helpfulness: Is output useful to user? (1-5 scale)
- Safety: Is output safe/appropriate? (pass/fail)
- Hallucination Rate: % of outputs with false information
Use when: Subjective quality assessment needed
TASK SUCCESS METRICS
- Completion Rate: % of tasks successfully completed
- User Satisfaction: Thumbs up/down, NPS
- Time to Success: How long to achieve goal
- Retry Rate: % of users who re-prompt after first response
Use when: Evaluating end-to-end task performance
RANKING METRICS (Retrieval/Recommendation)
- MRR (Mean Reciprocal Rank): Average of 1/rank of first relevant result
- NDCG (Normalized Discounted Cumulative Gain): Quality of ranking
- Precision@K: % of top K results that are relevant
Use when: Evaluating search or recommendation quality
Qualitative Metrics:
Human Evaluation
PAIRWISE COMPARISON: Show human raters two outputs (A vs B), ask "Which is better?"
- Advantage: Easier than absolute rating
- Disadvantage: Slower, requires more comparisons
LIKERT SCALE RATING: Rate outputs 1-5 on dimensions (coherence, helpfulness, safety)
- Advantage: Fast, can aggregate scores
- Disadvantage: Subjective, rater disagreement
TASK COMPLETION: Can human complete task using AI output?
- Advantage: Measures real utility
- Disadvantage: Slow, expensive
RED TEAM REVIEW: Experts try to find failures (adversarial testing)
- Advantage: Finds edge cases
- Disadvantage: Not systematic
Step 4: Automated Evaluation Strategies
Use LLM as Judge:
LLM-as-Evaluator Pattern
CONCEPT: Use GPT-4 (or strong model) to evaluate outputs from your AI.
PROMPT TEMPLATE: "You are an expert evaluator. Rate the following AI response on:
- Relevance (1-5)
- Accuracy (1-5)
- Helpfulness (1-5)
User query: {query} AI response: {response} Ground truth (if available): {truth}
Provide ratings and brief explanation."
ADVANTAGES: ✅ Scalable (can eval thousands of examples) ✅ Consistent (same rubric every time) ✅ Fast (seconds per eval) ✅ Cheap (pennies per eval)
DISADVANTAGES: ❌ Not 100% reliable (LLM judge can be wrong) ❌ Requires validation (compare to human ratings) ❌ Can miss nuanced failures
VALIDATION:
- Run LLM judge on 100-200 examples
- Have humans also rate same examples
- Calculate inter-rater agreement (Cohen's kappa)
- If agreement >70%, LLM judge is trustworthy
Step 5: Build Eval Pipeline
Continuous Evaluation System:
Eval Pipeline Architecture
COMPONENTS:
-
Test Suite Storage
- JSON/CSV of test cases
- Version controlled (git)
- Tagged by category (happy path, edge case, etc.)
-
Runner Script
- Iterate through test cases
- Call AI system with each input
- Collect outputs
- Log latency, cost, errors
-
Scorer
- Compare output to expected (if available)
- Run automated metrics (accuracy, ROUGE, BLEU, etc.)
- Call LLM judge for quality rating
- Aggregate scores
-
Regression Detection
- Compare current run to baseline
- Flag significant drops (e.g., accuracy down >5%)
- Alert team if regression detected
-
Reporting Dashboard
- Visualize metrics over time
- Drill down into failures
- Compare models/prompts side-by-side
FREQUENCY:
- Pre-deploy: Every code/prompt change
- Nightly: Full suite run on production
- Weekly: Human review of sample outputs
Step 6: Common Eval Patterns
Evaluation Strategies by Use Case
RECOMMENDATION SYSTEMS
Test: Does user engage with recommendation?
Metrics:
- Click-through rate (CTR)
- Conversion rate (purchase, complete)
- Time spent with recommended item
- Diversity (not all same type)
Golden Dataset:
- Historical user behavior (X user liked Y, did they like Z?)
- Synthetic: "If user likes [A, B, C], recommend [D]?"
CONTENT MODERATION
Test: Does it correctly flag unsafe content?
Metrics:
- Precision (flagged = actually unsafe)
- Recall (actually unsafe = flagged)
- False positive rate (safe content flagged)
Golden Dataset:
- Curated examples of safe/unsafe content
- Edge cases (satire, context-dependent)
SUMMARIZATION
Test: Does summary capture key points?
Metrics:
- ROUGE score (overlap with reference summary)
- Factual consistency (no hallucinations)
- Compression ratio (length of summary / original)
Golden Dataset:
- Documents with human-written summaries
- Check: All key facts present, no false info
CODE GENERATION
Test: Does generated code work?
Metrics:
- Syntax correctness (parses without errors)
- Functional correctness (passes unit tests)
- Code quality (readable, efficient)
Golden Dataset:
- Programming problems with test cases
- Example: "Write function that reverses string" + 10 test cases
CONVERSATIONAL AI
Test: Does it handle multi-turn conversation well?
Metrics:
- Coherence across turns
- Context retention (remembers earlier messages)
- Task completion rate (user achieves goal)
- Safety (doesn't generate harmful content)
Golden Dataset:
- Scripted conversations with expected paths
- User logs (real conversations, anonymized)
Step 7: A/B Testing for AI
Compare models, prompts, or configurations:
AI A/B Testing Framework
SETUP:
- Define variants (Model A vs. Model B, or Prompt v1 vs. v2)
- Random assignment (50/50 split)
- Define success metric (accuracy, user satisfaction, task completion)
- Minimum sample size (depends on expected effect size)
METRICS TO TRACK:
- Primary: Quality (accuracy, preference, satisfaction)
- Secondary: Latency, cost, error rate
- Guardrails: Safety violations, user complaints
STATISTICAL SIGNIFICANCE:
- Run until p < 0.05 (95% confidence)
- Typically need 1,000-10,000 samples depending on effect size
- Use tools: Optimizely, LaunchDarkly, or custom
COMMON TESTS:
- Model comparison: GPT-4 vs. Claude vs. Gemini
- Prompt engineering: Version A vs. B
- Temperature: 0.7 vs. 0.9 (creativity vs. consistency)
- Context window: Include X vs. Y tokens of context
EXAMPLE: Variant A: GPT-4 with prompt v1 Variant B: GPT-4 with prompt v2
Metric: User thumbs up rate
- A: 70% thumbs up (n=500)
- B: 75% thumbs up (n=500)
- Result: B wins, p=0.03 (significant) → Ship prompt v2
Common Eval Mistakes
Anti-Patterns
❌ No Golden Dataset Testing AI without reference examples → Fix: Curate 100+ examples with expected outputs
❌ Testing Only Happy Path Ignoring edge cases and adversarial inputs → Fix: 30% of dataset should be edge cases
❌ Manual Eval Only Reviewing outputs one-by-one (doesn't scale) → Fix: Automate with LLM judge + spot-check humans
❌ No Regression Detection Shipping changes without comparing to baseline → Fix: Track metrics over time, alert on drops
❌ Vanity Metrics Measuring things that don't correlate with user value → Fix: Eval what matters (task success, user satisfaction)
❌ Overfitting to Eval Set Optimizing prompts specifically for test cases → Fix: Hold out test set, regularly refresh with new examples
Eval Tooling
Open Source:
-
LangSmith (LangChain) - Eval framework for LLM apps
-
Prompt flow (Microsoft) - End-to-end eval pipeline
-
Weights & Biases - Experiment tracking
-
Ragas - RAG evaluation framework
Commercial:
-
Anthropic Console - Claude model evals
-
OpenAI Evals - GPT model testing
-
HumanSignal - Human annotation platform
Related Skills
-
/building-with-llms
-
Best practices for AI product development
-
/ai-product-strategy
-
Strategic AI product decisions
-
/testing-strategies
-
Traditional software testing
-
/performance-optimization
-
Optimize AI latency/cost
Last Updated: 2026-01-22