Evaluation Skill
Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.
Core Insight: The 95% Variance Finding
Research shows 95% of output variance comes from just two sources:
-
80% from prompt tokens (wording, structure, examples)
-
15% from random seed/sampling
Temperature, model version, and other factors account for only 5%.
Implication: Focus evaluation on prompt quality, not model tweaking.
What's Included
Examples (examples/ )
-
Prompt comparison - A/B testing prompts with rubrics
-
Model evaluation - Comparing outputs across models
-
Regression testing - Detecting output degradation
Reference Guides (reference/ )
-
Rubric design - Multi-dimensional evaluation criteria
-
LLM-as-judge - Using LLMs to evaluate LLM outputs
-
Statistical methods - Handling non-determinism
Templates (templates/ )
-
Rubric templates - Ready-to-use evaluation criteria
-
Judge prompts - LLM-as-judge prompt templates
-
Test case format - Structured test case templates
Checklists (checklists/ )
-
Evaluation setup - Before running evaluations
-
Rubric validation - Ensuring rubric quality
Key Concepts
- Multi-Dimensional Rubrics
Don't use single scores. Break down evaluation into dimensions:
Dimension Weight Criteria
Accuracy 30% Factually correct, no hallucinations
Completeness 25% Addresses all requirements
Clarity 20% Well-organized, easy to understand
Conciseness 15% No unnecessary content
Format 10% Follows specified structure
- Handling Non-Determinism
LLMs are non-deterministic. Handle with:
Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases
Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal
Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals
- LLM-as-Judge Pattern
Use a judge LLM to evaluate outputs:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Prompt │────▶│ Test LLM │────▶│ Output │ └─────────────┘ └─────────────┘ └─────────────┘ │ ▼ ┌─────────────┐ ┌─────────────┐ │ Rubric │────▶│ Judge LLM │ └─────────────┘ └─────────────┘ │ ▼ ┌─────────────┐ │ Score │ └─────────────┘
Best Practice: Use stronger model as judge (Opus judges Sonnet).
- Test Case Design
Structure test cases with:
interface TestCase { id: string input: string // User message or context expectedBehavior: string // What output should do rubric: RubricItem[] // Evaluation criteria groundTruth?: string // Optional gold standard metadata: { category: string difficulty: 'easy' | 'medium' | 'hard' createdAt: string } }
Evaluation Workflow
Step 1: Define Rubric
rubric: dimensions: - name: accuracy weight: 0.3 criteria: 5: "Completely accurate, no errors" 4: "Minor errors, doesn't affect correctness" 3: "Some errors, partially correct" 2: "Significant errors, mostly incorrect" 1: "Completely incorrect or hallucinated"
Step 2: Create Test Cases
test_cases:
- id: "code-gen-001" input: "Write a function to reverse a string" expected_behavior: "Returns working reverse function" ground_truth: | function reverse(s: string): string { return s.split('').reverse().join('') }
Step 3: Run Evaluation
Run test suite
python evaluate.py --suite code-generation --runs 3
Output
┌─────────────────────────────────────────────┐
│ Test Suite: code-generation │
│ Total: 50 | Pass: 47 | Fail: 3 │
│ Accuracy: 94% (±2.1%) │
│ Avg Score: 4.2/5.0 │
└─────────────────────────────────────────────┘
Step 4: Analyze Results
Look for:
-
Low-scoring dimensions - Target for improvement
-
High-variance cases - Prompt needs clarification
-
Regression from baseline - Investigate changes
Grey Haven Integration
With TDD Workflow
- Write test cases (expected behavior)
- Run baseline evaluation
- Modify prompt/implementation
- Run evaluation again
- Compare: new scores ≥ baseline?
With Pipeline Architecture
acquire → prepare → process → parse → render → EVALUATE │ ┌───────┴───────┐ │ Compare to │ │ ground truth │ │ or rubric │ └───────────────┘
With Prompt Engineering
Current prompt → Evaluate → Score: 3.2 Apply principles → Improve prompt New prompt → Evaluate → Score: 4.1 ✓
Use This Skill When
-
Testing new prompts before production
-
Comparing prompt variations (A/B testing)
-
Validating model outputs meet quality bar
-
Detecting regressions after changes
-
Building evaluation datasets
-
Implementing automated quality gates
Related Skills
-
prompt-engineering
-
Improve prompts based on evaluation
-
testing-strategy
-
Overall testing approaches
-
llm-project-development
-
Pipeline with evaluation stage
Quick Start
Design your rubric
cat templates/rubric-template.yaml
Create test cases
cat templates/test-case-template.yaml
Learn LLM-as-judge
cat reference/llm-as-judge-guide.md
Run evaluation checklist
cat checklists/evaluation-setup-checklist.md
Skill Version: 1.0 Key Finding: 95% variance from prompts (80%) + sampling (15%) Last Updated: 2025-01-15