Validate Evaluator

Calibrate an LLM judge against human judgment.

Overview

Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
Run judge on dev set and measure TPR/TNR
Iterate on the judge until TPR and TNR > 90% on dev set
Run once on held-out test set for final TPR/TNR
Apply bias correction formula to production data

Prerequisites

A built LLM judge prompt (from write-judge-prompt)
Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
- Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
- Labels must come from a domain expert, not outsourced annotators
Candidate few-shot examples from your labeled data

Core Instructions

Step 1: Create Data Splits

Split human-labeled data into three disjoint sets:

Split	Size	Purpose	Rules
Training	10-20% (~10-20 examples)	Source of few-shot examples for the judge prompt	Only clear-cut Pass and Fail cases. Used directly in the prompt.
Dev	40-45% (~40-45 examples)	Iterative evaluator refinement	Never include in the prompt. Evaluate against repeatedly.
Test	40-45% (~40-45 examples)	Final unbiased accuracy measurement	Do NOT look at during development. Used once at the end.

Target: 30-50 examples of each class (Pass and Fail) across dev and test combined. Use balanced splits even if real-world prevalence is skewed — you need enough Fail examples to measure TNR reliably.

from sklearn.model_selection import train_test_split

# First split: separate test set
train_dev, test = train_test_split(
    labeled_data, test_size=0.4, stratify=labeled_data['label'], random_state=42
)
# Second split: separate training examples from dev set
train, dev = train_test_split(
    train_dev, test_size=0.75, stratify=train_dev['label'], random_state=42
)
# Result: ~15% train, ~45% dev, ~40% test

Step 2: Run Evaluator on Dev Set

Run the judge on every example in the dev set. Compare predictions to human labels.

Step 3: Measure TPR and TNR

TPR (True Positive Rate): When a human says Pass, how often does the judge also say Pass?

TPR = (judge says Pass AND human says Pass) / (human says Pass)

TNR (True Negative Rate): When a human says Fail, how often does the judge also say Fail?

TNR = (judge says Fail AND human says Fail) / (human says Fail)

from sklearn.metrics import confusion_matrix

tn, fp, fn, tp = confusion_matrix(human_labels, evaluator_labels,
                                   labels=['Fail', 'Pass']).ravel()
tpr = tp / (tp + fn)
tnr = tn / (tn + fp)

Use TPR/TNR, not Precision/Recall or raw accuracy. These two metrics directly map to the bias correction formula. Use Cohen's Kappa only for measuring agreement between two human annotators, not for judge-vs-ground-truth.

Step 4: Inspect Disagreements

Examine every case where the judge disagrees with human labels:

Disagreement Type	Judge	Human	Fix
False Pass	Pass	Fail	Judge is too lenient. Strengthen Fail definitions or add edge-case examples.
False Fail	Fail	Pass	Judge is too strict. Clarify Pass definitions or adjust examples.

For each disagreement, determine whether to:

Clarify wording in the judge prompt
Swap or add few-shot examples from the training set
Add explicit rules for the edge case
Split the criterion into more specific sub-checks

Step 5: Iterate

Refine the judge prompt and re-run on the dev set. Repeat until TPR and TNR stabilize.

Stopping criteria:

Target: TPR > 90% AND TNR > 90%
Minimum acceptable: TPR > 80% AND TNR > 80%

If alignment stalls:

Problem	Solution
TPR and TNR both low	Use a more capable LLM for the judge
One metric low, one acceptable	Inspect disagreements for the low metric specifically
Both plateau below target	Decompose the criterion into smaller, more atomic checks
Consistently wrong on certain input types	Add targeted few-shot examples from training set
Labels themselves seem inconsistent	Re-examine human labels; the rubric may need refinement

Step 6: Final Measurement on Test Set

Run the judge exactly once on the held-out test set. Record final TPR and TNR.

Do not iterate after seeing test set results. Go back to step 4 with new dev data if needed.

Step 7 (Optional): Estimate True Success Rate (Rogan-Gladen Correction)

Raw judge scores on unlabeled production data are biased. If you need an accurate aggregate pass rate, correct for known judge errors:

theta_hat = (p_obs + TNR - 1) / (TPR + TNR - 1)

Where:

p_obs = fraction of unlabeled traces the judge scored as Pass
TPR, TNR = from test set measurement
theta_hat = corrected estimate of true success rate

Clip to [0, 1]. Invalid when TPR + TNR - 1 is near 0 (judge is no better than random).

Example:

Judge TPR = 0.92, TNR = 0.88
500 production traces: 400 scored Pass -> p_obs = 0.80
theta_hat = (0.80 + 0.88 - 1) / (0.92 + 0.88 - 1) = 0.68 / 0.80 = 0.85
True success rate is ~85%, not the raw 80%

Step 8: Confidence Interval

Compute a bootstrap confidence interval. A point estimate alone is not enough.

import numpy as np

def bootstrap_ci(human_labels, eval_labels, p_obs, n_bootstrap=2000):
    """Bootstrap 95% CI for corrected success rate."""
    n = len(human_labels)
    estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(n, size=n, replace=True)
        h = np.array(human_labels)[idx]
        e = np.array(eval_labels)[idx]

        tp = ((h == 'Pass') & (e == 'Pass')).sum()
        fn = ((h == 'Pass') & (e == 'Fail')).sum()
        tn = ((h == 'Fail') & (e == 'Fail')).sum()
        fp = ((h == 'Fail') & (e == 'Pass')).sum()

        tpr_b = tp / (tp + fn) if (tp + fn) > 0 else 0
        tnr_b = tn / (tn + fp) if (tn + fp) > 0 else 0
        denom = tpr_b + tnr_b - 1

        if abs(denom) < 1e-6:
            continue
        theta = (p_obs + tnr_b - 1) / denom
        estimates.append(np.clip(theta, 0, 1))

    return np.percentile(estimates, 2.5), np.percentile(estimates, 97.5)

lower, upper = bootstrap_ci(test_human, test_eval, p_obs=0.80)
print(f"95% CI: [{lower:.2f}, {upper:.2f}]")

Or use judgy (pip install judgy):

from judgy import estimate_success_rate

result = estimate_success_rate(
    human_labels=test_human_labels,
    evaluator_labels=test_eval_labels,
    unlabeled_labels=prod_eval_labels
)
print(f"Corrected rate: {result.estimate:.2f}")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

Practical Guidance

Pin exact model versions for LLM judges (e.g., gpt-4o-2024-05-13, not gpt-4o). Providers update models without notice, causing silent drift.
Re-validate after changing the judge prompt, switching models, or when production confidence intervals widen unexpectedly.
Use ~100 labeled examples (50 Pass, 50 Fail). Below 60, confidence intervals become wide.
One trusted domain expert is the most efficient labeling path. If not feasible, have two annotators label 20-50 traces independently and resolve disagreements before proceeding.
Improving TPR narrows the confidence interval more than improving TNR. The correction formula divides by TPR, so low TPR amplifies estimation errors into wide CIs.

Anti-Patterns

Assuming judges "just work" without validation. A judge may consistently miss failures or flag passing traces.
Using raw accuracy or percent agreement. Use TPR and TNR. With class imbalance, raw accuracy is misleading.
Dev/test examples as few-shot examples. This is data leakage.
Reporting dev set performance as final accuracy. Dev numbers are optimistic. The test set gives the unbiased estimate.
Raw judge scores without bias correction. If you report an aggregate pass rate, apply the Rogan-Gladen formula (Step 7).
Point estimates without confidence intervals. A corrected rate of 85% could easily be 78-92% with small test sets. Report the range so stakeholders know how much to trust the number.