Add-on: LLM Judge Evals

Use this skill when you need qualitative evaluation (clarity, domain fit, UX coherence, docs quality) in addition to deterministic checks.

Compatibility

Works with all stacks.
Best paired with addon-deterministic-eval-suite.

Inputs

Collect:

JUDGE_BACKEND: auto | langchain | google-adk (default auto).
JUDGE_MODEL: model id to run scoring.
JUDGE_TIMEOUT_SECONDS: default 60.
JUDGE_MAX_RETRIES: default 2.
JUDGE_TEMPERATURE: default 0.
JUDGE_FAIL_ON_BACKEND_MISMATCH: yes | no (default yes).
JUDGE_RUBRIC_MODE: product | security | developer-experience | custom.
PASS_THRESHOLD: default 0.75.
BLOCK_ON_JUDGE_FAIL: yes | no (default no).

Integration Workflow

Add judge artifacts:

config/skill_manifest.json
evals/judge/rubric.md
evals/judge/cases/
scripts/evals/run_llm_judge.py
.github/workflows/evals-judge.yml
REVIEW_BUNDLE/JUDGE_REPORT.md

Copy and adapt this skill's bundled starter script:
scripts/run_llm_judge.py
Place the adapted result in the target project at scripts/evals/run_llm_judge.py.

Define rubric:

scoring categories and weights
failure reasons template
required evidence links (files/lines/commands)

Execute judge run:

evaluate generated files against rubric per scenario
resolve backend from config/skill_manifest.json plus judge inputs
use a single adapter boundary for backend-specific scoring
store structured JSON + markdown summary
replace the bundled starter template's placeholder reporting with a real project-local backend adapter before treating judge scores as authoritative

Merge policy:

default advisory (BLOCK_ON_JUDGE_FAIL=no)
blocking only when explicitly configured

Backend Resolution Contract

scripts/evals/run_llm_judge.py must read config/skill_manifest.json as the source of truth for selected skills and declared judge capabilities.
The manifest should include:

{
  "base_skill": "architect-python-uv-fastapi-sqlalchemy",
  "addons": [
    "addon-deterministic-eval-suite",
    "addon-llm-judge-evals",
    "addon-langchain-llm"
  ],
  "capabilities": {
    "judge_backends": ["langchain"]
  }
}

Resolution order:
- If JUDGE_BACKEND != auto, use the requested backend only if the matching addon is present in the manifest.
- If JUDGE_BACKEND=auto and only addon-langchain-llm is present, use langchain.
- If JUDGE_BACKEND=auto and only addon-google-agent-dev-kit is present, use google-adk.
- If both addons are present, fail and require explicit JUDGE_BACKEND.
- If neither addon is present, fail with an explicit unsupported configuration error.
Model resolution:
- JUDGE_MODEL wins when set.
- For langchain, fall back to DEFAULT_MODEL.
- For google-adk, fall back to ADK_DEFAULT_MODEL.
The judge runner should expose a stable adapter interface (for example JudgeBackend.score(prompt)) so rubric logic, thresholding, and report generation stay backend-agnostic.

Required Template

`evals/judge/rubric.md`

# Judge Rubric

- Technical coherence (0-1)
- Requirement coverage (0-1)
- Domain language alignment (0-1)
- UX quality and states (0-1)
- Documentation clarity (0-1)

Pass threshold: 0.75

Guardrails

Documentation contract for generated code:
- Python: write module docstrings and docstrings for public classes, methods, and functions.
- Next.js/TypeScript: write JSDoc for exported components, hooks, utilities, and route handlers.
- Add concise rationale comments only for non-obvious logic, invariants, or safety constraints.
- Apply this contract even when using template snippets below; expand templates as needed.
Never replace deterministic gates with judge scores.
Keep prompts/rubrics versioned in repo for auditability.
Record model/version and timestamp for each run.
Surface uncertainty as explicit notes, not silent pass.
Do not infer judge backend from incidental files or imports; use the manifest and explicit inputs.
If multiple LLM-capable addons are installed, do not guess. Require an explicit JUDGE_BACKEND.

Validation Checklist

Confirm generated code includes required docstrings/JSDoc and rationale comments for non-obvious logic.

test -f evals/judge/rubric.md
test -f scripts/evals/run_llm_judge.py
test -f .github/workflows/evals-judge.yml
test -f REVIEW_BUNDLE/JUDGE_REPORT.md || true

Decision Justification Rule

Every non-trivial decision must include a concrete justification.
Capture the alternatives considered and why they were rejected.
State tradeoffs and residual risks for the chosen option.
If justification is missing, treat the task as incomplete and surface it as a blocker.

addon-llm-judge-evals

Safety Notice

Copy this and send it to your AI assistant to learn

Add-on: LLM Judge Evals

Compatibility

Inputs

Integration Workflow

Backend Resolution Contract

Required Template

`evals/judge/rubric.md`

Guardrails

Validation Checklist

Decision Justification Rule

Source Transparency

Related Skills

addon-rag-ingestion-pipeline

addon-llm-ancient-greek-translation

addon-nostr-nip-profile-selector

addon-deterministic-eval-suite