llm-judge

LLM Judge Skill

Compare code implementations across 2+ repositories using structured evaluation.

Overview

This skill implements a two-phase LLM-as-judge evaluation:

Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts
Phase 2: Judging - Parallel judges score each dimension using consistent rubrics

Reference Files

File Purpose

references/fact-schema.md JSON schema for Phase 1 facts

references/scoring-rubrics.md Detailed rubrics for each dimension

references/repo-agent.md Instructions for Phase 1 agents

references/judge-agents.md Instructions for Phase 2 judges

Scoring Dimensions

Dimension Default Weight Evaluates

Functionality 30% Spec compliance, test pass rate

Security 25% Vulnerabilities, security patterns

Test Quality 20% Coverage, DRY, mock boundaries

Overengineering 15% Unnecessary complexity

Dead Code 10% Unused code, TODOs

Scoring Scale

Score Meaning

5 Excellent - Exceeds expectations

4 Good - Meets requirements, minor issues

3 Average - Functional but notable gaps

2 Below Average - Significant issues

1 Poor - Fails basic requirements

Phase 1: Spawning Repo Agents

For each repository, spawn a Task agent with:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

Your Repo: $REPO_LABEL at $REPO_PATH Spec Document: $SPEC_CONTENT

Instructions: Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Phase 2: Spawning Judge Agents

After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):

You are the $DIMENSION Judge for the LLM Judge evaluation.

Spec Document: $SPEC_CONTENT

Facts from all repos: $ALL_FACTS_JSON

Instructions: Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

After Phase 2 completes:

Collect scores from all 5 judges
For each repo, compute weighted total: weighted_total = sum(score[dim] * weight[dim]) / 100
Rank repos by weighted total (descending)
Generate verdict explaining the ranking

Output

Write results to .beagle/llm-judge-report.json and display markdown summary.

Dependencies

@beagle:llm-artifacts-detection
Reused by repo agents for dead code/overengineering

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

langgraph-code-review

docling

python-code-review

fastapi-code-review