llm-judge

Compare code implementations across 2+ repositories using structured evaluation.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-judge" with this command: npx skills add existential-birds/beagle/existential-birds-beagle-llm-judge

LLM Judge Skill

Compare code implementations across 2+ repositories using structured evaluation.

Overview

This skill implements a two-phase LLM-as-judge evaluation:

  • Phase 1: Fact Gathering - Parallel agents explore each repo and extract structured facts

  • Phase 2: Judging - Parallel judges score each dimension using consistent rubrics

Reference Files

File Purpose

references/fact-schema.md JSON schema for Phase 1 facts

references/scoring-rubrics.md Detailed rubrics for each dimension

references/repo-agent.md Instructions for Phase 1 agents

references/judge-agents.md Instructions for Phase 2 judges

Scoring Dimensions

Dimension Default Weight Evaluates

Functionality 30% Spec compliance, test pass rate

Security 25% Vulnerabilities, security patterns

Test Quality 20% Coverage, DRY, mock boundaries

Overengineering 15% Unnecessary complexity

Dead Code 10% Unused code, TODOs

Scoring Scale

Score Meaning

5 Excellent - Exceeds expectations

4 Good - Meets requirements, minor issues

3 Average - Functional but notable gaps

2 Below Average - Significant issues

1 Poor - Fails basic requirements

Phase 1: Spawning Repo Agents

For each repository, spawn a Task agent with:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

Your Repo: $REPO_LABEL at $REPO_PATH Spec Document: $SPEC_CONTENT

Instructions: Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Phase 2: Spawning Judge Agents

After all Phase 1 agents complete, spawn 5 judge agents (one per dimension):

You are the $DIMENSION Judge for the LLM Judge evaluation.

Spec Document: $SPEC_CONTENT

Facts from all repos: $ALL_FACTS_JSON

Instructions: Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

After Phase 2 completes:

  • Collect scores from all 5 judges

  • For each repo, compute weighted total: weighted_total = sum(score[dim] * weight[dim]) / 100

  • Rank repos by weighted total (descending)

  • Generate verdict explaining the ranking

Output

Write results to .beagle/llm-judge-report.json and display markdown summary.

Dependencies

  • @beagle:llm-artifacts-detection
  • Reused by repo agents for dead code/overengineering

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

langgraph-code-review

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

docling

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

python-code-review

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

fastapi-code-review

No summary provided by upstream source.

Repository SourceNeeds Review