judge

Score implementations using the 5-criteria framework. Fill out ALL sections exactly as shown.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "judge" with this command: npx skills add 2389-research/claude-plugins/2389-research-claude-plugins-judge

Speed-Run Judge

Score implementations using the 5-criteria framework. Fill out ALL sections exactly as shown.

Terminology: This skill uses "impl" but works for both:

  • Showdown: runner-1, runner-2, runner-3 (same design, different implementations)

  • Any%: variant-a, variant-b (different approaches/designs)

REQUIRED OUTPUT FORMAT

You MUST produce this exact structure. Do not summarize or abbreviate.

Gate Check

ImplTests PassDesign Adherence
impl-1X/X orYes/No
impl-2X/X orYes/No

Feasibility Check

ImplStatusNotes
impl-1OK / FlagDetails
impl-2OK / FlagDetails

Scoring Worksheet

impl-1

Fitness for Purpose (Does it solve the actual problem?)

Functional requirements:

  • Primary use case works end-to-end?
  • All explicitly stated requirements implemented?
  • Handles realistic scenarios, not just happy path?

User needs (beyond literal requirements):

  • Would the user actually use this, or just demo it?
  • Does it solve the real problem, not just the literal request?
  • Does deployment/distribution match stated needs?

Future considerations (if relevant):

  • If growth/scaling mentioned, does architecture support it?
  • If team/collaboration mentioned, is it maintainable by others?

Checklist: _/8 YES → Score: _/5 (7-8=5, 5-6=4, 4=3, 2-3=2, 0-1=1) Note: Not all items apply to every project. Score based on relevant items.

Justified Complexity (Every line earning its keep?)

  • Unnecessary abstractions: ___
  • Dead code: ___
  • Bloat estimate: ___%

Line count comparison (if multiple impls):

  • This impl: ___ lines
  • Smallest impl: ___ lines
  • Extra lines justified by: ___

Score: _/5 (5=minimal, 4=slight bloat <10%, 3=10-25% bloat, 2=25-50%, 1=>50%)

Readability (Understand core flow in 5 min?) Violations:

  • Single-letter vars (not loop index): +1 each = __
  • Functions >50 lines: +1 each = __
  • Nesting >3 levels: +1 each = __
  • Magic numbers: +1 each = __
  • Bad function names: +1 each = __ Total violations: __ → Score: _/5 (0=5, 1-2=4, 3-4=3, 5-7=2, 8+=1)

Robustness & Scale (Handles unexpected + growth?)

  • Input validation?
  • External call error handling?
  • Useful error messages?
  • Null/empty handling?
  • Async timeouts?
  • No unbounded loops?
  • O(n log n) or better?
  • Bounded memory?
  • Queries paginated?
  • No blocking I/O in hot path?
  • Backoff/retry logic?
  • Handles 10x load? Checklist: _/12 YES + feasibility flags → Score: _/5 (11-12 + no flags=5, 9-10 or minor flag=4, 7-8=3, 5-6 or major flag=2, <5 or critical flag=1)

Maintainability (Pain of next change?)

  • Single responsibility per function?
  • Explicit dependencies (no globals)?
  • Business logic separated from infra?
  • New feature = ≤3 files changed?
  • Config externalized?
  • Tests catch regressions? Checklist: _/6 YES → Score: _/5 (6=5, 5=4, 4=3, 2-3=2, 0-1=1)

impl-2

[REPEAT SAME FORMAT]

impl-3 (if applicable)

[REPEAT SAME FORMAT]

Speed-Run Metrics

ImplHosted LLM CallsFix CyclesGeneration Time (ms)
impl-1
impl-2

Judge Scorecard

Criterionimpl-1impl-2impl-3Best
Fitness for Purpose
Justified Complexity
Readability
Robustness & Scale
Maintainability
TOTAL/25/25/25

Hard Gates

GateResult
Fitness Gate (Δ ≥ 2)Triggered/Not triggered
Critical Flaw (any = 1)Triggered/Not triggered

Winner Selection

Winner: impl-X (Score: __/25)

Selection rationale: [2-3 sentences explaining WHY this implementation won]

Trade-offs acknowledged: [What the other implementations did better]

Token efficiency note: [How the winner used hosted LLM - fewer fix cycles, better prompts, etc.]

Scoring Reference

Scores Meaning

Score Meaning

5 Excellent - exceeds expectations

4 Good - fully meets requirements

3 Adequate - core works, some gaps

2 Poor - significant issues

1 Critical flaw - disqualifying

Hard Gates (Automatic)

  • Fitness Gate: If Fitness Δ ≥ 2 between impls → Higher fitness WINS immediately

  • Critical Flaw: If ANY criterion = 1 → That impl is ELIMINATED

Fitness Gate Interpretation

The Fitness Gate triggers the same way in both contexts, but means different things:

Context What Fitness Δ ≥ 2 Means

Showdown One runner deviated from or misunderstood the design. All runners should have similar Fitness since they're implementing the same spec. A large gap is a red flag.

Any% One approach genuinely solves the problem better. Different approaches can legitimately have different Fitness. A large gap means one approach is clearly superior.

In both cases, higher Fitness wins. The interpretation just explains why the gap exists.

Tiebreaker: Speed-Run Efficiency

When total scores are tied or within 1 point, prefer the implementation that:

  • Used fewer hosted LLM fix cycles (cleaner contract prompts)

  • Had fewer total hosted LLM calls (better task decomposition)

  • Generated code faster (simpler, more focused prompts)

This rewards better use of the speed-run pipeline, not just code quality.

Feasibility Red Flags

Check before scoring:

  • O(n²) or worse on unbounded data

  • Unbounded memory growth

  • Self-DDoS patterns (polling, no backoff)

  • Missing pagination

  • Blocking I/O in hot path

  • No error recovery

Process

  • Read all implementation code (should already be in context)

  • Fill out the worksheet for EACH implementation - do not skip sections

  • Fill out the Speed-Run Metrics table

  • Check hard gates

  • Announce winner with rationale (include token efficiency note)

CRITICAL: Use integer scores only (1-5). Do not use half points like 4.5.

CRITICAL: Fill out every checkbox. Do not summarize or abbreviate the worksheet.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

binary-re:static-analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

binary-re:triage

No summary provided by upstream source.

Repository SourceNeeds Review
Research

omakase-off

No summary provided by upstream source.

Repository SourceNeeds Review
Research

binary-re:dynamic-analysis

No summary provided by upstream source.

Repository SourceNeeds Review