generate-judgements

Use when creating or updating test judgement definitions (judge_definitions) for an agent skill evaluation YAML config. Analyzes a skill's SKILL.md and reference files to produce fine-grained yes/no judge questions with scope tags. Triggers on "generate judgements", "create judge definitions", "write test config", "add judgements", "生成判定", "创建测试配置", "更新判定", "补充judgement".

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "generate-judgements" with this command: npx skills add panlm/generate-judgements

Generate Judgements for Skill Evaluation

Analyze a skill's source files and produce fine-grained judge_definitions for the mlflow-skills automated evaluation framework. Each judgement is a yes/no question that an LLM judge answers by reading the execution trace.

Prerequisites

  • Access to the target skill directory (must contain SKILL.md)
  • Familiarity with the mlflow-skills YAML config format (see references/yaml-config-spec.md)

Workflow

digraph generate_judgements {
  rankdir=TB;
  node [shape=box];

  collect [label="Phase 1\nCollect & Analyze Skill Files"];
  infer [label="Phase 2\nInfer Scopes"];
  confirm_scope [label="User confirms scopes" shape=diamond];
  generate [label="Phase 3\nGenerate Judgements per Scope"];
  present [label="Phase 4\nPresent to User"];
  confirm_judge [label="User approves?" shape=diamond];
  write [label="Phase 5\nWrite / Update YAML"];

  collect -> infer;
  infer -> confirm_scope;
  confirm_scope -> generate [label="approved"];
  confirm_scope -> infer [label="revise"];
  generate -> present;
  present -> confirm_judge;
  confirm_judge -> write [label="approved"];
  confirm_judge -> generate [label="revise"];
}

Phase 1: Collect and Analyze Skill Files

Ask the user for two inputs (or auto-detect them):

  1. Skill directory path — the folder containing SKILL.md
  2. Existing test config YAML path (optional) — if provided, the tool will update its judge_definitions section instead of creating a new file

Then read all available files in this order:

PriorityFilePurpose
1SKILL.mdPrimary source — workflow steps, behavior rules, output format
2references/*Supporting details — templates, CLI commands, query patterns
3README.md / README_CN.mdAdditional context — scope boundaries, limitations
4Existing test config YAMLUnderstand current judgements to avoid duplication

While reading, extract and note:

  • Workflow steps — numbered steps the skill must follow
  • Behavior rules — "must", "always", "never", "do not" directives
  • Output format requirements — file naming, sections, tables, mandatory fields
  • Conditional branches — if/else paths that lead to different outputs
  • Important guidelines — the "Important Guidelines" or similar section at the end

Phase 2: Infer Scopes

Analyze the skill for distinct execution paths that produce different outputs or follow different logic. Each distinct path becomes a scope.

How to identify scopes:

  1. Look for conditional branches in the workflow (e.g., "If X → do A; else → do B")
  2. Look for optional steps (e.g., "Only execute this step if...")
  3. Look for different output modes (e.g., "checklist only" vs "assessment report")

Scope naming rules:

  • Use lowercase, single-word or hyphenated names: checklist, assessment, research
  • The scope all is reserved — it means "always run regardless of test_scope"
  • Every skill has at least the implicit all scope for common/shared behavior

Present inferred scopes to the user with a brief description of each:

I found the following execution branches in this skill:

1. `all` — Common behavior shared across all paths
   (skill loading, doc search, categorization, source annotations)

2. `checklist` — Checklist-only output path
   (no live resource, generates checklist file, offers next steps)

3. `assessment` — Live assessment path
   (runs AWS CLI, generates assessment report, no separate checklist)

Does this look right? Should I add, remove, or rename any scope?

Wait for user confirmation before proceeding.

Phase 3: Generate Judgements

For each confirmed scope, generate fine-grained judge_definitions. Follow these rules:

3.1 Granularity Principle

One check point per judgement. Each judgement tests exactly ONE behavior or requirement.

# GOOD — one specific check
- name: sequential-mcp-calls
  scope: all
  question: >
    Check that MCP tool calls were executed sequentially...

# BAD — multiple checks crammed into one
- name: workflow-correct
  scope: all
  question: >
    Check that the agent searched docs sequentially, read pages,
    extracted items into 5 categories, and wrote the file...

3.2 Judgement Categories

Generate judgements in this order, for each scope:

Category A: Skill Loading & Invocation (scope: all)

  • Was the skill loaded (SKILL.md read)?
  • Were reference files read when needed?

Category B: Workflow Behavior (scope: all or scope-specific)

  • Did each workflow step execute correctly?
  • Were sequential/parallel execution rules followed?
  • Were error handling / retry rules followed?
  • Were conditional branches taken correctly?

Category C: Output Quality (scope: all or scope-specific)

  • Does the output contain all mandatory sections/categories?
  • Does the output follow the naming convention?
  • Does the output include required metadata (source annotations, IDs, etc.)?
  • Are quantities within expected ranges?

Category D: Scope-Specific Behavior (per non-all scope)

  • What is unique to this execution path?
  • What should NOT happen in this path? (negative checks)
  • What additional output/actions are expected?

Category E: Guidelines Compliance (scope: all)

  • Are "always/never/must" directives respected?
  • Is the output language correct?
  • Are edge cases handled?

3.3 Naming Convention

Use kebab-case names that describe the check:

skill-invoked              — skill was loaded
sequential-mcp-calls       — tool calls are sequential
doc-search-coverage        — search queries cover required topics
five-categories-complete   — output has all 5 categories
file-naming-convention     — output file name matches pattern
aws-cli-commands-executed  — CLI commands were run
no-separate-checklist-file — negative check: no extra file

3.4 Question Writing Rules

Each question field must be a self-contained instruction for the LLM judge. Follow the patterns in references/judgement-patterns.md.

Required elements in every question:

  1. What to check — "Check that..." or "Verify that..."
  2. Where to look — "Look in the trace for...", "Look for tool calls..."
  3. Success criteria — specific, measurable condition for answering "yes"
  4. Leniency guidance (when appropriate) — "Be lenient...", "Answer 'yes' if at least..."

Important clarifications to include when relevant:

  • Distinguish between parallel tool CALLS vs batched requests in one call
  • Specify minimum thresholds (e.g., "at least 4 of 5", "roughly 30-50")
  • Clarify what counts (e.g., "each URL in the requests array counts as a separate page")
  • State default answer when evidence is ambiguous (e.g., "benefit of the doubt → yes")

3.5 Negative Checks

For each scope, also generate negative judgements — things that should NOT happen:

  • In checklist scope: assessment-only artifacts should NOT appear
  • In assessment scope: checklist-only artifacts should NOT appear
  • Across all scopes: forbidden behaviors (e.g., parallel calls when sequential is required)

Phase 4: Present Judgements to User

Present the generated judgements grouped by scope with clear section headers:

## Generated Judgements

### Scope: all (7 judgements)
| # | Name | Check |
|---|------|-------|
| 1 | skill-invoked | Skill was loaded from .claude/skills/ |
| 2 | sequential-mcp-calls | MCP calls are sequential, not parallel |
| ... | ... | ... |

### Scope: checklist (2 judgements)
| # | Name | Check |
|---|------|-------|
| 1 | file-naming-convention | Output file follows naming pattern |
| ... | ... | ... |

### Scope: assessment (8 judgements)
| ... | ... | ... |

Total: 17 judgements across 3 scopes.

Does this look right? Should I add, remove, or modify any judgement?

Wait for user confirmation. Iterate if the user requests changes.

Phase 5: Write / Update YAML

Once approved, write the output:

If an existing YAML config was provided:

Replace only the judge_definitions: section. Preserve all other fields (name, prompt, skills, timeout_seconds, environment, etc.) exactly as they are.

Add the standard scope comment block above judge_definitions::

# ==============================================================
# Judge Definitions
#
# scope values:
#   all        — runs in all test scenarios
#   {scope1}   — only when test_scope={scope1}
#   {scope2}   — only when test_scope={scope2}
# ==============================================================
judge_definitions:

If no existing YAML was provided:

Generate a complete YAML config file. Ask the user for:

  • name — test run name
  • project_dir — temp project directory name
  • prompt — default prompt for the test
  • test_scope — default scope to use

Use sensible defaults from the skill directory name for the rest. See references/yaml-config-spec.md for the full config structure.

File naming: {skill-name}.yaml placed in the appropriate tests/configs/ directory.

After writing, inform the user of the file path and remind them:

  • They can override test_scope and prompt from the CLI
  • Empty environment values won't override existing env vars
  • Judges with scope: all always run

Important Guidelines

  • Be exhaustive: Extract every testable behavior from the skill. It's better to have too many judgements than to miss an important check. The user can always remove extras.
  • One point per judgement: Never combine multiple checks. If you're tempted to use "and" in a question, split it into two judgements.
  • Write for an LLM judge: The question will be answered by an LLM reading a raw MLflow trace (JSON with tool calls and responses). Be explicit about where to find evidence in the trace.
  • Include thresholds: When the skill specifies numbers (e.g., "5 search queries", "30-50 items", "at least 3 per category"), encode those in the judgement.
  • Respect language: Write judgement questions in English (they are consumed by an LLM judge, not shown to end users). But interact with the user in their language.
  • Preserve existing work: When updating an existing YAML, review current judgements first. Keep well-written ones, improve weak ones, add missing ones.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Web3

APEX IA Scanner

APEX IA - Scanner profissional multi-timeframe para Binance Futures. Detecta cruzamentos SMA 8x21 com confirmação de Pivot SuperTrend, RSI, volume e confluência

Registry SourceRecently Updated
Web3

Value Chain Analysis

Use when analyzing where profit concentrates across an industry or within a firm, decomposing business activities into primary and support functions to find...

Registry SourceRecently Updated
Web3

Polymarket 交易助手

基于全链路AI工具,从市场扫描到自动实盘交易,提供Polymarket交易决策、持仓管理与绩效回测功能。

Registry SourceRecently Updated
450Profile unavailable
Web3

Home Repair Contractor Hiring Kit

Helps homeowners define repair scope, compare contractor quotes, organize documentation, and plan milestone payments without providing legal advice or replac...

Registry SourceRecently Updated
470Profile unavailable