Skill Grader

Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.

Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.

When to Use

✅ Use for:

Auditing a single skill's quality
Comparing skills against each other
Prioritizing which skills to improve first
Quality control sweeps across a skill library
Generating improvement roadmaps

❌ NOT for:

Creating new skills (use skill-architect )
Grading code quality or non-skill documents
Evaluating agent performance (different from skill quality)

Grading Process

flowchart TD A[Read SKILL.md + all files] --> B[Score each of 10 axes] B --> C[Assign letter grade per axis] C --> D[Compute overall grade] D --> E[Write improvement recommendations] E --> F[Produce grading report]

Step-by-Step

Read the entire skill folder — SKILL.md, all references, scripts, CHANGELOG, README
Score each axis — Use the rubric below (0-100 per axis)
Convert to letter grade — See grade scale
Compute overall grade — Weighted average (Description and Scope are 2x weight)
Write 1-3 specific improvements per axis scoring below B+
Produce the grading report in the output format below

The 10 Evaluation Axes

Axis 1: Description Quality (Weight: 2x)

Does the description follow [What] [When] [Keywords]. NOT for [Exclusions] ?

Grade Criteria

A Specific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words

B Has keywords and NOT clause, but slightly vague or missing synonym coverage

C Too generic, missing NOT clause, or >100 words of process detail

D Single vague sentence ("helps with X") or name/description mismatch

F Missing or empty description

Axis 2: Scope Discipline (Weight: 2x)

Is the skill narrowly focused on one expertise type, or a catch-all?

Grade Criteria

A One clear expertise domain, "When to Use" and "NOT for" sections both present and specific

B Mostly focused, minor boundary ambiguity

C Covers 2-3 related but distinct domains, should probably be split

D Catch-all skill ("helps with anything related to X")

F No scope boundaries defined at all

Axis 3: Progressive Disclosure

Does the skill follow the three-layer architecture (metadata → SKILL.md → references)?

Grade Criteria

A SKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions

B SKILL.md <500 lines, some references used, index present

C SKILL.md >500 lines, or all content inlined with no references

D SKILL.md >800 lines, or references exist but aren't indexed in SKILL.md

F Single massive file with no structure

Axis 4: Anti-Pattern Coverage

Does the skill encode expert knowledge that prevents common mistakes?

Grade Criteria

A 3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes

B 1-2 anti-patterns with clear explanation

C Anti-patterns mentioned but no structured template

D No anti-patterns, just positive instructions

F Contains advice that IS an anti-pattern (outdated, harmful)

Axis 5: Self-Contained Tools

Does the skill include working tools (scripts, MCPs, subagents)?

Grade Criteria

A Working scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification

B Scripts exist and work but lack error handling or docs

C Scripts referenced but are templates/pseudocode

D Phantom tools (SKILL.md references files that don't exist)

F References non-existent tools AND no acknowledgment

Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.

Axis 6: Activation Precision

Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?

Grade Criteria

A Description has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors

B Good keywords, minor false-positive risk

C Generic keywords that overlap with other skills

D No specific keywords, or NOT clause contradicts intended use

F Description would cause constant false activation

Axis 7: Visual Artifacts

Does the skill use Mermaid diagrams, code examples, and tables effectively?

Grade Criteria

A Decision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns

B Some diagrams or tables, but key decision trees still in prose

C Tables used but no Mermaid diagrams for processes

D Prose-only, no visual structure

F Wall of text with no formatting aids

Axis 8: Output Contracts

Does the skill define what it produces in a format consumable by other agents?

Grade Criteria

A Explicit output format (JSON schema, markdown template, or structured sections), subagent-consumable

B Output format implied but not explicitly documented

C No output format, but content is structured enough to infer

D Unstructured prose output expected

F N/A (pure reference skill) — exempt from this axis

Axis 9: Temporal Awareness

Does the skill track when knowledge was current and what has changed?

Grade Criteria

A Timelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates

B Some temporal context, CHANGELOG exists

C No dates on knowledge, but CHANGELOG exists

D No temporal context anywhere, knowledge could be stale

F Contains demonstrably outdated advice without warning

Axis 10: Documentation Quality

README, CHANGELOG, and reference organization.

Grade Criteria

A README with quick start, CHANGELOG with dated versions, references well-organized with clear filenames

B README and CHANGELOG exist, references present

C SKILL.md is the only file, but it's well-structured

D No README, no CHANGELOG, disorganized references

F SKILL.md is the only file and it's poorly structured

Grade Scale

Letter Score Range Meaning

A+ 97-100 Exemplary — sets the standard

A 93-96 Excellent — minor improvements possible

A- 90-92 Very good — a few small gaps

B+ 87-89 Good — notable room for improvement

B 83-86 Solid — several areas need work

B- 80-82 Above average — meaningful gaps

C+ 77-79 Average — significant improvements needed

C 73-76 Below average — major gaps

C- 70-72 Barely adequate

D+ 67-69 Poor — fundamental issues

D 63-66 Very poor — needs major rework

D- 60-62 Near-failing quality

F <60 Failing — start over

Overall Grade Computation

Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.

Overall = (2×Axis1 + 2×Axis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12

Convert the numeric average to a letter grade using the scale above.

Output Format

Produce this exact structure:

Skill Grading Report: [skill-name]

Graded: [date] Overall Grade: [letter] ([score]/100)

Axis Grades

#	Axis	Grade	Score	Key Finding
1	Description Quality	[grade]	[score]	[1-line finding]
2	Scope Discipline	[grade]	[score]	[1-line finding]
3	Progressive Disclosure	[grade]	[score]	[1-line finding]
4	Anti-Pattern Coverage	[grade]	[score]	[1-line finding]
5	Self-Contained Tools	[grade]	[score]	[1-line finding]
6	Activation Precision	[grade]	[score]	[1-line finding]
7	Visual Artifacts	[grade]	[score]	[1-line finding]
8	Output Contracts	[grade]	[score]	[1-line finding]
9	Temporal Awareness	[grade]	[score]	[1-line finding]
10	Documentation Quality	[grade]	[score]	[1-line finding]

Top 3 Improvements (Highest Impact)

[Axis]: [Specific action] — [Why this matters, expected grade improvement]
[Axis]: [Specific action] — [Why this matters, expected grade improvement]
[Axis]: [Specific action] — [Why this matters, expected grade improvement]

Detailed Notes

[Axis name] ([grade])

[2-3 sentences of specific feedback with examples from the skill]

[Repeat for each axis scoring below B+]

Quick Grading (Abbreviated)

For rapid triage across many skills, produce only:

Skill	Overall	Desc	Scope	Disc	Anti	Tools	Activ	Visual	Output	Temp	Docs
[name]	[grade]	...	...	...	...	...	...	...	...	...	...

Anti-Patterns in Grading

Grade Inflation

Wrong: Giving B+ because "it's pretty good" without checking criteria. Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.

Missing Context

Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts. Right: Mark Axis 5 as "A — tools not applicable for this skill type."

Ignoring Phantoms

Wrong: Scoring Axis 5 as B because scripts are "referenced." Right: Actually check if every referenced file exists. If scripts/validate.py is mentioned but doesn't exist, that's D.

skill-grader

Safety Notice

Copy this and send it to your AI assistant to learn

Skill Grading Report: [skill-name]

Axis Grades

Top 3 Improvements (Highest Impact)

Detailed Notes

[Axis name] ([grade])

Source Transparency

Related Skills

agent-creator

test-automation-expert

chatbot-analytics

dag-task-scheduler