AGENTS.md / CLAUDE.md Evaluator
Evaluate whether rules in instruction files actually change model behavior, identify non-discriminating rules (model already does this by default), and optimize the file for maximum impact per token.
Core Concept
Most AGENTS.md/CLAUDE.md files contain rules the model already follows without being told. These waste context tokens every conversation. This skill runs controlled A/B tests — with the instruction file vs. without it — to identify which rules earn their place and which can be cut.
The Codebase-Teaches-Patterns Effect
Well-structured codebases make most instruction rules redundant. The model explores existing code — reads package.json, scans existing components, follows import patterns — and matches conventions automatically. In empirical testing on a 755-line CLAUDE.md across a monorepo with 3 frontend stacks, 25 of 26 assertions passed identically with or without the instruction file. The model discovered React patterns, shadcn/ui conventions, Recharts usage, Hebrew labels, Firestore helpers, Zod validation, and serverTimestamp — all from existing code.
The only assertion that discriminated was pure domain knowledge the codebase couldn't teach: parallel components in separate directories that must always be changed together.
This means your CLAUDE.md is probably 80-95% redundant. The eval process will reveal exactly which rules survive.
The Evaluation Loop
- Read the instruction file and categorize each rule
- Generate eval prompts from commit analysis (not guessing)
- Run clean A/B tests with proper file isolation
- Grade and benchmark the results
- Identify non-discriminating rules and recommend cuts
- Iterate until the file is lean and every rule pulls its weight
Step 1: Read and Categorize Rules
Read ALL instruction files in scope. For project-level evals, find every instruction file recursively:
find {project-root} -name "CLAUDE.md" -o -name "AGENTS.md" | grep -v node_modules | grep -v .git
For global evals, check all config directories:
ls -la ~/.claude/CLAUDE.md ~/.codex/AGENTS.md ~/.cursor/AGENTS.md 2>/dev/null
Categorize each rule:
| Category | Examples | Likely discriminating? |
|---|---|---|
| Coding style | type over interface, early returns, no React.FC | Low — models default to modern patterns |
| Response format | Checklist style, tldr footer, separators | Low-Medium — some are model defaults |
| Workflow | TDD, validation steps, branch naming | High — models don't do these unprompted |
| Decision style | Single recommendation, epistemic separation | High — models default to option menus |
| Safety guardrails | No force-push, no rm -rf, no dev server | Can't A/B test safely — keep by default |
| Identity/preferences | Author name, email, no emojis | Keep — low cost, high consequence if wrong |
| Domain-specific | Parallel component parity, project-specific helpers | Keep — can't be known without instruction |
Rules in the last three categories should be kept regardless of eval results. Focus eval effort on coding style, response format, workflow, and decision style.
Step 2: Generate Eval Prompts
Commit-Derived Evals (Preferred)
The most effective eval prompts come from analyzing real project history. Generic prompts ("write a React component") test patterns the model discovers from the codebase anyway.
Spawn a subagent to analyze recent commits:
Analyze the last 50-100 commits in {project-root} using git log.
Look for:
1. Patterns that appear in commits but aren't obvious from code structure alone
2. Mistakes that were fixed (these reveal non-obvious conventions)
3. Multi-file changes that suggest coupling rules
4. Domain-specific patterns (naming, helpers, config)
For each pattern, write a realistic eval prompt that would test whether
the model knows about it WITHOUT being told. Save to {workspace}/evals/commit-analysis.md
Designing Good Assertions
Each eval needs 3-6 assertions. Mix automated (grep-able) and manual:
- Automated:
"check": "companyCollection"— regex against output - Manual: "Mentions or checks the parallel component" — requires reading context
A discriminating assertion:
- Passes when the instruction file is loaded
- Fails when the instruction file is absent
- Cannot be explained by the codebase alone
If an assertion passes in both conditions, the rule it tests is dead weight.
Save prompts to {workspace}/evals/evals.json. Present to the user before running.
Step 3: Run Clean A/B Tests
This is the most error-prone step. Claude Code auto-loads CLAUDE.md/AGENTS.md into every conversation, including subagents. A contaminated baseline produces false results — you'll think rules are "non-discriminating" when really both runs had them.
Clean Baseline Protocol
This protocol was refined through 3 failed attempts. Follow it exactly.
Phase 1: Discover all instruction and memory files
# Find ALL instruction files in the project (monorepos have them in packages)
find {project-root} -name "CLAUDE.md" -o -name "AGENTS.md" \
| grep -v node_modules | grep -v .git | sort
Also check user-level config and memory files — these are auto-loaded too:
ls -la ~/.claude/CLAUDE.md ~/.codex/AGENTS.md ~/.cursor/AGENTS.md 2>/dev/null
ls -d ~/.claude/projects/*/memory/ 2>/dev/null
Phase 2: Move ALL files to /tmp/ (not rename, not .bak)
Renaming to .bak in the same directory is weak — the model might still discover it. Move to /tmp/ with directory structure preserved. Include memory files — there is no CLI flag to disable memory loading.
BACKUP_DIR="/tmp/{project-name}-md-backup-$(date +%s)"
mkdir -p "$BACKUP_DIR"
# For each instruction file found in Phase 1:
for f in {list of files}; do
REL_PATH="${f#{project-root}/}"
mkdir -p "$BACKUP_DIR/$(dirname $REL_PATH)"
mv "$f" "$BACKUP_DIR/$REL_PATH"
done
# User-level files: ASK before touching. These are outside the project.
# "I found global instruction files and memory dirs that will contaminate baselines.
# OK to temporarily rename them? I'll restore immediately after."
# Only proceed with explicit user consent:
for f in ~/.claude/CLAUDE.md ~/.codex/AGENTS.md ~/.cursor/AGENTS.md; do
[ -f "$f" ] && mv "$f" "${f}.bak"
done
find ~/.claude/projects -type d -name "memory" -exec sh -c 'mv "$1" "${1}_bak"' _ {} \; 2>/dev/null
# If the user declines, note that baselines may be partially contaminated by global rules.
Phase 3: Verify removal
# Must return zero results:
find {project-root} -name "CLAUDE.md" -o -name "AGENTS.md" \
| grep -v node_modules | grep -v .git | wc -l
If ANY files remain, do not proceed.
Phase 4: Spawn all clean baseline agents
Launch all baseline agents NOW. They start in a world with zero instruction files.
Phase 5: WAIT for ALL baseline agents to complete
DO NOT restore files until every single baseline agent has finished and you have saved its timing data. This is critical — restoring early creates a race condition where agents that haven't fully initialized yet may pick up the restored files.
Phase 6: Restore all files (ALWAYS — even if baselines errored)
Restoration is non-negotiable. If agents fail, time out, or you hit any error — restore FIRST, debug second. Leaving user config renamed breaks all future Claude sessions.
# Only after ALL baselines are done (or failed):
for f in $(find "$BACKUP_DIR" -type f); do
REL_PATH="${f#$BACKUP_DIR/}"
mv "$f" "{project-root}/$REL_PATH"
done
# Restore user-level files (if they were renamed with user consent):
for f in ~/.claude/CLAUDE.md.bak ~/.codex/AGENTS.md.bak ~/.cursor/AGENTS.md.bak; do
[ -f "$f" ] && mv "$f" "${f%.bak}"
done
# Restore memory dirs:
find ~/.claude/projects -type d -name "memory_bak" -exec sh -c 'mv "$1" "${1%_bak}"' _ {} \; 2>/dev/null
Phase 7: Git checkout
Baseline agents may have modified project files during their runs:
cd {project-root} && git checkout .
With-Skill Runs
For with-skill runs, all instruction files are on disk (auto-loaded). Additionally have the agent read the root file explicitly:
You are being evaluated. Read the file {path-to-instruction-file} first,
then follow ALL instructions in it exactly.
Your task: {eval prompt}
Save outputs to {workspace}/iteration-{N}/{eval-name}/with_skill/outputs/
Spawning Strategy
Launch ALL runs — all with-skill and all clean-baseline agents — but in two batches:
- Batch 1: All clean baselines (files are moved to /tmp/)
- Wait for all baselines to complete, restore files, git checkout
- Batch 2: All with-skill runs (files are back on disk)
Within each batch, launch all agents in a single turn for parallelism.
Each agent should save:
- Output files to
outputs/ - Full response text to
outputs/response.md
Capture total_tokens and duration_ms from each task notification into timing.json.
Step 4: Grade and Benchmark
After all runs complete:
Grade Each Run
For each output, evaluate assertions. Prefer scripted checks over eyeballing:
# Automated assertion check example
grep -cP 'companyCollection' outputs/response.md
Save results to grading.json in each run directory. The eval viewer expects this exact schema:
{
"eval_id": 6,
"eval_name": "descriptive-name",
"config": "with_skill",
"expectations": [
{ "text": "Assertion description", "passed": true, "evidence": "How you verified" }
]
}
The expectations array MUST use fields text, passed, and evidence — not name/met/details.
Build the Benchmark
Create benchmark.json. The eval viewer expects this exact schema:
{
"metadata": { "skill_name": "agents-md", ... },
"runs": [
{
"eval_id": 6,
"eval_name": "descriptive-name",
"configuration": "with_skill",
"run_number": 1,
"result": {
"pass_rate": 0.833,
"passed": 5,
"failed": 1,
"total": 6,
"time_seconds": 162.5,
"tokens": 76318,
"tool_calls": 0,
"errors": 0
},
"expectations": [
{ "text": "...", "passed": true, "evidence": "..." }
]
}
],
"run_summary": {
"with_skill": {
"pass_rate": { "mean": 0.793, "stddev": 0.124, "min": 0.667, "max": 1.0 },
"time_seconds": { "mean": 169.5, "stddev": 31.5, ... },
"tokens": { "mean": 69800, "stddev": 12502, ... }
},
"without_skill": { ... },
"delta": { "pass_rate": "+0.067", "time_seconds": "-13.9", "tokens": "+13460" }
}
}
Critical: configuration must be "with_skill" or "without_skill" — the viewer keys on these exact strings. Put each with_skill entry before its without_skill counterpart.
Aggregate with the bundled script:
python {this-skill-path}/scripts/aggregate_benchmark.py {workspace}/iteration-N --skill-name agents-md
Key Metrics
For each assertion, classify it:
- Non-discriminating: Passes in BOTH configs (rule is dead weight — the codebase teaches it)
- Discriminating: Passes with-skill only (rule earns its place)
- Inversely discriminating: Fails with-skill, passes baseline (rule may be hurting)
- Always failing: Fails in both (rule isn't followed, or assertion is wrong)
Step 5: Analyze and Recommend
Expect Most Rules to be Non-Discriminating
Due to the codebase-teaches-patterns effect, most project-specific rules will be non-discriminating. The model reads existing files and matches conventions. This is the expected result, not a test failure. Common non-discriminating patterns:
- Tech stack choices (React vs Angular) — model reads
package.json - Library preferences (Recharts, shadcn/ui) — model reads existing imports
- Helper functions (companyCollection, serverTimestamp) — model reads existing code
- Language/locale (Hebrew, RTL) — model reads existing translations
- Path aliases, naming conventions — model reads
tsconfig.jsonand existing files
What Actually Discriminates
Rules that discriminate tend to be:
- Coupling knowledge: "When you change X, also change Y" (parallel component parity)
- Non-obvious workflow: "Run this specific validation before finishing"
- Decision style: Single recommendation vs. option menus
- Epistemic style: Separate known/inferred/uncertain
- Non-standard patterns: Internal packages wrapping public APIs, generated code, unconventional directory meanings — things that look standard but aren't
The Keep Test
For each surviving rule, apply this filter — it must pass at least ONE criterion:
| Criterion | Question |
|---|---|
| Specificity | Does it describe a project-unique decision? |
| Behavior change | Would the agent do something different without it? |
| Surprise | Would it surprise a competent dev joining the project? |
| Pain | Did it come from a real mistake or recurring PR feedback? |
| Frequency | Does it apply to nearly every task AND cost few tokens? |
| No feedback loop | Can't be enforced by tests, types, linters, or CI? |
If a rule fails all six AND is non-discriminating in A/B tests, it's dead weight.
Safety/Identity/Domain Rules: Keep Regardless
Don't delete based on A/B tests:
- Safety guardrails (too dangerous to test without)
- Identity/preferences (too cheap to worry about — author name, email)
- Domain context the codebase structure can't teach
Token Budget Analysis
Calculate: how many tokens does the instruction file add per conversation vs. behavioral improvement? A 755-line file that produces +3.8% improvement is a poor trade. A 50-line file with the same improvement is excellent.
After Trimming: Restructure for Progressive Disclosure
Once you know which rules survive, restructure the file:
- Root AGENTS.md should be a compact map (<100 lines) — pointers to deeper docs, not a monolith
- Extract domain content to
docs/— detailed conventions, architecture notes, testing patterns - Use nested AGENTS.md for scoped rules —
packages/api/AGENTS.mdfor API-only conventions - Delete slop — identity prompts ("you are an expert"), vague advice ("follow best practices"), things agents already know ("use meaningful names"), empty sections
- Prefer feedback loops over prose — a lint rule or CI check is more reliable than an AGENTS.md bullet
Step 6: Launch the Viewer
python {this-skill-path}/eval-viewer/generate_review.py \
{workspace}/iteration-N \
--skill-name "agents-md" \
--benchmark {workspace}/iteration-N/benchmark.json \
--static /tmp/agents-md-eval-review.html
Fix Viewer Labels
The viewer defaults to "with skill"/"without skill" labels. For agents.md evals, post-process the HTML to show "with agents.md"/"without agents.md":
html = open("/tmp/agents-md-eval-review.html").read()
LABEL_MAP = '{"with_skill": "with agents.md", "without_skill": "without agents.md"}[config] || config.replace(/_/g, " ");'
html = html.replace(
'badge.textContent = config.replace(/_/g, " ");',
'badge.textContent = ' + LABEL_MAP
)
# Also fix benchmark table labels (labelA, labelB, configLabel)
for var in ['labelA', 'labelB', 'configLabel']:
orig = f'const {var} = config'
if var == 'labelA':
orig = 'const labelA = configA.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
repl = 'const labelA = configA === "with_skill" ? "With agents.md" : configA === "without_skill" ? "Without agents.md" : configA.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
html = html.replace(orig, repl)
elif var == 'labelB':
orig = 'const labelB = configB.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
repl = 'const labelB = configB === "with_skill" ? "With agents.md" : configB === "without_skill" ? "Without agents.md" : configB.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
html = html.replace(orig, repl)
elif var == 'configLabel':
orig = 'const configLabel = config.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
repl = 'const configLabel = config === "with_skill" ? "With agents.md" : config === "without_skill" ? "Without agents.md" : config.replace(/_/g, " ").replace(/\\b\\w/g, c => c.toUpperCase());'
html = html.replace(orig, repl)
open("/tmp/agents-md-eval-review.html", "w").write(html)
For iteration 2+, pass --previous-workspace to show diffs.
Step 7: Apply Changes and Re-test
After the user reviews:
- Delete non-discriminating rules from the instruction file
- Strengthen discriminating rules (add "why", make more concise)
- Re-run evals into
iteration-{N+1}/to verify:- Discriminating rules still pass
- No regressions from deletions
- Token count is lower
Keep iterating until:
- Every remaining rule is discriminating or untestable-but-necessary
- The user is satisfied with the file
- Pass rate is stable
Workspace Layout
{workspace}/
├── evals/
│ ├── evals.json (initial eval prompts)
│ ├── evals-v2.json (commit-derived prompts)
│ └── commit-analysis.md (git log findings)
├── iteration-1/
│ ├── {eval-name}/
│ │ ├── eval_metadata.json
│ │ ├── with_skill/
│ │ │ ├── outputs/
│ │ │ ├── timing.json
│ │ │ └── grading.json
│ │ └── clean_baseline/
│ │ ├── outputs/
│ │ ├── timing.json
│ │ └── grading.json
│ ├── benchmark.json
│ └── benchmark.md
├── iteration-2/
│ └── ...
└── evals.json
Common Pitfalls (from real eval sessions)
-
Contaminated baselines: The #1 failure mode. Claude Code auto-loads CLAUDE.md for ALL conversations including subagents. Simply telling an agent "don't read CLAUDE.md" does nothing — the file must be physically absent.
-
Restoring files too early: Spawning agents then immediately restoring creates a race condition. Wait for ALL agents to finish.
-
Missing nested instruction files: Monorepos have package-level CLAUDE.md files (e.g.,
packages/functions/CLAUDE.md). Move ALL of them, not just the root one. -
Generic eval prompts: "Write a React component" tests patterns the model discovers from
package.jsonanyway. Use commit-derived prompts that target non-obvious domain knowledge. -
Wrong benchmark schema: The eval viewer expects a specific JSON structure with
runs[]array andrun_summarywithmean/stddev. Custom schemas produce empty viewers. -
Expecting high deltas: Due to the codebase-teaches-patterns effect, +3-10% is a realistic delta for a well-structured codebase. +30%+ deltas suggest contaminated baselines.
References
All bundled within this skill — no external dependencies:
- Eval viewer:
eval-viewer/generate_review.py— generates the HTML review viewer - Aggregation:
scripts/aggregate_benchmark.py— aggregates grading into benchmark.json - Grader agent:
agents/grader.md— prompt for grading subagent - Analyzer agent:
agents/analyzer.md— prompt for analysis subagent - Schemas:
references/schemas.md— JSON schemas for evals, grading, benchmark - Eval categories:
references/eval-categories.md— rule categories, empirical findings, commit-derived eval guide