Multi-Dimensional Evaluation Framework Designer
A skill for designing custom multi-dimensional evaluation frameworks for AI systems. Walks the user from "I have a system to evaluate" to "I have a calibrated, group-organized scorecard with canonical/proxy duality and explicit failure modes."
The central premise: a single composite score destroys the information you need to debug which dimension actually drove the outcome. This skill produces frameworks that force the reader to look at multiple numbers, with rules for when each measurement is reliable.
Four-stage flow
- Stage 1 — Domain elicitation: what system, what evaluation question, what calibration cases
- Stage 2 — Taxonomy design: group structure + dimensions per group
- Stage 3 — Rubric: canonical/proxy split per dimension + failure modes
- Stage 4 — Judgment: group-wise scorecard interpretation (no composite)
After Stage 4, ask: "Want to score additional cases or adjust the rubric?" — this is the calibration loop.
When to use
Activate when the user:
- Wants to evaluate AI systems (agents, deliberations, RAG, multi-step reasoning) across multiple qualitatively-different dimensions
- Needs to compare instances with asymmetric data availability (some have canonical metrics, others have only narrative logs)
- Has noticed single-score benchmarks miss important variation between systems
- Says "tradeoffs" — and wants to make those tradeoffs explicit per dimension
- Wants a reusable scorecard format that survives infrastructure migrations
Don't activate when:
- The user wants a single comparable benchmark number — point them at HumanEval / MMLU / domain-specific benchmarks instead
- The system has a clear single quality metric (perplexity, accuracy on a labeled set)
- The user is asking how to design one metric, not a framework of metrics
Stage 1 — Domain elicitation
Goal: extract enough about the user's evaluation domain to design groups and dimensions.
Turn 1 — concrete instances, not abstract criteria. Ask:
"Give me 1-2 concrete instances of systems you want to evaluate (or have already evaluated). What's the question that comparison should answer? — e.g., 'is system V2 more grounded than V1?' / 'does adding a Critic agent reduce sycophancy?'"
This grounds the design in real comparisons rather than generic axes.
Turn 2 — calibration cases. Ask:
"Of the systems you've already run, which 2-3 do you have strong intuitions about — i.e., 'I expect X to score higher than Y because Z'? Those are your calibration cases."
If the user has no calibration cases yet, the framework can't be calibrated. Either:
- Run on at least 2 prior instances first, or
- Design the framework theoretically and acknowledge it's uncalibrated until run
Turn 3 — data availability. Ask:
"For each calibration case, what data do you have? — structured records (jsonl, database)? narrative logs (markdown, reports)? both? Same schema across cases or different?"
This determines canonical/proxy split for Stage 3.
Turn 4 — capability layers (optional). If the system is complex, ask:
"If you had to split the evaluation into 3 layers, what would they be? Examples: evidence-quality / process-dynamics / structural-form. Or: retrieval-quality / ranking-quality / adaptation-quality."
The user's natural splits become the groups. If the user can't articulate layers, default to a 3-group structure: (1) evidence/grounding, (2) process/dynamics, (3) structural/architecture. Or use the 4-family alternative shown in memory-bench-taxonomy.md.
By end of Stage 1 you should know:
- The system class being evaluated (multi-agent / single-LLM / RAG / tool-using / etc.)
- 2-3 calibration cases with expected ordinals
- Data availability map (which cases have canonical data, which need proxy)
- Group structure (typically 3 groups, may be 2 or 4)
Stage 2 — Taxonomy design
Author the group structure + dimensions per group.
Step 1: Surface the 12-axis MADEF reference to the user. Ask which axes feel relevant.
Don't force the user to use all 12 — most domains use 5-8 of the MADEF axes plus 0-3 domain-specific additions. The MADEF table at the bottom of madef-axes.md shows likely keep/modify/drop patterns for common domains (single-LLM reasoning, tool-using agents, RAG, multi-step coding).
Step 2: Show the memory-bench-designer's 4-family taxonomy as alternative shape.
This makes the point that group structure is domain-driven. memory-bench has 4 groups (capability families) because memory has those layers. Deliberation has 3 groups (evidence/process/structure) because deliberation has those layers. Don't blindly copy — let the user's domain shape it.
Step 3: Walk the design worksheet. Use axes-design-worksheet.md to fill in:
- Group names + what each layer asks
- 2-5 dimensions per group
- For each dimension: name + 1-line definition
Cap at 8-12 total dimensions. More than 12 is unmanageable; less than 4 isn't multi-dim.
Stage 3 — Rubric
For each dimension designed in Stage 2, fill in the operational rubric using canonical-vs-proxy-decision.md:
- Canonical measure (formula given full data)
- Fallback proxy (operationalization for partial data)
- Tie-break rule (partial credit cases)
- Flag conditions (when to attach
⚠) - Refusal threshold (when proxy is too noisy to score)
A dimension without all five fields is not yet operational — it's a sketch.
Apply group-design-principles.md M1-M5 meta-principles:
- M1: ambiguous → report range, not point
- M2: population-count normalization required for cross-instance
- M3: stress conditions evaluated separately
- M4: framework must be falsifiable
- M5: calibration before claims
Stage 4 — Judgment
Apply the framework to the calibration cases the user named in Stage 1.
For each case, populate scorecard.md.tmpl with group-wise scores.
Critical: report group means separately, never a composite. A failing system with one group at 0.9 and another at 0.2 is not the same as a system with all groups at 0.55.
Verify ordinal predictions: do the calibration cases score in the predicted order? If not:
- Iterate the rubric and log the change in
iteration_log.md(see group-design-principles.md M5) - Or accept that the prediction was wrong and document why
The framework freezes (becomes versioned) when the calibration ordinals hold and at least 2-3 real adjustments have been logged.
Quick example
User: "I have 4 multi-agent debate experiments. The 4th one added claims+verifications infra. I want to evaluate which experiment is doing the most rigorous deliberation."
Stage 1 reveals:
- System class: multi-agent deliberation, 3-5 agents per experiment, 13-20 rounds each
- Calibration cases: V1/V2/V3 (legacy) and V4 (with claims infra)
- Data availability: legacy has narrative round logs only; V4 has full state jsonl
- Predicted ordinals: V2 > V1 (added Critic), V3 > V2 (more agents), V4 highest on grounding (has claims infra)
Stage 2 lands on the 12-axis MADEF taxonomy in madef-axes.md, with 3 groups (Grounding / Dynamics / Architecture).
Stage 3 fills in canonical/proxy for each axis. Most legacy experiments need proxy on A1, A3, B1, B2; V4 has canonical on all.
Stage 4 produces 4 scorecards. The ordinals confirm V4 is highest on Group A (Grounding) but the picture is more nuanced on Group B (V3 outscores V4 on dynamics due to more agents and a unique cross-agent finding). The framework surfaces which dimensions move with the architecture change, which is what the user needed.
Full walkthrough: examples/deliberation-system-eval.md.
How the skill behaves at each turn
- Don't dump all 12 axes at once. Surface them in groups, ask about relevance group-by-group.
- Don't start with the rubric (Stage 3) before the taxonomy is settled (Stage 2). Operational definitions before the design intent is wasted work.
- Do push back if the user wants a single composite. The pattern's whole point is to refuse that. Explain why (it hides which dimension failed) rather than just refusing.
- Do verify calibration ordinals before the user "trusts" the framework. If the framework can't reproduce the ordinals the user predicted, something is wrong (rubric, prediction, or scoring) — find which.
References
- references/group-design-principles.md — five design principles + five meta-principles, domain-agnostic
- references/canonical-vs-proxy-decision.md — decision tree for two-track measurement
- references/madef-axes.md — 12-axis instantiation for multi-agent deliberation (use as reference, adapt to your domain)
- references/memory-bench-taxonomy.md — 4-family/8-dimension instantiation for memory eval (alternative shape)
Templates
- templates/axes-design-worksheet.md — fill-in worksheet for designing your own axes
- templates/scorecard.md.tmpl — output format for group-wise scorecards
Examples
- examples/deliberation-system-eval.md — applying MADEF to 4 deliberation experiments
- examples/cross-domain-rag-eval.md — adapting the pattern to RAG evaluation
What this skill does NOT do
- It does not run benchmarks for you — it designs the framework you'll run
- It does not produce automated scoring — scoring is procedurally specified but human-in-the-loop for proxy work
- It does not collapse multi-dim into a single ranking number (refusal is the design)
- It does not validate that the dimensions you choose are the right dimensions for your domain — that's a calibration question, the framework only enforces self-consistency
License
MIT