Multi-Dimensional Evaluation Framework Designer

A skill for designing custom multi-dimensional evaluation frameworks for AI systems. Walks the user from "I have a system to evaluate" to "I have a calibrated, group-organized scorecard with canonical/proxy duality and explicit failure modes."

The central premise: a single composite score destroys the information you need to debug which dimension actually drove the outcome. This skill produces frameworks that force the reader to look at multiple numbers, with rules for when each measurement is reliable.

Four-stage flow

Stage 1 — Domain elicitation: what system, what evaluation question, what calibration cases
Stage 2 — Taxonomy design: group structure + dimensions per group
Stage 3 — Rubric: canonical/proxy split per dimension + failure modes
Stage 4 — Judgment: group-wise scorecard interpretation (no composite)

After Stage 4, ask: "Want to score additional cases or adjust the rubric?" — this is the calibration loop.

When to use

Activate when the user:

Wants to evaluate AI systems (agents, deliberations, RAG, multi-step reasoning) across multiple qualitatively-different dimensions
Needs to compare instances with asymmetric data availability (some have canonical metrics, others have only narrative logs)
Has noticed single-score benchmarks miss important variation between systems
Says "tradeoffs" — and wants to make those tradeoffs explicit per dimension
Wants a reusable scorecard format that survives infrastructure migrations

Don't activate when:

The user wants a single comparable benchmark number — point them at HumanEval / MMLU / domain-specific benchmarks instead
The system has a clear single quality metric (perplexity, accuracy on a labeled set)
The user is asking how to design one metric, not a framework of metrics

Stage 1 — Domain elicitation

Goal: extract enough about the user's evaluation domain to design groups and dimensions.

Turn 1 — concrete instances, not abstract criteria. Ask:

"Give me 1-2 concrete instances of systems you want to evaluate (or have already evaluated). What's the question that comparison should answer? — e.g., 'is system V2 more grounded than V1?' / 'does adding a Critic agent reduce sycophancy?'"

This grounds the design in real comparisons rather than generic axes.

Turn 2 — calibration cases. Ask:

"Of the systems you've already run, which 2-3 do you have strong intuitions about — i.e., 'I expect X to score higher than Y because Z'? Those are your calibration cases."

If the user has no calibration cases yet, the framework can't be calibrated. Either:

Run on at least 2 prior instances first, or
Design the framework theoretically and acknowledge it's uncalibrated until run

Turn 3 — data availability. Ask:

"For each calibration case, what data do you have? — structured records (jsonl, database)? narrative logs (markdown, reports)? both? Same schema across cases or different?"

This determines canonical/proxy split for Stage 3.

Turn 4 — capability layers (optional). If the system is complex, ask:

"If you had to split the evaluation into 3 layers, what would they be? Examples: evidence-quality / process-dynamics / structural-form. Or: retrieval-quality / ranking-quality / adaptation-quality."

The user's natural splits become the groups. If the user can't articulate layers, default to a 3-group structure: (1) evidence/grounding, (2) process/dynamics, (3) structural/architecture. Or use the 4-family alternative shown in memory-bench-taxonomy.md.

By end of Stage 1 you should know:

The system class being evaluated (multi-agent / single-LLM / RAG / tool-using / etc.)
2-3 calibration cases with expected ordinals
Data availability map (which cases have canonical data, which need proxy)
Group structure (typically 3 groups, may be 2 or 4)

Stage 2 — Taxonomy design

Author the group structure + dimensions per group.

Step 1: Surface the 12-axis MADEF reference to the user. Ask which axes feel relevant.

Don't force the user to use all 12 — most domains use 5-8 of the MADEF axes plus 0-3 domain-specific additions. The MADEF table at the bottom of madef-axes.md shows likely keep/modify/drop patterns for common domains (single-LLM reasoning, tool-using agents, RAG, multi-step coding).

Step 2: Show the memory-bench-designer's 4-family taxonomy as alternative shape.

This makes the point that group structure is domain-driven. memory-bench has 4 groups (capability families) because memory has those layers. Deliberation has 3 groups (evidence/process/structure) because deliberation has those layers. Don't blindly copy — let the user's domain shape it.

Step 3: Walk the design worksheet. Use axes-design-worksheet.md to fill in:

Group names + what each layer asks
2-5 dimensions per group
For each dimension: name + 1-line definition

Cap at 8-12 total dimensions. More than 12 is unmanageable; less than 4 isn't multi-dim.

Stage 3 — Rubric

For each dimension designed in Stage 2, fill in the operational rubric using canonical-vs-proxy-decision.md:

Canonical measure (formula given full data)
Fallback proxy (operationalization for partial data)
Tie-break rule (partial credit cases)
Flag conditions (when to attach ⚠)
Refusal threshold (when proxy is too noisy to score)

A dimension without all five fields is not yet operational — it's a sketch.

Apply group-design-principles.md M1-M5 meta-principles:

M1: ambiguous → report range, not point
M2: population-count normalization required for cross-instance
M3: stress conditions evaluated separately
M4: framework must be falsifiable
M5: calibration before claims

Stage 4 — Judgment

Apply the framework to the calibration cases the user named in Stage 1.

For each case, populate scorecard.md.tmpl with group-wise scores.

Critical: report group means separately, never a composite. A failing system with one group at 0.9 and another at 0.2 is not the same as a system with all groups at 0.55.

Verify ordinal predictions: do the calibration cases score in the predicted order? If not:

Iterate the rubric and log the change in iteration_log.md (see group-design-principles.md M5)
Or accept that the prediction was wrong and document why

The framework freezes (becomes versioned) when the calibration ordinals hold and at least 2-3 real adjustments have been logged.

Quick example

User: "I have 4 multi-agent debate experiments. The 4th one added claims+verifications infra. I want to evaluate which experiment is doing the most rigorous deliberation."

Stage 1 reveals:

System class: multi-agent deliberation, 3-5 agents per experiment, 13-20 rounds each
Calibration cases: V1/V2/V3 (legacy) and V4 (with claims infra)
Data availability: legacy has narrative round logs only; V4 has full state jsonl
Predicted ordinals: V2 > V1 (added Critic), V3 > V2 (more agents), V4 highest on grounding (has claims infra)

Stage 2 lands on the 12-axis MADEF taxonomy in madef-axes.md, with 3 groups (Grounding / Dynamics / Architecture).

Stage 3 fills in canonical/proxy for each axis. Most legacy experiments need proxy on A1, A3, B1, B2; V4 has canonical on all.

Stage 4 produces 4 scorecards. The ordinals confirm V4 is highest on Group A (Grounding) but the picture is more nuanced on Group B (V3 outscores V4 on dynamics due to more agents and a unique cross-agent finding). The framework surfaces which dimensions move with the architecture change, which is what the user needed.

Full walkthrough: examples/deliberation-system-eval.md.

How the skill behaves at each turn

Don't dump all 12 axes at once. Surface them in groups, ask about relevance group-by-group.
Don't start with the rubric (Stage 3) before the taxonomy is settled (Stage 2). Operational definitions before the design intent is wasted work.
Do push back if the user wants a single composite. The pattern's whole point is to refuse that. Explain why (it hides which dimension failed) rather than just refusing.
Do verify calibration ordinals before the user "trusts" the framework. If the framework can't reproduce the ordinals the user predicted, something is wrong (rubric, prediction, or scoring) — find which.

References

references/group-design-principles.md — five design principles + five meta-principles, domain-agnostic
references/canonical-vs-proxy-decision.md — decision tree for two-track measurement
references/madef-axes.md — 12-axis instantiation for multi-agent deliberation (use as reference, adapt to your domain)
references/memory-bench-taxonomy.md — 4-family/8-dimension instantiation for memory eval (alternative shape)

Templates

templates/axes-design-worksheet.md — fill-in worksheet for designing your own axes
templates/scorecard.md.tmpl — output format for group-wise scorecards

Examples

examples/deliberation-system-eval.md — applying MADEF to 4 deliberation experiments
examples/cross-domain-rag-eval.md — adapting the pattern to RAG evaluation

What this skill does NOT do

It does not run benchmarks for you — it designs the framework you'll run
It does not produce automated scoring — scoring is procedurally specified but human-in-the-loop for proxy work
It does not collapse multi-dim into a single ranking number (refusal is the design)
It does not validate that the dimensions you choose are the right dimensions for your domain — that's a calibration question, the framework only enforces self-consistency

License

MIT

multi-dim-eval-framework

Safety Notice

Copy this and send it to your AI assistant to learn