multi-dim-eval-framework

Designs a multi-dimensional evaluation framework for AI systems where single-score benchmarks lose information. Use when comparing experiments/agents across qualitatively different dimensions, when canonical metrics aren't available for legacy systems, or when explaining *which* dimension drove an outcome matters more than ranking.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "multi-dim-eval-framework" with this command: npx skills add tatsuko-tsukimi/multi-dim-eval-framework

Multi-Dimensional Evaluation Framework Designer

A skill for designing custom multi-dimensional evaluation frameworks for AI systems. Walks the user from "I have a system to evaluate" to "I have a calibrated, group-organized scorecard with canonical/proxy duality and explicit failure modes."

The central premise: a single composite score destroys the information you need to debug which dimension actually drove the outcome. This skill produces frameworks that force the reader to look at multiple numbers, with rules for when each measurement is reliable.

Four-stage flow

  • Stage 1 — Domain elicitation: what system, what evaluation question, what calibration cases
  • Stage 2 — Taxonomy design: group structure + dimensions per group
  • Stage 3 — Rubric: canonical/proxy split per dimension + failure modes
  • Stage 4 — Judgment: group-wise scorecard interpretation (no composite)

After Stage 4, ask: "Want to score additional cases or adjust the rubric?" — this is the calibration loop.

When to use

Activate when the user:

  • Wants to evaluate AI systems (agents, deliberations, RAG, multi-step reasoning) across multiple qualitatively-different dimensions
  • Needs to compare instances with asymmetric data availability (some have canonical metrics, others have only narrative logs)
  • Has noticed single-score benchmarks miss important variation between systems
  • Says "tradeoffs" — and wants to make those tradeoffs explicit per dimension
  • Wants a reusable scorecard format that survives infrastructure migrations

Don't activate when:

  • The user wants a single comparable benchmark number — point them at HumanEval / MMLU / domain-specific benchmarks instead
  • The system has a clear single quality metric (perplexity, accuracy on a labeled set)
  • The user is asking how to design one metric, not a framework of metrics

Stage 1 — Domain elicitation

Goal: extract enough about the user's evaluation domain to design groups and dimensions.

Turn 1 — concrete instances, not abstract criteria. Ask:

"Give me 1-2 concrete instances of systems you want to evaluate (or have already evaluated). What's the question that comparison should answer? — e.g., 'is system V2 more grounded than V1?' / 'does adding a Critic agent reduce sycophancy?'"

This grounds the design in real comparisons rather than generic axes.

Turn 2 — calibration cases. Ask:

"Of the systems you've already run, which 2-3 do you have strong intuitions about — i.e., 'I expect X to score higher than Y because Z'? Those are your calibration cases."

If the user has no calibration cases yet, the framework can't be calibrated. Either:

  • Run on at least 2 prior instances first, or
  • Design the framework theoretically and acknowledge it's uncalibrated until run

Turn 3 — data availability. Ask:

"For each calibration case, what data do you have? — structured records (jsonl, database)? narrative logs (markdown, reports)? both? Same schema across cases or different?"

This determines canonical/proxy split for Stage 3.

Turn 4 — capability layers (optional). If the system is complex, ask:

"If you had to split the evaluation into 3 layers, what would they be? Examples: evidence-quality / process-dynamics / structural-form. Or: retrieval-quality / ranking-quality / adaptation-quality."

The user's natural splits become the groups. If the user can't articulate layers, default to a 3-group structure: (1) evidence/grounding, (2) process/dynamics, (3) structural/architecture. Or use the 4-family alternative shown in memory-bench-taxonomy.md.

By end of Stage 1 you should know:

  • The system class being evaluated (multi-agent / single-LLM / RAG / tool-using / etc.)
  • 2-3 calibration cases with expected ordinals
  • Data availability map (which cases have canonical data, which need proxy)
  • Group structure (typically 3 groups, may be 2 or 4)

Stage 2 — Taxonomy design

Author the group structure + dimensions per group.

Step 1: Surface the 12-axis MADEF reference to the user. Ask which axes feel relevant.

Don't force the user to use all 12 — most domains use 5-8 of the MADEF axes plus 0-3 domain-specific additions. The MADEF table at the bottom of madef-axes.md shows likely keep/modify/drop patterns for common domains (single-LLM reasoning, tool-using agents, RAG, multi-step coding).

Step 2: Show the memory-bench-designer's 4-family taxonomy as alternative shape.

This makes the point that group structure is domain-driven. memory-bench has 4 groups (capability families) because memory has those layers. Deliberation has 3 groups (evidence/process/structure) because deliberation has those layers. Don't blindly copy — let the user's domain shape it.

Step 3: Walk the design worksheet. Use axes-design-worksheet.md to fill in:

  • Group names + what each layer asks
  • 2-5 dimensions per group
  • For each dimension: name + 1-line definition

Cap at 8-12 total dimensions. More than 12 is unmanageable; less than 4 isn't multi-dim.

Stage 3 — Rubric

For each dimension designed in Stage 2, fill in the operational rubric using canonical-vs-proxy-decision.md:

  • Canonical measure (formula given full data)
  • Fallback proxy (operationalization for partial data)
  • Tie-break rule (partial credit cases)
  • Flag conditions (when to attach )
  • Refusal threshold (when proxy is too noisy to score)

A dimension without all five fields is not yet operational — it's a sketch.

Apply group-design-principles.md M1-M5 meta-principles:

  • M1: ambiguous → report range, not point
  • M2: population-count normalization required for cross-instance
  • M3: stress conditions evaluated separately
  • M4: framework must be falsifiable
  • M5: calibration before claims

Stage 4 — Judgment

Apply the framework to the calibration cases the user named in Stage 1.

For each case, populate scorecard.md.tmpl with group-wise scores.

Critical: report group means separately, never a composite. A failing system with one group at 0.9 and another at 0.2 is not the same as a system with all groups at 0.55.

Verify ordinal predictions: do the calibration cases score in the predicted order? If not:

  • Iterate the rubric and log the change in iteration_log.md (see group-design-principles.md M5)
  • Or accept that the prediction was wrong and document why

The framework freezes (becomes versioned) when the calibration ordinals hold and at least 2-3 real adjustments have been logged.

Quick example

User: "I have 4 multi-agent debate experiments. The 4th one added claims+verifications infra. I want to evaluate which experiment is doing the most rigorous deliberation."

Stage 1 reveals:

  • System class: multi-agent deliberation, 3-5 agents per experiment, 13-20 rounds each
  • Calibration cases: V1/V2/V3 (legacy) and V4 (with claims infra)
  • Data availability: legacy has narrative round logs only; V4 has full state jsonl
  • Predicted ordinals: V2 > V1 (added Critic), V3 > V2 (more agents), V4 highest on grounding (has claims infra)

Stage 2 lands on the 12-axis MADEF taxonomy in madef-axes.md, with 3 groups (Grounding / Dynamics / Architecture).

Stage 3 fills in canonical/proxy for each axis. Most legacy experiments need proxy on A1, A3, B1, B2; V4 has canonical on all.

Stage 4 produces 4 scorecards. The ordinals confirm V4 is highest on Group A (Grounding) but the picture is more nuanced on Group B (V3 outscores V4 on dynamics due to more agents and a unique cross-agent finding). The framework surfaces which dimensions move with the architecture change, which is what the user needed.

Full walkthrough: examples/deliberation-system-eval.md.

How the skill behaves at each turn

  • Don't dump all 12 axes at once. Surface them in groups, ask about relevance group-by-group.
  • Don't start with the rubric (Stage 3) before the taxonomy is settled (Stage 2). Operational definitions before the design intent is wasted work.
  • Do push back if the user wants a single composite. The pattern's whole point is to refuse that. Explain why (it hides which dimension failed) rather than just refusing.
  • Do verify calibration ordinals before the user "trusts" the framework. If the framework can't reproduce the ordinals the user predicted, something is wrong (rubric, prediction, or scoring) — find which.

References

Templates

Examples

What this skill does NOT do

  • It does not run benchmarks for you — it designs the framework you'll run
  • It does not produce automated scoring — scoring is procedurally specified but human-in-the-loop for proxy work
  • It does not collapse multi-dim into a single ranking number (refusal is the design)
  • It does not validate that the dimensions you choose are the right dimensions for your domain — that's a calibration question, the framework only enforces self-consistency

License

MIT

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

OpenClaw Smartness Eval

OpenClaw 智能度综合评伌技能。围绕 14 个维度(含规划能力、幻觉控制)输出综合评分、证据、风险与趋势。对齐 CLEAR/T-Eval/Anthropic 行业标准。

Registry SourceRecently Updated
3191Profile unavailable
Automation

AI Benchmark — Measure How Your Agent Thinks

Experiential benchmark for AI reasoning — measures calibration, epistemic flexibility, risk assessment, and metacognition through interactive concert experie...

Registry SourceRecently Updated
1302Profile unavailable
Automation

Agent Benchmark

提供基于12项标准化任务的AI Agent能力评估,涵盖文件操作、数据处理、系统操作、健壮性和代码质量,自动评分生成报告。

Registry SourceRecently Updated
1470Profile unavailable
Automation

Botmark Skill

AI能力测评。当用户说'跑个分/测评/benchmark'时,通过BotMark API自动完成能力评估并生成报告。需要BOTMARK_API_KEY环境变量。

Registry Source
3720Profile unavailable