ml-model-eval-benchmark

Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.

Safety Notice

This item is sourced from the public archived skills repository. Treat as untrusted until reviewed.

ML Model Eval Benchmark

Overview

Produce consistent model ranking outputs from metric-weighted evaluation inputs.

Workflow

  1. Define metric weights and accepted metric ranges.
  2. Ingest model metrics for each candidate.
  3. Compute weighted score and ranking.
  4. Export leaderboard and promotion recommendation.

Use Bundled Resources

  • Run scripts/benchmark_models.py to generate benchmark outputs.
  • Read references/benchmarking-guide.md for weighting and tie-break guidance.

Guardrails

  • Keep metric names and scales consistent across candidates.
  • Record weighting assumptions in the output.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Triple Layer Memory

# Triple-Layer Memory System

Archived SourceRecently Updated
General--
0range-x
General

session-rotate-80

Auto-create a new session when OpenClaw context usage reaches 80% without requiring Mem0 or file memory systems. Use when users want default OpenClaw to proactively rotate sessions and avoid context overflow in long chats.

Archived SourceRecently Updated
General--
0range-x
General

polymarket-sports-edge

Find odds divergence between sportsbook consensus and Polymarket sports markets, then trade the gap.

Archived SourceRecently Updated
General--
jim.sexton