ml-model-evaluation

ML model evaluation workflow for metric design, threshold setting, and failure segmentation. Use when model readiness decisions require explicit accept/reject criteria and segment-level evidence; do not use for generic API-layer or infrastructure-only changes.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ml-model-evaluation" with this command: npx skills add kentoshimizu/sw-agent-skills/kentoshimizu-sw-agent-skills-ml-model-evaluation

Ml Model Evaluation

Overview

Use this skill to evaluate models with decision-grade evidence across aggregate and high-risk segments.

Scope Boundaries

Use this skill when the task matches the trigger condition described in description.
Do not use this skill when the primary task falls outside this skill's domain.

Shared References

Threshold and segmentation rules:
- references/threshold-and-segmentation-rules.md

Templates And Assets

Evaluation report template:
- assets/evaluation-report-template.md

Inputs To Gather

Dataset splits and baseline/candidate definitions.
Business cost trade-offs for false positives/negatives.
Segment definitions for fairness/risk-critical cohorts.
Acceptance thresholds and calibration requirements.

Deliverables

Evaluation report with thresholds and decision.
Segment-level failure analysis.
Acceptance/rejection rationale and follow-ups.

Workflow

Build evaluation report in assets/evaluation-report-template.md.
Apply threshold/segment policy via references/threshold-and-segmentation-rules.md.
Validate calibration and error concentration risks.
Compare baseline vs candidate under same conditions.
Publish release recommendation and unresolved risks.

Quality Standard

Thresholds are tied to business risk trade-offs.
Critical segments are explicitly evaluated.
Decision rationale is traceable to evidence.

Failure Conditions

Stop when evaluation omits high-risk segments.
Stop when acceptance thresholds are undefined.
Escalate when model risk is unacceptable for rollout.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open in GitHub Open in ClawHub

Related Skills

Related by shared tags or category signals.

Automation

architecture-clean-architecture

No summary provided by upstream source.

Repository SourceNeeds Review

Automation

sqlalchemy-orm-patterns

No summary provided by upstream source.

Repository SourceNeeds Review

Automation

information-architecture

No summary provided by upstream source.

Repository SourceNeeds Review

Automation

db-normalization

No summary provided by upstream source.

Repository SourceNeeds Review