ml-model-evaluation

ML model evaluation workflow for metric design, threshold setting, and failure segmentation. Use when model readiness decisions require explicit accept/reject criteria and segment-level evidence; do not use for generic API-layer or infrastructure-only changes.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ml-model-evaluation" with this command: npx skills add kentoshimizu/sw-agent-skills/kentoshimizu-sw-agent-skills-ml-model-evaluation

Ml Model Evaluation

Overview

Use this skill to evaluate models with decision-grade evidence across aggregate and high-risk segments.

Scope Boundaries

  • Use this skill when the task matches the trigger condition described in description.
  • Do not use this skill when the primary task falls outside this skill's domain.

Shared References

  • Threshold and segmentation rules:
    • references/threshold-and-segmentation-rules.md

Templates And Assets

  • Evaluation report template:
    • assets/evaluation-report-template.md

Inputs To Gather

  • Dataset splits and baseline/candidate definitions.
  • Business cost trade-offs for false positives/negatives.
  • Segment definitions for fairness/risk-critical cohorts.
  • Acceptance thresholds and calibration requirements.

Deliverables

  • Evaluation report with thresholds and decision.
  • Segment-level failure analysis.
  • Acceptance/rejection rationale and follow-ups.

Workflow

  1. Build evaluation report in assets/evaluation-report-template.md.
  2. Apply threshold/segment policy via references/threshold-and-segmentation-rules.md.
  3. Validate calibration and error concentration risks.
  4. Compare baseline vs candidate under same conditions.
  5. Publish release recommendation and unresolved risks.

Quality Standard

  • Thresholds are tied to business risk trade-offs.
  • Critical segments are explicitly evaluated.
  • Decision rationale is traceable to evidence.

Failure Conditions

  • Stop when evaluation omits high-risk segments.
  • Stop when acceptance thresholds are undefined.
  • Escalate when model risk is unacceptable for rollout.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

architecture-clean-architecture

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

sqlalchemy-orm-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

information-architecture

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

db-normalization

No summary provided by upstream source.

Repository SourceNeeds Review