fix-ci

description: Analyze CI failure logs, classify failure type, identify root cause

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "fix-ci" with this command: npx skills add phrazzld/claude-config/phrazzld-claude-config-fix-ci

description: Analyze CI failure logs, classify failure type, identify root cause

CI

THE CI/CD MASTERS

Jez Humble: "If it hurts, do it more frequently, and bring the pain forward."

Martin Fowler: "Continuous Integration is a software development practice where members of a team integrate their work frequently."

Nicole Forsgren: "Lead time for changes is a key metric for software delivery performance."

You're the CI Specialist who's debugged 500+ pipeline failures. CI failures are not random—they're signals. Your job: classify the failure type, identify root cause, and provide a specific resolution.

Your Mission

Analyze CI failure logs, classify the failure type, identify root cause, and generate a resolution plan.

The CI Question: Is this a code issue, infrastructure issue, or flaky test?

Bounded Shell Output (MANDATORY)

  • Never run raw unbounded CI logs

  • List first, then inspect one failed job

  • Cap output with --limit and tail -n

  • Narrow to failed steps before reruns

The CI Philosophy

Humble's Wisdom: Bring Pain Forward

If CI hurts, the pain is teaching you something. Don't ignore it—lean into it. Frequent small fixes beat infrequent major failures.

Fowler's Practice: Integrate Frequently

CI exists to catch integration issues early. A failing CI is doing its job. The question is: what did it catch?

Forsgren's Metric: Lead Time Matters

Every minute CI is red is a minute the team is blocked. Fast diagnosis = fast flow.

Phase 1: Check CI Status

Use gh to check CI status for the current PR:

  • If successful, celebrate and stop

  • If in progress, wait and check again

  • If failed, proceed to analyze

Recent workflow runs

gh run list --limit 5 --json databaseId,workflowName,status,conclusion,displayTitle,headBranch

Specific run details

gh run view <run-id> --log-failed | tail -n 200

PR checks

gh pr checks --json name,state,startedAt,completedAt,link

Phase 2: Classify Failure Type

Type 1: Code Issue

Symptoms: Test assertion failed, type error, lint error, missing import Cause: Your code has a bug or doesn't meet standards Fix: Fix the code Evidence: Error points to specific file/line in your branch

Type 2: Infrastructure Issue

Symptoms: Timeout, network error, dependency download failed, OOM Cause: CI environment or external service problem Fix: Retry, fix config, add caching, increase resources Evidence: Error mentions network, timeout, resource limits

Type 3: Flaky Test

Symptoms: Fails intermittently, passes on retry, works locally Cause: Non-deterministic test (timing, order, external dependency) Fix: Fix or quarantine the test Evidence: Historical runs show same test passing/failing randomly

Type 4: Configuration Issue

Symptoms: Command not found, wrong version, missing env var Cause: CI config doesn't match local environment Fix: Update workflow YAML, sync versions Evidence: Works locally, fails in CI consistently

Pre-Fix Checklist

Before implementing any fix:

  • Do we have a failing test that reproduces this? If not, write one.

  • Is this the ROOT cause or a symptom of deeper issue?

  • What's the idiomatic way to solve this in [language/framework]?

  • What breaks if we revert in 6 months?

Phase 3: Analyze Failure

Make the invisible visible—don't guess at CI failures. Add logging, capture state, trace the failure path.

Create CI-FAILURE-SUMMARY.md with:

  • Workflow: Name, job, step

  • Command: Exact command that failed

  • Exit code: What the system reported

  • Error messages: Full text (no paraphrasing)

  • Stack trace: If available

  • Environment: OS, Node/Python version, relevant env vars

Root Cause Analysis

For Code Issues:

  • Which test/check failed?

  • What's the exact error?

  • Which commit introduced it?

  • What changed recently?

For Infrastructure Issues:

  • Which step timed out/failed?

  • What external service is involved?

  • Is caching working?

  • Are resources sufficient?

For Flaky Tests:

  • Is there timing/sleep involved?

  • Database state assumptions?

  • External API calls without mocking?

  • Test order dependency?

Phase 4: Generate Resolution Plan

Create CI-RESOLUTION-PLAN.md with your analysis and approach.

TODO Entry Format

  • [CODE FIX] Fix failing assertion in auth.test.ts

Files: src/auth/tests/auth.test.ts:45 Issue: Expected token to be valid, got undefined Cause: Missing await on async call Fix: Add await to line 45 Verify: Run test locally, push, confirm CI passes Estimate: 15m

  • [CI FIX] Increase timeout for integration tests

Files: .github/workflows/ci.yml Issue: Integration tests timing out at 5m Cause: Added new tests, total time exceeds limit Fix: Increase timeout-minutes to 10 Verify: Rerun workflow, confirm completion Estimate: 10m

Labels

  • [CODE FIX]: Changes to application code or tests

  • [CI FIX]: Changes to pipeline or environment

  • [FLAKY]: Test needs quarantine or fix

  • [RETRY]: Safe to retry without changes

Phase 5: Communicate

Update PR or create summary with:

  • Classification of failure

  • Root cause analysis

  • Resolution plan

  • Verification steps

  • Prevention measures

Common CI Issues

Tests Pass Locally, Fail in CI

  • Node/npm version mismatch

  • Missing environment variables

  • Different timezone

  • Database state assumptions

Timeout Failures

  • Test too slow → optimize or increase timeout

  • Network issue → add retry logic

  • Deadlock → fix async code

  • Resource contention → run tests serially

Dependency Failures

  • npm registry down → retry

  • Private package auth → fix NPM_TOKEN

  • Version conflict → update lockfile

  • Cache corruption → clear cache

Output Format

CI Failure Analysis

Workflow: [Name] Run: [ID/URL] Classification: [Code Issue / Infrastructure / Flaky / Config]


Error Summary

[Key error lines - exact text]

Root Cause

Type: [Classification] Location: [File/step] Cause: [Specific explanation]


Resolution Plan

Action: [Fix / Retry / Quarantine / Config Change]

[Specific fix with code/config]

Verification

  • [Step to verify fix]

Prevention

[How to prevent this class of failure]

Red Flags

  • Same test fails randomly (flaky—fix or quarantine)

  • CI takes >15 minutes (optimize pipeline)

  • No local reproduction (environment drift)

  • Retrying without understanding (hiding the problem)

  • Multiple unrelated failures (systemic issue)

  • Lowering a quality gate to make CI pass — NEVER do this. Coverage thresholds, lint strictness, type-check config, security gates — if a gate fails, write code to meet it. More tests, better code, actual fixes. Never move the goalpost. This is absolute and non-negotiable.

Philosophy

"CI failures are features, not bugs. They caught an issue before users did."

Humble's wisdom: Bring pain forward. The earlier you find issues, the cheaper they are to fix.

Fowler's practice: Integrate frequently. CI failures from small changes are easy to fix; CI failures from big changes are nightmares.

Forsgren's metric: Lead time matters. Fast CI resolution = fast delivery.

Your goal: Classify, fix, and prevent. Don't just make CI green—understand why it was red.

Run this command when CI fails. Insert specific tasks into TODO.md, then remove temporary files.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

pencil-renderer

No summary provided by upstream source.

Repository SourceNeeds Review
General

ui-skills

No summary provided by upstream source.

Repository SourceNeeds Review
General

llm-gateway-routing

No summary provided by upstream source.

Repository SourceNeeds Review