incident-runbook-generator

Incident Runbook Generator

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "incident-runbook-generator" with this command: npx skills add patricio0312rev/skills/patricio0312rev-skills-incident-runbook-generator

Incident Runbook Generator

Create actionable runbooks for common incidents.

Runbook Template

Runbook: Database Connection Pool Exhausted

Severity: P1 (Critical) Estimated Time to Resolve: 15-30 minutes Owner: Database Team (On-call)

Symptoms

  • Application errors: "connection pool exhausted"
  • Increased API latency (>5s)
  • Failed health checks
  • CloudWatch alarm: DatabaseConnectionsHigh

Detection

  • Alert: DatabaseConnectionPoolExhausted
  • Metrics: active_connections > max_connections * 0.9
  • Logs: "Error: connect ETIMEDOUT"

Immediate Actions (5 min)

  1. Verify the issue
    # Check current connections
    SELECT count(*) FROM pg_stat_activity;
    

Identify long-running queries

SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC LIMIT 10;

Kill blocking queries (if safe)

SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes';

Mitigation (10 min)

Scale up connection pool (temporary)

Update RDS parameter group

aws rds modify-db-parameter-group
--db-parameter-group-name prod-params
--parameters "ParameterName=max_connections,ParameterValue=200"

Restart application (if needed)

kubectl rollout restart deployment/api

Monitor recovery

watch -n 5 'psql -c "SELECT count(*) FROM pg_stat_activity;"'

Root Cause Investigation

Check for:

  • Recent deployment (new code with connection leaks)

  • Traffic spike (legitimate or DDoS)

  • Slow queries holding connections

  • Connection pool configuration too small

  • Application not releasing connections

Rollback Steps

If caused by deployment:

Rollback to previous version

kubectl rollout undo deployment/api

Verify

kubectl rollout status deployment/api

Communication Template

Initial (within 5 min):

🚨 INCIDENT: Database connection pool exhausted Status: Investigating Impact: API errors and slowness ETA: 15-30 min Next update: 10 min

Update (every 10 min):

UPDATE: Killed long-running queries Status: Mitigating Impact: Still degraded, improving Actions: Scaling connection pool Next update: 10 min

Resolution:

✅ RESOLVED: Database connections normalized Duration: 25 minutes Root cause: Connection leak in v2.3.4 Fix: Rolled back to v2.3.3 Follow-up: Bug fix PR #1234 Postmortem: [link]

Prevention

  • Add connection pool metrics to dashboards

  • Implement connection timeout (30s)

  • Add connection leak detection in tests

  • Set up pre-deployment load testing

  • Review connection pool sizing

Related Runbooks

  • Database High CPU

  • Slow Database Queries

  • Application OOM

Output Checklist

  • Symptoms documented
  • Detection criteria
  • Step-by-step actions
  • Owner assigned
  • Rollback procedure
  • Communication templates
  • Prevention measures ENDFILE

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

framer-motion-animator

No summary provided by upstream source.

Repository SourceNeeds Review
General

eslint-prettier-config

No summary provided by upstream source.

Repository SourceNeeds Review
General

nginx-config-optimizer

No summary provided by upstream source.

Repository SourceNeeds Review
General

changelog-writer

No summary provided by upstream source.

Repository SourceNeeds Review