incident-response

Production incident response procedures for Python/React applications. Use when responding to production outages, investigating error spikes, diagnosing performance degradation, or conducting post-mortems. Covers severity classification (SEV1-SEV4), incident commander role, communication templates, diagnostic commands for FastAPI/ PostgreSQL/Redis, rollback procedures, and blameless post-mortem process. Does NOT cover monitoring setup (use monitoring-setup) or deployment procedures (use deployment-pipeline).

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "incident-response" with this command: npx skills add hieutrtr/ai1-skills/hieutrtr-ai1-skills-incident-response

Incident Response

When to Use

Activate this skill when:

  • Production service is down or returning errors to users
  • Error rate has spiked beyond normal thresholds
  • Performance has degraded significantly (latency increase, timeouts)
  • An alert has fired from the monitoring system
  • Users are reporting issues that indicate a systemic problem
  • A failed deployment needs investigation and remediation
  • Conducting a post-mortem or root cause analysis after an incident

Output: Write runbooks to docs/runbooks/<service>-runbook.md and post-mortems to postmortem-YYYY-MM-DD.md.

Do NOT use this skill for:

  • Setting up monitoring or alerting rules (use monitoring-setup)
  • Performing routine deployments (use deployment-pipeline)
  • Docker image or infrastructure issues (use docker-best-practices)
  • Feature development or code changes (use python-backend-expert or react-frontend-expert)

Instructions

Severity Classification

Classify every incident immediately. Severity determines response urgency, communication cadence, and escalation path.

SeverityImpactExamplesResponse TimeUpdate Cadence
SEV1 (P1)Complete outage, all users affectedService down, data loss, security breachImmediate (< 5 min)Every 15 min
SEV2 (P2)Major degradation, most users affectedCore feature broken, severe latency< 15 minEvery 30 min
SEV3 (P3)Partial degradation, some users affectedNon-critical feature broken, intermittent errors< 1 hourEvery 2 hours
SEV4 (P4)Minor issue, few users affectedCosmetic bug, edge case error< 4 hoursDaily

Escalation rules:

  • SEV1: Page on-call engineer + engineering manager immediately
  • SEV2: Page on-call engineer, notify engineering manager
  • SEV3: Notify on-call engineer via Slack
  • SEV4: Create ticket, address during normal working hours

See references/escalation-contacts.md for the contact matrix.

5-Minute Triage Workflow

When an incident is detected, follow this triage workflow within the first 5 minutes.

┌─────────────────────────────────────────────────────────┐
│  MINUTE 0-1: Acknowledge and Classify                   │
│  • Acknowledge the alert or report                      │
│  • Assign severity (SEV1-SEV4)                          │
│  • Designate incident commander                         │
├─────────────────────────────────────────────────────────┤
│  MINUTE 1-2: Assess Scope                               │
│  • Check health endpoints for all services              │
│  • Check error rate and latency dashboards              │
│  • Determine: which services are affected?              │
├─────────────────────────────────────────────────────────┤
│  MINUTE 2-3: Identify Recent Changes                    │
│  • Check: was there a recent deployment?                │
│  • Check: any infrastructure changes?                   │
│  • Check: any external dependency issues?               │
├─────────────────────────────────────────────────────────┤
│  MINUTE 3-4: Initial Communication                      │
│  • Post in #incidents channel                           │
│  • Update status page if SEV1/SEV2                      │
│  • Page additional responders if needed                 │
├─────────────────────────────────────────────────────────┤
│  MINUTE 4-5: Begin Investigation or Mitigate            │
│  • If recent deploy: consider immediate rollback        │
│  • If not deploy-related: begin diagnostic commands     │
│  • Start incident timeline log                          │
└─────────────────────────────────────────────────────────┘

Quick health check command:

./skills/incident-response/scripts/health-check-all-services.sh \
  --output-dir ./incident-triage/

Incident Commander Role

The incident commander (IC) coordinates the response. They do NOT investigate directly.

IC responsibilities:

  1. Coordinate -- Assign tasks to responders, prevent duplicate work
  2. Communicate -- Post regular updates to stakeholders
  3. Decide -- Make go/no-go decisions on rollback, escalation, communication
  4. Track -- Maintain the incident timeline
  5. Close -- Declare the incident resolved and schedule the post-mortem

IC communication template (initial):

INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]

IC communication template (update):

INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]

Investigation Steps

Follow these diagnostic steps based on the type of issue.

Application Errors (FastAPI)

# 1. Check application logs for errors
./skills/incident-response/scripts/fetch-logs.sh \
  --service backend \
  --since "15 minutes ago" \
  --output-dir ./incident-logs/

# 2. Check error rate from logs
docker logs app-backend --since 15m 2>&1 | grep -c "ERROR"

# 3. Check active connections and request patterns
curl -s http://localhost:8000/health/ready | jq .

# 4. Check if the issue is in a specific endpoint
docker logs app-backend --since 15m 2>&1 | \
  grep "ERROR" | \
  grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn

# 5. Check Python process status
docker exec app-backend ps aux
docker exec app-backend python -c "import sys; print(sys.version)"

Database Issues (PostgreSQL)

# 1. Check database connectivity
docker exec app-db pg_isready -U postgres

# 2. Check active connections (connection pool exhaustion?)
docker exec app-db psql -U postgres -d app_prod -c "
  SELECT count(*), state FROM pg_stat_activity
  GROUP BY state ORDER BY count DESC;
"

# 3. Check for long-running queries (locks, deadlocks?)
docker exec app-db psql -U postgres -d app_prod -c "
  SELECT pid, now() - pg_stat_activity.query_start AS duration,
         query, state
  FROM pg_stat_activity
  WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
  AND state != 'idle'
  ORDER BY duration DESC;
"

# 4. Check for lock contention
docker exec app-db psql -U postgres -d app_prod -c "
  SELECT blocked_locks.pid AS blocked_pid,
         blocking_locks.pid AS blocking_pid,
         blocked_activity.query AS blocked_query
  FROM pg_catalog.pg_locks blocked_locks
  JOIN pg_catalog.pg_stat_activity blocked_activity
    ON blocked_activity.pid = blocked_locks.pid
  JOIN pg_catalog.pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.relation = blocked_locks.relation
    AND blocking_locks.pid != blocked_locks.pid
  JOIN pg_catalog.pg_stat_activity blocking_activity
    ON blocking_activity.pid = blocking_locks.pid
  WHERE NOT blocked_locks.granted;
"

# 5. Check disk space
docker exec app-db df -h /var/lib/postgresql/data

Redis Issues

# 1. Check Redis connectivity
docker exec app-redis redis-cli ping

# 2. Check memory usage
docker exec app-redis redis-cli info memory | grep used_memory_human

# 3. Check connected clients
docker exec app-redis redis-cli info clients | grep connected_clients

# 4. Check slow log
docker exec app-redis redis-cli slowlog get 10

# 5. Check keyspace
docker exec app-redis redis-cli info keyspace

Network and Infrastructure

# 1. Check DNS resolution
nslookup api.example.com

# 2. Check SSL certificate expiry
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# 3. Check container resource usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

# 4. Check disk space on host
df -h /

# 5. Check if dependent services are reachable
curl -sf https://external-api.example.com/health || echo "External API unreachable"

Remediation Actions

Immediate Mitigations (apply within minutes)

IssueMitigationCommand
Bad deploymentRollback./scripts/deploy.sh --rollback --env production --version $PREV_SHA --output-dir ./results/
Connection pool exhaustedRestart backenddocker restart app-backend
Long-running queryKill querySELECT pg_terminate_backend(<pid>);
Memory leakRestart servicedocker restart app-backend
Redis fullFlush non-critical keysredis-cli --scan --pattern "cache:*" | xargs redis-cli del
SSL expiredApply new certUpdate cert in load balancer
Disk fullClean logs/temp filesdocker system prune -f

Longer-Term Fixes (apply after stabilization)

  1. Fix the root cause in code -- Create a branch, fix, test, deploy through normal pipeline
  2. Add monitoring -- If the issue was not caught by existing alerts, add new alert rules
  3. Add tests -- Write regression tests for the failure scenario
  4. Update runbooks -- Document the new failure mode and remediation steps

Communication Protocol

Internal Communication

Channels:

  • #incidents -- Active incident coordination (SEV1/SEV2)
  • #incidents-low -- SEV3/SEV4 tracking
  • #engineering -- Post-incident summaries

Rules:

  1. All communication happens in the designated incident channel
  2. Use threads for investigation details, keep main channel for status updates
  3. IC posts updates at the defined cadence (see severity table)
  4. Tag relevant people explicitly, do not assume they are watching
  5. Timestamp all significant findings and actions

External Communication (SEV1/SEV2)

Status page update template:

[Investigating] We are investigating reports of [issue description].
Users may experience [user-visible impact].
We will provide an update within [time].
[Identified] The issue has been identified as [brief description].
We are working on a fix. Estimated resolution: [time estimate].
[Resolved] The issue affecting [service] has been resolved.
The root cause was [brief description].
We apologize for the disruption and will publish a detailed post-mortem.

Post-Mortem / RCA Framework

Conduct a blameless post-mortem within 48 hours of every SEV1/SEV2 incident. SEV3 incidents receive a lightweight review.

See references/post-mortem-template.md for the full template.

Post-mortem principles:

  1. Blameless -- Focus on systems and processes, not individuals
  2. Thorough -- Identify all contributing factors, not just the trigger
  3. Actionable -- Every finding must produce a concrete action item with an owner
  4. Timely -- Conduct within 48 hours while details are fresh
  5. Shared -- Publish to the entire engineering team

Post-mortem structure:

  1. Summary -- What happened, when, and what was the impact
  2. Timeline -- Minute-by-minute account of detection, investigation, mitigation
  3. Root cause -- The fundamental reason the incident occurred
  4. Contributing factors -- Other conditions that made the incident worse
  5. What went well -- Effective parts of the response
  6. What could be improved -- Gaps in detection, response, or tooling
  7. Action items -- Specific tasks with owners and due dates

Five Whys technique for root cause analysis:

Why did users see 500 errors?
  -> Because the backend service returned errors to the load balancer.
Why did the backend service return errors?
  -> Because database connections timed out.
Why did database connections time out?
  -> Because the connection pool was exhausted.
Why was the connection pool exhausted?
  -> Because a new endpoint opened connections without releasing them.
Why were connections not released?
  -> Because the endpoint was missing the async context manager for sessions.

Root cause: Missing async context manager for database sessions in new endpoint.

Generate a structured incident report:

python skills/incident-response/scripts/generate-incident-report.py \
  --title "Database connection pool exhaustion" \
  --severity SEV2 \
  --start-time "2024-01-15T14:30:00Z" \
  --end-time "2024-01-15T15:15:00Z" \
  --output-dir ./post-mortems/

Incident Response Scripts

ScriptPurposeUsage
scripts/fetch-logs.shFetch recent logs from services./scripts/fetch-logs.sh --service backend --since "30m" --output-dir ./logs/
scripts/health-check-all-services.shCheck health of all services./scripts/health-check-all-services.sh --output-dir ./health/
scripts/generate-incident-report.pyGenerate structured incident reportpython scripts/generate-incident-report.py --title "..." --severity SEV1 --output-dir ./reports/

Quick Reference: Common Incident Patterns

PatternSymptomLikely CauseFirst Action
502/503 errorsUsers see error pageBackend crashed or overloadedCheck docker ps, restart if needed
Slow responsesHigh latency, timeoutsDB queries, external APICheck slow query log, DB connections
Partial failuresSome endpoints failSingle dependency downCheck individual service health
Memory growthOOM kills, restartsMemory leakCheck docker stats, restart
Error spike after deployErrors start exactly at deploy timeBug in new codeRollback immediately
Gradual degradationSlowly worsening metricsResource exhaustion, connection leakCheck resource usage trends

Output Files

Runbooks: Write to docs/runbooks/<service>-runbook.md:

# Runbook: [Service Name]

## Service Overview
- Purpose, dependencies, critical paths

## Common Issues
### Issue 1: [Description]
- **Symptoms:** [What you see]
- **Diagnosis:** [Commands to run]
- **Resolution:** [Steps to fix]

## Escalation
- On-call: #ops-oncall
- Service owner: @team-name

Post-mortems: Write to postmortem-YYYY-MM-DD.md:

# Post-Mortem: [Incident Title]

## Summary
- **Date:** YYYY-MM-DD
- **Severity:** SEV1-4
- **Duration:** X hours
- **Impact:** [Users/revenue affected]

## Timeline
- HH:MM - [Event]

## Root Cause
[Technical explanation]

## Action Items
- [ ] [Preventive measure] - Owner: @name - Due: YYYY-MM-DD

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

python-backend-expert

No summary provided by upstream source.

Repository SourceNeeds Review
General

e2e-testing

No summary provided by upstream source.

Repository SourceNeeds Review
Security

code-review-security

No summary provided by upstream source.

Repository SourceNeeds Review
General

react-testing-patterns

No summary provided by upstream source.

Repository SourceNeeds Review