Incident Response

When to Use

Activate this skill when:

Production service is down or returning errors to users
Error rate has spiked beyond normal thresholds
Performance has degraded significantly (latency increase, timeouts)
An alert has fired from the monitoring system
Users are reporting issues that indicate a systemic problem
A failed deployment needs investigation and remediation
Conducting a post-mortem or root cause analysis after an incident

Output: Write runbooks to docs/runbooks/<service>-runbook.md and post-mortems to postmortem-YYYY-MM-DD.md.

Do NOT use this skill for:

Setting up monitoring or alerting rules (use monitoring-setup)
Performing routine deployments (use deployment-pipeline)
Docker image or infrastructure issues (use docker-best-practices)
Feature development or code changes (use python-backend-expert or react-frontend-expert)

Instructions

Severity Classification

Classify every incident immediately. Severity determines response urgency, communication cadence, and escalation path.

Severity	Impact	Examples	Response Time	Update Cadence
SEV1 (P1)	Complete outage, all users affected	Service down, data loss, security breach	Immediate (< 5 min)	Every 15 min
SEV2 (P2)	Major degradation, most users affected	Core feature broken, severe latency	< 15 min	Every 30 min
SEV3 (P3)	Partial degradation, some users affected	Non-critical feature broken, intermittent errors	< 1 hour	Every 2 hours
SEV4 (P4)	Minor issue, few users affected	Cosmetic bug, edge case error	< 4 hours	Daily

Escalation rules:

SEV1: Page on-call engineer + engineering manager immediately
SEV2: Page on-call engineer, notify engineering manager
SEV3: Notify on-call engineer via Slack
SEV4: Create ticket, address during normal working hours

See references/escalation-contacts.md for the contact matrix.

5-Minute Triage Workflow

When an incident is detected, follow this triage workflow within the first 5 minutes.

┌─────────────────────────────────────────────────────────┐
│  MINUTE 0-1: Acknowledge and Classify                   │
│  • Acknowledge the alert or report                      │
│  • Assign severity (SEV1-SEV4)                          │
│  • Designate incident commander                         │
├─────────────────────────────────────────────────────────┤
│  MINUTE 1-2: Assess Scope                               │
│  • Check health endpoints for all services              │
│  • Check error rate and latency dashboards              │
│  • Determine: which services are affected?              │
├─────────────────────────────────────────────────────────┤
│  MINUTE 2-3: Identify Recent Changes                    │
│  • Check: was there a recent deployment?                │
│  • Check: any infrastructure changes?                   │
│  • Check: any external dependency issues?               │
├─────────────────────────────────────────────────────────┤
│  MINUTE 3-4: Initial Communication                      │
│  • Post in #incidents channel                           │
│  • Update status page if SEV1/SEV2                      │
│  • Page additional responders if needed                 │
├─────────────────────────────────────────────────────────┤
│  MINUTE 4-5: Begin Investigation or Mitigate            │
│  • If recent deploy: consider immediate rollback        │
│  • If not deploy-related: begin diagnostic commands     │
│  • Start incident timeline log                          │
└─────────────────────────────────────────────────────────┘

Quick health check command:

./skills/incident-response/scripts/health-check-all-services.sh \
  --output-dir ./incident-triage/

Incident Commander Role

The incident commander (IC) coordinates the response. They do NOT investigate directly.

IC responsibilities:

Coordinate -- Assign tasks to responders, prevent duplicate work
Communicate -- Post regular updates to stakeholders
Decide -- Make go/no-go decisions on rollback, escalation, communication
Track -- Maintain the incident timeline
Close -- Declare the incident resolved and schedule the post-mortem

IC communication template (initial):

INCIDENT DECLARED: [Title]
Severity: [SEV1/SEV2/SEV3/SEV4]
Commander: [Name]
Start time: [UTC timestamp]
Impact: [What users are experiencing]
Status: Investigating
Next update: [Time]

IC communication template (update):

INCIDENT UPDATE: [Title]
Severity: [SEV level]
Duration: [Time since start]
Status: [Investigating/Identified/Mitigating/Resolved]
Current findings: [What we know]
Actions in progress: [What we are doing]
Next update: [Time]

Investigation Steps

Follow these diagnostic steps based on the type of issue.

Application Errors (FastAPI)

# 1. Check application logs for errors
./skills/incident-response/scripts/fetch-logs.sh \
  --service backend \
  --since "15 minutes ago" \
  --output-dir ./incident-logs/

# 2. Check error rate from logs
docker logs app-backend --since 15m 2>&1 | grep -c "ERROR"

# 3. Check active connections and request patterns
curl -s http://localhost:8000/health/ready | jq .

# 4. Check if the issue is in a specific endpoint
docker logs app-backend --since 15m 2>&1 | \
  grep "ERROR" | \
  grep -oP '"path":"[^"]*"' | sort | uniq -c | sort -rn

# 5. Check Python process status
docker exec app-backend ps aux
docker exec app-backend python -c "import sys; print(sys.version)"

Database Issues (PostgreSQL)

# 1. Check database connectivity
docker exec app-db pg_isready -U postgres

# 2. Check active connections (connection pool exhaustion?)
docker exec app-db psql -U postgres -d app_prod -c "
  SELECT count(*), state FROM pg_stat_activity
  GROUP BY state ORDER BY count DESC;
"

# 3. Check for long-running queries (locks, deadlocks?)
docker exec app-db psql -U postgres -d app_prod -c "
  SELECT pid, now() - pg_stat_activity.query_start AS duration,
         query, state
  FROM pg_stat_activity
  WHERE (now() - pg_stat_activity.query_start) > interval '30 seconds'
  AND state != 'idle'
  ORDER BY duration DESC;
"

# 4. Check for lock contention
docker exec app-db psql -U postgres -d app_prod -c "
  SELECT blocked_locks.pid AS blocked_pid,
         blocking_locks.pid AS blocking_pid,
         blocked_activity.query AS blocked_query
  FROM pg_catalog.pg_locks blocked_locks
  JOIN pg_catalog.pg_stat_activity blocked_activity
    ON blocked_activity.pid = blocked_locks.pid
  JOIN pg_catalog.pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.relation = blocked_locks.relation
    AND blocking_locks.pid != blocked_locks.pid
  JOIN pg_catalog.pg_stat_activity blocking_activity
    ON blocking_activity.pid = blocking_locks.pid
  WHERE NOT blocked_locks.granted;
"

# 5. Check disk space
docker exec app-db df -h /var/lib/postgresql/data

Redis Issues

# 1. Check Redis connectivity
docker exec app-redis redis-cli ping

# 2. Check memory usage
docker exec app-redis redis-cli info memory | grep used_memory_human

# 3. Check connected clients
docker exec app-redis redis-cli info clients | grep connected_clients

# 4. Check slow log
docker exec app-redis redis-cli slowlog get 10

# 5. Check keyspace
docker exec app-redis redis-cli info keyspace

Network and Infrastructure

# 1. Check DNS resolution
nslookup api.example.com

# 2. Check SSL certificate expiry
echo | openssl s_client -servername api.example.com -connect api.example.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# 3. Check container resource usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

# 4. Check disk space on host
df -h /

# 5. Check if dependent services are reachable
curl -sf https://external-api.example.com/health || echo "External API unreachable"

Remediation Actions

Immediate Mitigations (apply within minutes)

Issue	Mitigation	Command
Bad deployment	Rollback	`./scripts/deploy.sh --rollback --env production --version $PREV_SHA --output-dir ./results/`
Connection pool exhausted	Restart backend	`docker restart app-backend`
Long-running query	Kill query	`SELECT pg_terminate_backend(<pid>);`
Memory leak	Restart service	`docker restart app-backend`
Redis full	Flush non-critical keys	`redis-cli --scan --pattern "cache:*" \| xargs redis-cli del`
SSL expired	Apply new cert	Update cert in load balancer
Disk full	Clean logs/temp files	`docker system prune -f`

Longer-Term Fixes (apply after stabilization)

Fix the root cause in code -- Create a branch, fix, test, deploy through normal pipeline
Add monitoring -- If the issue was not caught by existing alerts, add new alert rules
Add tests -- Write regression tests for the failure scenario
Update runbooks -- Document the new failure mode and remediation steps

Communication Protocol

Internal Communication

Channels:

#incidents -- Active incident coordination (SEV1/SEV2)
#incidents-low -- SEV3/SEV4 tracking
#engineering -- Post-incident summaries

Rules:

All communication happens in the designated incident channel
Use threads for investigation details, keep main channel for status updates
IC posts updates at the defined cadence (see severity table)
Tag relevant people explicitly, do not assume they are watching
Timestamp all significant findings and actions

External Communication (SEV1/SEV2)

Status page update template:

[Investigating] We are investigating reports of [issue description].
Users may experience [user-visible impact].
We will provide an update within [time].

[Identified] The issue has been identified as [brief description].
We are working on a fix. Estimated resolution: [time estimate].

[Resolved] The issue affecting [service] has been resolved.
The root cause was [brief description].
We apologize for the disruption and will publish a detailed post-mortem.

Post-Mortem / RCA Framework

Conduct a blameless post-mortem within 48 hours of every SEV1/SEV2 incident. SEV3 incidents receive a lightweight review.

See references/post-mortem-template.md for the full template.

Post-mortem principles:

Blameless -- Focus on systems and processes, not individuals
Thorough -- Identify all contributing factors, not just the trigger
Actionable -- Every finding must produce a concrete action item with an owner
Timely -- Conduct within 48 hours while details are fresh
Shared -- Publish to the entire engineering team

Post-mortem structure:

Summary -- What happened, when, and what was the impact
Timeline -- Minute-by-minute account of detection, investigation, mitigation
Root cause -- The fundamental reason the incident occurred
Contributing factors -- Other conditions that made the incident worse
What went well -- Effective parts of the response
What could be improved -- Gaps in detection, response, or tooling
Action items -- Specific tasks with owners and due dates

Five Whys technique for root cause analysis:

Why did users see 500 errors?
  -> Because the backend service returned errors to the load balancer.
Why did the backend service return errors?
  -> Because database connections timed out.
Why did database connections time out?
  -> Because the connection pool was exhausted.
Why was the connection pool exhausted?
  -> Because a new endpoint opened connections without releasing them.
Why were connections not released?
  -> Because the endpoint was missing the async context manager for sessions.

Root cause: Missing async context manager for database sessions in new endpoint.

Generate a structured incident report:

python skills/incident-response/scripts/generate-incident-report.py \
  --title "Database connection pool exhaustion" \
  --severity SEV2 \
  --start-time "2024-01-15T14:30:00Z" \
  --end-time "2024-01-15T15:15:00Z" \
  --output-dir ./post-mortems/