Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

When to Use This Skill

Creating incident response procedures
Building service-specific runbooks
Establishing escalation paths
Documenting recovery procedures
Responding to active incidents
Onboarding on-call engineers

Core Concepts

Incident Severity Levels

Severity Impact Response Time Example

SEV1 Complete outage, data loss 15 min Production down

SEV2 Major degradation 30 min Critical feature broken

SEV3 Minor impact 2 hours Non-critical bug

SEV4 Minimal impact Next business day Cosmetic issue

Runbook Structure
Overview & Impact
Detection & Alerts
Initial Triage
Mitigation Steps
Root Cause Investigation
Resolution Procedures
Verification & Rollback
Communication Templates
Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

[Service Name] Outage Runbook

Overview

Service: Payment Processing Service Owner: Platform Team Slack: #payments-incidents PagerDuty: payments-oncall

Impact Assessment

Which customers are affected?
What percentage of traffic is impacted?
Are there financial implications?
What's the blast radius?

Detection

Alerts

payment_error_rate > 5% (PagerDuty)
payment_latency_p99 > 2s (Slack)
payment_success_rate < 95% (PagerDuty)

Dashboards

Initial Triage (First 5 Minutes)

1. Assess Scope

# Check service health
kubectl get pods -n payments -l app=payment-service

# Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

Quick Health Checks

Can you reach the service? curl -I https://api.company.com/payments/health
Database connectivity? Check connection pool metrics
External dependencies? Check Stripe, bank API status
Recent changes? Check deploy history

Initial Classification

Symptom Likely Cause Go To Section

All requests failing Service down Section 4.1

High latency Database/dependency Section 4.2

Partial failures Code bug Section 4.3

Spike in errors Traffic surge Section 4.4

Mitigation Procedures

4.1 Service Completely Down

Step 1: Check pod status

kubectl get pods -n payments

Step 2: If pods are crash-looping, check logs

kubectl logs -n payments -l app=payment-service --tail=100

Step 3: Check recent deployments

kubectl rollout history deployment/payment-service -n payments

Step 4: ROLLBACK if recent deploy is suspect

kubectl rollout undo deployment/payment-service -n payments

Step 5: Scale up if resource constrained

kubectl scale deployment/payment-service -n payments --replicas=10

Step 6: Verify recovery

kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

Step 1: Check database connections

kubectl exec -n payments deploy/payment-service --
curl localhost:8080/metrics | grep db_pool

Step 2: Check slow queries (if DB issue)

psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;"

Step 3: Kill long-running queries if needed

psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

Step 4: Check external dependency latency

curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

Step 5: Enable circuit breaker if dependency is slow

kubectl set env deployment/payment-service
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

Step 1: Identify error pattern

Step 2: Check error tracking

Go to Sentry: https://sentry.io/payments

Step 3: If specific endpoint, enable feature flag to disable

curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

Step 4: If data issue, check recent data changes

psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

Step 1: Check current request rate

kubectl top pods -n payments

Step 2: Scale horizontally

kubectl scale deployment/payment-service -n payments --replicas=20

Step 3: Enable rate limiting

kubectl set env deployment/payment-service
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=1000 -n payments

Step 4: If attack, block suspicious IPs

kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-suspicious namespace: payments spec: podSelector: matchLabels: app: payment-service ingress:

from:
- ipBlock: cidr: 0.0.0.0/0 except:
  - 192.168.1.0/24 # Suspicious range EOF

Verification Steps

Verify service is healthy

curl -s https://api.company.com/payments/health | jq

Verify error rate is back to normal

curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

Verify latency is acceptable

curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

Smoke test critical flows

./scripts/smoke-test-payments.sh

Rollback Procedures

Rollback Kubernetes deployment

kubectl rollout undo deployment/payment-service -n payments

Rollback database migration (if applicable)

./scripts/db-rollback.sh $MIGRATION_VERSION

Rollback feature flag

curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

Condition Escalate To Contact

15 min unresolved SEV1 Engineering Manager @manager (Slack)

Data breach suspected Security Team #security-incidents

Financial impact > $10k Finance + Legal @finance-oncall

Customer communication needed Support Lead @support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2 Status: Investigating Impact: ~20% of payment requests failing Start Time: [TIME] Incident Commander: [NAME]

Current Actions:

Investigating root cause
Scaling up service
Monitoring dashboards

Updates in #payments-incidents

Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating Impact: Reduced to ~5% failure rate Duration: 25 minutes

Actions Taken:

Rolled back deployment v2.3.4 → v2.3.3
Scaled service from 5 → 10 replicas

Next Steps:

Continuing to monitor
Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes Impact: ~5,000 affected transactions Root Cause: Memory leak in v2.3.4

Resolution:

Rolled back to v2.3.3
Transactions auto-retried successfully

Follow-up:

Postmortem scheduled for [DATE]
Bug fix in progress

Template 2: Database Incident Runbook

# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |

## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start &#x3C; now() - interval '10 minutes';

Replication Lag

-- Check lag on replica
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable

Disk Space Critical

# Check disk usage
df -h /var/lib/postgresql/data

# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk

## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Troubleshooting

### Runbook steps work in staging but fail during a real incident

Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:

```bash
# Step: Check pod status
kubectl get pods -n payments

# Prerequisites: kubectl configured, kubeconfig points to correct cluster
# If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`
# Expected output: pods in Running state

On-call engineer panics and skips steps out of order

Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:

## Quick Checklist
- [ ] 1. Declare incident severity and open war room
- [ ] 2. Check service health (Section 4.1)
- [ ] 3. Check recent deployments (Section 4.1)
- [ ] 4. Roll back if deploy is suspect (Section 4.1)
- [ ] 5. Post initial notification to #payments-incidents
- [ ] 6. Escalate if > 15 min unresolved

Runbook is outdated — commands reference old cluster names or endpoints

Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all curl
 endpoints and kubectl
 context names are still valid:

## Runbook Metadata
| Field | Value |
|---|---|
| Last verified | 2024-11-15 |
| Owner | @platform-team |
| Review cadence | After every SEV1/SEV2 |

Stakeholder communication is delayed while engineers are heads-down

Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:

Update every 15 minutes (even if no new information):
- Current status (Investigating / Mitigating / Monitoring)
- Impact (what is broken, who is affected, % of traffic)
- What we are doing right now
- Next update in: 15 minutes

Database runbook commands cause additional downtime when run incorrectly

Add explicit warnings before destructive SQL commands and require a dry-run output check before executing:

-- WARNING: This terminates active connections. Verify count first.
-- DRY RUN (check count before terminating):
SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start &#x3C; now() - interval '10 minutes';

-- EXECUTE only after verifying count is reasonable (&#x3C; 50):
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND query_start &#x3C; now() - interval '10 minutes';

Related Skills

- postmortem-writing
 - After resolving an incident, use postmortem templates to capture root cause and preventive actions

- on-call-handoff-patterns
 - Structure shift handoffs so the incoming responder has full context on active incidents