Incident Runbook Templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
When to Use This Skill
-
Creating incident response procedures
-
Building service-specific runbooks
-
Establishing escalation paths
-
Documenting recovery procedures
-
Responding to active incidents
-
Onboarding on-call engineers
Core Concepts
- Incident Severity Levels
Severity Impact Response Time Example
SEV1 Complete outage, data loss 15 min Production down
SEV2 Major degradation 30 min Critical feature broken
SEV3 Minor impact 2 hours Non-critical bug
SEV4 Minimal impact Next business day Cosmetic issue
-
Runbook Structure
-
Overview & Impact
-
Detection & Alerts
-
Initial Triage
-
Mitigation Steps
-
Root Cause Investigation
-
Resolution Procedures
-
Verification & Rollback
-
Communication Templates
-
Escalation Matrix
Runbook Templates
Template 1: Service Outage Runbook
[Service Name] Outage Runbook
Overview
Service: Payment Processing Service Owner: Platform Team Slack: #payments-incidents PagerDuty: payments-oncall
Impact Assessment
- Which customers are affected?
- What percentage of traffic is impacted?
- Are there financial implications?
- What's the blast radius?
Detection
Alerts
payment_error_rate > 5%(PagerDuty)payment_latency_p99 > 2s(Slack)payment_success_rate < 95%(PagerDuty)
Dashboards
Initial Triage (First 5 Minutes)
1. Assess Scope
# Check service health
kubectl get pods -n payments -l app=payment-service
# Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
- Quick Health Checks
-
Can you reach the service? curl -I https://api.company.com/payments/health
-
Database connectivity? Check connection pool metrics
-
External dependencies? Check Stripe, bank API status
-
Recent changes? Check deploy history
- Initial Classification
Symptom Likely Cause Go To Section
All requests failing Service down Section 4.1
High latency Database/dependency Section 4.2
Partial failures Code bug Section 4.3
Spike in errors Traffic surge Section 4.4
Mitigation Procedures
4.1 Service Completely Down
Step 1: Check pod status
kubectl get pods -n payments
Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
4.2 High Latency
Step 1: Check database connections
kubectl exec -n payments deploy/payment-service --
curl localhost:8080/metrics | grep db_pool
Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;"
Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
4.3 Partial Failures (Specific Errors)
Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 |
grep -i error | sort | uniq -c | sort -rn | head -20
Step 2: Check error tracking
Go to Sentry: https://sentry.io/payments
Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';"
4.4 Traffic Surge
Step 1: Check current request rate
kubectl top pods -n payments
Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20
Step 3: Enable rate limiting
kubectl set env deployment/payment-service
RATE_LIMIT_ENABLED=true
RATE_LIMIT_RPS=1000 -n payments
Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-suspicious namespace: payments spec: podSelector: matchLabels: app: payment-service ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range EOF
- ipBlock:
cidr: 0.0.0.0/0
except:
Verification Steps
Verify service is healthy
curl -s https://api.company.com/payments/health | jq
Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
Smoke test critical flows
./scripts/smoke-test-payments.sh
Rollback Procedures
Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments
Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION
Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
Escalation Matrix
Condition Escalate To Contact
15 min unresolved SEV1 Engineering Manager @manager (Slack)
Data breach suspected Security Team #security-incidents
Financial impact > $10k Finance + Legal @finance-oncall
Customer communication needed Support Lead @support-lead
Communication Templates
Initial Notification (Internal)
🚨 INCIDENT: Payment Service Degradation
Severity: SEV2 Status: Investigating Impact: ~20% of payment requests failing Start Time: [TIME] Incident Commander: [NAME]
Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards
Updates in #payments-incidents
Status Update
📊 UPDATE: Payment Service Incident
Status: Mitigating Impact: Reduced to ~5% failure rate Duration: 25 minutes
Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas
Next Steps:
- Continuing to monitor
- Root cause analysis in progress
ETA to Resolution: ~15 minutes
Resolution Notification
✅ RESOLVED: Payment Service Incident
Duration: 45 minutes Impact: ~5,000 affected transactions Root Cause: Memory leak in v2.3.4
Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully
Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
Template 2: Database Incident Runbook
# Database Incident Runbook
## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';
Replication Lag
-- Check lag on replica
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable
Disk Space Critical
# Check disk usage
df -h /var/lib/postgresql/data
# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"
# If emergency, delete old data or expand disk
## Best Practices
### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress
### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident