Incident Response

Patterns and practices for effective incident management, from detection through postmortem.

When to Use This Skill

Designing incident response processes
Creating incident runbooks
Establishing on-call rotations
Running effective postmortems
Improving mean time to recovery (MTTR)

Incident Lifecycle

┌─────────────────────────────────────────────────────────┐ │ INCIDENT LIFECYCLE │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Detect │─►│ Respond │─►│ Recover │─►│ Learn │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ Alerting Triage & Mitigation Postmortem │ │ Monitoring Diagnosis Remediation Action Items │ └─────────────────────────────────────────────────────────┘

Key Metrics

MTTD - Mean Time to Detect └── Time from incident start to detection

MTTA - Mean Time to Acknowledge └── Time from alert to human acknowledgment

MTTR - Mean Time to Recover └── Time from detection to resolution

MTTF - Mean Time to Failure └── Time between incidents (reliability)

Goal: Minimize MTTD + MTTA + MTTR

Incident Severity

Severity Levels

SEV 1 - Critical ├── Complete outage ├── Data loss or security breach ├── All/most users affected ├── Response: Immediate (24/7) └── Example: Production database down

SEV 2 - High ├── Major functionality impaired ├── Significant user impact ├── Workaround may exist ├── Response: Urgent (business hours++) └── Example: Payment processing degraded

SEV 3 - Medium ├── Partial functionality affected ├── Limited user impact ├── Workaround available ├── Response: Normal priority └── Example: Report generation slow

SEV 4 - Low ├── Minor issue ├── Minimal user impact ├── Response: Best effort └── Example: UI cosmetic bug

Severity Matrix

                User Impact
             Low   Medium   High

Scope ├─────────────────────────────┤ Wide │ SEV3 SEV2 SEV1 │ Medium │ SEV4 SEV3 SEV2 │ Limited │ SEV4 SEV4 SEV3 │ └─────────────────────────────┘

Incident Roles

Core Roles

Incident Commander (IC) ├── Owns the incident end-to-end ├── Makes decisions and delegates ├── Controls incident channel ├── Does NOT debug (coordinates) └── Focus: Big picture, communication

Tech Lead ├── Leads technical investigation ├── Coordinates technical responders ├── Makes technical decisions ├── Reports to IC └── Focus: Root cause, fix

Communications Lead ├── Handles external communication ├── Updates status page ├── Manages customer notifications ├── Reports to IC └── Focus: Stakeholder updates

Scribe ├── Documents timeline ├── Records decisions and actions ├── Captures important information ├── Reports to IC └── Focus: Documentation

Role Handoffs

Handoff protocol:

IC: "I'm handing IC to [Name]"
New IC: "I'm taking IC. Current status is..."
IC: "Confirmed, [Name] is now IC"

Handoff when:

Shift ends
Fatigue sets in
Expertise needed
Escalation required

Incident Communication

Communication Channels

Internal: ┌─────────────────────────────────────────────────────────┐ │ #incident-YYYY-MM-DD-topic │ │ - All incident communication here │ │ - Pinned: Current status, timeline, roles │ │ - Bridge call link for voice │ └─────────────────────────────────────────────────────────┘

External:

Status page (status.example.com)
Customer emails
Social media (if needed)
Support channels

Status Update Template

[TIME] Incident Update - [TITLE]

Current Status: [Investigating|Identified|Monitoring|Resolved]

Impact: [What users are experiencing]

What we know:

[Key facts]

What we're doing:

[Current actions]

Next update: [Time or "as soon as we learn more"]

Update Cadence

SEV 1: Every 15-30 minutes SEV 2: Every 30-60 minutes SEV 3: Every 1-2 hours SEV 4: As needed

Also update when:

Status changes
Major new information
Actions taken
Resolution achieved

Incident Response Process

Step 1: Detection & Alert

Triggers:

Automated alerting
User reports
Internal discovery

First responder actions:

Acknowledge alert
Assess initial impact
Declare incident if needed
Page additional responders
Open incident channel

Step 2: Triage & Mobilize

Triage questions:

What is the user impact?
How many users affected?
Is there a workaround?
What's the severity?

Mobilize:

Page appropriate responders
Establish roles (IC, Tech Lead, Comms)
Start incident channel
Begin timeline documentation

Step 3: Investigate & Diagnose

Investigation approach:

What changed recently? └── Deployments, config, infrastructure
What do metrics/logs show? └── Error rates, latency, traces
What's the blast radius? └── Which services, which users
What are the hypotheses? └── List, prioritize, test

Common commands:

"Checking [system] now"
"Theory: [hypothesis]"
"Found: [discovery]"
"Need: [resource/access]"

Step 4: Mitigate

Mitigation strategies:

Rollback └── Revert recent changes
Failover └── Switch to backup/replica
Scale └── Add capacity
Disable └── Turn off affected feature
Hotfix └── Deploy targeted fix

Priority: Restore service first, root cause later

Step 5: Resolve & Verify

Resolution checklist: □ Service restored □ Metrics normalized □ User-facing impact ended □ Monitoring in place for recurrence □ Temporary mitigations documented

Verification:

Check key SLIs
Test user flows
Monitor for 15-30 minutes
Confirm with affected teams

Step 6: Close & Learn

Closure:

Declare incident resolved
Final status update
Schedule postmortem
Assign postmortem owner
Close incident channel (archive)

Timeline:

Postmortem doc: Within 24-48 hours
Postmortem meeting: Within 5 business days
Action items: Tracked to completion

On-Call Practices

On-Call Structure

Primary: First responder to alerts Secondary: Backup if primary unavailable Escalation: Manager/senior for major incidents

Rotation options:

Weekly rotation
Follow-the-sun (multiple timezones)
Split shifts (day/night)

Off-hours policy:

What's page-worthy?
What can wait?
Compensation for off-hours

On-Call Responsibilities

During shift:

Respond to alerts within SLA
Triage and resolve or escalate
Document actions taken
Hand off to next shift

Handoff includes:

Open alerts/incidents
Recent incidents
Known issues
Scheduled changes

Alert Hygiene

Good alert:

Actionable
Urgent
User-impacting
Clear resolution path

Alert anti-patterns:

"Somebody should look at this"
Duplicate alerts
Non-actionable information
Crying wolf (frequent false positives)

Regular review:

Which alerts fired?
Which were actionable?
Which were noise?
What's missing?

Runbooks

Runbook Structure

Runbook: [Alert Name]

Overview

What this alert means and why it matters.

Impact

What users experience when this fires.

Investigation Steps

Check [metric/log/dashboard]
Look for [specific pattern]
Verify [component status]

Mitigation Steps

If [condition], do [action]
If [condition], do [action]
Escalate if [condition]

Rollback Procedure

How to undo changes if needed.

Contacts

Service owner: [name]
Escalation: [name/team]

Incident Postmortem: [Title]

Date: [Date] Duration: [Start - End] Severity: [SEV level] Authors: [Names]

Summary

One paragraph summary of what happened.

Impact

Users affected: [Number/percentage]
Duration: [Time]
Revenue impact: [If applicable]

Timeline

[Time] - Event [Time] - Action taken [Time] - Resolution

Root Cause

What actually caused the incident.

Contributing Factors

What made it worse or harder to resolve.

What Went Well

[Positive observations]

What Could Be Improved

[Improvement areas]

Action Items

Action	Owner	Due Date	Status
[Action]	[Name]	[Date]	[Status]

Lessons Learned

Key takeaways for the organization.

Action Item Types

Prevent: Stop this from happening again Detect: Find it faster next time Mitigate: Reduce impact when it happens Document: Improve runbooks/documentation

Priority:

High-impact, low-effort
Required for safety
Reduces toil
Nice to have

Best Practices

Declare incidents early When in doubt, declare
Focus on mitigation first Root cause analysis later
Communicate frequently Silence breeds anxiety
Document as you go Don't rely on memory
Practice with game days Drill before real incidents
Blameless postmortems Systems fail, not people
Track action items Complete what you commit
Regular on-call review Improve the on-call experience

Related Skills

slo-sli-error-budget
SLOs and alerting
observability-patterns
Using observability in incidents
chaos-engineering-fundamentals
Proactive resilience testing

incident-response

Safety Notice

Copy this and send it to your AI assistant to learn

Runbook: [Alert Name]

Overview

Impact

Investigation Steps

Mitigation Steps

Rollback Procedure

Contacts

Related Links

Incident Postmortem: [Title]

Summary

Impact

Timeline

Root Cause

Contributing Factors

What Went Well

What Could Be Improved

Action Items

Lessons Learned

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering