Incident Response

Manage incidents effectively and conduct comprehensive post-mortem analysis to improve system reliability, security, and operational excellence. This skill covers incident management processes, communication protocols, technical investigation, and blameless post-mortems.

When to use me

Use this skill when:

An incident occurs affecting system availability, performance, or security
You need to establish or improve incident response processes
You're conducting post-mortem analysis to prevent recurrence
You need to coordinate cross-functional incident response teams
You want to implement incident severity classification and escalation
You need to document incidents and create actionable follow-up items
You're building incident response playbooks and runbooks
You want to improve mean time to detection (MTTD) and mean time to resolution (MTTR)
You need to comply with incident reporting requirements (SLAs, regulatory)

What I do

Incident severity classification: Classify incidents by severity and impact
Incident coordination: Coordinate cross-functional incident response teams
Communication management: Manage incident communication (internal, external, stakeholders)
Technical investigation: Conduct technical root cause analysis
Timeline reconstruction: Reconstruct incident timeline
Impact assessment: Assess business and technical impact
Remediation coordination: Coordinate immediate remediation actions
Post-mortem facilitation: Facilitate blameless post-mortem analysis
Action item tracking: Track and monitor post-mortem action items
Process improvement: Improve incident response processes based on learnings
Playbook development: Develop and maintain incident response playbooks
Metrics and reporting: Track incident metrics and generate reports

Examples

# Start incident response process
./scripts/analyze-incident-response.sh --severity sev1 --type availability --start

# Conduct post-mortem analysis
./scripts/analyze-incident-response.sh --post-mortem --incident-id INC-2025-001 --output post-mortem.md

# Generate incident report
./scripts/analyze-incident-response.sh --report --incident-id INC-2025-001 --format pdf

# Analyze incident metrics
./scripts/analyze-incident-response.sh --metrics --period monthly --output metrics.json

# Develop incident playbook
./scripts/analyze-incident-response.sh --playbook --type security --output security-playbook.md

Output format

Incident Response Analysis
─────────────────────────────────────
Incident ID: INC-2025-001
Severity: SEV-1 (Critical)
Type: Availability Incident
Start Time: 2025-01-15T02:30:00Z
Resolution Time: 2025-01-15T04:45:00Z
Duration: 2 hours 15 minutes

INCIDENT SUMMARY:
──────────────────
Title: Database cluster failover failure causing 100% API error rate
Impact: Complete service outage for all customers
Affected Systems: Primary database cluster, API services, customer applications
Root Cause: Automated failover script bug combined with network partition
Resolution: Manual failover to standby cluster, script fix, validation

SEVERITY CLASSIFICATION:
────────────────────────
Severity: SEV-1 (Critical)
Criteria:
• Customer Impact: 100% of customers affected
• Revenue Impact: Estimated $85,000/hour
• Service Level: SLA violation (99.95% target, actual 99.12%)
• Duration: 2 hours 15 minutes (> 5 minute target for SEV-1)

TIMELINE RECONSTRUCTION:
────────────────────────
02:30:00 - Monitoring alerts detect 100% API error rate
02:32:15 - Incident declared SEV-1, incident commander assigned
02:35:00 - Technical team begins investigation
02:45:00 - Identified database primary node failure
02:50:00 - Automated failover script fails due to bug
03:00:00 - Manual investigation reveals network partition
03:15:00 - Decision to perform manual failover to standby cluster
03:30:00 - Manual failover initiated
03:45:00 - Standby cluster promoted to primary
04:00:00 - Application connections redirected
04:15:00 - Service begins recovery
04:30:00 - 50% traffic restored
04:45:00 - 100% traffic restored, incident resolved

TECHNICAL INVESTIGATION:
─────────────────────────
Primary Cause: Automated failover script bug (line 245: incorrect condition check)
Contributing Factors:
1. Network partition between database nodes
2. Insufficient failover script testing
3. Lack of circuit breaker pattern in application connections
4. Monitoring gaps in failover process
5. Single point of failure in network configuration

Root Cause Analysis (5 Whys):
1. Why did service outage occur? Database primary node failed
2. Why did automatic failover not work? Failover script had bug
3. Why did bug exist in script? Insufficient testing of edge cases
4. Why insufficient testing? Test suite didn't include network partition scenarios
5. Why no network partition tests? Not considered in risk assessment

IMPACT ASSESSMENT:
───────────────────
Customer Impact:
• 100% of customers unable to access service
• 15,842 failed transactions
• 8,750 customer support tickets generated

Business Impact:
• Revenue loss: $191,250 (estimated)
• SLA credits: $42,500 (estimated)
• Reputation damage: Significant social media complaints

Technical Impact:
• 2 hours 15 minutes of complete outage
• 45 minutes of partial degradation
• Database corruption risk (mitigated by backups)

COMMUNICATION LOG:
───────────────────
02:35:00 - Internal alert: #incidents channel created
02:40:00 - Stakeholder notification: Executive team notified
02:45:00 - Status page update: Investigating increased error rates
03:00:00 - Customer notification: Email sent to enterprise customers
03:30:00 - Status page update: Identified issue, working on fix
04:00:00 - Status page update: Implementing fix, estimated recovery 60 minutes
04:30:00 - Status page update: Service recovering
04:50:00 - Status page update: Service restored, monitoring
05:00:00 - Post-mortem scheduled notification

REMEDIATION ACTIONS:
─────────────────────
Immediate (During Incident):
1. Manual failover to standby cluster (completed)
2. Application connection pool reset (completed)
3. Traffic redistribution (completed)

Short-term (24 hours):
1. Fix failover script bug (completed)
2. Enhance monitoring for failover events (in progress)
3. Update runbook with manual failover steps (in progress)
4. Review network configuration (scheduled)

Long-term (30 days):
1. Implement circuit breaker pattern in all services
2. Add network partition testing to test suite
3. Conduct failover drills quarterly
4. Implement chaos engineering for resilience testing
5. Review and update all automated failover scripts

POST-MORTEM ANALYSIS:
──────────────────────
What Went Well:
• Quick incident declaration and response
• Effective communication throughout incident
• Backup systems worked as designed
• Team coordination across time zones
• Documentation available and useful

What Could Be Improved:
• Automated failover reliability
• Testing coverage for edge cases
• Monitoring visibility into failover process
• Incident playbook completeness
• Stakeholder communication templates

Lessons Learned:
1. Automated systems require regular testing of failure scenarios
2. Network partitions must be considered in high-availability designs
3. Circuit breaker patterns are essential for resilient microservices
4. Regular failover drills build muscle memory for real incidents
5. Blameless culture enables honest post-mortem analysis

ACTION ITEMS:
─────────────
1. HIGH PRIORITY: Fix failover script bug and deploy to all environments
   Owner: Database Engineering
   Due: 2025-01-16
   Status: In Progress

2. HIGH PRIORITY: Implement circuit breaker pattern in API services
   Owner: Platform Engineering
   Due: 2025-01-30
   Status: Not Started

3. MEDIUM PRIORITY: Add network partition testing to test suite
   Owner: Quality Engineering
   Due: 2025-02-15
   Status: Not Started

4. MEDIUM PRIORITY: Update incident playbook with manual failover steps
   Owner: DevOps
   Due: 2025-01-20
   Status: In Progress

5. LOW PRIORITY: Conduct quarterly failover drills
   Owner: SRE
   Due: 2025-04-01 (recurring)
   Status: Scheduled

INCIDENT METRICS:
─────────────────
Mean Time To Detection (MTTD): 2 minutes 15 seconds
Mean Time To Acknowledgment (MTTA): 5 minutes
Mean Time To Resolution (MTTR): 2 hours 15 minutes
Mean Time Between Failures (MTBF): 45 days
Incident Count (30 days): 3
Severity Distribution: SEV-1: 1, SEV-2: 2, SEV-3: 0

PROCESS IMPROVEMENTS:
─────────────────────
1. Update incident severity classification guidelines
2. Implement automated incident timeline generation
3. Create stakeholder communication templates
4. Establish incident commander rotation
5. Implement incident metrics dashboard

RECOMMENDATIONS:
────────────────
1. Implement automated failover testing in CI/CD pipeline
2. Conduct chaos engineering exercises quarterly
3. Establish incident response training for all engineers
4. Implement service-level objectives (SLOs) and error budgets
5. Create incident severity playbooks for common scenarios

INCIDENT RESPONSE MATURITY ASSESSMENT:
───────────────────────────────────────
Current Level: 3/5 (Defined)
• Process: Defined incident response process
• People: Trained incident responders
• Technology: Basic incident management tools
• Culture: Blameless post-mortems established
• Metrics: Basic incident metrics tracked

Target Level: 4/5 (Managed)
• Process: Integrated incident response with DevOps
• People: Cross-functional incident response teams
• Technology: Advanced incident management platform
• Culture: Continuous learning from incidents
• Metrics: Comprehensive metrics with trend analysis

Notes

Incident response should follow a blameless culture focusing on systems and processes, not individuals
Communication is critical during incidents; establish clear communication channels and protocols
Post-mortems should be conducted within 48 hours of incident resolution while details are fresh
Action items from post-mortems must be tracked to completion to prevent recurrence
Incident metrics (MTTD, MTTA, MTTR) help measure and improve response effectiveness
Regular incident response drills and tabletop exercises improve preparedness
Incident severity classification should be based on business impact, not technical symptoms
Documentation and runbooks should be living documents updated based on incident learnings
Consider regulatory and compliance requirements for incident reporting and documentation
Integrate incident response with change management and problem management processes

incident-response

Safety Notice

Copy this and send it to your AI assistant to learn

Incident Response

When to use me

What I do

Examples

Output format

Notes

Source Transparency

Related Skills

adversarial-thinking

redteam

performance-profiling

test-gap-analysis