Incident Management
Incident Severity
Level Impact Response Time
SEV1 Complete outage Immediate
SEV2 Major degradation < 15 min
SEV3 Minor degradation < 1 hour
SEV4 Low impact Next business day
Incident Response
- Detect
-
Monitoring alerts
-
Customer reports
-
Error logs
- Triage
-
Assess severity
-
Assign incident commander
-
Create communication channel
- Investigate
-
Check recent changes
-
Review logs and metrics
-
Identify root cause
- Mitigate
-
Apply quick fix
-
Rollback if needed
-
Communicate status
- Resolve
-
Confirm fix
-
Monitor for recurrence
-
Close incident
- Learn
-
Post-mortem meeting
-
Document findings
-
Create action items
Post-Mortem Template
Post-Mortem: [Incident Title]
Summary
[Brief description of what happened]
Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Resolution]
Impact
- Duration: [X hours]
- Users affected: [X]
- Revenue impact: [if applicable]
Root Cause
[What caused this incident]
Contributing Factors
- [Factor 1]
- [Factor 2]
What Went Well
- [Positive 1]
- [Positive 2]
What Could Be Improved
- [Improvement 1]
- [Improvement 2]
Action Items
- [Action 1] - Owner: [Name]
- [Action 2] - Owner: [Name]
Blameless Culture
-
Focus on systems, not people
-
"What failed?" not "Who failed?"
-
Share learnings openly
-
Celebrate near-misses