Post-Mortem & Incident Review Framework
Run structured post-mortems that actually prevent repeat failures. Blameless analysis, root cause identification, and action tracking.
When to Use
- After any production incident, outage, or service degradation
- After a missed deadline, failed launch, or lost deal
- After any event costing >$5K or >4 hours of team time
- Quarterly review of recurring incident patterns
Post-Mortem Template
1. Incident Summary (Complete Within 24 Hours)
Incident ID: [AUTO-GENERATED]
Date/Time: [Start] → [End] (Duration: X hours)
Severity: SEV-1 (revenue impact) | SEV-2 (customer impact) | SEV-3 (internal impact)
Impact: [Users affected] | [Revenue lost] | [SLA breached Y/N]
Detection: How was it found? (Monitoring / Customer report / Internal discovery)
Detection Delay: Time from incident start → first alert
2. Timeline (Minute-by-Minute for SEV-1, 15-min blocks for SEV-2/3)
HH:MM - Event description
HH:MM - First alert triggered
HH:MM - Team notified
HH:MM - Investigation started
HH:MM - Root cause identified
HH:MM - Fix deployed
HH:MM - Confirmed resolved
3. Root Cause Analysis — 5 Whys
Why 1: [Direct cause]
Why 2: [Why did that happen?]
Why 3: [Why did THAT happen?]
Why 4: [Systemic cause]
Why 5: [Organizational/cultural root]
4. Contributing Factors
Score each factor 0-3 (0=not a factor, 3=primary contributor):
| Factor | Score | Notes |
|---|---|---|
| Missing/inadequate monitoring | ||
| Insufficient testing | ||
| Documentation gaps | ||
| Process not followed | ||
| Knowledge concentration (bus factor) | ||
| Capacity/scaling limits | ||
| Third-party dependency | ||
| Communication breakdown | ||
| Change management failure | ||
| Technical debt |
5. What Went Well
List 3-5 things that worked during the response:
- Fast detection? Good runbooks? Strong communication? Quick escalation?
6. Action Items
Every action MUST have an owner and deadline:
| # | Action | Owner | Deadline | Priority | Status |
|---|---|---|---|---|---|
| 1 | P0/P1/P2 | Open |
Priority definitions:
- P0: Must complete before next business day
- P1: Must complete within 1 week
- P2: Must complete within 1 sprint/month
7. Recurrence Prevention
- Monitoring added/improved for this failure mode
- Runbook created/updated
- Test coverage added
- Architecture change needed? (If yes, create RFC)
- Training needed for team?
Blameless Post-Mortem Rules
- Focus on systems, not individuals
- "What happened" not "who did it"
- Assume everyone acted with best intentions and available information
- The goal is learning, not punishment
- If you find yourself writing someone's name next to a mistake, rewrite it as a process gap
Incident Cost Calculator
Direct costs:
Revenue lost during downtime: $___
SLA credits issued: $___
Emergency vendor/contractor costs: $___
Indirect costs:
Engineering hours × loaded rate: ___ hrs × $___/hr = $___
Customer churn risk (affected users × churn probability × LTV): $___
Brand/reputation (estimate): $___
Total incident cost: $___
Cost per minute of downtime: $___
Quarterly Incident Review
Every quarter, analyze patterns across all post-mortems:
- Top 3 root cause categories — Where should you invest in prevention?
- Mean time to detect (MTTD) — Is monitoring improving?
- Mean time to resolve (MTTR) — Is response getting faster?
- Action item completion rate — Are you actually fixing things?
- Repeat incidents — Same root cause twice = systemic failure
- Cost trend — Total incident cost per quarter (should decrease)
Industry-Specific Post-Mortem Considerations
| Industry | Key Focus | Regulatory Requirement |
|---|---|---|
| Fintech | Transaction integrity, audit trail | SOX, PCI-DSS incident reporting |
| Healthcare | PHI exposure, patient safety | HIPAA breach notification (60 days) |
| SaaS | SLA compliance, data integrity | SOC 2 incident management |
| E-commerce | Order integrity, payment processing | PCI-DSS, consumer protection |
| Manufacturing | Safety incidents, production loss | OSHA reporting requirements |
Go Deeper
Your post-mortems reveal where AI agents should be deployed first — the repetitive failures, the manual monitoring gaps, the processes that break under load.
- Find your highest-cost gaps: AI Revenue Leak Calculator
- Industry-specific deployment playbooks: AfrexAI Context Packs — $47
- Pick 3: $97 | All 10: $197 | Everything: $247
- Deploy your first agent: Agent Setup Wizard
Built by AfrexAI — turning incident patterns into automation opportunities.