Root Cause Analysis
Symptom → hypothesis formation → evidence gathering → elimination → root cause → verified fix.
<when_to_use>
-
Diagnosing system failures or unexpected behavior
-
Investigating incidents or outages
-
Finding the actual cause vs surface symptoms
-
Preventing recurrence through understanding
-
Any situation where "why did this happen?" needs answering
NOT for: known issues with documented fixes, simple configuration errors, guessing without evidence
</when_to_use>
<discovery_phase>
Core Questions
Question Why it matters
What's the symptom? Exact manifestation of the problem
When did it start? First occurrence, patterns in timing
Can you reproduce it? Consistently, intermittently, specific conditions
What changed recently? Deployments, config, dependencies, environment
What have you tried? Previous fix attempts, their results
What are the constraints? Time budget, what can't be modified
Confidence Thresholds
Level State Action
0-2 Symptom unclear or can't reproduce Keep gathering info
3 Good context, some gaps Can start hypothesis phase
4+ Clear picture Proceed to investigation
At level 3+, transition to hypothesis formation. Below level 3, keep gathering context.
</discovery_phase>
<hypothesis_formation>
Quality Criteria
Good Hypothesis Weak Hypothesis
Testable Too broad ("something's wrong")
Falsifiable Untestable
Specific Contradicts evidence
Plausible Assumes conclusion
Multiple Working Hypotheses
Generate 2-4 competing theories:
-
List each hypothesis with supporting/contradicting evidence
-
Rank by likelihood (evidence support, parsimony, testability)
-
Design tests to differentiate between them
</hypothesis_formation>
<evidence_gathering>
Observation Collection
Category What to Gather
Error manifestation Exact symptoms, messages, states
Reproduction steps Minimal sequence triggering issue
System state Logs, variables, config at failure time
Environment Versions, platform, dependencies
Timing When started, frequency, patterns
Breadcrumb Analysis
Trace backwards from symptom:
-
Last known good state — what was working?
-
First observable failure — when did it break?
-
Changes between — what's different?
-
Root trigger — first thing that went wrong
</evidence_gathering>
<hypothesis_testing>
Test Design
For each hypothesis:
-
Prediction — if true, what should we observe?
-
Test method — how to verify?
-
Expected result — what confirms/refutes?
-
Time budget — when to move on?
Testing Priorities
Priority Strategy
First Quick, non-destructive, local tests
Second Most likely causes, common failures
Third Edge cases, rare failures
Execution Loop
Baseline → Single variable change → Observe → Document → Iterate
</hypothesis_testing>
<elimination_methodology>
Three core techniques:
Technique When to Use
Binary Search Large problem space, ordered changes
Variable Isolation Multiple variables, need causation
Process of Elimination Finite set of possible causes
See elimination-techniques.md for detailed methods.
</elimination_methodology>
<time_boxing>
Phase Duration Exit Condition
Discovery 5-10 min Questions answered, can reproduce
Hypothesis 10-15 min 2-4 testable theories ranked
Testing 15-30 min per hypothesis Confirmed or ruled out
Fix Variable Root cause addressed
Verification 10-15 min Fix confirmed, prevention documented
If stuck beyond 2x estimate → step back, seek fresh perspective, or escalate.
</time_boxing>
<audit_trail>
Log every step:
[TIME] PHASE: Action → Result [10:15] DISCOVERY: Gathered error logs → Found NullPointerException [10:22] HYPOTHESIS: User object not initialized [10:28] TEST: Added null check logging → Confirmed user is null
Benefits: Prevents revisiting same ground, enables handoff, catches circular investigation.
See documentation-templates.md for full templates.
</audit_trail>
<common_pitfalls>
Watch for these patterns:
Trap Counter
"I already looked at that" Re-examine with fresh evidence
"That can't be the issue" Test anyway, let evidence decide
"We need to fix this quickly" Methodical investigation is faster
Confirmation bias Actively seek disconfirming evidence
Correlation = causation Test direct causal mechanism
See pitfalls.md for detailed resistance patterns and recovery.
</common_pitfalls>
<confidence_calibration>
Level Indicators
High Consistent reproduction, clear cause-effect, multiple confirmations, fix verified
Moderate Reproduces mostly, strong correlation, single confirmation
Low Inconsistent reproduction, unclear correlation, unverified hypothesis
</confidence_calibration>
ALWAYS:
-
Gather sufficient context before hypothesizing
-
Form multiple competing hypotheses
-
Test systematically, one variable at a time
-
Document investigation trail
-
Verify fix actually addresses root cause
-
Document for future prevention
NEVER:
-
Jump to solutions without diagnosis
-
Trust single hypothesis without testing alternatives
-
Apply fixes without understanding cause
-
Skip verification of fix
-
Repeat same failed investigation steps
-
Hide uncertainty about root cause
Deep-dive documentation:
-
elimination-techniques.md — binary search, variable isolation, process of elimination
-
pitfalls.md — cognitive biases and resistance patterns
-
documentation-templates.md — investigation logs and RCA reports
Related skills:
-
debugging-and-diagnosis — code-specific debugging (loads this skill)
-
codebase-analysis — uses for code investigation
-
report-findings — presenting investigation results