incident-analyzing

Incident Root Cause Analysis Skill

Your job is to move from symptom to cause — not just to note that two things happened at the same time, but to establish that one caused the other. You do this through structured hypothesis testing, reading actual code and logs, and eliminating competing explanations until one remains.

RCA Methodology

Work through these four steps explicitly. Don't skip to conclusions.

Step 1: Form a hypothesis

State the most probable root cause based on the symptom, timing, and any context provided. Be specific — name the service, function, query, dependency, or config value you suspect.

Example: "Hypothesis A: DB connection pool exhausted due to a connection leak introduced in UserService.fetchProfile() in the v2.4.1 deploy 2 hours ago."

Vague hypotheses like "it might be a DB issue" are not useful — they don't tell you what evidence to look for next.

Step 2: Identify discriminating evidence

For each hypothesis, specify what telemetry would confirm it and what would refute it. Then go get that evidence — read the log file, run the diagnostic command, read the source file.

Hypothesis A — DB connection pool exhaustion Confirms: pool utilization metric at 100%; pg_stat_activity shows idle connections holding Refutes: pool utilization below 80%; errors started immediately at deploy (not hours later)

Never theorize about code you can read directly.

Step 3: Test and update

State explicitly whether the evidence confirms or refutes the hypothesis.

"The pool utilization metric shows only 60% — this refutes Hypothesis A. The error timeline shows the spike began immediately at deploy, not hours later. Updating to Hypothesis B: ORM lazy loading behavior changed in the SQLAlchemy 1.4→2.0 upgrade bundled in this deploy, causing N+1 queries on the /api/user endpoint."

Keep cycling until one hypothesis survives all available evidence.

Step 4: Write the causal chain

Once converged, output the causal chain in this format. Every arrow must be a causal step, not a correlation.

[SQLAlchemy 2.0 upgrade in deploy v2.4.1] → [lazy loading disabled by default — relationship queries no longer batched] → [N+1 queries on every /api/user call — 1 query per user record instead of 1 total] → [DB CPU at 100%, query latency 8s average] → [HTTP 504s for all authenticated endpoints] → [checkout flow broken for all logged-in users]

Stack Trace Analysis

When given a stack trace or error log:

Identify the exception type — what class of error is this? (OOM, NPE, connection error, timeout, deserialization failure, deadlock)
Find the origin frame — the first frame in the developer's own code, not framework or library internals. That's where investigation starts.
Classify the error pattern — see the table below for known signatures
Determine novelty — new error (never appeared before) or regression (was working, now broken)?
Extract correlation/trace IDs — if present, use them to link log lines across services into a single timeline

In minified JS environments: note that vendor.js:L:C coordinates require source map resolution to identify the actual file and line. Flag this and ask for the .map file or unminified source if available.

Known error pattern signatures

Exception / Log Pattern Most Likely Cause First Diagnostic Step

connection pool timeout / too many connections

Connection pool exhaustion or leak Check pool utilization metric over time; look for missing close() in error paths

OOMKilled / Java heap space / heap out of memory

Memory leak or undersized limit Plot memory metric since last deploy — linear climb = leak, step change = new allocation

deadlock found / lock wait timeout exceeded

DB lock contention Check for long-running transactions or missing index on a recently changed query

certificate has expired / SSL handshake failed

Cert expiry openssl s_client -connect <host>:443 — check notAfter field

no such host / name resolution failed

DNS / service discovery failure nslookup <service> , check k8s service and endpoints

upstream timeout / 504 Gateway Timeout

Slow downstream dependency Find the slow span in distributed trace; test latency directly with curl -w "%{time_total}"

FATAL ERROR: Allocation failed (Node.js) V8 heap exhausted Check --max-old-space-size setting; profile heap with --inspect

Microservices Topology Reasoning

When the incident spans multiple services, read references/topology-patterns.md for detailed failure signatures and diagnosis commands.

Core approach:

Use correlation/trace IDs to link log lines across services into a unified timeline
Identify the origin service (first to show the error) vs propagation services (downstream victims that appear broken but are actually just receiving failures)
The origin service is where RCA focuses — propagation services self-heal once the origin is fixed
Draw a simplified call graph with the failure point annotated; this helps identify the blast radius

Phase 3: Impact Assessment

Before handing off to remediation, quantify the blast radius. This determines which remediation options are acceptable (a risky rollback is warranted for P0 data loss; it's not warranted for a P3 cosmetic issue).

IMPACT ASSESSMENT ───────────────── Users Affected : [number or % of traffic — be specific about which user segment] Revenue Path : [yes — [flow name] | no] Error Budget : [X% remaining this month | already breached | unknown] Downstream : [services that will fail or degrade if this continues] Data Integrity : [safe | at risk — [what data] | compromised — [what data]]

Answer this explicitly: Is a revenue-generating flow currently broken? This single question drives the P0 vs P1 boundary more than anything else.

Reference Files

Load these during investigation — don't try to recall failure patterns from memory:

references/topology-patterns.md — Diagnostic signatures, discriminating evidence, and exact Bash/SQL commands for 10 common distributed systems failure classes. Read when you need to match a symptom to a known pattern.

incident-analyzing

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

unit-test-running-coverage-analysis

editing-pptx-files

sourcing-from-atlassian

editing-docx-files