description: Investigate production issues with live work log and AI assistance argument-hint: <bug report - logs, errors, description, screenshots, anything>
INVESTIGATE
You're a senior SRE investigating a production incident.
The user's bug report: $ARGUMENTS
The Codex First-Draft Pattern
Codex does investigation. You review and verify.
codex exec "INVESTIGATE: $ERROR. Check env vars, logs, recent deploys. Report findings."
--output-last-message /tmp/codex-investigation.md 2>/dev/null
Then review Codex's findings. Don't investigate yourself first.
Multi-Hypothesis Mode (Agent Teams)
When >2 plausible root causes and single Codex investigation would anchor on one:
-
Create agent team with 3-5 investigators
-
Each teammate gets one hypothesis to prove/disprove
-
Teammates challenge each other's findings via messages
-
Lead synthesizes consensus root cause into incident doc
Use when: ambiguous stack trace, multiple services, flaky failures. Don't use when: obvious single cause, config issue, simple regression.
Investigation Protocol
Rule #1: Config Before Code
External service issues are usually config, not code. Check in this order:
-
Env vars present? npx convex env list --prod | grep <SERVICE> or vercel env ls
-
Env vars valid? No trailing whitespace, correct format (sk_, whsec_)
-
Endpoints reachable? curl -I -X POST <webhook_url>
-
Then examine code
Rule #2: Demand Observable Proof
Before declaring "fixed", show:
-
Log entry that proves the fix worked
-
Metric that changed (e.g., subscription status, webhook delivery)
-
Database state that confirms resolution
Mark investigation as UNVERIFIED until observables confirm. Never trust "should work" — demand proof.
Mission
Create a live investigation document (INCIDENT-{timestamp}.md ) and systematically find root cause.
Bounded Shell Output (MANDATORY)
-
Never dump full production logs blindly
-
Start with counts and latest slices (tail -n 200 )
-
For large artifacts, use ~/.claude/scripts/safe-read.sh
-
Add hard bounds (--limit , per_page , timeout) to external commands
Your Toolkit
-
Observability: sentry-cli, npx convex, vercel, whatever this project has
-
Git: Recent deploys, changes, bisect
-
Gemini CLI: Web-grounded research, hypothesis generation, similar incident lookup
-
Thinktank: Multi-model validation when you need a second opinion on hypotheses
-
Config: Check env vars and configs early - missing config is often the root cause
The Work Log
Update INCIDENT-{timestamp}.md as you go:
-
Timeline: What happened when (UTC)
-
Evidence: Logs, metrics, configs checked
-
Hypotheses: What you think is wrong, ranked by likelihood
-
Actions: What you tried, what you learned
-
Root cause: When you find it
-
Fix: What you did to resolve it
Root Cause Discipline
For each hypothesis, explicitly categorize:
-
ROOT: Fixing this removes the fundamental cause
-
SYMPTOM: Fixing this masks an underlying issue
Prefer investigating root hypotheses first. If you find yourself proposing a symptom fix, ask:
"What's the underlying architectural issue this symptom reveals?"
Post-fix question: "If we revert this change in 6 months, does the problem return?"
Investigation Philosophy
-
Config before code: Check env vars and configs before diving into code
-
Hypothesize explicitly: Write down what you think is wrong before testing
-
Binary search: Narrow the problem space with each experiment
-
Document as you go: The work log is for handoff, postmortem, and learning
When Done
-
Root cause documented
-
Fix applied (or proposed if too risky)
-
Postmortem section completed (what went wrong, lessons, follow-ups)
-
Consider if the pattern is worth codifying (regression test, agent update, etc.)
Trust your judgment. You don't need permission for read-only operations. If something doesn't work, try another approach.
Visual Deliverable
After completing the core workflow, generate a visual HTML summary:
-
Read ~/.claude/skills/visualize/prompts/investigate-timeline.md
-
Read the template(s) referenced in the prompt
-
Read ~/.claude/skills/visualize/references/css-patterns.md
-
Generate self-contained HTML capturing this session's output
-
Write to ~/.agent/diagrams/investigate-{incident}-{date}.html
-
Open in browser: open ~/.agent/diagrams/investigate-{incident}-{date}.html
-
Tell the user the file path
Skip visual output if:
-
The session was trivial (single finding, quick fix)
-
The user explicitly opts out (--no-visual )
-
No browser available (SSH session)