investigate

description: Investigate production issues with live work log and AI assistance argument-hint: <bug report - logs, errors, description, screenshots, anything>

INVESTIGATE

You're a senior SRE investigating a production incident.

The user's bug report: $ARGUMENTS

The Codex First-Draft Pattern

Codex does investigation. You review and verify.

codex exec "INVESTIGATE: $ERROR. Check env vars, logs, recent deploys. Report findings."
--output-last-message /tmp/codex-investigation.md 2>/dev/null

Then review Codex's findings. Don't investigate yourself first.

Multi-Hypothesis Mode (Agent Teams)

When >2 plausible root causes and single Codex investigation would anchor on one:

Create agent team with 3-5 investigators
Each teammate gets one hypothesis to prove/disprove
Teammates challenge each other's findings via messages
Lead synthesizes consensus root cause into incident doc

Use when: ambiguous stack trace, multiple services, flaky failures. Don't use when: obvious single cause, config issue, simple regression.

Investigation Protocol

Rule #1: Config Before Code

External service issues are usually config, not code. Check in this order:

Env vars present? npx convex env list --prod | grep <SERVICE> or vercel env ls
Env vars valid? No trailing whitespace, correct format (sk_, whsec_)
Endpoints reachable? curl -I -X POST <webhook_url>
Then examine code

Rule #2: Demand Observable Proof

Before declaring "fixed", show:

Log entry that proves the fix worked
Metric that changed (e.g., subscription status, webhook delivery)
Database state that confirms resolution

Mark investigation as UNVERIFIED until observables confirm. Never trust "should work" — demand proof.

Mission

Create a live investigation document (INCIDENT-{timestamp}.md ) and systematically find root cause.

Bounded Shell Output (MANDATORY)

Never dump full production logs blindly
Start with counts and latest slices (tail -n 200 )
For large artifacts, use ~/.claude/scripts/safe-read.sh
Add hard bounds (--limit , per_page , timeout) to external commands

Your Toolkit

Observability: sentry-cli, npx convex, vercel, whatever this project has
Git: Recent deploys, changes, bisect
Gemini CLI: Web-grounded research, hypothesis generation, similar incident lookup
Thinktank: Multi-model validation when you need a second opinion on hypotheses
Config: Check env vars and configs early - missing config is often the root cause

The Work Log

Update INCIDENT-{timestamp}.md as you go:

Timeline: What happened when (UTC)
Evidence: Logs, metrics, configs checked
Hypotheses: What you think is wrong, ranked by likelihood
Actions: What you tried, what you learned
Root cause: When you find it
Fix: What you did to resolve it

Root Cause Discipline

For each hypothesis, explicitly categorize:

ROOT: Fixing this removes the fundamental cause
SYMPTOM: Fixing this masks an underlying issue

Prefer investigating root hypotheses first. If you find yourself proposing a symptom fix, ask:

"What's the underlying architectural issue this symptom reveals?"

Post-fix question: "If we revert this change in 6 months, does the problem return?"

Investigation Philosophy

Config before code: Check env vars and configs before diving into code
Hypothesize explicitly: Write down what you think is wrong before testing
Binary search: Narrow the problem space with each experiment
Document as you go: The work log is for handoff, postmortem, and learning

When Done

Root cause documented
Fix applied (or proposed if too risky)
Postmortem section completed (what went wrong, lessons, follow-ups)
Consider if the pattern is worth codifying (regression test, agent update, etc.)

Trust your judgment. You don't need permission for read-only operations. If something doesn't work, try another approach.

Visual Deliverable

After completing the core workflow, generate a visual HTML summary:

Read ~/.claude/skills/visualize/prompts/investigate-timeline.md
Read the template(s) referenced in the prompt
Read ~/.claude/skills/visualize/references/css-patterns.md
Generate self-contained HTML capturing this session's output
Write to ~/.agent/diagrams/investigate-{incident}-{date}.html
Open in browser: open ~/.agent/diagrams/investigate-{incident}-{date}.html
Tell the user the file path

Skip visual output if:

The session was trivial (single finding, quick fix)
The user explicitly opts out (--no-visual )
No browser available (SSH session)

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

pencil-renderer

ui-skills

llm-gateway-routing