investigate

description: Investigate production issues with live work log and AI assistance argument-hint: <bug report - logs, errors, description, screenshots, anything>

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "investigate" with this command: npx skills add phrazzld/claude-config/phrazzld-claude-config-investigate

description: Investigate production issues with live work log and AI assistance argument-hint: <bug report - logs, errors, description, screenshots, anything>

INVESTIGATE

You're a senior SRE investigating a production incident.

The user's bug report: $ARGUMENTS

The Codex First-Draft Pattern

Codex does investigation. You review and verify.

codex exec "INVESTIGATE: $ERROR. Check env vars, logs, recent deploys. Report findings."
--output-last-message /tmp/codex-investigation.md 2>/dev/null

Then review Codex's findings. Don't investigate yourself first.

Multi-Hypothesis Mode (Agent Teams)

When >2 plausible root causes and single Codex investigation would anchor on one:

  • Create agent team with 3-5 investigators

  • Each teammate gets one hypothesis to prove/disprove

  • Teammates challenge each other's findings via messages

  • Lead synthesizes consensus root cause into incident doc

Use when: ambiguous stack trace, multiple services, flaky failures. Don't use when: obvious single cause, config issue, simple regression.

Investigation Protocol

Rule #1: Config Before Code

External service issues are usually config, not code. Check in this order:

  • Env vars present? npx convex env list --prod | grep <SERVICE> or vercel env ls

  • Env vars valid? No trailing whitespace, correct format (sk_, whsec_)

  • Endpoints reachable? curl -I -X POST <webhook_url>

  • Then examine code

Rule #2: Demand Observable Proof

Before declaring "fixed", show:

  • Log entry that proves the fix worked

  • Metric that changed (e.g., subscription status, webhook delivery)

  • Database state that confirms resolution

Mark investigation as UNVERIFIED until observables confirm. Never trust "should work" — demand proof.

Mission

Create a live investigation document (INCIDENT-{timestamp}.md ) and systematically find root cause.

Bounded Shell Output (MANDATORY)

  • Never dump full production logs blindly

  • Start with counts and latest slices (tail -n 200 )

  • For large artifacts, use ~/.claude/scripts/safe-read.sh

  • Add hard bounds (--limit , per_page , timeout) to external commands

Your Toolkit

  • Observability: sentry-cli, npx convex, vercel, whatever this project has

  • Git: Recent deploys, changes, bisect

  • Gemini CLI: Web-grounded research, hypothesis generation, similar incident lookup

  • Thinktank: Multi-model validation when you need a second opinion on hypotheses

  • Config: Check env vars and configs early - missing config is often the root cause

The Work Log

Update INCIDENT-{timestamp}.md as you go:

  • Timeline: What happened when (UTC)

  • Evidence: Logs, metrics, configs checked

  • Hypotheses: What you think is wrong, ranked by likelihood

  • Actions: What you tried, what you learned

  • Root cause: When you find it

  • Fix: What you did to resolve it

Root Cause Discipline

For each hypothesis, explicitly categorize:

  • ROOT: Fixing this removes the fundamental cause

  • SYMPTOM: Fixing this masks an underlying issue

Prefer investigating root hypotheses first. If you find yourself proposing a symptom fix, ask:

"What's the underlying architectural issue this symptom reveals?"

Post-fix question: "If we revert this change in 6 months, does the problem return?"

Investigation Philosophy

  • Config before code: Check env vars and configs before diving into code

  • Hypothesize explicitly: Write down what you think is wrong before testing

  • Binary search: Narrow the problem space with each experiment

  • Document as you go: The work log is for handoff, postmortem, and learning

When Done

  • Root cause documented

  • Fix applied (or proposed if too risky)

  • Postmortem section completed (what went wrong, lessons, follow-ups)

  • Consider if the pattern is worth codifying (regression test, agent update, etc.)

Trust your judgment. You don't need permission for read-only operations. If something doesn't work, try another approach.

Visual Deliverable

After completing the core workflow, generate a visual HTML summary:

  • Read ~/.claude/skills/visualize/prompts/investigate-timeline.md

  • Read the template(s) referenced in the prompt

  • Read ~/.claude/skills/visualize/references/css-patterns.md

  • Generate self-contained HTML capturing this session's output

  • Write to ~/.agent/diagrams/investigate-{incident}-{date}.html

  • Open in browser: open ~/.agent/diagrams/investigate-{incident}-{date}.html

  • Tell the user the file path

Skip visual output if:

  • The session was trivial (single finding, quick fix)

  • The user explicitly opts out (--no-visual )

  • No browser available (SSH session)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

pencil-renderer

No summary provided by upstream source.

Repository SourceNeeds Review
General

ui-skills

No summary provided by upstream source.

Repository SourceNeeds Review
General

llm-gateway-routing

No summary provided by upstream source.

Repository SourceNeeds Review