systematic-debugging

Systematic Debugging Skill

Overview

This skill provides a structured four-phase debugging framework emphasizing root cause discovery before attempting fixes. Core principle: "Random fixes waste time and create new bugs. Quick patches mask underlying issues."

Quick Start

Investigate - Gather evidence, reproduce consistently
Analyze - Compare with working patterns
Hypothesize - Form and test specific theories
Implement - Fix with test coverage

When to Use

Bug reports requiring investigation
Test failures with unclear causes
Production incidents
Performance regressions
Integration failures
Any debugging that requires more than 5 minutes

The Four Phases

Phase 1: Root Cause Investigation

Objective: Understand the problem completely before attempting any fix.

Steps:

Examine error messages thoroughly
Reproduce the issue consistently
Review recent changes (commits, configs, dependencies)
Gather diagnostic evidence (logs, traces, metrics)
For multi-component systems, add instrumentation at each boundary

Questions to answer:

What exactly is failing?
When did it start failing?
What changed recently?
Can I reproduce it reliably?

Phase 2: Pattern Analysis

Objective: Find working examples and understand differences.

Steps:

Locate working examples in the codebase
Compare against reference implementations completely
Identify differences systematically
Understand all dependencies

Key comparisons:

Working vs. broken code paths
Expected vs. actual behavior
Known good state vs. current state

Phase 3: Hypothesis and Testing

Objective: Form and validate theories before changing code.

Steps:

Formulate a specific hypothesis
Design a test for the hypothesis
Test with minimal changes (one variable at a time)
Verify results before proceeding

Hypothesis format: "The bug occurs because [condition] when [trigger], which causes [symptom]."

Phase 4: Implementation

Objective: Fix the root cause with proper verification.

Steps:

Create a failing test case reproducing the bug
Implement a single fix addressing the root cause
Verify the test passes
Verify no other tests broke
Document the fix

Critical Safeguards

Hard Stop Rule

If >= 3 fixes fail: STOP and question the architecture.

When multiple fixes fail, the issue indicates deeper structural problems requiring discussion rather than continued symptom-patching.

Red Flags (Restart Process)

Proposing solutions before investigation
Attempting multiple simultaneous fixes
Assuming without verification
Skipping reproduction step
"It should work" without evidence

Debugging Anti-Patterns

Anti-Pattern Problem Correct Approach

Shotgun debugging Random changes hoping something works Systematic investigation

Printf debugging only Incomplete picture Structured instrumentation

Blame the framework Avoids understanding Verify framework behavior

"Works on my machine" Environment assumptions Document exact repro steps

Quick patch Hides root cause Find and fix actual cause

Instrumentation Strategies

Logging Strategy

Entry/exit of suspected functions
Input/output values at boundaries
State changes at key points
Timing information for performance issues

Boundary Tracing

For multi-component systems:

[Input] -> [Component A] -> [Component B] -> [Output] ^ ^ ^ ^ | | | | Check 1 Check 2 Check 3 Check 4

Add verification at each boundary to isolate failure point.

Best Practices

Reproduce before investigating
Document investigation steps
Test one hypothesis at a time
Write regression test for every bug fix
Share findings with team
Update documentation when environment-related

Don't

Jump to conclusions
Make multiple changes at once
Fix symptoms instead of causes
Skip the hypothesis step
Merge fixes without tests
Ignore intermittent failures

Error Handling

Situation Action

Cannot reproduce Gather more context, check environment differences

Multiple potential causes Isolate and test each separately

Fix breaks other things Revert, investigate dependencies

Root cause unclear after investigation Escalate, add more instrumentation

Metrics

Metric Target Description

First-fix success rate

80% Fixes that resolve issue first time

Regression rate <5% Bug fixes causing new bugs

Investigation time ratio

60% Time spent investigating vs. coding

Documentation rate 100% Bugs documented with root cause

Debugging Checklist

Issue reproduced consistently
Recent changes reviewed
Error messages fully understood
Working comparison found
Hypothesis documented
Single-variable test performed
Root cause identified
Failing test written
Fix implemented
All tests pass
Fix documented

Related Skills

tdd-obra - Test-first development
writing-plans - Plan implementations
code-reviewer - Code quality review

Version History

1.0.0 (2026-01-19): Initial release adapted from obra/superpowers

systematic-debugging

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

echarts

pandoc

mkdocs

gis