tdd

Test-Driven Development (TDD)

Strict Red-Green-Refactor workflow for robust, self-documenting, production-ready code.

Quick Navigation

Situation Go To

New to this codebase Step 1: Explore Environment

Know the framework, starting work Step 2: Select Mode

Need the core loop reference Step 3: Core TDD Loop

Complex edge cases to cover Property-Based Testing

Tests are flaky/unreliable Flaky Test Management

Need isolated test environment Hermetic Testing

Measuring test quality Mutation Testing

The Three Rules (Robert C. Martin)

No Production Code without a failing test
Write Only Enough Test to Fail (compilation errors count)
Write Only Enough Code to Pass (no optimizations yet)

The Loop: 🔴 RED (write failing test) → 🟢 GREEN (minimal code to pass) → 🔵 REFACTOR (clean up) → Repeat

Step 1: Explore Test Environment

Do NOT assume anything. Explore the codebase first.

Checklist:

Search for test files: glob("/.test.") , glob("/.spec.") , glob("**/test_*.py")
Check package.json scripts, Makefile , or CI workflows
Look for config: vitest.config.* , jest.config.* , pytest.ini , Cargo.toml

Framework Detection:

Language Config Files Test Command

Node.js package.json , vitest.config.*

npm test , bun test

Python pyproject.toml , pytest.ini

pytest

Go go.mod , *_test.go

go test ./...

Rust Cargo.toml

cargo test

Step 2: Select Mode

Mode When First Action

New Feature Adding functionality Read existing module tests, confirm green baseline

Bug Fix Reproducing issue Write failing reproduction test FIRST

Refactor Cleaning code Ensure ≥80% coverage on target code

Legacy No tests exist Add characterization tests before changing

Tie-breaker: If coverage <20% or tests absent → use Legacy Mode first.

Mode: New Feature

Read existing tests for the module
Run tests to confirm green baseline
Enter Core Loop for new behavior
Commits: test(module): add test for X → feat(module): implement X

Mode: Bug Fix

Write failing reproduction test (MUST fail before fix)
Confirm failure is assertion error, not syntax error
Write minimal fix
Run full test suite
Commits: test: add failing test for bug #123 → fix: description (#123)

Mode: Refactor

Run coverage on the specific function you'll refactor
If coverage <80% → add characterization tests first
Refactor in small steps (ONE change → run tests → repeat)
Never change behavior during refactor

Mode: Legacy Code

Find Seams - insertion points for tests (Sensing Seams, Separation Seams)
Break Dependencies - use Sprout Method or Wrap Method
Add characterization tests (capture current behavior)
Build safety net: happy path + error cases + boundaries
Then apply TDD for your changes

→ See references/examples.md for full code examples of each mode.

Step 3: The Core TDD Loop

Before Starting: Scenario List

List all behaviors to cover:

Happy path cases
Edge cases and boundaries
Error/failure cases
Pessimism: 3 ways this could fail (network, null, invalid state)

🔴 RED Phase

Write ONE test (single behavior or edge case)
Use AAA: Arrange → Act → Assert
Run test, verify it FAILS for expected reason

Checks:

Is failure an assertion error? (Not SyntaxError /ModuleNotFoundError )
Can I explain why this should fail?
If test passes immediately → STOP. Test is broken or feature exists.

🟢 GREEN Phase

Write minimal code to pass
Do NOT implement "perfect" solution
Verify test passes

Checks:

Is this the simplest solution?
Can I delete any of this code and still pass?

🔵 REFACTOR Phase

Look for duplication, unclear names, magic values
Clean up without changing behavior
Verify tests still pass

Repeat

Select next scenario, return to RED.

Triangulation: If implementation is too specific (hardcoded), write another test with different inputs to force generalization.

Stop Conditions

Signal Response

Test passes immediately Check assertions, verify feature isn't already built

Test fails for wrong reason Fix setup/imports first

Flaky test STOP. Fix non-determinism immediately

Slow feedback (>5s) Optimize or mock external calls

Coverage decreased Add tests for uncovered paths

Test Distribution: The Testing Trophy

The Testing Trophy (Kent C. Dodds) reflects modern testing reality: integration tests give the best confidence-to-effort ratio.

      _____________
     /   System    \      ← Few, slow, high confidence; brittle (E2E)
    /_______________\
   /                 \
  /    Integration    \   ← Real interactions between units — **BEST ROI** (Integration)
  \                   /
   \_________________/
     \    Unit     /      ← Fast &#x26; cheap but test in isolation (Unit) 
      \___________/
      /   Static  \       ← Typecheck, linting — typos/types (Static)
     /_____________\

Layer Breakdown

Layer What Tools When

Static Type errors, syntax, linting TypeScript, ESLint Always on, catches 50%+ of bugs for free

Unit Pure functions, algorithms, utilities vitest, jest, pytest Isolated logic with no dependencies

Integration Components + hooks + services together Testing Library, MSW, Testcontainers Real user flows, real(ish) data

E2E Full app in browser Playwright, Cypress Critical paths only (login, checkout)

Why Integration Tests Win

Unit tests prove code works in isolation. Integration tests prove code works together.

Concern Unit Test Integration Test

Component renders ✅ ✅

Component + hook works ❌ ✅

Component + API works ❌ ✅

User flow works ❌ ✅

Catches real bugs Sometimes Usually

The insight: Most bugs live in the seams between modules, not inside pure functions. Integration tests catch seam bugs; unit tests don't.

Practical Guidance

Start with integration tests - Test the way users use your code
Drop to unit tests for complex algorithms or edge cases
Use E2E sparingly - Slow, flaky, expensive to maintain
Let static analysis do the heavy lifting - TypeScript catches more bugs than most unit tests
Prefer fakes over mocks - Fakes have real behavior; mocks just return canned data
SMURF quality: Sustainable, Maintainable, Useful, Resilient, Fast

Anti-Patterns

Pattern Problem Fix

Mirror Blindness Same agent writes test AND code State test intent before GREEN

Happy Path Bias Only success scenarios Include errors in Scenario List

Refactoring While Red Changing structure with failing tests Get to GREEN first

The Mockery Over-mocking hides bugs Prefer fakes or real implementations

Coverage Theater Tests without meaningful assertions Assert behavior, not lines

Multi-Test Step Multiple tests before implementing One test at a time

Verification Trap 🤖 AI tests what code does not what it should do State intent in plain language; separate agent review

Test Exploitation 🤖 LLMs exploit weak assertions or overload operators Use PBT alongside examples; strict equality

Assertion Omission 🤖 Missing edge cases (null, undefined, boundaries) Scenario list with errors; test.each

Hallucinated Mock 🤖 AI generates fake mocks without proper setup Testcontainers for integration; real Fakes for unit

Critical: Verify tests by (1) running them, (2) having separate agent review, (3) never trusting generated tests blindly.

Advanced Techniques

Use these techniques at specific points in your workflow:

Technique Use During Purpose

Test Doubles 🔴 RED phase Isolate dependencies when writing tests

Property-Based Testing 🔴 RED phase Cover edge cases for complex logic

Contract Testing 🔴 RED phase Define API expectations between services

Snapshot Testing 🔴 RED phase Capture UI/response structure

Hermetic Testing 🔵 Setup Ensure test isolation and determinism

Mutation Testing ✅ After GREEN Validate test suite effectiveness

Coverage Analysis ✅ After GREEN Find untested code paths

Flaky Test Management 🔧 Maintenance Fix unreliable tests blocking CI

Test Doubles (Use: Writing Tests with Dependencies)

When: Your code depends on something slow, unreliable, or complex (DB, API, filesystem).

Type Purpose When

Stub Returns canned answers Need specific return values

Mock Verifies interactions Need to verify calls made

Fake Simplified implementation Need real behavior without cost

Spy Records calls Need to observe without changing

Decision: Dependency slow/unreliable? → Fake (complex) or Stub (simple). Need to verify calls? → Mock/Spy. Otherwise → real implementation.

→ See references/examples.md → Test Double Examples

Hermetic Testing (Use: Test Environment Setup)

When: Setting up test infrastructure. Tests must be isolated and deterministic.

Principles:

Isolation: Unique temp directories/state per test
Reset: Clean up in setUp/tearDown
Determinism: No time-based logic or shared mutable state

Database Strategies:

Strategy Speed Fidelity Use When

In-memory (SQLite) Fast Low Unit tests, simple queries

Testcontainers Medium High Integration tests

Transactional Rollback Fast High Tests sharing schema (80x faster than TRUNCATE)

→ See references/examples.md → Hermetic Testing Examples

Property-Based Testing (Use: Writing Tests for Complex Logic)

When: Writing tests for algorithms, state machines, serialization, or code with many edge cases.

Tools: fast-check (JS/TS), Hypothesis (Python), proptest (Rust)

Properties to Test:

Commutativity: f(a, b) == f(b, a)
Associativity: f(f(a, b), c) == f(a, f(b, c))
Identity: f(a, identity) == a
Round-trip: decode(encode(x)) == x
Metamorphic: If input changes by X, output changes by Y (useful when you don't know expected output)

How: Replace multiple example-based tests with one property test that generates random inputs.

Critical: Always log the seed on failure. Without it, you cannot reproduce the failing case.

→ See references/examples.md → Property-Based Testing Examples

Mutation Testing (Use: Validating Test Quality)

When: After tests pass, to verify they actually catch bugs. Use for critical code (auth, payments) or before major refactors.

Tools: Stryker (JS/TS), PIT (Java), mutmut (Python)

How: Tool mutates your code (e.g., changes > to >= ). If tests still pass → your tests are weak.

Interpretation:

80% mutation score = good test suite
Survived mutants = tests don't catch those changes → add tests for these

Equivalent Mutant Problem: Some mutants change syntax but not behavior (e.g., i < 10 → i != 10 in a loop where i only increments). These can't be killed—100% score is often impossible. Focus on surviving mutants in critical paths, not chasing perfect scores.

When NOT to use: Tool-generated code (OpenAPI clients, Protobuf stubs, ORM models), simple DTOs/getters, legacy code with slow tests, or CI pipelines that must finish in <5 minutes. Use --incremental --since main for PR-focused runs. Note: This does NOT mean skip mutation testing on code you (the agent) wrote—always validate your own work.

→ See references/examples.md → Mutation Testing Examples

Flaky Test Management (Use: CI/CD Maintenance)

When: Tests fail intermittently, blocking CI or eroding trust in the test suite.

Root Causes:

Cause Fix

Timing (setTimeout , races) Fake timers, await properly

Shared state Isolate per test

Randomness Seed or mock

Network Use MSW or fakes

Order dependency Make tests independent

Parallel transaction conflicts Isolate DB connections per worker

How: Detect (--repeat 10 ) → Quarantine (separate suite) → Fix root cause → Restore

Quarantine Rules:

Issue-linked: Every quarantined test MUST link to a tracking issue. Prevents "quarantine-and-forget."
Mute, don't skip: Prefer muting (runs but doesn't fail build) over skipping. You still collect failure data.
Reintroduction criteria: Test must pass N consecutive runs (e.g., 100) on main before leaving quarantine.

→ See references/examples.md → Flaky Test Examples

Contract Testing (Use: Writing Tests for Service Boundaries)

When: Writing tests for code that calls or exposes APIs. Prevents integration breakage.

How (Pact): Consumer defines expected interactions → Contract published → Provider verifies → CI fails if contract broken.

→ See references/examples.md → Contract Testing Examples

Coverage Analysis (Use: Finding Gaps After Tests Pass)

When: After writing tests, to find untested code paths. NOT a goal in itself.

Metric Measures Threshold

Line Lines executed 70-80%

Branch Decision paths 60-70%

Mutation Test effectiveness

80%

Risk-Based Prioritization: P0 (auth, payments) → P1 (core logic) → P2 (helpers) → P3 (config)

Warning: High coverage ≠ good tests. Tests must assert meaningful behavior.

Snapshot Testing (Use: Writing Tests for UI/Output Structure)

When: Writing tests for UI components, API responses, or error message formats.

Appropriate: UI structure, API response shapes, error formats. Avoid: Behavior testing, dynamic content, entire pages.

How: Capture output once, verify it doesn't change unexpectedly. Always review diffs carefully.

→ See references/examples.md → Snapshot Testing Examples

Integration with Other Skills

Task Skill Usage

Committing git-commit

test: for RED, feat: for GREEN

Code Quality code-quality

Run during REFACTOR phase

Documentation docs-check

Check if behavior changes need docs

References

Foundational:

Three Rules of TDD - Robert C. Martin
Test Pyramid - Martin Fowler
Testing Trophy - Kent C. Dodds
Working Effectively with Legacy Code - Michael Feathers

Tools: Testcontainers | fast-check | Stryker | MSW | Pact

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

ui-animation

use-graphite

code-quality

git-commit