shadow-mode

Shadow Mode Migration Pattern

Shadow mode mirrors production traffic to a new system without affecting users. The shadow system's responses are discarded — only the production response reaches the user — but both responses are logged and compared to validate correctness.

When to Use This Skill

Use this skill when... Use dual-write instead when...

Validating read behavior of a replacement service Both systems need to persist writes

Testing performance under real production load You need the new store to be authoritative

Comparing response correctness before cutover Migrating data stores that must stay in sync

Evaluating a new service version safely The new system needs to receive and store mutations

Load testing a new deployment with real traffic You need strong consistency between systems

Core Concepts

Traffic Flow

Client Request │ ▼ ┌─────────────┐ │ Router / │ │ Proxy │ ├──────┬──────┤ │ │ │ ▼ │ ▼ Prod │ Shadow System │ System │ │ │ ▼ │ ▼ Prod │ Shadow Response│ Response │ │ │ ▼ │ (discard) Client │ │ │ ▼ │ Compare & │ Log ▼

Shadow Modes

Mode Description Use case

Full mirror 100% of traffic duplicated Final validation before cutover

Sampled mirror Percentage of traffic (e.g., 10%) Early validation, capacity-constrained shadow

Selective mirror Specific request types or endpoints Targeted validation of changed behavior

Replay mirror Recorded traffic replayed offline Testing without live shadow infrastructure

Implementation Architecture

Key Components

Component Responsibility

Traffic splitter Duplicates requests to shadow system

Shadow router Forwards mirrored requests, manages timeouts

Response comparator Compares prod vs shadow responses

Discrepancy logger Records differences with full context

Metrics collector Tracks match rates, latency, error rates

Kill switch Disables shadow traffic instantly if issues arise

Deployment Topology

Topology How it works Trade-offs

Proxy-based Load balancer or API gateway mirrors requests Simple setup, adds proxy hop

Application-level Application code sends async copy of request Fine-grained control, code coupling

Infrastructure-level Service mesh (Istio, Linkerd) mirrors traffic No code changes, requires mesh

Log replay Capture request logs, replay against shadow No live infrastructure needed, not real-time

Implementation Patterns

Proxy-Based Mirroring

Configure the load balancer or API gateway to:

Forward the original request to the production backend
Clone the request and send it to the shadow backend
Return only the production response to the client
Shadow response is logged but never returned
Shadow request timeout is independent of production

Application-Level Mirroring

Intercept the incoming request at the application layer
Process the request normally through the production path
Asynchronously send a copy of the request to the shadow service
Do not block the production response on the shadow response
Compare responses in a background worker

Response Comparison Strategy

Compare responses field by field with configurable rules:

Field type Comparison approach

IDs, timestamps Ignore (expected to differ)

Computed values Compare within tolerance (e.g., floating point)

Collections Compare as sets (ignore ordering unless significant)

Status codes Exact match required

Error responses Categorize and compare error types

Headers Compare relevant headers only (Content-Type, Cache-Control)

Handling Stateful Requests

Shadow mode works best with read-only requests. For stateful (write) requests:

Approach Description

Skip writes Only mirror read requests to shadow

Isolated state Shadow has its own database seeded from production

Dry-run writes Shadow validates the write but does not persist

Record-only Log what shadow would have written, compare intent

Gradual Rollout

Phase Traffic % Duration Goal

Smoke test 1% Hours Verify shadow receives and processes requests
Canary 5-10% Days Identify obvious discrepancies
Validation 25-50% Days-weeks Build confidence in match rate
Full mirror 100% Days-weeks Final validation before cutover

Validation Metrics

Metric Target Description

Response match rate

99.9% Percentage of identical responses

Shadow latency (P50) Within 2x of prod Shadow performance baseline

Shadow latency (P99) Monitored Tail latency under real load

Shadow error rate < prod error rate Shadow should not produce more errors

Shadow availability Monitored Shadow uptime (not a blocker)

Discrepancy categories Trending to zero Known differences resolved over time

Common Pitfalls

Pitfall Mitigation

Shadow affects production performance Async mirroring, independent timeouts, kill switch

Shadow writes to shared resources Isolate shadow databases, queues, and external services

Non-deterministic responses cause false mismatches Configure comparison rules to ignore timestamps, IDs, nonces

Shadow receives stale data Seed shadow database from recent production snapshot

Traffic amplification overwhelms shadow Use sampled mirroring, auto-scaling, or circuit breakers

Request ordering differs between prod and shadow Compare request-by-request, not sequence-dependent

Authentication tokens expire for shadow Mint shadow-specific tokens or bypass auth in shadow

Integration with Dual Write

Shadow mode and dual write are complementary migration techniques:

Migration phase Technique Purpose

Early validation Shadow mode (reads) Verify the new system returns correct responses

Data sync Dual write Keep both stores authoritative during transition

Pre-cutover Both simultaneously Shadow validates reads, dual write maintains data

Cutover Dual write reversal New system becomes primary, old becomes secondary

Post-cutover Shadow mode (reversed) Mirror to old system to verify nothing broke

Strangler Fig Context

Both patterns are tactics within the broader Strangler Fig migration strategy:

Identify a component to migrate
Shadow traffic to validate the replacement
Dual write to synchronize data stores
Cut over reads, then writes
Decommission the old component
Repeat for the next component

Kill Switch Requirements

Shadow mode must have an immediate disable mechanism:

Feature flag or configuration toggle (no deployment required)
Disables within seconds, not minutes
Monitored — alerts if shadow causes production impact
Tested before enabling shadow traffic

Monitoring Checklist

Production latency impact (should be zero or negligible)
Shadow request success rate
Shadow response latency distribution
Response match rate by endpoint
Discrepancy log volume and categories
Shadow system resource utilization
Kill switch status and responsiveness

Agentic Optimizations

Context Approach

Architecture review Verify shadow isolation (no shared writes), kill switch exists

Code review Check async mirroring does not block production path

Implementation Start with proxy-based mirroring at 1%, increase gradually

Testing Verify kill switch works, confirm production is unaffected when shadow fails

Quick Reference

Term Definition

Shadow system The new system receiving mirrored traffic

Production system The live system serving real users

Traffic splitter Component that duplicates requests

Match rate Percentage of shadow responses matching production

Kill switch Mechanism to instantly disable shadow traffic

Dark launching Synonym for shadow mode — feature is live but invisible to users

Canary traffic Small percentage of mirrored requests for initial validation

Strangler fig Broader migration strategy of incrementally replacing components

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

ruff linting

imagemagick-conversion

jq json processing

api-testing