Shadow Mode Migration Pattern
Shadow mode mirrors production traffic to a new system without affecting users. The shadow system's responses are discarded — only the production response reaches the user — but both responses are logged and compared to validate correctness.
When to Use This Skill
Use this skill when... Use dual-write instead when...
Validating read behavior of a replacement service Both systems need to persist writes
Testing performance under real production load You need the new store to be authoritative
Comparing response correctness before cutover Migrating data stores that must stay in sync
Evaluating a new service version safely The new system needs to receive and store mutations
Load testing a new deployment with real traffic You need strong consistency between systems
Core Concepts
Traffic Flow
Client Request │ ▼ ┌─────────────┐ │ Router / │ │ Proxy │ ├──────┬──────┤ │ │ │ ▼ │ ▼ Prod │ Shadow System │ System │ │ │ ▼ │ ▼ Prod │ Shadow Response│ Response │ │ │ ▼ │ (discard) Client │ │ │ ▼ │ Compare & │ Log ▼
Shadow Modes
Mode Description Use case
Full mirror 100% of traffic duplicated Final validation before cutover
Sampled mirror Percentage of traffic (e.g., 10%) Early validation, capacity-constrained shadow
Selective mirror Specific request types or endpoints Targeted validation of changed behavior
Replay mirror Recorded traffic replayed offline Testing without live shadow infrastructure
Implementation Architecture
Key Components
Component Responsibility
Traffic splitter Duplicates requests to shadow system
Shadow router Forwards mirrored requests, manages timeouts
Response comparator Compares prod vs shadow responses
Discrepancy logger Records differences with full context
Metrics collector Tracks match rates, latency, error rates
Kill switch Disables shadow traffic instantly if issues arise
Deployment Topology
Topology How it works Trade-offs
Proxy-based Load balancer or API gateway mirrors requests Simple setup, adds proxy hop
Application-level Application code sends async copy of request Fine-grained control, code coupling
Infrastructure-level Service mesh (Istio, Linkerd) mirrors traffic No code changes, requires mesh
Log replay Capture request logs, replay against shadow No live infrastructure needed, not real-time
Implementation Patterns
Proxy-Based Mirroring
Configure the load balancer or API gateway to:
-
Forward the original request to the production backend
-
Clone the request and send it to the shadow backend
-
Return only the production response to the client
-
Shadow response is logged but never returned
-
Shadow request timeout is independent of production
Application-Level Mirroring
-
Intercept the incoming request at the application layer
-
Process the request normally through the production path
-
Asynchronously send a copy of the request to the shadow service
-
Do not block the production response on the shadow response
-
Compare responses in a background worker
Response Comparison Strategy
Compare responses field by field with configurable rules:
Field type Comparison approach
IDs, timestamps Ignore (expected to differ)
Computed values Compare within tolerance (e.g., floating point)
Collections Compare as sets (ignore ordering unless significant)
Status codes Exact match required
Error responses Categorize and compare error types
Headers Compare relevant headers only (Content-Type, Cache-Control)
Handling Stateful Requests
Shadow mode works best with read-only requests. For stateful (write) requests:
Approach Description
Skip writes Only mirror read requests to shadow
Isolated state Shadow has its own database seeded from production
Dry-run writes Shadow validates the write but does not persist
Record-only Log what shadow would have written, compare intent
Gradual Rollout
Phase Traffic % Duration Goal
-
Smoke test 1% Hours Verify shadow receives and processes requests
-
Canary 5-10% Days Identify obvious discrepancies
-
Validation 25-50% Days-weeks Build confidence in match rate
-
Full mirror 100% Days-weeks Final validation before cutover
Validation Metrics
Metric Target Description
Response match rate
99.9% Percentage of identical responses
Shadow latency (P50) Within 2x of prod Shadow performance baseline
Shadow latency (P99) Monitored Tail latency under real load
Shadow error rate < prod error rate Shadow should not produce more errors
Shadow availability Monitored Shadow uptime (not a blocker)
Discrepancy categories Trending to zero Known differences resolved over time
Common Pitfalls
Pitfall Mitigation
Shadow affects production performance Async mirroring, independent timeouts, kill switch
Shadow writes to shared resources Isolate shadow databases, queues, and external services
Non-deterministic responses cause false mismatches Configure comparison rules to ignore timestamps, IDs, nonces
Shadow receives stale data Seed shadow database from recent production snapshot
Traffic amplification overwhelms shadow Use sampled mirroring, auto-scaling, or circuit breakers
Request ordering differs between prod and shadow Compare request-by-request, not sequence-dependent
Authentication tokens expire for shadow Mint shadow-specific tokens or bypass auth in shadow
Integration with Dual Write
Shadow mode and dual write are complementary migration techniques:
Migration phase Technique Purpose
Early validation Shadow mode (reads) Verify the new system returns correct responses
Data sync Dual write Keep both stores authoritative during transition
Pre-cutover Both simultaneously Shadow validates reads, dual write maintains data
Cutover Dual write reversal New system becomes primary, old becomes secondary
Post-cutover Shadow mode (reversed) Mirror to old system to verify nothing broke
Strangler Fig Context
Both patterns are tactics within the broader Strangler Fig migration strategy:
-
Identify a component to migrate
-
Shadow traffic to validate the replacement
-
Dual write to synchronize data stores
-
Cut over reads, then writes
-
Decommission the old component
-
Repeat for the next component
Kill Switch Requirements
Shadow mode must have an immediate disable mechanism:
-
Feature flag or configuration toggle (no deployment required)
-
Disables within seconds, not minutes
-
Monitored — alerts if shadow causes production impact
-
Tested before enabling shadow traffic
Monitoring Checklist
-
Production latency impact (should be zero or negligible)
-
Shadow request success rate
-
Shadow response latency distribution
-
Response match rate by endpoint
-
Discrepancy log volume and categories
-
Shadow system resource utilization
-
Kill switch status and responsiveness
Agentic Optimizations
Context Approach
Architecture review Verify shadow isolation (no shared writes), kill switch exists
Code review Check async mirroring does not block production path
Implementation Start with proxy-based mirroring at 1%, increase gradually
Testing Verify kill switch works, confirm production is unaffected when shadow fails
Quick Reference
Term Definition
Shadow system The new system receiving mirrored traffic
Production system The live system serving real users
Traffic splitter Component that duplicates requests
Match rate Percentage of shadow responses matching production
Kill switch Mechanism to instantly disable shadow traffic
Dark launching Synonym for shadow mode — feature is live but invisible to users
Canary traffic Small percentage of mirrored requests for initial validation
Strangler fig Broader migration strategy of incrementally replacing components