Resilience Patterns
Patterns for building systems that gracefully handle failures, degrade gracefully, and recover automatically.
When to Use This Skill
-
Implementing circuit breakers
-
Designing retry strategies
-
Isolating failures with bulkheads
-
Building fault-tolerant systems
-
Handling cascading failures
Why Resilience Matters
In distributed systems, failure is not exceptional—it's normal.
Networks fail. Services crash. Databases timeout. The question isn't IF but WHEN.
Resilience = The ability to handle failures gracefully
Goals:
- Prevent cascading failures
- Degrade gracefully
- Recover automatically
- Maintain availability
Core Resilience Patterns
- Retry Pattern
What: Automatically retry failed operations When: Transient failures (network blips, temporary unavailability)
Simple retry: ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Request │────►│ Failure │────►│ Retry │───► Success └─────────┘ └─────────┘ └─────────┘
With backoff: Request → Fail → Wait 100ms → Retry Fail → Wait 200ms → Retry Fail → Wait 400ms → Retry Fail → Give up
Backoff strategies:
- Fixed: Wait same time each retry
- Linear: 100ms, 200ms, 300ms...
- Exponential: 100ms, 200ms, 400ms, 800ms...
- Exponential + Jitter: Add randomness to prevent thundering herd
Retry Best Practices
Do:
- Add jitter to prevent thundering herd
- Set maximum retry count
- Use exponential backoff
- Only retry transient failures
- Log retries for visibility
Don't:
- Retry non-idempotent operations blindly
- Retry client errors (400s)
- Retry indefinitely
- Use same delay for all retries
- Circuit Breaker Pattern
What: Stop calling a failing service temporarily When: Service is consistently failing
States: ┌──────────────────────────────────────────────────────────┐ │ │ │ ┌────────┐ Failures ┌────────┐ Timeout │ │ │ CLOSED │───────────────►│ OPEN │─────────────┐ │ │ │ │ │ │ │ │ │ └────┬───┘ └────────┘ │ │ │ │ ▲ │ │ │ │ │ ▼ │ │ │ Success Failure ┌────────┐ │ │ └────────────────────────────────────►│ HALF │ │ │ Success │ OPEN │ │ │ ◄───────────────┴────────┘ │ │ │ └──────────────────────────────────────────────────────────┘
CLOSED: Normal operation, requests flow through OPEN: Failures exceeded threshold, fail fast HALF-OPEN: Testing if service recovered
Circuit Breaker Configuration
Key parameters:
Failure threshold: How many failures to open
- Too low: Opens on minor issues
- Too high: Doesn't protect enough
- Typical: 5-10 failures or 50% error rate
Timeout (open duration): How long to stay open
- Too short: May retry too quickly
- Too long: Slow recovery
- Typical: 30-60 seconds
Success threshold: Successes to close from half-open
- Typically 1-3 successful requests
- Bulkhead Pattern
What: Isolate components to contain failures When: Prevent one failure from taking down everything
Ship analogy: ┌─────────────────────────────────────────────┐ │ Ship without bulkheads │ │ ┌───────────────────────────────────────┐ │ │ │ One hole → Entire ship floods │ │ │ └───────────────────────────────────────┘ │ └─────────────────────────────────────────────┘
┌─────────────────────────────────────────────┐ │ Ship with bulkheads │ │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │ X │ │ │ │ │ │ │ │ OK │ │Flood │ │ OK │ │ OK │ │ │ └──────┘ └──────┘ └──────┘ └──────┘ │ │ One compartment floods, others stay dry │ └─────────────────────────────────────────────┘
Bulkhead Implementation
Thread pool isolation: ┌────────────────────────────────────────────────────────┐ │ Application │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Service A Pool │ │ Service B Pool │ │ │ │ [10 threads] │ │ [10 threads] │ │ │ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ ▼ ▼ │ │ Service A Service B │ │ (slow) (healthy) │ └────────────────────────────────────────────────────────┘
Service A being slow doesn't exhaust threads for Service B
Semaphore isolation:
- Limit concurrent requests per dependency
- Lighter weight than thread pools
- Good for async operations
- Timeout Pattern
What: Limit how long to wait for operations When: Always (every external call needs a timeout)
Without timeout: Request → Service hangs → Caller waits forever → Resources exhausted
With timeout: Request → Service hangs → Timeout after 5s → Caller handles failure
Timeout types:
- Connection timeout: Time to establish connection
- Read timeout: Time to receive response
- Overall timeout: Total time allowed
Timeout Best Practices
Setting timeouts:
- Connection: 1-5 seconds (fast fail)
- Read: Based on p99 latency + buffer
- Overall: Sum of connection + read + processing
Example: Connection timeout: 2s Read timeout: 10s (p99 is 5s, 2x buffer) Overall timeout: 15s
Cascade consideration: If A calls B calls C:
- C timeout < B timeout < A timeout
- Each layer has buffer for retries
- Fallback Pattern
What: Provide alternative when primary fails When: There's a degraded but acceptable alternative
Fallback options: ┌────────────────────────────────────────────────────────┐ │ Primary fails? Options: │ │ │ │ 1. Cached data: Return stale but valid data │ │ 2. Default value: Return safe default │ │ 3. Degraded service: Reduced functionality │ │ 4. Alternative service: Different provider │ │ 5. Graceful error: Friendly error message │ └────────────────────────────────────────────────────────┘
Example: Primary: Real-time price service Fallback 1: Cached price (< 5 min old) Fallback 2: Last known price with warning Fallback 3: "Price temporarily unavailable"
- Rate Limiting Pattern
What: Control the rate of requests When: Protect services from overload
Client-side rate limiting:
- Limit outgoing requests
- Prevent overwhelming dependencies
Server-side rate limiting:
- Limit incoming requests
- Protect from traffic spikes
See: rate-limiting-patterns skill for details
Pattern Combinations
Typical Resilience Stack
Request Flow: ┌─────────────────────────────────────────────────────────┐ │ │ │ ┌────────────┐ │ │ │ Timeout │ ← Overall request timeout │ │ │ ┌────────┐ │ │ │ │ │ Retry │ │ ← With exponential backoff │ │ │ │┌──────┐│ │ │ │ │ ││Circuit││ ← Fail fast if service down │ │ │ ││Breaker│ │ │ │ │ │└──────┘│ │ │ │ │ │┌──────┐│ │ │ │ │ ││Bulkhead│ ← Isolate from other calls │ │ │ │└──────┘│ │ │ │ │ └────────┘ │ │ │ └────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────┐ │ │ │ Service │ │ │ └──────────┘ │ │ │ │ │ Failure?──────► Fallback │ │ │ └─────────────────────────────────────────────────────────┘
Order of Application
Outer to inner:
- Timeout: Overall time limit
- Retry: Attempt recovery from transient failures
- Circuit Breaker: Stop calling failed services
- Bulkhead: Isolate this call from others
- [Call service]
- Fallback: Handle failures gracefully
Load Shedding
What is Load Shedding?
When system is overloaded:
- Accept what you can handle
- Reject the rest gracefully
- Better to serve some users well than all users poorly
Priority-based shedding:
- High priority: Never shed
- Medium: Shed under moderate load
- Low: Shed first
Implementation
Approaches:
-
Queue-based:
- Fixed-size queue
- Reject when queue full
- Serve based on priority
-
Rate-based:
- Maximum requests per second
- Reject when exceeded
- Return 503 or 429
-
Adaptive:
- Monitor latency/error rate
- Reduce acceptance as stress increases
- Recover as system stabilizes
Graceful Degradation
Levels of Degradation
Level 0: Full functionality └── Everything works normally
Level 1: Non-essential features disabled └── Recommendations off, analytics delayed
Level 2: Reduced functionality └── Read-only mode, cached data only
Level 3: Minimal functionality └── Core features only, no personalization
Level 4: Maintenance mode └── Static page, "be back soon"
Transition: Automatic based on system health metrics or manual via feature flags
Feature Degradation Examples
E-commerce during high load:
Full feature:
- Real-time inventory
- Personalized recommendations
- Live chat support
- Detailed analytics
Degraded:
- Cached inventory (5 min delay)
- Generic recommendations
- Contact form only
- Analytics queued
Minimal:
- Static "in stock" status
- No recommendations
- Email support only
- Analytics dropped
Health Checks
Types of Health Checks
-
Liveness check: "Is the process alive?"
- Simple ping
- Returns 200 if running
- Used for restart decisions
-
Readiness check: "Can it handle traffic?"
- Checks dependencies
- Returns 200 if ready
- Used for load balancer
-
Deep health check: "Is everything working?"
- Comprehensive checks
- May be slower
- Used for monitoring/debugging
Health Check Best Practices
Do:
- Keep liveness checks simple and fast
- Check all critical dependencies in readiness
- Include version/build info in response
- Return appropriate status codes
Don't:
- Block liveness on dependencies
- Include heavy operations in health checks
- Expose sensitive information
- Forget to handle dependency timeouts
Testing Resilience
How to Test
-
Unit tests:
- Test each pattern in isolation
- Mock failures
- Verify behavior
-
Integration tests:
- Test pattern combinations
- Inject failures
- Verify recovery
-
Chaos engineering:
- Test in production-like environment
- Random failures
- Verify system behavior
See: chaos-engineering-fundamentals skill
Implementation Considerations
Library vs Custom
Libraries (recommended):
- Polly (.NET)
- Resilience4j (Java)
- Hystrix (Java, deprecated)
- go-resilience (Go)
Benefits:
- Battle-tested
- Well-documented
- Community support
- Metrics built-in
Custom implementation:
- Only when specific needs
- High maintenance burden
- Risk of subtle bugs
Monitoring Resilience
Metrics to track:
Circuit Breaker:
- State changes
- Open duration
- Failure rate
Retries:
- Retry count
- Retry success rate
- Final success/failure
Bulkhead:
- Concurrent calls
- Rejections
- Queue depth
Timeouts:
- Timeout count
- Latency distribution
Best Practices
-
Every external call needs a timeout No call should wait forever
-
Retry only transient failures Don't retry 400 errors
-
Circuit breaker per dependency Different services need different protection
-
Bulkhead critical paths Isolate important from less important
-
Plan fallbacks Know what to do when things fail
-
Monitor everything Can't fix what you can't see
-
Test failure paths Happy path tests aren't enough
Related Skills
-
chaos-engineering-fundamentals
-
Testing resilience
-
distributed-transactions
-
Handling failures in transactions
-
rate-limiting-patterns
-
Controlling request rates