Resilience Patterns Skill
Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.
Overview
-
Building fault-tolerant multi-agent systems
-
Implementing LLM API integrations with proper error handling
-
Designing distributed workflows that need graceful degradation
-
Adding observability to failure scenarios
-
Protecting systems from cascade failures
Core Patterns
- Circuit Breaker Pattern (reference: circuit-breaker.md)
Prevents cascade failures by "tripping" when a service exceeds failure thresholds.
+-------------------------------------------------------------------+ | Circuit Breaker States | +-------------------------------------------------------------------+ | | | +----------+ failures >= threshold +----------+ | | | CLOSED | ----------------------------> | OPEN | | | | (normal) | | (reject) | | | +----+-----+ +----+-----+ | | | | | | | success timeout | | | | expires | | | | +------------+ | | | | | HALF_OPEN |<-----------------+ | | +---------+ (probe) | | | +------------+ | | | | CLOSED: Allow requests, count failures | | OPEN: Reject immediately, return fallback | | HALF_OPEN: Allow probe request to test recovery | | | +-------------------------------------------------------------------+
Key Configuration:
-
failure_threshold : Failures before opening (default: 5)
-
recovery_timeout : Seconds before attempting recovery (default: 30)
-
half_open_requests : Probes to allow in half-open (default: 1)
- Bulkhead Pattern (reference: bulkhead-pattern.md)
Isolates failures by partitioning resources into independent pools.
+-------------------------------------------------------------------+ | Bulkhead Isolation | +-------------------------------------------------------------------+ | | | +------------------+ +------------------+ | | | TIER 1: Critical | | TIER 2: Standard | | | | (5 workers) | | (3 workers) | | | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | | | |#| |#| | | | | |#| | | | | | | | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | | | +-+ +-+ | | | | | | | | | | | | Queue: 2 | | | | +-+ +-+ | | | | | | Queue: 0 | +------------------+ | | +------------------+ | | | | +------------------+ | | | TIER 3: Optional | # = Active request | | | (2 workers) | = Available slot | | | +-+ +-+ | | | | |#| |#| FULL! | Tier 1: synthesis, quality_gate | | | +-+ +-+ | Tier 2: analysis agents | | | Queue: 5 | Tier 3: enrichment, optional features | | +------------------+ | | | +-------------------------------------------------------------------+
Tier Configuration (OrchestKit):
Tier Workers Queue Timeout Use Case
1 (Critical) 5 10 300s Synthesis, quality gate
2 (Standard) 3 5 120s Content analysis agents
3 (Optional) 2 3 60s Enrichment, caching
- Retry Strategies (reference: retry-strategies.md)
Intelligent retry logic with exponential backoff and jitter.
+-------------------------------------------------------------------+ | Exponential Backoff + Jitter | +-------------------------------------------------------------------+ | | | Attempt 1: --> X (fail) | | wait: 1s +/- 0.5s | | | | Attempt 2: --> X (fail) | | wait: 2s +/- 1s | | | | Attempt 3: --> X (fail) | | wait: 4s +/- 2s | | | | Attempt 4: --> OK (success) | | | | Formula: delay = min(base * 2^attempt, max_delay) * jitter | | Jitter: random(0.5, 1.5) to prevent thundering herd | | | +-------------------------------------------------------------------+
Error Classification for Retries:
RETRYABLE_ERRORS = { # HTTP/Network 408, 429, 500, 502, 503, 504, # HTTP status codes ConnectionError, TimeoutError, # Network errors
# LLM-specific
"rate_limit_exceeded",
"model_overloaded",
"context_length_exceeded", # Retry with truncation
}
NON_RETRYABLE_ERRORS = { 400, 401, 403, 404, # Client errors "invalid_api_key", "content_policy_violation", "invalid_request_error", }
- LLM-Specific Resilience (reference: llm-resilience.md)
Patterns specific to LLM API integrations.
+-------------------------------------------------------------------+ | LLM Fallback Chain | +-------------------------------------------------------------------+ | | | Request --> [Primary Model] --success--> Response | | | | | fail | | v | | [Fallback Model] --success--> Response | | | | | fail | | v | | [Cached Response] --hit--> Response | | | | | miss | | v | | [Default Response] --> Graceful Degradation | | | | Example Chain: | | 1. claude-sonnet-4-5-20251101 (primary) | | 2. gpt-5.2-mini (fallback) | | 3. Semantic cache lookup | | 4. "Analysis unavailable" + partial results | | | +-------------------------------------------------------------------+
Token Budget Management:
+-------------------------------------------------------------------+ | Token Budget Guard | +-------------------------------------------------------------------+ | | | Input: 8,000 tokens | | +---------------------------------------------+ | | |################################# | | | +---------------------------------------------+ | | ^ | | | | | Context Limit (16K) | | | | Strategy when approaching limit: | | 1. Summarize earlier context (compress 4:1) | | 2. Drop low-priority content (optional fields) | | 3. Split into multiple requests | | 4. Fail fast with "content too large" error | | | +-------------------------------------------------------------------+
Quick Reference
Pattern When to Use Key Benefit
Circuit Breaker External service calls Prevent cascade failures
Bulkhead Multi-tenant/multi-agent Isolate failures
Retry + Backoff Transient failures Automatic recovery
Fallback Chain Critical operations Graceful degradation
Token Budget LLM calls Cost control, prevent failures
OrchestKit Integration Points
-
Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
-
LLM Calls: All model invocations use fallback chain + retry logic
-
External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
-
Database Ops: Bulkhead isolation for read vs write operations
Files in This Skill
References (Conceptual Guides)
-
references/circuit-breaker.md
-
Deep dive on circuit breaker pattern
-
references/bulkhead-pattern.md
-
Bulkhead isolation strategies
-
references/retry-strategies.md
-
Retry algorithms and error classification
-
references/llm-resilience.md
-
LLM-specific patterns
-
references/error-classification.md
-
How to categorize errors
Templates (Code Patterns)
-
scripts/circuit-breaker.py
-
Ready-to-use circuit breaker class
-
scripts/bulkhead.py
-
Semaphore-based bulkhead implementation
-
scripts/retry-handler.py
-
Configurable retry decorator
-
scripts/llm-fallback-chain.py
-
Multi-model fallback pattern
-
scripts/token-budget.py
-
Token budget guard implementation
Examples
- examples/orchestkit-workflow-resilience.md
- Full OrchestKit integration example
Checklists
-
checklists/pre-deployment-resilience.md
-
Production readiness checklist
-
checklists/circuit-breaker-setup.md
-
Circuit breaker configuration guide
2026 Best Practices
-
Adaptive Thresholds: Use sliding windows, not fixed counters
-
Observability First: Every circuit trip = alert + metric + trace
-
Graceful Degradation: Always have a fallback, even if partial
-
Health Endpoints: Separate health check from circuit state
-
Chaos Testing: Regularly test failure scenarios in staging
Related Skills
-
observability-monitoring
-
Metrics and alerting for circuit breaker state changes
-
caching-strategies
-
Cache as fallback layer in degradation scenarios
-
error-handling-rfc9457
-
Structured error responses for resilience failures
-
background-jobs
-
Async processing with retry and failure handling
Key Decisions
Decision Choice Rationale
Circuit breaker recovery Half-open probe Gradual recovery, prevents immediate re-failure
Retry algorithm Exponential backoff + jitter Prevents thundering herd, respects rate limits
Bulkhead isolation Semaphore-based tiers Simple, efficient, prioritizes critical operations
LLM fallback Model chain with cache Graceful degradation, cost optimization, availability
Capability Details
circuit-breaker
Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:
-
Prevent cascade failures when external services fail
-
Automatically recover when services come back online
-
Fail fast instead of waiting for timeouts
bulkhead
Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:
-
Isolate failures to prevent entire system crashes
-
Prioritize critical operations over optional ones
-
Limit concurrent requests to protect resources
retry-strategies
Keywords: retry, backoff, exponential, jitter, thundering herd Solves:
-
Handle transient failures automatically
-
Avoid overwhelming recovering services
-
Classify errors as retryable vs non-retryable
llm-resilience
Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:
-
Handle LLM API rate limits gracefully
-
Fall back to alternative models when primary fails
-
Manage token budgets to prevent context overflow
error-classification
Keywords: error, retryable, transient, permanent, classification Solves:
-
Determine which errors should be retried
-
Categorize errors by severity and recoverability
-
Map HTTP status codes to resilience actions