resilience-patterns

Resilience Patterns Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

Overview

Building fault-tolerant multi-agent systems
Implementing LLM API integrations with proper error handling
Designing distributed workflows that need graceful degradation
Adding observability to failure scenarios
Protecting systems from cascade failures

Core Patterns

Circuit Breaker Pattern (reference: circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

+-------------------------------------------------------------------+ | Circuit Breaker States | +-------------------------------------------------------------------+ | | | +----------+ failures >= threshold +----------+ | | | CLOSED | ----------------------------> | OPEN | | | | (normal) | | (reject) | | | +----+-----+ +----+-----+ | | | | | | | success timeout | | | | expires | | | | +------------+ | | | | | HALF_OPEN |<-----------------+ | | +---------+ (probe) | | | +------------+ | | | | CLOSED: Allow requests, count failures | | OPEN: Reject immediately, return fallback | | HALF_OPEN: Allow probe request to test recovery | | | +-------------------------------------------------------------------+

Key Configuration:

failure_threshold : Failures before opening (default: 5)
recovery_timeout : Seconds before attempting recovery (default: 30)
half_open_requests : Probes to allow in half-open (default: 1)

Bulkhead Pattern (reference: bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.

+-------------------------------------------------------------------+ | Bulkhead Isolation | +-------------------------------------------------------------------+ | | | +------------------+ +------------------+ | | | TIER 1: Critical | | TIER 2: Standard | | | | (5 workers) | | (3 workers) | | | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | | | |#| |#| | | | | |#| | | | | | | | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | | | +-+ +-+ | | | | | | | | | | | | Queue: 2 | | | | +-+ +-+ | | | | | | Queue: 0 | +------------------+ | | +------------------+ | | | | +------------------+ | | | TIER 3: Optional | # = Active request | | | (2 workers) | = Available slot | | | +-+ +-+ | | | | |#| |#| FULL! | Tier 1: synthesis, quality_gate | | | +-+ +-+ | Tier 2: analysis agents | | | Queue: 5 | Tier 3: enrichment, optional features | | +------------------+ | | | +-------------------------------------------------------------------+

Tier Configuration (OrchestKit):

Tier Workers Queue Timeout Use Case

1 (Critical) 5 10 300s Synthesis, quality gate

2 (Standard) 3 5 120s Content analysis agents

3 (Optional) 2 3 60s Enrichment, caching

Retry Strategies (reference: retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.

Error Classification for Retries:

RETRYABLE_ERRORS = { # HTTP/Network 408, 429, 500, 502, 503, 504, # HTTP status codes ConnectionError, TimeoutError, # Network errors

# LLM-specific
"rate_limit_exceeded",
"model_overloaded",
"context_length_exceeded",  # Retry with truncation

}

NON_RETRYABLE_ERRORS = { 400, 401, 403, 404, # Client errors "invalid_api_key", "content_policy_violation", "invalid_request_error", }

LLM-Specific Resilience (reference: llm-resilience.md)

Patterns specific to LLM API integrations.

Token Budget Management:

+-------------------------------------------------------------------+ | Token Budget Guard | +-------------------------------------------------------------------+ | | | Input: 8,000 tokens | | +---------------------------------------------+ | | |################################# | | | +---------------------------------------------+ | | ^ | | | | | Context Limit (16K) | | | | Strategy when approaching limit: | | 1. Summarize earlier context (compress 4:1) | | 2. Drop low-priority content (optional fields) | | 3. Split into multiple requests | | 4. Fail fast with "content too large" error | | | +-------------------------------------------------------------------+

Quick Reference

Pattern When to Use Key Benefit

Circuit Breaker External service calls Prevent cascade failures

Bulkhead Multi-tenant/multi-agent Isolate failures

Retry + Backoff Transient failures Automatic recovery

Fallback Chain Critical operations Graceful degradation

Token Budget LLM calls Cost control, prevent failures

OrchestKit Integration Points

Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier
LLM Calls: All model invocations use fallback chain + retry logic
External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs
Database Ops: Bulkhead isolation for read vs write operations

Files in This Skill

References (Conceptual Guides)

references/circuit-breaker.md
Deep dive on circuit breaker pattern
references/bulkhead-pattern.md
Bulkhead isolation strategies
references/retry-strategies.md
Retry algorithms and error classification
references/llm-resilience.md
LLM-specific patterns
references/error-classification.md
How to categorize errors

Templates (Code Patterns)

scripts/circuit-breaker.py
Ready-to-use circuit breaker class
scripts/bulkhead.py
Semaphore-based bulkhead implementation
scripts/retry-handler.py
Configurable retry decorator
scripts/llm-fallback-chain.py
Multi-model fallback pattern
scripts/token-budget.py
Token budget guard implementation

Examples

examples/orchestkit-workflow-resilience.md
Full OrchestKit integration example

Checklists

checklists/pre-deployment-resilience.md
Production readiness checklist
checklists/circuit-breaker-setup.md
Circuit breaker configuration guide

2026 Best Practices

Adaptive Thresholds: Use sliding windows, not fixed counters
Observability First: Every circuit trip = alert + metric + trace
Graceful Degradation: Always have a fallback, even if partial
Health Endpoints: Separate health check from circuit state
Chaos Testing: Regularly test failure scenarios in staging

Related Skills

observability-monitoring
Metrics and alerting for circuit breaker state changes
caching-strategies
Cache as fallback layer in degradation scenarios
error-handling-rfc9457
Structured error responses for resilience failures
background-jobs
Async processing with retry and failure handling

Key Decisions

Decision Choice Rationale

Circuit breaker recovery Half-open probe Gradual recovery, prevents immediate re-failure

Retry algorithm Exponential backoff + jitter Prevents thundering herd, respects rate limits

Bulkhead isolation Semaphore-based tiers Simple, efficient, prioritizes critical operations

LLM fallback Model chain with cache Graceful degradation, cost optimization, availability

Capability Details

circuit-breaker

Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:

Prevent cascade failures when external services fail
Automatically recover when services come back online
Fail fast instead of waiting for timeouts

bulkhead

Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:

Isolate failures to prevent entire system crashes
Prioritize critical operations over optional ones
Limit concurrent requests to protect resources

retry-strategies

Keywords: retry, backoff, exponential, jitter, thundering herd Solves:

Handle transient failures automatically
Avoid overwhelming recovering services
Classify errors as retryable vs non-retryable

llm-resilience

Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:

Handle LLM API rate limits gracefully
Fall back to alternative models when primary fails
Manage token budgets to prevent context overflow

error-classification

Keywords: error, retryable, transient, permanent, classification Solves:

Determine which errors should be retried
Categorize errors by severity and recoverability
Map HTTP status codes to resilience actions

resilience-patterns

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

responsive-patterns

domain-driven-design

dashboard-patterns

rag-retrieval