resilience-patterns

Resilience Patterns Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "resilience-patterns" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-resilience-patterns

Resilience Patterns Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

Overview

  • Building fault-tolerant multi-agent systems

  • Implementing LLM API integrations with proper error handling

  • Designing distributed workflows that need graceful degradation

  • Adding observability to failure scenarios

  • Protecting systems from cascade failures

Core Patterns

  1. Circuit Breaker Pattern (reference: circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

+-------------------------------------------------------------------+ | Circuit Breaker States | +-------------------------------------------------------------------+ | | | +----------+ failures >= threshold +----------+ | | | CLOSED | ----------------------------> | OPEN | | | | (normal) | | (reject) | | | +----+-----+ +----+-----+ | | | | | | | success timeout | | | | expires | | | | +------------+ | | | | | HALF_OPEN |<-----------------+ | | +---------+ (probe) | | | +------------+ | | | | CLOSED: Allow requests, count failures | | OPEN: Reject immediately, return fallback | | HALF_OPEN: Allow probe request to test recovery | | | +-------------------------------------------------------------------+

Key Configuration:

  • failure_threshold : Failures before opening (default: 5)

  • recovery_timeout : Seconds before attempting recovery (default: 30)

  • half_open_requests : Probes to allow in half-open (default: 1)

  1. Bulkhead Pattern (reference: bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.

+-------------------------------------------------------------------+ | Bulkhead Isolation | +-------------------------------------------------------------------+ | | | +------------------+ +------------------+ | | | TIER 1: Critical | | TIER 2: Standard | | | | (5 workers) | | (3 workers) | | | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | | | |#| |#| | | | | |#| | | | | | | | | +-+ +-+ +-+ | | +-+ +-+ +-+ | | | | +-+ +-+ | | | | | | | | | | | | Queue: 2 | | | | +-+ +-+ | | | | | | Queue: 0 | +------------------+ | | +------------------+ | | | | +------------------+ | | | TIER 3: Optional | # = Active request | | | (2 workers) | = Available slot | | | +-+ +-+ | | | | |#| |#| FULL! | Tier 1: synthesis, quality_gate | | | +-+ +-+ | Tier 2: analysis agents | | | Queue: 5 | Tier 3: enrichment, optional features | | +------------------+ | | | +-------------------------------------------------------------------+

Tier Configuration (OrchestKit):

Tier Workers Queue Timeout Use Case

1 (Critical) 5 10 300s Synthesis, quality gate

2 (Standard) 3 5 120s Content analysis agents

3 (Optional) 2 3 60s Enrichment, caching

  1. Retry Strategies (reference: retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.

+-------------------------------------------------------------------+ | Exponential Backoff + Jitter | +-------------------------------------------------------------------+ | | | Attempt 1: --> X (fail) | | wait: 1s +/- 0.5s | | | | Attempt 2: --> X (fail) | | wait: 2s +/- 1s | | | | Attempt 3: --> X (fail) | | wait: 4s +/- 2s | | | | Attempt 4: --> OK (success) | | | | Formula: delay = min(base * 2^attempt, max_delay) * jitter | | Jitter: random(0.5, 1.5) to prevent thundering herd | | | +-------------------------------------------------------------------+

Error Classification for Retries:

RETRYABLE_ERRORS = { # HTTP/Network 408, 429, 500, 502, 503, 504, # HTTP status codes ConnectionError, TimeoutError, # Network errors

# LLM-specific
"rate_limit_exceeded",
"model_overloaded",
"context_length_exceeded",  # Retry with truncation

}

NON_RETRYABLE_ERRORS = { 400, 401, 403, 404, # Client errors "invalid_api_key", "content_policy_violation", "invalid_request_error", }

  1. LLM-Specific Resilience (reference: llm-resilience.md)

Patterns specific to LLM API integrations.

+-------------------------------------------------------------------+ | LLM Fallback Chain | +-------------------------------------------------------------------+ | | | Request --> [Primary Model] --success--> Response | | | | | fail | | v | | [Fallback Model] --success--> Response | | | | | fail | | v | | [Cached Response] --hit--> Response | | | | | miss | | v | | [Default Response] --> Graceful Degradation | | | | Example Chain: | | 1. claude-sonnet-4-5-20251101 (primary) | | 2. gpt-5.2-mini (fallback) | | 3. Semantic cache lookup | | 4. "Analysis unavailable" + partial results | | | +-------------------------------------------------------------------+

Token Budget Management:

+-------------------------------------------------------------------+ | Token Budget Guard | +-------------------------------------------------------------------+ | | | Input: 8,000 tokens | | +---------------------------------------------+ | | |################################# | | | +---------------------------------------------+ | | ^ | | | | | Context Limit (16K) | | | | Strategy when approaching limit: | | 1. Summarize earlier context (compress 4:1) | | 2. Drop low-priority content (optional fields) | | 3. Split into multiple requests | | 4. Fail fast with "content too large" error | | | +-------------------------------------------------------------------+

Quick Reference

Pattern When to Use Key Benefit

Circuit Breaker External service calls Prevent cascade failures

Bulkhead Multi-tenant/multi-agent Isolate failures

Retry + Backoff Transient failures Automatic recovery

Fallback Chain Critical operations Graceful degradation

Token Budget LLM calls Cost control, prevent failures

OrchestKit Integration Points

  • Workflow Agents: Each agent wrapped with circuit breaker + bulkhead tier

  • LLM Calls: All model invocations use fallback chain + retry logic

  • External APIs: Circuit breaker on YouTube, arXiv, GitHub APIs

  • Database Ops: Bulkhead isolation for read vs write operations

Files in This Skill

References (Conceptual Guides)

  • references/circuit-breaker.md

  • Deep dive on circuit breaker pattern

  • references/bulkhead-pattern.md

  • Bulkhead isolation strategies

  • references/retry-strategies.md

  • Retry algorithms and error classification

  • references/llm-resilience.md

  • LLM-specific patterns

  • references/error-classification.md

  • How to categorize errors

Templates (Code Patterns)

  • scripts/circuit-breaker.py

  • Ready-to-use circuit breaker class

  • scripts/bulkhead.py

  • Semaphore-based bulkhead implementation

  • scripts/retry-handler.py

  • Configurable retry decorator

  • scripts/llm-fallback-chain.py

  • Multi-model fallback pattern

  • scripts/token-budget.py

  • Token budget guard implementation

Examples

  • examples/orchestkit-workflow-resilience.md
  • Full OrchestKit integration example

Checklists

  • checklists/pre-deployment-resilience.md

  • Production readiness checklist

  • checklists/circuit-breaker-setup.md

  • Circuit breaker configuration guide

2026 Best Practices

  • Adaptive Thresholds: Use sliding windows, not fixed counters

  • Observability First: Every circuit trip = alert + metric + trace

  • Graceful Degradation: Always have a fallback, even if partial

  • Health Endpoints: Separate health check from circuit state

  • Chaos Testing: Regularly test failure scenarios in staging

Related Skills

  • observability-monitoring

  • Metrics and alerting for circuit breaker state changes

  • caching-strategies

  • Cache as fallback layer in degradation scenarios

  • error-handling-rfc9457

  • Structured error responses for resilience failures

  • background-jobs

  • Async processing with retry and failure handling

Key Decisions

Decision Choice Rationale

Circuit breaker recovery Half-open probe Gradual recovery, prevents immediate re-failure

Retry algorithm Exponential backoff + jitter Prevents thundering herd, respects rate limits

Bulkhead isolation Semaphore-based tiers Simple, efficient, prioritizes critical operations

LLM fallback Model chain with cache Graceful degradation, cost optimization, availability

Capability Details

circuit-breaker

Keywords: circuit breaker, failure threshold, cascade failure, trip, half-open Solves:

  • Prevent cascade failures when external services fail

  • Automatically recover when services come back online

  • Fail fast instead of waiting for timeouts

bulkhead

Keywords: bulkhead, isolation, semaphore, thread pool, resource pool, tier Solves:

  • Isolate failures to prevent entire system crashes

  • Prioritize critical operations over optional ones

  • Limit concurrent requests to protect resources

retry-strategies

Keywords: retry, backoff, exponential, jitter, thundering herd Solves:

  • Handle transient failures automatically

  • Avoid overwhelming recovering services

  • Classify errors as retryable vs non-retryable

llm-resilience

Keywords: LLM, fallback, model, token budget, rate limit, context length Solves:

  • Handle LLM API rate limits gracefully

  • Fall back to alternative models when primary fails

  • Manage token budgets to prevent context overflow

error-classification

Keywords: error, retryable, transient, permanent, classification Solves:

  • Determine which errors should be retried

  • Categorize errors by severity and recoverability

  • Map HTTP status codes to resilience actions

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

rag-retrieval

No summary provided by upstream source.

Repository SourceNeeds Review