reliability-strategy-builder

Reliability Strategy Builder

Build resilient systems with proper failure handling and SLOs.

Reliability Patterns

Circuit Breaker

Prevent cascading failures by stopping requests to failing services.

class CircuitBreaker { private state: "closed" | "open" | "half-open" = "closed"; private failureCount = 0; private lastFailureTime?: Date;

async execute<T>(operation: () => Promise<T>): Promise<T> { if (this.state === "open") { if (this.shouldAttemptReset()) { this.state = "half-open"; } else { throw new Error("Circuit breaker is OPEN"); } }

try {
  const result = await operation();
  this.onSuccess();
  return result;
} catch (error) {
  this.onFailure();
  throw error;
}

}

private onSuccess() { this.failureCount = 0; this.state = "closed"; }

private onFailure() { this.failureCount++; this.lastFailureTime = new Date();

if (this.failureCount >= 5) {
  this.state = "open";
}

}

private shouldAttemptReset(): boolean { if (!this.lastFailureTime) return false; const now = Date.now(); const elapsed = now - this.lastFailureTime.getTime(); return elapsed > 60000; // 1 minute } }

Retry with Backoff

Handle transient failures with exponential backoff.

async function retryWithBackoff<T>( operation: () => Promise<T>, maxRetries = 3, baseDelay = 1000 ): Promise<T> { for (let attempt = 0; attempt <= maxRetries; attempt++) { try { return await operation(); } catch (error) { if (attempt === maxRetries) throw error;

  // Exponential backoff: 1s, 2s, 4s
  const delay = baseDelay * Math.pow(2, attempt);
  await sleep(delay);
}

} throw new Error("Max retries exceeded"); }

Fallback Pattern

Provide degraded functionality when primary fails.

async function getUserWithFallback(userId: string): Promise<User> { try { // Try primary database return await primaryDb.users.findById(userId); } catch (error) { logger.warn("Primary DB failed, using cache");

// Fallback to cache
const cached = await cache.get(`user:${userId}`);
if (cached) return cached;

// Final fallback: return minimal user object
return {
  id: userId,
  name: "Unknown User",
  email: "unavailable",
};

} }

Bulkhead Pattern

Isolate failures to prevent resource exhaustion.

class ThreadPool { private pools = new Map<string, Semaphore>();

constructor() { // Separate pools for different operations this.pools.set("critical", new Semaphore(100)); this.pools.set("standard", new Semaphore(50)); this.pools.set("background", new Semaphore(10)); }

async execute(priority: string, operation: () => Promise<any>) { const pool = this.pools.get(priority); await pool.acquire();

try {
  return await operation();
} finally {
  pool.release();
}

} }

SLO Definitions

SLO Template

service: user-api slos:

name: Availability description: API should be available for successful requests target: 99.9% measurement: type: ratio success: status_code < 500 total: all_requests window: 30 days
name: Latency description: 95% of requests complete within 500ms target: 95% measurement: type: percentile metric: request_duration_ms threshold: 500 percentile: 95 window: 7 days
name: Error Rate description: Less than 1% of requests result in errors target: 99% measurement: type: ratio success: status_code < 400 OR status_code IN [401, 403, 404] total: all_requests window: 24 hours

Error Budget

Error Budget = 100% - SLO

Example: SLO: 99.9% availability Error Budget: 0.1% = 43.2 minutes/month downtime allowed

Failure Mode Analysis

Component	Failure Mode	Impact	Probability	Detection	Mitigation
Database	Unresponsive	HIGH	Medium	Health checks every 10s	Circuit breaker, read replicas
API Gateway	Overload	HIGH	Low	Request queue depth	Rate limiting, auto-scaling
Cache	Eviction	MEDIUM	High	Cache hit rate	Fallback to DB, larger cache
Queue	Backed up	LOW	Medium	Queue depth metric	Add workers, DLQ

Reliability Checklist

Infrastructure

Load balancer with health checks
Multiple availability zones
Auto-scaling configured
Database replication
Regular backups (tested!)

Application

Circuit breakers on external calls
Retry logic with backoff
Timeouts on all I/O
Fallback mechanisms
Graceful degradation

Monitoring

SLO dashboard
Error budgets tracked
Alerting on SLO violations
Latency percentiles (p50, p95, p99)
Dependency health checks

Operations

Incident response runbook
On-call rotation
Postmortem template
Disaster recovery plan
Chaos engineering tests

Incident Response Plan

Severity Levels

SEV1 (Critical): Complete service outage, data loss

Response time: <15 minutes
Page on-call immediately

SEV2 (High): Partial outage, degraded performance

Response time: <1 hour
Alert on-call

SEV3 (Medium): Minor issues, workarounds available

Response time: <4 hours
Create ticket

SEV4 (Low): Cosmetic issues, no user impact

Response time: Next business day
Backlog

Incident Response Steps

Acknowledge: Confirm receipt within SLA
Assess: Determine severity and impact
Communicate: Update status page
Mitigate: Stop the bleeding (rollback, scale, disable)
Resolve: Fix root cause
Document: Write postmortem

Best Practices

Design for failure: Assume components will fail
Fail fast: Don't let slow failures cascade
Isolate failures: Bulkhead pattern
Graceful degradation: Reduce functionality, don't crash
Monitor SLOs: Track error budgets
Test failure modes: Chaos engineering
Document runbooks: Clear incident response

Output Checklist

Circuit breakers implemented
Retry logic with backoff
Fallback mechanisms
Bulkhead isolation
SLOs defined (availability, latency, errors)
Error budgets calculated
Failure mode analysis
Monitoring dashboard
Incident response plan
Runbooks documented

reliability-strategy-builder

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

data-retention-archiving-planner

secrets-scanner

eslint-prettier-config

rate-limiting-abuse-protection