chaos-engineering-resilience

Chaos Engineering & Resilience Testing

<default_to_action> When testing system resilience or injecting failures:

DEFINE steady state (normal metrics: error rate, latency, throughput)
HYPOTHESIZE system continues in steady state during failure
INJECT real-world failures (network, instance, disk, CPU)
OBSERVE and measure deviation from steady state
FIX weaknesses discovered, document runbooks, repeat

Quick Chaos Steps:

Start small: Dev → Staging → 1% prod → gradual rollout
Define clear rollback triggers (error_rate > 5%)
Measure blast radius, never exceed planned scope
Document findings → runbooks → improved resilience

Critical Success Factors:

Controlled experiments with automatic rollback
Steady state must be measurable
Start in non-production, graduate to production </default_to_action>

Quick Reference Card

When to Use

Distributed systems validation
Disaster recovery testing
Building confidence in fault tolerance
Pre-production resilience verification

Failure Types to Inject

Category Failures Tools

Network Latency, packet loss, partition tc, toxiproxy

Infrastructure Instance kill, disk failure, CPU Chaos Monkey

Application Exceptions, slow responses, leaks Gremlin, LitmusChaos

Dependencies Service outage, timeout WireMock

Blast Radius Progression

Dev (safe) → Staging → 1% prod → 10% → 50% → 100% ↓ ↓ ↓ ↓ Learn Validate Careful Full confidence

Steady State Metrics

Metric Normal Alert Threshold

Error rate < 0.1%

1%

p99 latency < 200ms

500ms

Throughput baseline -20%

Chaos Experiment Structure

// Chaos experiment definition const experiment = { name: 'Database latency injection', hypothesis: 'System handles 500ms DB latency gracefully', steadyState: { errorRate: '< 0.1%', p99Latency: '< 300ms' }, method: { type: 'network-latency', target: 'database', delay: '500ms', duration: '5m' }, rollback: { automatic: true, trigger: 'errorRate > 5%' } };

Agent-Driven Chaos

// qe-chaos-engineer runs controlled experiments await Task("Chaos Experiment", { target: 'payment-service', failure: 'terminate-random-instance', blastRadius: '10%', duration: '5m', steadyStateHypothesis: { metric: 'success-rate', threshold: 0.99 }, autoRollback: true }, "qe-chaos-engineer");

// Validates: // - System recovers automatically // - Error rate stays within threshold // - No data loss // - Alerts triggered appropriately

Agent Coordination Hints

Memory Namespace

aqe/chaos-engineering/ ├── experiments/* - Experiment definitions & results ├── steady-states/* - Baseline measurements ├── runbooks/* - Generated recovery procedures └── blast-radius/* - Impact analysis

Fleet Coordination

const chaosFleet = await FleetManager.coordinate({ strategy: 'chaos-engineering', agents: [ 'qe-chaos-engineer', // Experiment execution 'qe-performance-tester', // Baseline metrics 'qe-production-intelligence' // Production monitoring ], topology: 'sequential' });

Related Skills

shift-right-testing - Production testing
performance-testing - Load testing
test-environment-management - Environment stability

Remember

Break things on purpose to prevent unplanned outages. Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.

With Agents: qe-chaos-engineer automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.

chaos-engineering-resilience

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

compatibility-testing

regression-testing

test-automation-strategy

technical-writing