chaos-engineering-fundamentals

Chaos Engineering Fundamentals

Principles and practices for chaos engineering - proactively discovering system weaknesses through controlled experiments.

When to Use This Skill

Implementing chaos engineering practices
Designing fault injection experiments
Building confidence in system resilience
Discovering hidden failure modes
Validating disaster recovery

What is Chaos Engineering?

Chaos Engineering = Proactive resilience testing

Traditional testing: "Does it work when everything is right?" Chaos engineering: "Does it work when things go wrong?"

Principle: Build confidence in the system's ability to withstand turbulent conditions in production.

Not about breaking things randomly. About controlled experiments to learn.

The Chaos Engineering Loop

┌─────────────────────────────────────────────────────────┐ │ CHAOS ENGINEERING LOOP │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Define │────►│ Inject │────►│ Observe │ │ │ │ Steady │ │ Chaos │ │ Results │ │ │ │ State │ │ │ │ │ │ │ └─────────┘ └─────────┘ └────┬────┘ │ │ ▲ │ │ │ │ │ │ │ │ ┌─────────┐ │ │ │ └──────────│ Improve │◄─────────┘ │ │ │ System │ │ │ └─────────┘ │ └─────────────────────────────────────────────────────────┘

Core Principles

Build Hypothesis Around Steady State

Steady State = Normal system behavior

Define measurable indicators:

Request success rate: 99.9%
Latency p99: < 200ms
Orders processed/minute: > 100
User sessions active: > 10,000

Hypothesis format: "When [fault condition] occurs, the system will maintain [steady state metrics] within [acceptable bounds]"

Example: "When one database replica fails, request success rate will remain above 99.5% and latency below 500ms"

Vary Real-World Events

Inject realistic failures:

Infrastructure:

Server crash
Network partition
Disk full
CPU exhaustion
Clock skew

Application:

Service unavailable
Slow responses
Corrupted data
Certificate expiry
Resource exhaustion

Dependencies:

Database failure
Cache unavailable
Third-party API down
Message queue backup

Run Experiments in Production

Why production?

Real traffic patterns
Real infrastructure
Real dependencies
Real monitoring

Start safe:

Begin in non-production
Graduate to canary
Progress to production
Expand blast radius gradually

Safety nets:

Kill switch ready
Rollback plan
Limited blast radius
Monitoring in place

Automate Experiments to Run Continuously

One-time experiments find one-time bugs. Continuous experiments catch regressions.

Automation goals:

Run experiments regularly
Integrate with CI/CD
Catch new failure modes
Validate changes

Example schedule:

Critical paths: Daily
Core services: Weekly
Full system: Monthly

Minimize Blast Radius

Control experiment impact:

Scope limitations:

Single instance
Percentage of traffic
Specific region
Test accounts only

Duration limits:

Seconds to minutes
Automatic termination
Scheduled windows

Abort conditions:

Error rate exceeds threshold
Customer impact detected
Manual kill switch

Experiment Design

Experiment Structure

Experiment: [Name] Date: [When] Team: [Who]

Hypothesis

When [fault is injected], the system will [expected behavior] because [reasoning].

Steady State Metrics

[Metric 1]: [Expected value]
[Metric 2]: [Expected value]

Experiment Details

Fault Type: [What we're injecting] Target: [Where we're injecting] Magnitude: [How severe] Duration: [How long]

Blast Radius

Affected services: [List]
Affected users: [Percentage/count]
Region/zone: [Scope]

Abort Conditions

[Condition 1] → Abort
[Condition 2] → Abort

Rollback Plan

[Step 1]
[Step 2]

Results

Hypothesis: [Confirmed/Falsified] Observations: [What we saw] Action Items: [What to fix]

Common Experiment Types

Service Failure └── Kill instances, return errors
Network Failures └── Latency injection, packet loss, partitions
Resource Exhaustion └── CPU stress, memory pressure, disk full
Dependency Failures └── Database down, cache miss, API timeout
State Corruption └── Clock skew, data inconsistency
Traffic Surge └── Sudden load increase

Fault Injection Patterns

Infrastructure Faults

Instance termination:

Kill random instances
Verify auto-scaling/recovery
Netflix Chaos Monkey style

Zone/region failure:

Simulate full zone outage
Test failover to other zones
Verify data consistency

Network partition:

Split brain scenarios
Cross-region communication failure
Consensus algorithm behavior

Application Faults

Latency injection:

Add artificial delay
Test timeout handling
Verify circuit breakers

Error injection:

Return 500 errors
Throw exceptions
Test error handling paths

Resource leaks:

Memory leaks
Connection pool exhaustion
File handle exhaustion

Dependency Faults

Database failures:

Primary failover
Replica lag
Connection pool exhaustion

Cache failures:

Cache miss scenarios
Cache cluster failure
Stampede protection

External API failures:

Timeout
Rate limiting
Malformed responses

Chaos Engineering Tools

Open Source Tools

Chaos Monkey (Netflix)

Random instance termination
AWS focused
Part of Simian Army

Gremlin

Comprehensive chaos platform
Multiple attack types
Enterprise features

Litmus

Kubernetes native
ChaosHub experiment library
GitOps friendly

Chaos Mesh

Kubernetes native
Various fault types
Dashboard included

Pumba

Docker chaos testing
Container-level faults
CI/CD integration

Cloud Provider Tools

AWS:

Fault Injection Simulator
Native integration

Azure:

Chaos Studio
Azure-native experiments

GCP:

No native tool (use Gremlin/Litmus)

Implementation Strategy

Maturity Model

Level 0: Ad-hoc

Manual testing
No chaos practice
Reactive to failures

Level 1: Beginning

First experiments
Non-production only
Manual execution

Level 2: Intermediate

Regular experiments
Production experiments
Some automation

Level 3: Advanced

Continuous chaos
Automated experiments
Broad coverage

Level 4: Expert

Chaos as code
Integrated in CI/CD
GameDays regular

Getting Started

Week 1-2: Foundation

Identify critical paths
Define steady state metrics
Set up monitoring

Week 3-4: First Experiments

Start with known failures
Run in non-production
Document learnings

Month 2: Expand

Add more experiment types
Move to production (carefully)
Automate basic experiments

Month 3+: Mature

Regular GameDays
Continuous experiments
Integrate with CI/CD

GameDays

What is a GameDay?

GameDay = Planned chaos exercise

Like a fire drill for systems:

Scheduled in advance
Multiple failure scenarios
Practice incident response
Learn and improve

GameDay Structure

Before:

Define objectives
Plan scenarios
Notify stakeholders
Prepare runbooks
Set up monitoring

During:

Run scenarios
Observe system behavior
Practice incident response
Document findings

After:

Debrief meeting
Document learnings
Create action items
Plan next GameDay

GameDay Scenarios

Scenario categories:

Infrastructure
- Region failure
- Network partition
- Scaling limits
Application
- Service outage
- Deployment failure
- Configuration error
Data
- Database corruption
- Backup restoration
- Data center switch
Security
- Credential rotation
- Certificate expiry
- Access revocation

Safety and Guardrails

Experiment Safety

Before running chaos:

Have a hypothesis Know what you're testing
Limit blast radius Start small, expand gradually
Have abort conditions Automatic stops
Have rollback plan Know how to undo
Monitor everything Can't learn what you can't see
Communicate Team knows experiment is running

Kill Switch

Every experiment needs:

Manual kill switch:

Instant termination
Accessible to multiple people
Tested before experiment

Automatic abort:

Error rate threshold
Latency threshold
Customer impact detection

Notification:

Alert when abort triggered
Log reason for abort

Measuring Success

Chaos engineering success metrics:

Experiments run Are we doing chaos regularly?
Issues discovered Are we finding problems?
MTTR improvement Are we recovering faster?
Incident prevention Did chaos prevent production incidents?
Confidence level Does team trust the system more?

Best Practices

Start small Begin with simple experiments
Hypothesis first Know what you're testing
Automate gradually Manual first, then automate
Production eventually That's where real chaos lives
Blameless culture Findings are learnings, not failures
Regular GameDays Practice makes prepared
Share learnings Spread knowledge across teams

Related Skills

resilience-patterns
Building resilient systems
gameday-planning
Detailed GameDay planning
incident-response
Handling discovered issues