Chaos Engineering
Principles
-
Build a Hypothesis: Define expected behavior
-
Minimize Blast Radius: Start small
-
Run in Production: Real conditions matter
-
Automate: Make experiments repeatable
-
Minimize Impact: Have abort conditions
Experiment Process
-
Steady State: Define normal metrics
-
Hypothesis: "System will maintain X under condition Y"
-
Introduce Variables: Inject failure
-
Observe: Compare to steady state
-
Analyze: Confirm or disprove hypothesis
Common Experiments
Network Failures
Add latency
tc qdisc add dev eth0 root netem delay 100ms
Packet loss
tc qdisc add dev eth0 root netem loss 10%
Remove
tc qdisc del dev eth0 root
Resource Exhaustion
CPU stress
stress --cpu 4 --timeout 60s
Memory stress
stress --vm 2 --vm-bytes 1G --timeout 60s
Disk fill
dd if=/dev/zero of=/tmp/fill bs=1M count=1024
Service Failures
-
Kill processes
-
Restart containers
-
Terminate instances
-
Block dependencies
Chaos Tools
-
Chaos Monkey: Random instance termination
-
Gremlin: Comprehensive chaos platform
-
Litmus: Kubernetes chaos engineering
-
Chaos Mesh: Cloud-native chaos
Experiment Template
Experiment: [Name]
Hypothesis
If [condition], then [expected behavior].
Steady State
- Metric A: [baseline value]
- Metric B: [baseline value]
Method
- [Step 1]
- [Step 2]
- [Step 3]
Abort Conditions
- If [condition], stop immediately
Results
[What happened]
Findings
[What we learned]
Safety Rules
-
Start in non-production
-
Have rollback ready
-
Monitor continuously
-
Communicate with team
-
Document everything