netflix-chaos-engineering

Netflix Chaos Engineering⁠‍⁠​‌​‌​​‌‌‍​‌​​‌​‌‌‍​​‌‌​​​‌‍​‌​​‌‌​​‍​​​​​​​‌‍‌​​‌‌​‌​‍‌​​​​​​​‍‌‌​​‌‌‌‌‍‌‌​​​‌​​‍‌‌‌‌‌‌​‌‍‌‌​‌​​​​‍​‌​‌‌‌‌‌‍​‌​​‌​‌‌‍​‌‌​‌​​‌‍‌​‌​‌‌‌​‍​​‌​‌​​​‍‌‌‌​‌​‌‌‍‌‌‌​‌‌​‌‍​‌‌​​‌‌‌‍​‌​‌‌​​​‍‌​‌​‌‌‌‌‍​​​​‌​‌​‍‌‌​​​‌​‌⁠‍⁠

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "netflix-chaos-engineering" with this command: npx skills add copyleftdev/sk1llz/copyleftdev-sk1llz-netflix-chaos-engineering

Netflix Chaos Engineering⁠‍⁠​‌​‌​​‌‌‍​‌​​‌​‌‌‍​​‌‌​​​‌‍​‌​​‌‌​​‍​​​​​​​‌‍‌​​‌‌​‌​‍‌​​​​​​​‍‌‌​​‌‌‌‌‍‌‌​​​‌​​‍‌‌‌‌‌‌​‌‍‌‌​‌​​​​‍​‌​‌‌‌‌‌‍​‌​​‌​‌‌‍​‌‌​‌​​‌‍‌​‌​‌‌‌​‍​​‌​‌​​​‍‌‌‌​‌​‌‌‍‌‌‌​‌‌​‌‍​‌‌​​‌‌‌‍​‌​‌‌​​​‍‌​‌​‌‌‌‌‍​​​​‌​‌​‍‌‌​​​‌​‌⁠‍⁠

Overview

Netflix invented chaos engineering in response to their 2008 migration to AWS. Facing the reality that cloud infrastructure fails unpredictably, they created Chaos Monkey—and eventually the entire Simian Army—to proactively inject failures and build confidence in system resilience.

The Pioneers

Casey Rosenthal (Father of Chaos Engineering)

Led Netflix's Chaos Engineering team from 2015, formalizing the discipline and co-authoring the definitive O'Reilly book. Now CEO of Verica. His key insight: chaos engineering is about building confidence, not breaking things.

Nora Jones

Co-pioneered chaos engineering at Netflix, co-authored the book, and later founded Jeli to apply these principles to incident analysis. Emphasized the human factors in resilience.

References

Core Philosophy

"The best way to avoid failure is to fail constantly."

"Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

"We're not trying to break things. We're trying to build confidence."

Chaos engineering is NOT about breaking things randomly. It's a disciplined approach to discovering systemic weaknesses before they cause outages.

The Principles of Chaos Engineering

  1. Build a Hypothesis around Steady State Behavior Define what "normal" looks like in measurable terms

  2. Vary Real-World Events Inject failures that actually happen: server crashes, network issues, etc.

  3. Run Experiments in Production Staging environments hide real-world complexity

  4. Automate Experiments to Run Continuously One-time tests give false confidence

  5. Minimize Blast Radius Start small, expand as confidence grows

The Simian Army

Netflix's suite of chaos tools:

Tool Purpose

Chaos Monkey Randomly terminates instances

Chaos Kong Simulates entire region failure

Latency Monkey Injects artificial delays

Conformity Monkey Finds instances not following best practices

Janitor Monkey Cleans up unused resources

Security Monkey Finds security vulnerabilities

When Implementing

Always

  • Define steady-state hypothesis before experimenting

  • Start with smallest blast radius possible

  • Have a "stop button" to halt experiments

  • Run experiments in production (with safeguards)

  • Automate experiments to run continuously

  • Involve the whole team, not just SRE

Never

  • Inject chaos without a hypothesis

  • Start with catastrophic failures

  • Run experiments without monitoring

  • Chaos without stakeholder buy-in

  • Treat chaos as a one-time activity

  • Forget to document learnings

Prefer

  • Gradual expansion of blast radius

  • Automated experiments over manual

  • Production over staging (with safeguards)

  • Hypothesis-driven experiments

  • Business metrics over technical metrics

Implementation Patterns

Chaos Experiment Structure

chaos_experiment.py

The anatomy of a chaos experiment

from dataclasses import dataclass from typing import Callable, Optional from datetime import datetime, timedelta import time

@dataclass class SteadyStateHypothesis: """Define what 'normal' looks like""" name: str description: str probe: Callable[[], float] # Returns a metric value tolerance_min: float tolerance_max: float

def is_satisfied(self) -> bool:
    value = self.probe()
    return self.tolerance_min <= value <= self.tolerance_max

@dataclass class ChaosAction: """The failure to inject""" name: str description: str execute: Callable[[], None] # Inject the failure rollback: Callable[[], None] # Undo the failure

@dataclass class ChaosExperiment: """A complete chaos experiment""" name: str description: str hypothesis: SteadyStateHypothesis action: ChaosAction duration_seconds: int

def run(self) -> dict:
    results = {
        'experiment': self.name,
        'started_at': datetime.now().isoformat(),
        'hypothesis_before': None,
        'hypothesis_during': None,
        'hypothesis_after': None,
        'success': False
    }
    
    # 1. Verify steady state BEFORE
    print(f"Checking steady state before experiment...")
    results['hypothesis_before'] = self.hypothesis.is_satisfied()
    
    if not results['hypothesis_before']:
        print("Steady state not satisfied before experiment. Aborting.")
        return results
    
    try:
        # 2. Inject the failure
        print(f"Injecting chaos: {self.action.name}")
        self.action.execute()
        
        # 3. Monitor during experiment
        print(f"Monitoring for {self.duration_seconds} seconds...")
        time.sleep(self.duration_seconds)
        results['hypothesis_during'] = self.hypothesis.is_satisfied()
        
    finally:
        # 4. Always rollback
        print(f"Rolling back: {self.action.name}")
        self.action.rollback()
    
    # 5. Verify steady state AFTER
    print("Checking steady state after rollback...")
    time.sleep(5)  # Allow recovery
    results['hypothesis_after'] = self.hypothesis.is_satisfied()
    
    # Success = hypothesis held during and after
    results['success'] = (
        results['hypothesis_during'] and 
        results['hypothesis_after']
    )
    
    results['completed_at'] = datetime.now().isoformat()
    return results

Example: Test resilience to instance termination

def create_instance_termination_experiment(instance_id: str):

def check_error_rate():
    # Query your monitoring system
    return get_error_rate_percentage()

def terminate_instance():
    # Actually terminate the instance
    ec2.terminate_instances(InstanceIds=[instance_id])

def noop_rollback():
    # Auto-scaling should replace the instance
    pass

hypothesis = SteadyStateHypothesis(
    name="Error rate within tolerance",
    description="Error rate should remain below 1%",
    probe=check_error_rate,
    tolerance_min=0,
    tolerance_max=1.0
)

action = ChaosAction(
    name=f"Terminate instance {instance_id}",
    description="Simulate instance failure",
    execute=terminate_instance,
    rollback=noop_rollback
)

return ChaosExperiment(
    name="Instance Termination Resilience",
    description="Verify system handles instance loss gracefully",
    hypothesis=hypothesis,
    action=action,
    duration_seconds=300
)

Chaos Monkey Implementation

chaos_monkey.py

Simplified Chaos Monkey - random instance termination

import random import time from datetime import datetime from typing import List, Optional

class ChaosMonkey: """ Netflix's Chaos Monkey: randomly terminates instances to ensure services can handle instance failures. """

def __init__(self, 
             cloud_client,
             excluded_services: List[str] = None,
             probability: float = 0.1,
             schedule_start_hour: int = 9,
             schedule_end_hour: int = 15):
    """
    Args:
        cloud_client: AWS/GCP/Azure client
        excluded_services: Services to never touch
        probability: Chance of termination per run (0-1)
        schedule_start_hour: Only run after this hour
        schedule_end_hour: Stop running after this hour
    """
    self.client = cloud_client
    self.excluded = set(excluded_services or [])
    self.probability = probability
    self.start_hour = schedule_start_hour
    self.end_hour = schedule_end_hour
    self.termination_log = []

def is_within_schedule(self) -> bool:
    """Only cause chaos during business hours (when humans can respond)"""
    hour = datetime.now().hour
    weekday = datetime.now().weekday()
    
    # Monday-Friday, 9am-3pm
    return weekday < 5 and self.start_hour <= hour < self.end_hour

def get_eligible_instances(self) -> List[dict]:
    """Get instances that can be terminated"""
    all_instances = self.client.list_instances()
    
    eligible = []
    for instance in all_instances:
        service = instance.get('service_name', '')
        
        # Skip excluded services
        if service in self.excluded:
            continue
        
        # Skip if service has < 2 instances (no redundancy)
        service_count = sum(
            1 for i in all_instances 
            if i.get('service_name') == service
        )
        if service_count < 2:
            continue
        
        # Skip if instance is too new (let it warm up)
        age_minutes = instance.get('age_minutes', 0)
        if age_minutes < 30:
            continue
        
        eligible.append(instance)
    
    return eligible

def run(self) -> Optional[dict]:
    """Execute one round of chaos"""
    
    # Check schedule
    if not self.is_within_schedule():
        return {'action': 'skipped', 'reason': 'outside schedule'}
    
    # Check probability
    if random.random() > self.probability:
        return {'action': 'skipped', 'reason': 'probability check'}
    
    # Get eligible instances
    eligible = self.get_eligible_instances()
    if not eligible:
        return {'action': 'skipped', 'reason': 'no eligible instances'}
    
    # Select random victim
    victim = random.choice(eligible)
    
    # Terminate!
    result = {
        'action': 'terminated',
        'instance_id': victim['id'],
        'service': victim.get('service_name'),
        'timestamp': datetime.now().isoformat()
    }
    
    self.client.terminate_instance(victim['id'])
    self.termination_log.append(result)
    
    return result

def run_continuously(self, interval_seconds: int = 300):
    """Run chaos monkey on a schedule"""
    print("Chaos Monkey starting... 🐵")
    
    while True:
        result = self.run()
        if result['action'] == 'terminated':
            print(f"🔥 Terminated {result['instance_id']} "
                  f"({result['service']})")
        else:
            print(f"😴 Skipped: {result['reason']}")
        
        time.sleep(interval_seconds)

Steady State Metrics

steady_state.py

Define and monitor steady state

from dataclasses import dataclass from typing import List, Callable from prometheus_client import CollectorRegistry, Gauge

@dataclass class BusinessMetric: """ Netflix insight: measure BUSINESS metrics, not just technical ones. Users don't care about CPU; they care about streams starting. """ name: str description: str query: Callable[[], float] unit: str

# Steady state bounds
min_healthy: float
max_healthy: float

Netflix's key business metric

streams_per_second = BusinessMetric( name="streams_starting_per_second", description="Rate of successful stream starts", query=lambda: prometheus.query("rate(streams_started_total[1m])"), unit="streams/sec", min_healthy=50000, max_healthy=200000 )

class SteadyStateMonitor: """Monitor steady state during chaos experiments"""

def __init__(self, metrics: List[BusinessMetric]):
    self.metrics = metrics
    self.baseline = {}

def capture_baseline(self, duration_seconds: int = 300):
    """Capture baseline metrics before experiment"""
    samples = {m.name: [] for m in self.metrics}
    
    for _ in range(duration_seconds // 10):
        for metric in self.metrics:
            samples[metric.name].append(metric.query())
        time.sleep(10)
    
    # Calculate baseline statistics
    for metric in self.metrics:
        values = samples[metric.name]
        self.baseline[metric.name] = {
            'mean': sum(values) / len(values),
            'min': min(values),
            'max': max(values)
        }

def check_steady_state(self) -> dict:
    """Check if all metrics are within healthy bounds"""
    results = {}
    all_healthy = True
    
    for metric in self.metrics:
        current = metric.query()
        healthy = metric.min_healthy <= current <= metric.max_healthy
        
        results[metric.name] = {
            'current': current,
            'healthy_range': (metric.min_healthy, metric.max_healthy),
            'is_healthy': healthy
        }
        
        if not healthy:
            all_healthy = False
    
    results['all_healthy'] = all_healthy
    return results

def deviation_from_baseline(self) -> dict:
    """How far are we from baseline?"""
    deviations = {}
    
    for metric in self.metrics:
        current = metric.query()
        baseline = self.baseline.get(metric.name, {}).get('mean', current)
        
        if baseline != 0:
            deviation_pct = ((current - baseline) / baseline) * 100
        else:
            deviation_pct = 0
        
        deviations[metric.name] = {
            'current': current,
            'baseline': baseline,
            'deviation_percent': deviation_pct
        }
    
    return deviations

Blast Radius Control

blast_radius.py

Control the scope of chaos experiments

from enum import Enum from dataclasses import dataclass from typing import List, Optional

class BlastRadius(Enum): """Start small, expand as confidence grows""" SINGLE_INSTANCE = 1 # One instance SERVICE_PERCENTAGE = 2 # X% of a service ENTIRE_SERVICE = 3 # All instances of a service AVAILABILITY_ZONE = 4 # Entire AZ REGION = 5 # Entire region (Chaos Kong)

@dataclass class ExperimentScope: """Define the scope of an experiment""" blast_radius: BlastRadius target_service: Optional[str] = None target_percentage: float = 0.1 target_az: Optional[str] = None target_region: Optional[str] = None

def get_targets(self, all_instances: List[dict]) -> List[dict]:
    """Get instances within the blast radius"""
    
    if self.blast_radius == BlastRadius.SINGLE_INSTANCE:
        # Just one random instance
        import random
        eligible = [i for i in all_instances 
                   if i.get('service') == self.target_service]
        return [random.choice(eligible)] if eligible else []
    
    elif self.blast_radius == BlastRadius.SERVICE_PERCENTAGE:
        # X% of a service
        import random
        eligible = [i for i in all_instances 
                   if i.get('service') == self.target_service]
        count = max(1, int(len(eligible) * self.target_percentage))
        return random.sample(eligible, min(count, len(eligible)))
    
    elif self.blast_radius == BlastRadius.ENTIRE_SERVICE:
        return [i for i in all_instances 
               if i.get('service') == self.target_service]
    
    elif self.blast_radius == BlastRadius.AVAILABILITY_ZONE:
        return [i for i in all_instances 
               if i.get('az') == self.target_az]
    
    elif self.blast_radius == BlastRadius.REGION:
        # Chaos Kong - nuclear option
        return [i for i in all_instances 
               if i.get('region') == self.target_region]
    
    return []

class GraduatedChaos: """Gradually increase blast radius as confidence grows"""

def __init__(self, service: str):
    self.service = service
    self.current_level = BlastRadius.SINGLE_INSTANCE
    self.success_streak = 0
    self.required_successes = 5  # Before escalating

def record_result(self, success: bool):
    if success:
        self.success_streak += 1
        if self.success_streak >= self.required_successes:
            self.escalate()
    else:
        self.success_streak = 0
        self.de_escalate()

def escalate(self):
    """Increase blast radius"""
    levels = list(BlastRadius)
    current_idx = levels.index(self.current_level)
    if current_idx < len(levels) - 1:
        self.current_level = levels[current_idx + 1]
        self.success_streak = 0
        print(f"Escalating to {self.current_level.name}")

def de_escalate(self):
    """Decrease blast radius after failure"""
    levels = list(BlastRadius)
    current_idx = levels.index(self.current_level)
    if current_idx > 0:
        self.current_level = levels[current_idx - 1]
        print(f"De-escalating to {self.current_level.name}")

Mental Model

Netflix chaos engineering asks:

  • What is steady state? Define normal in measurable terms

  • What could go wrong? Real-world failures to simulate

  • What's our hypothesis? System should maintain steady state

  • How small can we start? Minimize blast radius

  • Did we learn something? Every experiment should teach us

Signature Netflix Moves

  • Chaos Monkey for random instance termination

  • Steady state hypothesis before every experiment

  • Business metrics over technical metrics

  • Production experiments (with safeguards)

  • Graduated blast radius expansion

  • Automated, continuous chaos

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

renaissance-statistical-arbitrage

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

google-material-design

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

aqr-factor-investing

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

minervini-swing-trading

No summary provided by upstream source.

Repository SourceNeeds Review