gameday-planning

Comprehensive guide for planning and executing GameDay exercises - organized chaos drills that test system resilience and incident response.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "gameday-planning" with this command: npx skills add melodic-software/claude-code-plugins/melodic-software-claude-code-plugins-gameday-planning

GameDay Planning

Comprehensive guide for planning and executing GameDay exercises - organized chaos drills that test system resilience and incident response.

When to Use This Skill

  • Planning GameDay exercises

  • Designing failure scenarios

  • Preparing teams for chaos experiments

  • Running disaster recovery drills

  • Improving incident response readiness

What is a GameDay?

GameDay = Planned chaos exercise for your systems

Like a fire drill, but for infrastructure:

  • Scheduled in advance
  • Controlled environment
  • Practice for real incidents
  • Learn and improve

Not chaos engineering:

  • GameDay: Scheduled team exercise
  • Chaos engineering: Continuous experiments

GameDays include:

  • Failure injection
  • Incident response practice
  • Team coordination
  • Runbook validation

GameDay Types

By Scope

  1. Component GameDay └── Single service or component └── Focused scenarios └── 2-4 hours

  2. Service GameDay └── Multiple related services └── Integration scenarios └── Half day

  3. Full System GameDay └── Complete system └── Disaster scenarios └── Full day

  4. Cross-Team GameDay └── Multiple teams involved └── Complex scenarios └── 1-2 days

By Objective

  1. Resilience validation └── Does the system handle failures?

  2. Recovery practice └── Can we restore from backup?

  3. Incident response training └── How well do we coordinate?

  4. Runbook validation └── Do our runbooks work?

  5. Capacity testing └── What happens under load?

Planning Phase

Timeline Overview

Week -4: Initial planning ├── Define objectives ├── Identify stakeholders └── Draft scenario ideas

Week -3: Scenario design ├── Detail failure scenarios ├── Define success criteria └── Identify risks

Week -2: Preparation ├── Review with stakeholders ├── Prepare monitoring ├── Update runbooks └── Brief participants

Week -1: Final prep ├── Confirm participants ├── Test monitoring ├── Walkthrough scenarios └── Prepare rollback plans

Day of: Execute ├── Pre-GameDay briefing ├── Run scenarios ├── Document observations └── Hot debrief

Objective Setting

Good objectives:

  • "Validate failover to secondary region works < 5 minutes"
  • "Confirm team can diagnose database issues using runbooks"
  • "Test load balancer behavior when 50% of nodes fail"

Bad objectives:

  • "See what breaks" (too vague)
  • "Test everything" (too broad)
  • "Find all bugs" (unrealistic)

SMART objectives: Specific: Clear scenario Measurable: Defined success criteria Achievable: Within team capability Relevant: Tests real risks Time-bound: Fits in GameDay

Scenario Design

Scenario template:

Name: [Descriptive name] Type: [Infrastructure/Application/Data/Process] Duration: [Expected time]

Objective: What are we testing?

Hypothesis: "When [fault], the system will [expected behavior]"

Setup:

  1. [Pre-condition 1]
  2. [Pre-condition 2]

Execution:

  1. [Injection step 1]
  2. [Injection step 2]

Expected Outcome:

  • [Metric] should [behavior]
  • [Alert] should [fire/not fire]
  • [Recovery] should [happen]

Success Criteria: □ [Criterion 1] □ [Criterion 2]

Abort Conditions:

  • [Condition] → Stop immediately
  • [Condition] → Pause and assess

Rollback Steps:

  1. [Rollback step 1]
  2. [Rollback step 2]

Common Scenarios

Infrastructure: □ Kill primary database instance □ Network partition between zones □ Full disk on critical service □ Memory exhaustion □ Certificate expiration

Application: □ Deploy bad configuration □ Overwhelm with traffic □ Corrupt cache entries □ Exhaust connection pool □ API dependency failure

Data: □ Restore from backup □ Data corruption detection □ Replication lag □ Schema migration failure

Process: □ Key team member unavailable □ Credentials rotation □ Access revocation □ Runbook-only resolution

Preparation Phase

Stakeholder Communication

Communication plan:

Leadership:

  • What: GameDay overview, risks, benefits
  • When: Week -3 (approval)
  • How: Meeting + document

Participating teams:

  • What: Detailed plan, roles, expectations
  • When: Week -2 (kickoff)
  • How: Meeting + documentation

Adjacent teams:

  • What: Notification, potential impact
  • When: Week -1
  • How: Email + calendar block

On-call:

  • What: Extra vigilance, escalation paths
  • When: Day before
  • How: Briefing + runbook

Participant Briefing

Briefing contents:

  1. Objectives What are we testing and why?

  2. Roles Who does what during GameDay?

  3. Schedule Timeline and scenario order

  4. Ground rules What's allowed, what's not

  5. Safety Kill switches, abort conditions

  6. Communication Channels, updates, escalation

  7. Questions Clear up any confusion

Monitoring Preparation

Before GameDay:

  1. Verify dashboards work

    • All relevant metrics visible
    • Baselines understood
  2. Configure extra alerting

    • GameDay-specific alerts
    • Lower thresholds if needed
  3. Prepare queries

    • Log queries ready
    • Trace searches prepared
  4. Test recording

    • Screen recording if needed
    • Metrics export configured
  5. Clear noise

    • Suppress known alerts
    • Reduce background chatter

Safety Measures

Required safety measures:

Kill switches:

  • Immediate stop for each scenario
  • Multiple people can trigger
  • Tested before GameDay

Blast radius limits:

  • Maximum affected users/traffic
  • Automatic enforcement
  • Clear escalation if exceeded

Rollback plans:

  • Documented for each scenario
  • Tested rollback procedures
  • Time-limited scenarios

Communication:

  • Dedicated channel
  • Clear "STOP" command
  • Status page ready to update

Customer protection:

  • Synthetic traffic if possible
  • Canary approach
  • Quick customer comm ready

Execution Phase

Day-of Structure

Typical GameDay schedule:

08:00 - Pre-GameDay briefing └── Review objectives, roles, safety

08:30 - Monitoring baseline └── Capture normal state

09:00 - Scenario 1 └── Execute, observe, document

10:30 - Break + quick debrief

11:00 - Scenario 2 └── Execute, observe, document

12:30 - Lunch break

13:30 - Scenario 3 └── Execute, observe, document

15:00 - Scenario 4 (if time)

16:00 - Hot debrief └── Initial observations

16:30 - Cleanup └── Ensure all reverted

Roles During Execution

GameDay Lead:

  • Runs the overall exercise
  • Makes go/no-go decisions
  • Controls pacing
  • Manages safety

Scenario Executor:

  • Injects faults
  • Monitors injection
  • Has kill switch
  • Reports status

Observers:

  • Watch system behavior
  • Document findings
  • Note unexpected events
  • Track metrics

Incident Responders:

  • Act as if real incident
  • Follow runbooks
  • Practice coordination
  • Don't know scenarios in advance (optional)

Scribe:

  • Records timeline
  • Documents decisions
  • Captures quotes
  • Notes action items

Documentation During

Timeline template:

[TIME] [ACTOR] [ACTION/OBSERVATION]

09:00 GameDay Lead: Starting Scenario 1 - DB failover 09:01 Executor: Triggered primary DB shutdown 09:02 Observer: Alert fired: DB connection errors 09:03 Observer: Failover initiated automatically 09:05 Observer: Secondary promoted to primary 09:07 Responder: Services reconnected 09:10 Observer: Error rate returning to normal 09:12 GameDay Lead: Scenario 1 complete - success

Capture:

  • Exact times
  • Who did what
  • System responses
  • Deviations from expected
  • Interesting observations

Handling Real Incidents

If real incident occurs during GameDay:

  1. STOP GameDay immediately "GameDay paused - real incident"

  2. Assess the real incident Is it related to GameDay?

  3. Revert any GameDay changes If potentially contributing

  4. Handle real incident Normal incident process

  5. Decide on continuation Resume or reschedule GameDay?

Always prioritize real incidents over GameDay.

Follow-Up Phase

Hot Debrief

Immediately after GameDay:

Duration: 30-60 minutes Participants: All GameDay participants

Agenda:

  1. What happened? (5 min per scenario)

    • Timeline walk-through
    • Key observations
  2. What worked well?

    • Celebrate successes
    • Note effective practices
  3. What didn't work?

    • Issues discovered
    • Gaps in tools/process
  4. Initial action items

    • Quick fixes
    • Further investigation needed
  5. Next steps

    • Postmortem schedule
    • Owner assignments

Formal Postmortem

Within 1 week of GameDay:

GameDay Postmortem

Executive Summary Brief overview of objectives, execution, outcomes

Scenarios Executed

ScenarioOutcomeKey Findings
DB failoverSuccess3 min recovery
Network partitionPartialManual intervention needed

Detailed Findings

Scenario 1: Database Failover

  • Hypothesis: Automatic failover < 5 min
  • Result: CONFIRMED (3 min actual)
  • Observations: [Details]

Scenario 2: Network Partition

  • Hypothesis: Services continue with degraded mode
  • Result: PARTIALLY CONFIRMED
  • Gap: Service X didn't handle gracefully
  • Observations: [Details]

Action Items

ActionOwnerPriorityDue Date
Fix Service X partition handling@engineerP12024-02-01
Update runbook for DB failover@oncallP22024-02-15

Recommendations for Next GameDay

  • [Suggestion 1]
  • [Suggestion 2]

Action Item Tracking

Every action item needs:

  • Clear description
  • Single owner
  • Priority level
  • Due date
  • Definition of done

Track in:

  • Issue tracker
  • Dedicated dashboard
  • Regular review meetings

Don't let action items languish. The point is to improve.

Best Practices

Planning

  1. Start small First GameDay should be simple

  2. Clear objectives Know what you're testing

  3. Stakeholder buy-in Get approval and support

  4. Thorough preparation Don't rush the prep work

  5. Documented scenarios Written plans, not in heads

Execution

  1. Safety first Kill switches ready

  2. Communicate constantly Everyone knows what's happening

  3. Document everything You'll forget otherwise

  4. Stay on schedule Don't let scenarios run over

  5. Be flexible Adapt to unexpected situations

Follow-Up

  1. Debrief immediately Hot debrief same day

  2. Formal postmortem Within a week

  3. Track action items Don't let them die

  4. Share learnings Spread knowledge broadly

  5. Plan the next one Make it a regular practice

Common Pitfalls

Pitfall: Scope creep Fix: Strict scenario limits, time boxes

Pitfall: Insufficient preparation Fix: Checklists, dry runs

Pitfall: No safety measures Fix: Required kill switches, abort criteria

Pitfall: Skipping documentation Fix: Dedicated scribe, templates

Pitfall: Orphaned action items Fix: Tracked, owned, reviewed

Pitfall: Infrequent GameDays Fix: Quarterly schedule, smaller scope

Maturity Progression

Level 1: Ad-hoc

  • First GameDay
  • Simple scenarios
  • Manual execution

Level 2: Regular

  • Quarterly GameDays
  • Multiple scenarios
  • Basic automation

Level 3: Integrated

  • Monthly GameDays
  • Complex scenarios
  • Good documentation
  • Action item tracking

Level 4: Continuous

  • Weekly smaller drills
  • Quarterly large GameDays
  • Automated scenarios
  • Metrics-driven improvement

Related Skills

  • chaos-engineering-fundamentals

  • Continuous chaos experiments

  • incident-response

  • Handling real incidents

  • resilience-patterns

  • Building resilient systems

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

design-thinking

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

plantuml-syntax

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

system-prompt-engineering

No summary provided by upstream source.

Repository SourceNeeds Review