runbook-creation

Runbook Creation Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "runbook-creation" with this command: npx skills add melodic-software/claude-code-plugins/melodic-software-claude-code-plugins-runbook-creation

Runbook Creation Skill

When to Use This Skill

Use this skill when:

  • Runbook Creation tasks - Working on operational runbook templates for incident response and procedures

  • Planning or design - Need guidance on Runbook Creation approaches

  • Best practices - Want to follow established patterns and standards

Overview

Create operational runbooks for incident response, maintenance procedures, and operational tasks.

MANDATORY: Documentation-First Approach

Before creating runbooks:

  • Invoke docs-management skill for runbook patterns

  • Verify SRE best practices via MCP servers (perplexity)

  • Base guidance on Google SRE principles

Runbook Types

Runbook Categories:

┌─────────────────────────────────────────────────────────────────────────────┐ │ Incident Response Runbooks │ │ • Alert-triggered procedures │ │ • Escalation paths │ │ • Communication templates │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Operational Runbooks │ │ • Deployment procedures │ │ • Maintenance tasks │ │ • Backup/restore operations │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Troubleshooting Runbooks │ │ • Diagnostic procedures │ │ • Common issue resolution │ │ • Debug workflows │ ├─────────────────────────────────────────────────────────────────────────────┤ │ Emergency Runbooks │ │ • Disaster recovery │ │ • Security incident response │ │ • Business continuity │ └─────────────────────────────────────────────────────────────────────────────┘

Standard Runbook Template

Runbook: [TITLE]

PropertyValue
IDRB-[NUMBER]
Category[Incident/Operational/Troubleshooting/Emergency]
Service[Service Name]
Owner[Team/Individual]
Last Updated[YYYY-MM-DD]
Last Tested[YYYY-MM-DD]
Review Frequency[Quarterly/Monthly/Annually]

Overview

Purpose: [What this runbook helps you accomplish]

When to Use: [Conditions that trigger this runbook]

Expected Outcome: [What success looks like]

Estimated Duration: [Time to complete]


Prerequisites

Required Access

  • [System/Tool 1] - [Role/Permission needed]
  • [System/Tool 2] - [Role/Permission needed]

Required Knowledge

  • [Skill/Knowledge 1]
  • [Skill/Knowledge 2]

Tools Needed

ToolPurposeAccess URL
[Tool 1][Purpose][URL/Link]
[Tool 2][Purpose][URL/Link]

Quick Reference

Quick Commands:
┌────────────────────────────────────────────────────────────────┐
│ Check service status: kubectl get pods -n [namespace]          │
│ View logs: kubectl logs -f [pod-name] -n [namespace]           │
│ Restart service: kubectl rollout restart deployment/[name]     │
│ Check metrics: [monitoring-url]                                │
└────────────────────────────────────────────────────────────────┘

Procedure

Step 1: [Step Name]

Objective: [What this step accomplishes]

Actions:

- 
[Action 1]

# Command example
kubectl get pods -n production

- 
[Action 2]

Expected Result: [What you should see]

If This Fails: Go to Troubleshooting Section

Step 2: [Step Name]

Objective: [What this step accomplishes]

Actions:

- [Action 1]

- [Action 2]

Decision Point:

┌─────────────────────────────────────┐
│ Is the service responding?          │
│                                     │
│ YES → Continue to Step 3            │
│ NO  → Go to Step 4 (Escalation)     │
└─────────────────────────────────────┘

Step 3: [Verification]

Objective: Verify the issue is resolved

Verification Checklist:

-  Service is responding to health checks

-  Metrics show normal values

-  No new errors in logs

-  Users can access the service

Troubleshooting

Issue: [Common Issue 1]

Symptoms: [What you observe]

Cause: [Root cause]

Resolution:

- [Step 1]

- [Step 2]

Issue: [Common Issue 2]

Symptoms: [What you observe]

Cause: [Root cause]

Resolution:

- [Step 1]

- [Step 2]

Escalation

When to Escalate

-  Issue not resolved after [X] minutes

-  Impact affects [threshold]

-  Required access not available

-  Unsure of next steps

Escalation Path

Level
Contact
Method
Response Time

L1
On-call Engineer
PagerDuty
15 min

L2
Team Lead
Slack #incidents
30 min

L3
Engineering Manager
Phone
1 hour

L4
VP Engineering
Phone
As needed

Communication

Status Updates

Template:

[TIMESTAMP] - [SERVICE] - [STATUS]

Current Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of user impact]
Next Update: [Time of next update]

Actions Taken:
- [Action 1]
- [Action 2]

Next Steps:
- [Planned action]

Stakeholder Notification

Stakeholder
When to Notify
Method

Engineering
Immediately
Slack

Product
If user-impacting
Slack

Support
If customer-facing
Email

Leadership
If SEV1/SEV2
Phone

Post-Incident

Cleanup Tasks

-  Remove any temporary fixes

-  Update monitoring/alerts if needed

-  Document any new learnings

Post-Incident Review

-  Schedule post-mortem meeting

-  Gather timeline and evidence

-  Identify action items

Appendix

Related Runbooks

- [RB-XXX: Related Runbook 1]

- [RB-YYY: Related Runbook 2]

Reference Documentation

- [Link to architecture docs]

- [Link to service docs]

Revision History

Version
Date
Author
Changes

1.0
[Date]
[Name]
Initial version

1.1
[Date]
[Name]
[Changes]

Incident Response Runbook Template

# Incident Runbook: [Alert Name]

| Property | Value |
|----------|-------|
| **Alert** | [Alert Name/ID] |
| **Severity** | [SEV1/SEV2/SEV3/SEV4] |
| **Service** | [Service Name] |
| **SLO Impact** | [Which SLO is affected] |

---

## Alert Details

**Trigger Condition:**
```text

[Alert query/condition]
Example: error_rate > 1% for 5 minutes

Alert Meaning: [What this alert indicates]

False Positive Indicators: [Signs this might be a false alarm]

Immediate Actions (First 5 Minutes)

1. Acknowledge Alert

# Acknowledge in PagerDuty
pd incident:acknowledge

# Or via Slack
/pd ack

2. Assess Impact

Quick Health Checks:

# Check service status
curl -s https://api.example.com/health | jq .

# Check error rate
kubectl logs -l app=service --tail=100 | grep -c ERROR

# Check pod status
kubectl get pods -n production -l app=service

Impact Assessment:

Check
Command
Expected
Actual

Health endpoint
curl /health

200 OK
[Result]

Error rate
grep ERROR

< 10
[Result]

Pod status
kubectl get pods

Running
[Result]

3. Initial Communication

Post in #incidents:

🔴 INCIDENT: [Service] - [Brief Description]
Severity: [SEV level]
Impact: [User impact]
Status: Investigating
Lead: @[your-name]

Diagnosis

Common Causes and Checks

Cause 1: High Traffic

# Check request rate
kubectl top pods -n production -l app=service

# Check HPA status
kubectl get hpa -n production

If traffic spike confirmed:

- Scale replicas: kubectl scale deployment/service --replicas=10

- Enable rate limiting if available

Cause 2: Database Issues

# Check database connections
kubectl exec -it [pod] -- psql -c "SELECT count(*) FROM pg_stat_activity;"

# Check slow queries
kubectl logs -l app=service | grep "slow query"

If database issues:

- Check connection pool exhaustion

- Look for long-running queries

- Consider read replica failover

Cause 3: Dependency Failure

# Check external dependencies
curl -s https://status.dependency.com/api/v2/status.json | jq .

# Check circuit breaker status
kubectl logs -l app=service | grep "circuit"

If dependency failure:

- Verify external service status

- Check for timeout configuration

- Consider enabling fallback behavior

Resolution Steps

Quick Fixes

Issue
Quick Fix
Command

Pod crash loop
Restart deployment
kubectl rollout restart deployment/service

Memory pressure
Increase limits
kubectl edit deployment/service

Config error
Rollback config
kubectl rollout undo deployment/service

Rollback Procedure

# List recent deployments
kubectl rollout history deployment/service -n production

# Rollback to previous version
kubectl rollout undo deployment/service -n production

# Rollback to specific revision
kubectl rollout undo deployment/service -n production --to-revision=2

Resolution Verification

Verification Checklist:

-  Alert has cleared

-  Health checks passing

-  Error rate below threshold

-  No user complaints in support channels

-  Metrics returning to baseline

Monitoring Period: Monitor for 15 minutes after resolution

Closure

Update Status

✅ RESOLVED: [Service] - [Brief Description]
Duration: [X] minutes
Root Cause: [Brief cause]
Resolution: [What fixed it]
Follow-up: [Any action items]

Post-Incident Tasks

-  Update incident timeline

-  Create post-mortem doc if SEV1/SEV2

-  File tickets for follow-up work

-  Update runbook if needed

Database Failover Runbook

# Runbook: Database Failover

| Property | Value |
|----------|-------|
| **ID** | RB-DB-001 |
| **Category** | Emergency |
| **Service** | PostgreSQL Primary |
| **Owner** | Platform Team |
| **Last Tested** | 2025-01-15 |

---

## Overview

**Purpose:** Failover from primary database to replica when primary is unavailable.

**When to Use:**
- Primary database unresponsive for > 5 minutes
- Primary database corruption detected
- Planned maintenance requiring failover

**Expected Outcome:** Application traffic routed to new primary

**Estimated Duration:** 15-30 minutes

---

## Prerequisites

### Required Access

- [ ] Azure Portal - Contributor on resource group
- [ ] kubectl - cluster-admin
- [ ] Database credentials - postgres superuser

### Pre-Failover Checks

```bash
# Verify replica is healthy and caught up
az postgres flexible-server replica list --resource-group rg-prod --name pg-primary

# Check replication lag
psql -h pg-replica.postgres.database.azure.com -U postgres -c \
  "SELECT pg_last_wal_receive_lsn() - pg_last_wal_replay_lsn() AS lag_bytes;"

Acceptable lag: < 1MB

Failover Procedure

Step 1: Confirm Primary is Unavailable

# Test primary connectivity
psql -h pg-primary.postgres.database.azure.com -U postgres -c "SELECT 1;"

# Check Azure status
az postgres flexible-server show --resource-group rg-prod --name pg-primary --query "state"

Expected: Connection timeout or error state

Step 2: Notify Stakeholders

🔴 DATABASE FAILOVER INITIATED
Target: pg-primary → pg-replica
Reason: [Primary unavailable/Maintenance/etc.]
Expected Downtime: 5-10 minutes

Step 3: Promote Replica

# Promote replica to primary (Azure Flexible Server)
az postgres flexible-server replica stop-replication \
  --resource-group rg-prod \
  --name pg-replica

# Verify promotion
az postgres flexible-server show \
  --resource-group rg-prod \
  --name pg-replica \
  --query "replicationRole"

Expected: replicationRole: None
 (standalone)

Step 4: Update Connection Strings

# Update Kubernetes secret
kubectl create secret generic db-connection \
  --from-literal=host=pg-replica.postgres.database.azure.com \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart applications to pick up new connection
kubectl rollout restart deployment -l uses-database=true -n production

Step 5: Verify Application Connectivity

# Check application logs
kubectl logs -l app=api-service --tail=50 | grep -i database

# Test application health
curl -s https://api.example.com/health | jq .database

Post-Failover

Immediate Tasks

-  Verify all applications connected to new primary

-  Check for data consistency

-  Monitor error rates

Recovery Tasks (Next 24 Hours)

-  Investigate original primary failure

-  Create new replica from new primary

-  Update DNS/connection strings permanently

-  Document incident and learnings

Rollback

If failover causes issues:

# If original primary is recoverable
# Stop writes to new primary
kubectl scale deployment --replicas=0 -l uses-database=true -n production

# Restore original primary
az postgres flexible-server update --resource-group rg-prod --name pg-primary --state Enabled

# Revert connection strings
kubectl create secret generic db-connection \
  --from-literal=host=pg-primary.postgres.database.azure.com \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart applications
kubectl rollout restart deployment -l uses-database=true -n production

Runbook Quality Checklist

Criterion
Description
Check

Actionable
Every step has a specific action
[ ]

Testable
Can be practiced in non-prod
[ ]

Current
Reflects current system state
[ ]

Complete
Covers happy and error paths
[ ]

Accessible
Available during incidents
[ ]

Versioned
Changes tracked with dates
[ ]

Workflow

When creating runbooks:

- Identify Need: What operation/incident needs documentation?

- Gather Information: Interview operators, review past incidents

- Draft Runbook: Use appropriate template

- Validate Steps: Walk through with subject matter expert

- Test in Non-Prod: Execute runbook in staging

- Publish: Add to runbook collection

- Train Team: Ensure operators know where to find it

- Maintain: Review and update regularly

References

For detailed guidance:

Last Updated: 2025-12-26

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

design-thinking

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

plantuml-syntax

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

system-prompt-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

swot-pestle-analysis

No summary provided by upstream source.

Repository SourceNeeds Review