Deployment Runbook

Overview

This skill provides deployment procedures, automated health checks, and rollback strategies to ensure safe, reliable deployments. Use it to standardize deployment workflows and reduce deployment-related incidents.

When to Use This Skill

Planning production deployments
Executing staged rollouts (canary, blue-green)
Performing post-deployment health checks
Rolling back failed deployments
Troubleshooting deployment issues
Establishing deployment best practices
Complementing the devops-automation agent for deployments

Pre-Deployment Checklist

Before any production deployment:

Code Review: All changes reviewed and approved
Tests Pass: CI/CD pipeline green
Unit tests: ✓
Integration tests: ✓
E2E tests: ✓
Database Migrations: Tested in staging
Backward compatible
Rollback script prepared
Configuration: Environment variables verified
Secrets rotated if needed
Feature flags configured
Monitoring: Dashboards and alerts ready
Error tracking enabled
Performance monitoring active
Log aggregation configured
Communication: Stakeholders notified
Deployment window announced
On-call engineer assigned
Rollback plan documented
Backups: Recent backup verified
Database backed up < 1 hour ago
Backup restoration tested
Capacity: Resources scaled appropriately
Auto-scaling configured
Rate limits reviewed
CDN cache warmed

Deployment Strategies

Blue-Green Deployment

Best for: Zero-downtime deployments, easy rollbacks

Process:

Deploy to inactive (green) environment
Run health checks on green
Switch traffic from blue to green
Monitor for issues
Keep blue as instant rollback option

Commands:

Deploy to green environment

./deploy.sh --env green

Run health checks

python3 scripts/health_check.py --env green

Switch traffic (gradual)

./switch_traffic.sh --from blue --to green --percentage 10 ./switch_traffic.sh --from blue --to green --percentage 50 ./switch_traffic.sh --from blue --to green --percentage 100

If issues: instant rollback

./switch_traffic.sh --from green --to blue --percentage 100

Canary Deployment

Best for: Risk-averse deployments, gradual rollouts

Process:

Deploy to small subset of servers (5-10%)
Monitor metrics closely
Gradually increase percentage
Roll back if metrics degrade

Monitoring During Canary:

Error rate < baseline + 1%
Response time < baseline + 10%
Success rate > 99.9%

Rolling Deployment

Best for: Standard updates, resource-constrained environments

Process:

Take one instance out of load balancer
Deploy new version
Run health checks
Add back to load balancer
Repeat for remaining instances

Deployment Workflow

Phase 1: Pre-Deployment (T-30 minutes)

1. Verify staging environment

./verify_staging.sh

2. Create deployment tag

git tag -a v1.2.3 -m "Release 1.2.3" git push origin v1.2.3

3. Trigger production build

./build_production.sh --tag v1.2.3

4. Backup database

./backup_db.sh --environment production

5. Notify team

./notify_slack.sh "🚀 Starting deployment v1.2.3 in 30 minutes"

Phase 2: Deployment (T-0)

1. Enable maintenance mode (if needed)

./maintenance_mode.sh --enable

2. Run database migrations

./run_migrations.sh --environment production

3. Deploy application

./deploy.sh --environment production --version v1.2.3

4. Disable maintenance mode

./maintenance_mode.sh --disable

Phase 3: Post-Deployment Health Checks

Run comprehensive health checks

python3 scripts/health_check.py --environment production

Expected output:

✓ API health endpoint responding

✓ Database connectivity OK

✓ Cache layer accessible

✓ External services reachable

✓ Error rate within threshold

✓ Response time within SLA

Phase 4: Monitoring (T+30 minutes)

Monitor these metrics:

Application Metrics:

Error rate: < 0.1%
Response time (p95): < 200ms
Request throughput: within expected range
Success rate: > 99.9%

Infrastructure Metrics:

CPU utilization: < 70%
Memory usage: < 80%
Disk I/O: normal patterns
Network latency: < 50ms

Business Metrics:

Conversion rate: no significant drop
User signups: within expected range
Transaction volume: normal patterns

Rollback Procedures

When to Rollback

Rollback immediately if:

Error rate > 1%
Critical functionality broken
Data corruption detected
Security vulnerability introduced
Performance degradation > 50%

Rollback Methods

Method 1: Traffic Switch (Fastest)

Blue-green: instant rollback

./switch_traffic.sh --from green --to blue --percentage 100

Verification

python3 scripts/health_check.py --environment production

Method 2: Version Revert

Deploy previous version

./deploy.sh --environment production --version v1.2.2

Run health checks

python3 scripts/health_check.py --environment production

Method 3: Database Rollback

If migrations were applied

./rollback_migration.sh --environment production --steps 1

Restore from backup (last resort)

./restore_db.sh --backup latest --environment production

Post-Rollback

Verify system health

python3 scripts/health_check.py --environment production

Notify stakeholders

./notify_slack.sh "⚠️ Deployment v1.2.3 rolled back. System stable on v1.2.2"

Create postmortem

What went wrong?
Why didn't we catch it?
How do we prevent recurrence?

Health Check Script

Use the included health check script:

Run all checks

python3 scripts/health_check.py --env production

Run specific check

python3 scripts/health_check.py --env production --check api

Verbose output

python3 scripts/health_check.py --env production --verbose

See scripts/health_check.py for implementation.

Troubleshooting Guide

Issue: Deployment Hangs

Symptoms:

Deployment script doesn't complete
Services not starting

Diagnosis:

Check service logs

kubectl logs -f deployment/app-name

Check events

kubectl get events --sort-by='.lastTimestamp'

Resolution:

Increase timeout values
Check resource constraints
Verify image pull secrets

Issue: High Error Rate Post-Deployment

Symptoms:

Error rate spike
500 errors in logs

Diagnosis:

Check application logs

tail -f /var/log/app/error.log

Check error distribution

grep "ERROR" /var/log/app/* | awk '{print $NF}' | sort | uniq -c | sort -nr

Resolution:

Check configuration changes
Verify environment variables
Review recent code changes
Consider immediate rollback

Issue: Database Connection Failures

Symptoms:

"Connection refused" errors
Timeout errors

Diagnosis:

Test database connectivity

python3 scripts/test_db_connection.py

Check connection pool

psql -h db-host -U user -c "SELECT * FROM pg_stat_activity;"

Resolution:

Verify connection strings
Check firewall rules
Increase connection pool size
Verify credentials

Communication Templates

Pre-Deployment Announcement

🚀 Production Deployment Scheduled

Version: v1.2.3 Time: 2024-01-15 14:00 UTC (30 minutes) Duration: ~15 minutes Impact: No expected downtime

Changes:

Feature: New user dashboard
Fix: Payment processing bug
Performance: API response time improvements

Rollback Plan: Blue-green switch (instant) On-Call: @engineer-name

Deployment Success

✅ Deployment Complete

Version: v1.2.3 Status: Successful Duration: 12 minutes

Health Checks: All passing ✓ Metrics: Within normal range Next Check: T+30 minutes

Monitoring dashboard: [link]

Deployment Rollback

⚠️ Deployment Rolled Back

Version: v1.2.3 → v1.2.2 (rollback) Reason: Elevated error rate (2.1%) Status: System stable on v1.2.2

Action Items:

Root cause analysis
Fix identified issue
Re-test in staging
Schedule re-deployment

Incident report: [link]

Resources

scripts/

health_check.py: Comprehensive deployment health checks
test_db_connection.py: Database connectivity verification

references/

deployment-checklist.md: Detailed pre/post deployment checklist
monitoring-guide.md: Metrics to monitor during deployments

Best Practices

Always deploy during low-traffic windows
Never deploy on Fridays (unless critical hotfix)
Keep deployments small (< 200 lines changed)
Monitor for 30+ minutes post-deployment
Document every rollback with postmortem
Test rollback procedure in staging first
Use feature flags for risky changes
Automate health checks (don't rely on manual verification)

Quick Reference

Emergency Rollback:

./switch_traffic.sh --from green --to blue --percentage 100

Health Check:

python3 scripts/health_check.py --env production

View Logs:

kubectl logs -f deployment/app-name --tail=100

Check Metrics:

curl https://metrics.example.com/api/health

deployment-runbook

Safety Notice

Copy this and send it to your AI assistant to learn

Deploy to green environment

Run health checks

Switch traffic (gradual)

If issues: instant rollback

1. Verify staging environment

2. Create deployment tag

3. Trigger production build

4. Backup database

5. Notify team

1. Enable maintenance mode (if needed)

2. Run database migrations

3. Deploy application

4. Disable maintenance mode

Run comprehensive health checks

Expected output:

✓ API health endpoint responding

✓ Database connectivity OK

✓ Cache layer accessible

✓ External services reachable

✓ Error rate within threshold

✓ Response time within SLA

Blue-green: instant rollback

Verification

Deploy previous version

Run health checks

If migrations were applied

Restore from backup (last resort)

Run all checks

Run specific check

Verbose output

Check service logs

Check events

Check application logs

Check error distribution

Test database connectivity

Check connection pool

Source Transparency

Related Skills

deep-read

git-workflow

skill-creator

ci-cd-templates