Deployment Runbook
Overview
This skill provides deployment procedures, automated health checks, and rollback strategies to ensure safe, reliable deployments. Use it to standardize deployment workflows and reduce deployment-related incidents.
When to Use This Skill
-
Planning production deployments
-
Executing staged rollouts (canary, blue-green)
-
Performing post-deployment health checks
-
Rolling back failed deployments
-
Troubleshooting deployment issues
-
Establishing deployment best practices
-
Complementing the devops-automation agent for deployments
Pre-Deployment Checklist
Before any production deployment:
-
Code Review: All changes reviewed and approved
-
Tests Pass: CI/CD pipeline green
-
Unit tests: ✓
-
Integration tests: ✓
-
E2E tests: ✓
-
Database Migrations: Tested in staging
-
Backward compatible
-
Rollback script prepared
-
Configuration: Environment variables verified
-
Secrets rotated if needed
-
Feature flags configured
-
Monitoring: Dashboards and alerts ready
-
Error tracking enabled
-
Performance monitoring active
-
Log aggregation configured
-
Communication: Stakeholders notified
-
Deployment window announced
-
On-call engineer assigned
-
Rollback plan documented
-
Backups: Recent backup verified
-
Database backed up < 1 hour ago
-
Backup restoration tested
-
Capacity: Resources scaled appropriately
-
Auto-scaling configured
-
Rate limits reviewed
-
CDN cache warmed
Deployment Strategies
- Blue-Green Deployment
Best for: Zero-downtime deployments, easy rollbacks
Process:
-
Deploy to inactive (green) environment
-
Run health checks on green
-
Switch traffic from blue to green
-
Monitor for issues
-
Keep blue as instant rollback option
Commands:
Deploy to green environment
./deploy.sh --env green
Run health checks
python3 scripts/health_check.py --env green
Switch traffic (gradual)
./switch_traffic.sh --from blue --to green --percentage 10 ./switch_traffic.sh --from blue --to green --percentage 50 ./switch_traffic.sh --from blue --to green --percentage 100
If issues: instant rollback
./switch_traffic.sh --from green --to blue --percentage 100
- Canary Deployment
Best for: Risk-averse deployments, gradual rollouts
Process:
-
Deploy to small subset of servers (5-10%)
-
Monitor metrics closely
-
Gradually increase percentage
-
Roll back if metrics degrade
Monitoring During Canary:
-
Error rate < baseline + 1%
-
Response time < baseline + 10%
-
Success rate > 99.9%
- Rolling Deployment
Best for: Standard updates, resource-constrained environments
Process:
-
Take one instance out of load balancer
-
Deploy new version
-
Run health checks
-
Add back to load balancer
-
Repeat for remaining instances
Deployment Workflow
Phase 1: Pre-Deployment (T-30 minutes)
1. Verify staging environment
./verify_staging.sh
2. Create deployment tag
git tag -a v1.2.3 -m "Release 1.2.3" git push origin v1.2.3
3. Trigger production build
./build_production.sh --tag v1.2.3
4. Backup database
./backup_db.sh --environment production
5. Notify team
./notify_slack.sh "🚀 Starting deployment v1.2.3 in 30 minutes"
Phase 2: Deployment (T-0)
1. Enable maintenance mode (if needed)
./maintenance_mode.sh --enable
2. Run database migrations
./run_migrations.sh --environment production
3. Deploy application
./deploy.sh --environment production --version v1.2.3
4. Disable maintenance mode
./maintenance_mode.sh --disable
Phase 3: Post-Deployment Health Checks
Run comprehensive health checks
python3 scripts/health_check.py --environment production
Expected output:
✓ API health endpoint responding
✓ Database connectivity OK
✓ Cache layer accessible
✓ External services reachable
✓ Error rate within threshold
✓ Response time within SLA
Phase 4: Monitoring (T+30 minutes)
Monitor these metrics:
Application Metrics:
-
Error rate: < 0.1%
-
Response time (p95): < 200ms
-
Request throughput: within expected range
-
Success rate: > 99.9%
Infrastructure Metrics:
-
CPU utilization: < 70%
-
Memory usage: < 80%
-
Disk I/O: normal patterns
-
Network latency: < 50ms
Business Metrics:
-
Conversion rate: no significant drop
-
User signups: within expected range
-
Transaction volume: normal patterns
Rollback Procedures
When to Rollback
Rollback immediately if:
-
Error rate > 1%
-
Critical functionality broken
-
Data corruption detected
-
Security vulnerability introduced
-
Performance degradation > 50%
Rollback Methods
Method 1: Traffic Switch (Fastest)
Blue-green: instant rollback
./switch_traffic.sh --from green --to blue --percentage 100
Verification
python3 scripts/health_check.py --environment production
Method 2: Version Revert
Deploy previous version
./deploy.sh --environment production --version v1.2.2
Run health checks
python3 scripts/health_check.py --environment production
Method 3: Database Rollback
If migrations were applied
./rollback_migration.sh --environment production --steps 1
Restore from backup (last resort)
./restore_db.sh --backup latest --environment production
Post-Rollback
Verify system health
python3 scripts/health_check.py --environment production
Notify stakeholders
./notify_slack.sh "⚠️ Deployment v1.2.3 rolled back. System stable on v1.2.2"
Create postmortem
-
What went wrong?
-
Why didn't we catch it?
-
How do we prevent recurrence?
Health Check Script
Use the included health check script:
Run all checks
python3 scripts/health_check.py --env production
Run specific check
python3 scripts/health_check.py --env production --check api
Verbose output
python3 scripts/health_check.py --env production --verbose
See scripts/health_check.py for implementation.
Troubleshooting Guide
Issue: Deployment Hangs
Symptoms:
-
Deployment script doesn't complete
-
Services not starting
Diagnosis:
Check service logs
kubectl logs -f deployment/app-name
Check events
kubectl get events --sort-by='.lastTimestamp'
Resolution:
-
Increase timeout values
-
Check resource constraints
-
Verify image pull secrets
Issue: High Error Rate Post-Deployment
Symptoms:
-
Error rate spike
-
500 errors in logs
Diagnosis:
Check application logs
tail -f /var/log/app/error.log
Check error distribution
grep "ERROR" /var/log/app/* | awk '{print $NF}' | sort | uniq -c | sort -nr
Resolution:
-
Check configuration changes
-
Verify environment variables
-
Review recent code changes
-
Consider immediate rollback
Issue: Database Connection Failures
Symptoms:
-
"Connection refused" errors
-
Timeout errors
Diagnosis:
Test database connectivity
python3 scripts/test_db_connection.py
Check connection pool
psql -h db-host -U user -c "SELECT * FROM pg_stat_activity;"
Resolution:
-
Verify connection strings
-
Check firewall rules
-
Increase connection pool size
-
Verify credentials
Communication Templates
Pre-Deployment Announcement
🚀 Production Deployment Scheduled
Version: v1.2.3 Time: 2024-01-15 14:00 UTC (30 minutes) Duration: ~15 minutes Impact: No expected downtime
Changes:
- Feature: New user dashboard
- Fix: Payment processing bug
- Performance: API response time improvements
Rollback Plan: Blue-green switch (instant) On-Call: @engineer-name
Deployment Success
✅ Deployment Complete
Version: v1.2.3 Status: Successful Duration: 12 minutes
Health Checks: All passing ✓ Metrics: Within normal range Next Check: T+30 minutes
Monitoring dashboard: [link]
Deployment Rollback
⚠️ Deployment Rolled Back
Version: v1.2.3 → v1.2.2 (rollback) Reason: Elevated error rate (2.1%) Status: System stable on v1.2.2
Action Items:
- Root cause analysis
- Fix identified issue
- Re-test in staging
- Schedule re-deployment
Incident report: [link]
Resources
scripts/
-
health_check.py: Comprehensive deployment health checks
-
test_db_connection.py: Database connectivity verification
references/
-
deployment-checklist.md: Detailed pre/post deployment checklist
-
monitoring-guide.md: Metrics to monitor during deployments
Best Practices
-
Always deploy during low-traffic windows
-
Never deploy on Fridays (unless critical hotfix)
-
Keep deployments small (< 200 lines changed)
-
Monitor for 30+ minutes post-deployment
-
Document every rollback with postmortem
-
Test rollback procedure in staging first
-
Use feature flags for risky changes
-
Automate health checks (don't rely on manual verification)
Quick Reference
Emergency Rollback:
./switch_traffic.sh --from green --to blue --percentage 100
Health Check:
python3 scripts/health_check.py --env production
View Logs:
kubectl logs -f deployment/app-name --tail=100
Check Metrics: