on-call-handoff-patterns

On-Call Handoff Patterns

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "on-call-handoff-patterns" with this command: npx skills add wshobson/agents/wshobson-agents-on-call-handoff-patterns

On-Call Handoff Patterns

Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.

When to Use This Skill

  • Transitioning on-call responsibilities

  • Writing shift handoff summaries

  • Documenting ongoing investigations

  • Establishing on-call rotation procedures

  • Improving handoff quality

  • Onboarding new on-call engineers

Core Concepts

  1. Handoff Components

Component Purpose

Active Incidents What's currently broken

Ongoing Investigations Issues being debugged

Recent Changes Deployments, configs

Known Issues Workarounds in place

Upcoming Events Maintenance, releases

  1. Handoff Timing

Recommended: 30 min overlap between shifts

Outgoing: ├── 15 min: Write handoff document └── 15 min: Sync call with incoming

Incoming: ├── 15 min: Review handoff document ├── 15 min: Sync call with outgoing └── 5 min: Verify alerting setup

Templates

Template 1: Shift Handoff Document

On-Call Handoff: Platform Team

Outgoing: @alice (2024-01-15 to 2024-01-22) Incoming: @bob (2024-01-22 to 2024-01-29) Handoff Time: 2024-01-22 09:00 UTC


🔴 Active Incidents

None currently active

No active incidents at handoff time.


🟡 Ongoing Investigations

1. Intermittent API Timeouts (ENG-1234)

Status: Investigating Started: 2024-01-20 Impact: ~0.1% of requests timing out

Context:

  • Timeouts correlate with database backup window (02:00-03:00 UTC)
  • Suspect backup process causing lock contention
  • Added extra logging in PR #567 (deployed 01/21)

Next Steps:

  • Review new logs after tonight's backup
  • Consider moving backup window if confirmed

Resources:

  • Dashboard: API Latency
  • Thread: #platform-eng (01/20, 14:32)

2. Memory Growth in Auth Service (ENG-1235)

Status: Monitoring Started: 2024-01-18 Impact: None yet (proactive)

Context:

  • Memory usage growing ~5% per day
  • No memory leak found in profiling
  • Suspect connection pool not releasing properly

Next Steps:

  • Review heap dump from 01/21
  • Consider restart if usage > 80%

Resources:


🟢 Resolved This Shift

Payment Service Outage (2024-01-19)

  • Duration: 23 minutes
  • Root Cause: Database connection exhaustion
  • Resolution: Rolled back v2.3.4, increased pool size
  • Postmortem: POSTMORTEM-89
  • Follow-up tickets: ENG-1230, ENG-1231

📋 Recent Changes

Deployments

ServiceVersionTimeNotes
api-gatewayv3.2.101/21 14:00Bug fix for header parsing
user-servicev2.8.001/20 10:00New profile features
auth-servicev4.1.201/19 16:00Security patch

Configuration Changes

  • 01/21: Increased API rate limit from 1000 to 1500 RPS
  • 01/20: Updated database connection pool max from 50 to 75

Infrastructure

  • 01/20: Added 2 nodes to Kubernetes cluster
  • 01/19: Upgraded Redis from 6.2 to 7.0

⚠️ Known Issues & Workarounds

1. Slow Dashboard Loading

Issue: Grafana dashboards slow on Monday mornings Workaround: Wait 5 min after 08:00 UTC for cache warm-up Ticket: OPS-456 (P3)

2. Flaky Integration Test

Issue: test_payment_flow fails intermittently in CI Workaround: Re-run failed job (usually passes on retry) Ticket: ENG-1200 (P2)


📅 Upcoming Events

DateEventImpactContact
01/23 02:00Database maintenance5 min read-only@dba-team
01/24 14:00Major release v5.0Monitor closely@release-team
01/25Marketing campaign2x traffic expected@platform

📞 Escalation Reminders

Issue TypeFirst EscalationSecond Escalation
Payment issues@payments-oncall@payments-manager
Auth issues@auth-oncall@security-team
Database issues@dba-team@infra-manager
Unknown/severe@engineering-manager@vp-engineering

🔧 Quick Reference

Common Commands

# Check service health
kubectl get pods -A | grep -v Running

# Recent deployments
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Clear cache (emergency only)
redis-cli FLUSHDB

Important Links

  • Runbooks

  • Service Catalog

  • Incident Slack

  • PagerDuty

Handoff Checklist

Outgoing Engineer

  • Document active incidents

  • Document ongoing investigations

  • List recent changes

  • Note known issues

  • Add upcoming events

  • Sync with incoming engineer

Incoming Engineer

  • Read this document

  • Join sync call

  • Verify PagerDuty is routing to you

  • Verify Slack notifications working

  • Check VPN/access working

  • Review critical dashboards

Template 2: Quick Handoff (Async)

# Quick Handoff: @alice → @bob

## TL;DR
- No active incidents
- 1 investigation ongoing (API timeouts, see ENG-1234)
- Major release tomorrow (01/24) - be ready for issues

## Watch List
1. API latency around 02:00-03:00 UTC (backup window)
2. Auth service memory (restart if > 80%)

## Recent
- Deployed api-gateway v3.2.1 yesterday (stable)
- Increased rate limits to 1500 RPS

## Coming Up
- 01/23 02:00 - DB maintenance (5 min read-only)
- 01/24 14:00 - v5.0 release

## Questions?
I'll be available on Slack until 17:00 today.

Template 3: Incident Handoff (Mid-Incident)

# INCIDENT HANDOFF: Payment Service Degradation

**Incident Start**: 2024-01-22 08:15 UTC
**Current Status**: Mitigating
**Severity**: SEV2

---

## Current State

- Error rate: 15% (down from 40%)
- Mitigation in progress: scaling up pods
- ETA to resolution: ~30 min

## What We Know

1. Root cause: Memory pressure on payment-service pods
2. Triggered by: Unusual traffic spike (3x normal)
3. Contributing: Inefficient query in checkout flow

## What We've Done

- Scaled payment-service from 5 → 15 pods
- Enabled rate limiting on checkout endpoint
- Disabled non-critical features

## What Needs to Happen

1. Monitor error rate - should reach <1% in ~15 min
2. If not improving, escalate to @payments-manager
3. Once stable, begin root cause investigation

## Key People

- Incident Commander: @alice (handing off)
- Comms Lead: @charlie
- Technical Lead: @bob (incoming)

## Communication

- Status page: Updated at 08:45
- Customer support: Notified
- Exec team: Aware

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

tailwind-design-system

Tailwind Design System (v4)

Repository Source
31.3K19K
wshobson
Automation

api-design-principles

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

nodejs-backend-patterns

No summary provided by upstream source.

Repository SourceNeeds Review