deepgram-incident-runbook

Deepgram Incident Runbook

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "deepgram-incident-runbook" with this command: npx skills add jeremylongshore/claude-code-plugins-plus-skills/jeremylongshore-claude-code-plugins-plus-skills-deepgram-incident-runbook

Deepgram Incident Runbook

Contents

  • Overview

  • Prerequisites

  • Instructions

  • Output

  • Error Handling

  • Examples

  • Resources

Overview

Standardized procedures for responding to Deepgram-related incidents with initial triage script, severity-based response (SEV1-SEV4), fallback activation, degradation investigation, and post-incident review templates.

Prerequisites

  • Monitoring and alerting configured

  • On-call rotation established

  • Fallback/queueing system available

  • Communication channels defined

Instructions

Step 1: Run Initial Triage (First 5 Minutes)

Execute triage script: check Deepgram status page, query error rate from Prometheus, check P95 latency, and test API connectivity with curl.

Step 2: Classify Severity

SEV1 (immediate): 100% failure, 5xx errors. SEV2 (<15min): 50%+ error rate. SEV3 (<1hr): elevated latency. SEV4 (<24hr): single feature affected.

Step 3: Respond to SEV1 (Complete Outage)

Acknowledge in PagerDuty/Slack. Verify API key validity. Check network. Activate fallback: queue requests for later replay, or switch to backup STT provider. Notify affected customers.

Step 4: Respond to SEV2 (Major Degradation)

Test transcription across multiple samples and models. Identify if specific model, feature, or audio type is affected. Mitigate: reduce request rate, disable non-critical features, switch models, enable retries.

Step 5: Respond to SEV3 (Minor Degradation)

Increase timeouts to 60s, enable aggressive retry (5 attempts), switch to simpler model (Nova), disable diarization. Monitor for improvement.

Step 6: Conduct Post-Incident Review

Document timeline, root cause, impact (duration, failed requests, revenue). List what went well and areas for improvement. Create action items with owners and due dates.

See detailed implementation for advanced patterns.

Output

  • Automated triage script

  • Severity classification guide

  • Fallback activation procedures

  • Degradation investigation playbook

  • Post-incident review template

Error Handling

Issue Cause Solution

All transcriptions failing API outage Activate fallback queue

50%+ error rate Partial degradation Test models, reduce features

Elevated latency Overload Increase timeouts, reduce rate

Single feature broken API regression Disable feature, report to Deepgram

Examples

Quick Reference

Resource URL

Deepgram Status https://status.deepgram.com

Deepgram Console https://console.deepgram.com

Support support@deepgram.com

Severity Levels

Level Definition Response Time

SEV1 Complete outage Immediate

SEV2 Major degradation < 15 min

SEV3 Minor degradation < 1 hour

SEV4 Minor issue < 24 hours

Escalation Contacts

Level Contact When

L1 On-call engineer First response

L2 Team lead 15 min without resolution

L3 Deepgram support Confirmed Deepgram issue

L4 Engineering director SEV1 > 1 hour

Resources

  • Deepgram Status Page

  • Deepgram Support

  • Internal Runbooks

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

backtesting-trading-strategies

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

svg-icon-generator

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

performance-lighthouse-runner

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

mindmap-generator

No summary provided by upstream source.

Repository SourceNeeds Review