incident-response

Incident Response and Remediation

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "incident-response" with this command: npx skills add 5dlabs/cto/5dlabs-cto-incident-response

Incident Response and Remediation

Patterns for diagnosing and fixing production issues.

Healer Mode Workflow

  • Investigate - Gather metrics, logs, and system state

  • Diagnose - Identify root cause before fixing

  • Fix - Implement minimal targeted fix

  • Validate - Confirm metrics improve after deployment

  • Document - Store learnings for future incidents

Tool Usage Priority

  • Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs

  • Kubernetes Tools - Check pod status, events, deployments

  • ArgoCD Tools - Verify GitOps sync status

  • Memory Search - Look for similar past incidents

  • Code Fix - Implement minimal targeted fix

Observability Queries

Prometheus Metrics

Error rate

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Latency P99

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

CPU usage

sum(rate(container_cpu_usage_seconds_total{pod=~"app-.*"}[5m])) by (pod)

Memory usage

container_memory_working_set_bytes{pod=~"app-.*"}

Loki Log Queries

Errors in last hour

{namespace="production", pod=~"app-.*"} |= "error" | json | level="error"

Stack traces

{namespace="production"} |= "panic" or |= "stack trace"

Slow requests

{namespace="production"} | json | latency_ms > 1000

Kubernetes Diagnostics

Pod status and events

kubectl get pods -n production -l app=myapp kubectl describe pod <pod-name> -n production kubectl get events -n production --sort-by='.lastTimestamp'

Logs

kubectl logs -n production -l app=myapp --tail=100 kubectl logs -n production <pod-name> --previous # Previous container

Resource usage

kubectl top pods -n production kubectl top nodes

Deployment status

kubectl rollout status deployment/myapp -n production kubectl rollout history deployment/myapp -n production

ArgoCD Status

Application status

argocd app get myapp argocd app diff myapp

Sync status

argocd app sync myapp --dry-run

Rollback

argocd app rollback myapp <revision>

Common Issues and Solutions

High Error Rate

  • Check recent deployments

  • Review error logs for patterns

  • Check dependency health

  • Verify configuration changes

High Latency

  • Check database query performance

  • Review external service latency

  • Check resource constraints (CPU/memory)

  • Look for lock contention

OOMKilled Pods

  • Increase memory limits

  • Check for memory leaks

  • Review recent code changes

  • Consider horizontal scaling

CrashLoopBackOff

  • Check logs for startup errors

  • Verify secrets and configs exist

  • Check health check endpoints

  • Review recent deployments

ImagePullBackOff

  • Verify image exists in registry

  • Check image pull secrets

  • Verify image tag is correct

  • Check registry connectivity

Healing Guidelines

  • Diagnose first - Understand the root cause before fixing

  • Minimal changes - Fix only what's broken

  • Document findings - Store learnings in memory for future incidents

  • Validate fix - Confirm metrics improve after deployment

  • Rollback if needed - Don't hesitate to rollback if fix doesn't work

Post-Incident

  • Update metrics/alerts if needed

  • Document root cause and fix

  • Store learnings in memory for similar incidents

  • Consider preventive measures

  • Update runbooks if applicable

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

expo-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

elysia-llm-docs

No summary provided by upstream source.

Repository SourceNeeds Review
General

better-auth-expo

No summary provided by upstream source.

Repository SourceNeeds Review
General

anime-js

No summary provided by upstream source.

Repository SourceNeeds Review