k8s-debug

Kubernetes Debugging Expertise

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "k8s-debug" with this command: npx skills add incidentfox/incidentfox/incidentfox-incidentfox-k8s-debug

Kubernetes Debugging Expertise

Golden Rule: Events Before Logs

When debugging Kubernetes issues, ALWAYS check events first:

  • get_pod_events

  • Shows scheduling, pulling, starting, probes, OOM

  • THEN get_pod_logs

  • Application-level errors

Events explain most crash/scheduling issues faster than logs.

Typical Investigation Flow

  1. list_pods → Get overview of pod health in namespace
  2. get_pod_events → Understand WHY pods are in their state
  3. get_pod_logs → Only if events don't explain the issue
  4. get_pod_resources → For performance/resource issues
  5. describe_deployment → Check deployment status and conditions

Common Issue Patterns

CrashLoopBackOff

First check: get_pod_events

Event Reason Likely Cause Next Step

OOMKilled Memory limit too low or memory leak Check get_pod_resources , increase limits

Error Application crash Check get_pod_logs for stack trace

BackOff Repeated failures Check logs for startup errors

Checklist:

  • Memory limits vs actual usage

  • Recent deployment changes (get_deployment_history )

  • Missing config/secrets

  • Dependency failures (database, external services)

OOMKilled

First check: get_pod_events (confirms OOMKilled) Then: get_pod_resources (compare usage to limits)

Common causes:

  • Memory limit set too low for workload

  • Memory leak (usage increases over time)

  • Sudden traffic spike causing memory pressure

  • Large request payloads cached in memory

ImagePullBackOff

First check: get_pod_events

Common causes:

  • Wrong image name or tag

  • Private registry without imagePullSecrets

  • Rate limiting from registry

  • Network issues reaching registry

Pending Pods

First check: get_pod_events

Look for:

  • FailedScheduling

  • Insufficient resources

  • Unschedulable

  • Node affinity/taints

  • No matching nodes for nodeSelector

Readiness/Liveness Probe Failures

First check: describe_pod (shows probe config) Then: get_pod_events (probe failure events) Then: get_pod_logs (why endpoint isn't responding)

Evicted Pods

First check: get_pod_events

Causes:

  • Node resource pressure (disk, memory)

  • Priority preemption

  • Taint-based eviction

Deployment Issues

Stuck Rollout

describe_deployment → Check replicas (desired vs ready vs available) get_deployment_history → Compare current vs previous revision get_pod_events → For pods in new ReplicaSet

Common causes:

  • New pods failing (CrashLoopBackOff)

  • Readiness probes failing

  • Resource constraints preventing scheduling

Rollback Decision

Use get_deployment_history to see previous working versions.

Error Classification

Non-Retryable (Stop Immediately)

  • 401 Unauthorized - Invalid credentials

  • 403 Forbidden - No permission

  • 404 Not Found - Resource doesn't exist

  • "config_required": true - Integration not configured

Retryable (May retry once)

  • 429 Too Many Requests

  • 500/502/503/504 Server errors

  • Timeout

  • Connection refused

Resource Investigation Pattern

For memory/CPU issues:

  1. get_pod_resources → See allocation vs usage
  2. describe_pod → See full container spec
  3. get_cloudwatch_metrics/query_datadog_metrics → Historical usage
  4. detect_anomalies on historical data → Find when issue started

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

investigate

No summary provided by upstream source.

Repository SourceNeeds Review
General

docker-debugging

No summary provided by upstream source.

Repository SourceNeeds Review
General

azure-infrastructure

No summary provided by upstream source.

Repository SourceNeeds Review
General

kubernetes-debug

No summary provided by upstream source.

Repository SourceNeeds Review