kubernetes-troubleshooting

Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "kubernetes-troubleshooting" with this command: npx skills add nik-kale/sre-skills/nik-kale-sre-skills-kubernetes-troubleshooting

Kubernetes Troubleshooting

Systematic approach to debugging Kubernetes issues.

When to Use This Skill

  • Pod stuck in CrashLoopBackOff
  • OOMKilled errors
  • ImagePullBackOff failures
  • Pod not starting or scheduling
  • Service connectivity issues
  • Resource constraint problems

Quick Diagnostic Commands

Start with these commands to understand the current state:

# Cluster overview
kubectl get nodes
kubectl get pods -A | grep -v Running

# Specific namespace
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>

Pod Debugging Workflow

Step 1: Check Pod Status

kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>

Look for:

  • Status: What state is the pod in?
  • Conditions: Ready, ContainersReady, PodScheduled
  • Events: Recent events at the bottom of describe output

Step 2: Identify the Problem Category

SymptomLikely CauseGo To Section
PendingScheduling issueScheduling Issues
CrashLoopBackOffApplication crashCrashLoopBackOff
ImagePullBackOffImage/registry issueImage Pull Issues
OOMKilledMemory exhaustionOOMKilled
Running but not ReadyHealth check failingReadiness Issues
ErrorContainer errorContainer Errors

Common Issues

Scheduling Issues

Pod stuck in Pending state.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events

Common Causes:

Event MessageCauseFix
Insufficient cpu/memoryNot enough resourcesAdd nodes or reduce requests
node(s) had taintsNode taintsAdd tolerations or remove taints
no nodes availableNo matching nodesCheck node selector/affinity
persistentvolumeclaim not foundPVC missingCreate the PVC

Fix Resource Issues:

# Check resource requests vs available
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check pending pod requests
kubectl get pod <pod> -o yaml | grep -A 10 resources

CrashLoopBackOff

Container keeps crashing and restarting.

Diagnostic:

# Check container logs (current)
kubectl logs <pod-name> -n <namespace>

# Check previous container logs
kubectl logs <pod-name> -n <namespace> --previous

# Check exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"

Common Exit Codes:

Exit CodeMeaningCommon Cause
0SuccessProcess completed (might be wrong for long-running)
1Application errorCheck application logs
137SIGKILL (OOM)Memory limit exceeded
139SIGSEGVSegmentation fault
143SIGTERMGraceful termination

Common Fixes:

  • Check application logs for startup errors
  • Verify environment variables and secrets
  • Check if dependencies are available
  • Verify resource limits aren't too restrictive

Image Pull Issues

ImagePullBackOff or ErrImagePull.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events

Common Causes:

ErrorCauseFix
repository does not existWrong image nameFix image name/tag
unauthorizedAuth failureCheck imagePullSecrets
manifest unknownTag doesn't existVerify tag exists
connection refusedRegistry unreachableCheck network/firewall

Fix Registry Auth:

# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password> \
  -n <namespace>

# Reference in pod spec
spec:
  imagePullSecrets:
  - name: regcred

OOMKilled

Container killed due to memory exhaustion.

Diagnostic:

kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState

Fix Options:

  1. Increase memory limit (if available):
resources:
  limits:
    memory: '512Mi' # Increase this
  requests:
    memory: '256Mi'
  1. Profile memory usage:
kubectl top pod <pod-name> -n <namespace> --containers
  1. Check for memory leaks in application code

Readiness Issues

Pod is Running but not Ready.

Diagnostic:

# Check readiness probe
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness

# Check probe endpoint manually
kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health

Common Causes:

  • Application not listening on expected port
  • Readiness endpoint returning non-200
  • Probe timeout too short
  • Dependencies not available

Fix Readiness Probe:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10 # Give app time to start
  periodSeconds: 5
  timeoutSeconds: 3 # Increase if needed
  failureThreshold: 3

Container Errors

Diagnostic:

# Get detailed container status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

# Check init containers
kubectl logs <pod-name> -n <namespace> -c <init-container-name>

Networking Troubleshooting

Service Not Reachable

# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Check service selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5
kubectl get pods -n <namespace> --show-labels

# Test connectivity from another pod
kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>

DNS Issues

# Check DNS resolution from pod
kubectl exec <pod> -n <namespace> -- nslookup <service-name>
kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

Resource Analysis

Node Pressure

# Check node conditions
kubectl describe nodes | grep -A 5 Conditions

# Check node resource usage
kubectl top nodes

# Find resource-heavy pods
kubectl top pods -A --sort-by=memory | head -20

PVC Issues

# Check PVC status
kubectl get pvc -n <namespace>

# Check PV status
kubectl get pv

# Describe for events
kubectl describe pvc <pvc-name> -n <namespace>

Quick Reference Commands

# Pod debugging
kubectl logs <pod> -n <ns>                    # Current logs
kubectl logs <pod> -n <ns> --previous         # Previous container logs
kubectl logs <pod> -n <ns> -c <container>     # Specific container
kubectl logs <pod> -n <ns> --tail=100 -f      # Follow logs

# Interactive debugging
kubectl exec -it <pod> -n <ns> -- /bin/sh     # Shell into container
kubectl exec <pod> -n <ns> -- env             # Check environment
kubectl exec <pod> -n <ns> -- cat /etc/hosts  # Check DNS

# Resource inspection
kubectl get pod <pod> -n <ns> -o yaml         # Full pod spec
kubectl describe pod <pod> -n <ns>            # Events and status
kubectl get events -n <ns> --sort-by='.lastTimestamp'

# Cluster-wide
kubectl get pods -A | grep -v Running         # Non-running pods
kubectl top pods -A --sort-by=cpu             # CPU usage
kubectl top pods -A --sort-by=memory          # Memory usage

Additional Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

incident-response

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

Open-broker

Hyperliquid trading plugin with background position monitoring and custom automations. Execute market orders, limit orders, manage positions, view funding ra...

Registry SourceRecently Updated
2.3K6ya7ya
Automation

Turing Pyramid

Prioritized action selection for AI agents. 10 needs with time-decay and tension scoring replace idle heartbeat loops with concrete next actions.

Registry SourceRecently Updated
Automation

Conversation Flow Monitor

Monitors and prevents conversation flow issues by implementing robust error handling, timeouts, and recovery mechanisms for reliable agent interactions.

Registry SourceRecently Updated
kubernetes-troubleshooting | V50.AI