kubeadm Troubleshooting
Overview
Systematic approach to diagnosing kubeadm cluster issues.
Core principle: Trace symptoms to specific components. Most issues are: certificates, networking, or kubelet configuration.
When to Use
kubeadm init or kubeadm join fails
- Nodes show
NotReady status
- Pods stuck in
Pending state
- Certificate-related errors
- kubelet crashlooping
- Control plane components failing
Diagnostic Flowchart
Symptom
│
├─ kubeadm init fails ────────────▶ Check pre-flight errors
│ Check port conflicts
│ Check containerd status
│
├─ kubeadm join fails ────────────▶ Check token validity
│ Check network connectivity
│ Check firewall (port 6443)
│
├─ Node NotReady ─────────────────▶ Check CNI installation
│ Check kubelet logs
│ Check node conditions
│
├─ Pod Pending ───────────────────▶ If CoreDNS: install CNI
│ If other: check node resources
│ Check taints/tolerations
│
├─ Certificate error ─────────────▶ Check certificate expiry
│ Check SAN configuration
│ Check clock synchronization
│
└─ kubelet crashloop ─────────────▶ Check config.yaml exists
Check containerd running
Check cgroup driver match
Quick Diagnostic Commands
# Cluster overview
kubectl get nodes -o wide
kubectl get pods -A
kubectl cluster-info
# Node details
kubectl describe node <node-name>
# Component health
kubectl get componentstatuses # deprecated but sometimes useful
kubectl get --raw='/readyz?verbose'
# kubelet status
systemctl status kubelet
journalctl -u kubelet -f --no-pager
# containerd status
systemctl status containerd
crictl ps
crictl pods
crictl info
# Network
ss -tlnp | grep -E '6443|10250|2379'
ip route
Issue: kubeadm init Fails
Pre-flight Errors
# See what's failing
kubeadm init --dry-run 2>&1 | grep -E '\[ERROR\]|\[WARNING\]'
| Error | Fix |
|---|
[ERROR CRI]: container runtime not running | systemctl start containerd |
[ERROR Swap]: running with swap on | swapoff -a |
[ERROR Port-6443]: Port 6443 in use | Previous cluster: kubeadm reset |
[ERROR DirAvailable]: /etc/kubernetes not empty | rm -rf /etc/kubernetes/* |
[ERROR FileAvailable--etc-kubernetes-manifests-*] | Clean previous manifests |
Port Conflicts
# Check what's using required ports
ss -tlnp | grep -E '6443|2379|2380|10250|10259|10257'
# If previous cluster
kubeadm reset -f
rm -rf /etc/kubernetes /var/lib/kubelet /var/lib/etcd
iptables -F && iptables -t nat -F
containerd Issues
# Check containerd running
systemctl status containerd
# Check CRI plugin enabled (should NOT be in disabled list)
grep disabled_plugins /etc/containerd/config.toml
# Check socket
ls -la /run/containerd/containerd.sock
# Test with crictl
crictl info
Issue: kubeadm join Fails
Connection Refused
# From worker node - test API server reachability
curl -k https://192.168.10.100:6443/healthz
# Expected: "ok"
# If "Connection refused": firewall or API not listening
# On control plane - check API listening
ss -tlnp | grep 6443
# Should show: *:6443 or 0.0.0.0:6443
# Check firewall
systemctl is-active firewalld
# If active:
firewall-cmd --list-ports
# Or disable:
systemctl disable --now firewalld
Token Expired
# On control plane - check token validity
kubeadm token list
# If empty or expired:
kubeadm token create --print-join-command
CA Hash Mismatch
# Regenerate correct hash on control plane
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | \
openssl rsa -pubin -outform der 2>/dev/null | \
openssl dgst -sha256 -hex | sed 's/^.* //'
TLS Bootstrap Timeout
# On worker - check kubelet logs during join
journalctl -u kubelet -f
# Common causes:
# - Firewall blocking 6443
# - Wrong advertise address (multi-NIC)
# - Network routing issues
Issue: Node NotReady
Check Node Conditions
kubectl describe node <node-name> | grep -A 20 Conditions
| Condition | False Means |
|---|
| Ready | Node unhealthy |
| MemoryPressure | Memory OK |
| DiskPressure | Disk OK |
| PIDPressure | PIDs OK |
| NetworkUnavailable | CNI working |
NetworkUnavailable = True (CNI Missing)
# Check if CNI installed
ls /etc/cni/net.d/
# Empty = no CNI
# Install Flannel
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
# Verify
kubectl get pods -n kube-flannel
kubectl get nodes # Should become Ready
kubelet Not Reporting
# On the NotReady node
systemctl status kubelet
journalctl -u kubelet --no-pager | tail -50
# Common issues:
# - config.yaml missing: kubeadm join not completed
# - containerd socket missing: containerd not running
# - cgroup driver mismatch: SystemdCgroup not set
Issue: Pods Stuck Pending
CoreDNS Pending (No CNI)
kubectl get pods -n kube-system | grep coredns
# coredns-xxx 0/1 Pending
kubectl describe pod -n kube-system coredns-xxx
# Events: FailedScheduling - no nodes available to schedule
# Fix: Install CNI
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
Other Pods Pending
kubectl describe pod <pod-name> -n <namespace>
# Check Events section
# Common causes:
# - Insufficient resources
# - Node taints without tolerations
# - Node selector mismatch
# - PVC not bound
Check Taints (Control Plane Only Cluster)
# Control plane has NoSchedule taint by default
kubectl describe node | grep Taints
# To schedule on control plane (single-node cluster)
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
Issue: Certificate Errors
Check Certificate Expiry
# All certificates
kubeadm certs check-expiration
# Specific certificate
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
Certificate SAN Issues
# Check API server certificate SANs
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -text | grep -A1 "Subject Alternative Name"
# If missing required SAN, regenerate:
kubeadm init phase certs apiserver --apiserver-cert-extra-sans=new.domain.com
Renew Certificates
# Renew all
kubeadm certs renew all
# Restart control plane components
crictl pods --name='kube-apiserver|kube-controller|kube-scheduler|etcd' -q | xargs -I {} crictl stop {}
Clock Skew
# Check time on all nodes
date
# Enable NTP
timedatectl set-ntp true
chronyc sources -v
Issue: kubelet Crashlooping
Before kubeadm init/join (Expected)
journalctl -u kubelet | grep "config.yaml"
# "failed to load kubelet config file, path: /var/lib/kubelet/config.yaml"
# This is NORMAL before kubeadm init or join
# kubelet needs config from kubeadm
After kubeadm init/join
# Check config exists
ls -la /var/lib/kubelet/config.yaml
# Check kubeconfig exists
ls -la /etc/kubernetes/kubelet.conf
# Check containerd
systemctl status containerd
# Check cgroup driver match
grep cgroupDriver /var/lib/kubelet/config.yaml
# Should be: cgroupDriver: systemd
grep SystemdCgroup /etc/containerd/config.toml
# Should be: SystemdCgroup = true
Issue: etcd Problems
Check etcd Health
# Using etcdctl with certs
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health
# Check etcd logs
crictl logs $(crictl ps -q --name=etcd)
etcd Data Corruption (Last Resort)
# WARNING: Destroys cluster state
kubeadm reset -f
rm -rf /var/lib/etcd
kubeadm init ...
Full Reset Procedure
# On ALL nodes
kubeadm reset -f
# Clean directories
rm -rf /etc/kubernetes
rm -rf /var/lib/kubelet
rm -rf /var/lib/etcd
rm -rf ~/.kube
# Clean iptables
iptables -F
iptables -t nat -F
iptables -t mangle -F
iptables -X
# Clean CNI
rm -rf /etc/cni/net.d
rm -rf /var/lib/cni
# Restart containerd
systemctl restart containerd
# Then re-init on control plane
kubeadm init --config=kubeadm-init.yaml
Log Collection Script
#!/bin/bash
LOGDIR="/tmp/k8s-debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$LOGDIR"
echo "Collecting system info..."
hostnamectl > "$LOGDIR/hostnamectl.txt"
ip addr > "$LOGDIR/ip-addr.txt"
ip route > "$LOGDIR/ip-route.txt"
ss -tlnp > "$LOGDIR/ss-tlnp.txt"
echo "Collecting service status..."
systemctl status kubelet > "$LOGDIR/kubelet-status.txt" 2>&1
systemctl status containerd > "$LOGDIR/containerd-status.txt" 2>&1
echo "Collecting logs..."
journalctl -u kubelet --no-pager > "$LOGDIR/kubelet.log" 2>&1
journalctl -u containerd --no-pager > "$LOGDIR/containerd.log" 2>&1
echo "Collecting Kubernetes info..."
kubectl get nodes -o wide > "$LOGDIR/nodes.txt" 2>&1
kubectl get pods -A -o wide > "$LOGDIR/pods.txt" 2>&1
kubectl describe nodes > "$LOGDIR/nodes-describe.txt" 2>&1
echo "Collecting directory listings..."
ls -la /etc/kubernetes/ > "$LOGDIR/etc-kubernetes.txt" 2>&1
ls -la /var/lib/kubelet/ > "$LOGDIR/var-lib-kubelet.txt" 2>&1
ls -la /etc/cni/net.d/ > "$LOGDIR/cni-config.txt" 2>&1
echo "Collecting crictl info..."
crictl info > "$LOGDIR/crictl-info.txt" 2>&1
crictl ps -a > "$LOGDIR/crictl-ps.txt" 2>&1
crictl pods > "$LOGDIR/crictl-pods.txt" 2>&1
tar -czf "$LOGDIR.tar.gz" -C /tmp "$(basename $LOGDIR)"
echo "Logs collected: $LOGDIR.tar.gz"
Quick Reference: Component Locations
| Component | Config | Logs |
|---|
| kubelet | /var/lib/kubelet/config.yaml | journalctl -u kubelet |
| containerd | /etc/containerd/config.toml | journalctl -u containerd |
| API Server | /etc/kubernetes/manifests/kube-apiserver.yaml | crictl logs <id> |
| etcd | /etc/kubernetes/manifests/etcd.yaml | crictl logs <id> |
| Certificates | /etc/kubernetes/pki/ | N/A |
| kubeconfig | /etc/kubernetes/*.conf | N/A |
| CNI | /etc/cni/net.d/ | Varies by CNI |