k8s-platform-operations

Use when performing cluster health checks, responding to incidents and alerts, planning and managing capacity, conducting maintenance operations, managing backups and recovery, or creating and following runbooks

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "k8s-platform-operations" with this command: npx skills add foxj77/claude-code-skills/foxj77-claude-code-skills-k8s-platform-operations

Kubernetes Platform Operations

Manage day-2 operations of Kubernetes platforms including monitoring, incident response, capacity planning, and operational excellence.

Keywords

kubernetes, operations, monitoring, incident, capacity, maintenance, backup, recovery, health check, runbook, on-call, escalation, node, pod, troubleshooting, performing, responding, planning, managing, conducting, creating

When to Use This Skill

  • Performing cluster health checks
  • Responding to incidents and alerts
  • Planning and managing capacity
  • Conducting maintenance operations
  • Managing backups and recovery
  • Creating and following runbooks

Related Skills

Quick Reference

TaskCommand
Cluster healthkubectl get nodes && kubectl get --raw='/healthz?verbose'
All pods statuskubectl get pods -A | grep -v Running
Resource usagekubectl top nodes && kubectl top pods -A
Recent eventskubectl get events -A --sort-by='.lastTimestamp'

Health Monitoring

Cluster Health Checks

kubectl get nodes -o wide
kubectl top nodes
kubectl get pods -n kube-system
kubectl get pods -n platform-system
kubectl get --raw='/healthz?verbose'
kubectl get --raw='/livez?verbose'
kubectl get --raw='/readyz?verbose'
kubectl get --raw='/healthz/etcd'

Key Metrics to Monitor

MetricWarningCriticalAction
Node CPU>70%>85%Scale or optimize
Node Memory>75%>90%Scale or evict
Pod restarts>3/hr>10/hrInvestigate
API latency p99>500ms>1sCheck etcd/load
etcd disk>70%>85%Expand/compact
PVC usage>75%>90%Expand or clean

Incident Response

Severity Levels

LevelImpactResponseExamples
P1Platform down15 minAPI unreachable, etcd failure
P2Major degradation30 minNode failures, ingress down
P3Partial impact2 hoursSingle tenant affected
P4Minor issue24 hoursNon-critical alerts

Incident Workflow

1. DETECT    → Alert fires or user report
2. TRIAGE    → Assess severity and impact
3. COMMUNICATE → Update status page/channel
4. INVESTIGATE → Gather evidence
5. MITIGATE  → Restore service (temporary fix OK)
6. RESOLVE   → Fix root cause
7. REVIEW    → Post-incident analysis

Escalation Path

L1 (On-call) → 15 min no progress
    ↓
L2 (Senior SRE) → 30 min no progress
    ↓
L3 (Platform Lead) → Critical impact
    ↓
Management (if customer-facing outage)

On-Call Handoff Format

Include: Active Issues (severity + status), Recent Changes (what + when), Upcoming Maintenance, Watchlist, Notes

Common Runbooks

Node NotReady

# 1. Check node status
kubectl describe node ${NODE_NAME}

# 2. Check kubelet (SSH to node)
journalctl -u kubelet -n 100 --no-pager

# 3. Check resources
kubectl top node ${NODE_NAME}

# 4. If unrecoverable
kubectl cordon ${NODE_NAME}
kubectl drain ${NODE_NAME} --ignore-daemonsets --delete-emptydir-data

Pod CrashLoopBackOff

# 1. Get details
kubectl describe pod ${POD} -n ${NS}

# 2. Check logs
kubectl logs ${POD} -n ${NS}
kubectl logs ${POD} -n ${NS} --previous

# 3. Check events
kubectl get events -n ${NS} --sort-by='.lastTimestamp' | grep ${POD}

# 4. Check resources
kubectl get pod ${POD} -n ${NS} -o jsonpath='{.spec.containers[*].resources}'

High Memory Pressure

# 1. Find top consumers
kubectl top pods -A --sort-by=memory | head -20

# 2. Check evictions
kubectl get pods -A --field-selector=status.phase=Failed | grep Evicted

# 3. Check node conditions
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: {.status.conditions[?(@.type=="MemoryPressure")].status}{"\n"}{end}'

# 4. Identify leaks (high restarts)
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name} restarts={.status.containerStatuses[0].restartCount}{"\n"}{end}' | sort -t= -k2 -rn | head -20

Capacity Planning

Resource Analysis

# Cluster summary
kubectl top nodes --no-headers | awk '{cpu+=$3; mem+=$5} END {print "Avg CPU:", cpu/NR"%", "Avg Mem:", mem/NR"%"}'

# Namespace consumption
kubectl top pods -A --no-headers | awk '{ns[$1]+=$3} END {for(n in ns) print n, ns[n]"m"}' | sort -k2 -rn

# Over-provisioned (requests >> actual)
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name} req={.spec.containers[0].resources.requests.cpu}{"\n"}{end}'

Capacity Thresholds

  • Green: <60% - Healthy headroom
  • Yellow: 60-80% - Plan scaling
  • Red: >80% - Scale immediately

Maintenance Operations

Node Maintenance

# 1. Cordon (prevent new pods)
kubectl cordon ${NODE}

# 2. Drain (evict pods)
kubectl drain ${NODE} \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=5m

# 3. Perform maintenance...

# 4. Uncordon
kubectl uncordon ${NODE}

Rolling Restart

kubectl rollout restart deployment/${NAME} -n ${NS}
kubectl rollout status deployment/${NAME} -n ${NS}

Certificate Check

kubeadm certs check-expiration

Backup & Recovery

etcd Backup

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Velero Backup

velero backup create platform-$(date +%Y%m%d) \
  --include-namespaces platform-system \
  --ttl 720h

velero schedule create daily-platform \
  --schedule="0 2 * * *" \
  --include-namespaces platform-system

Operational Checklists

Daily

  • Review alerting dashboard
  • Check node health
  • Verify backup completion
  • Review capacity metrics

Weekly

  • Audit resource quotas
  • Check certificate expiry
  • Review pending updates
  • Update runbooks

Monthly

  • Capacity planning review
  • Security patch assessment
  • Cost optimization review
  • Tenant usage reports

Post-Incident Review Format

Structure reviews with: Summary (duration, severity, impact), Timeline (time + event table), Root Cause, What Went Well, What Could Be Improved, Action Items (action, owner, due date)

Common Mistakes

MistakeWhy It FailsInstead
Draining a node without cordoning firstNew pods schedule onto the node during drainAlways kubectl cordon before kubectl drain
Skipping the COMMUNICATE step in incidentsStakeholders make assumptions; duplicate investigations startUpdate status channel before deep-diving
Running etcd backup without verifying restore procedureBackup may be corrupt or incompatible; you won't know until you need itTest restore to a non-production cluster periodically
Applying certificate rotation without checking dependent servicesServices using the old cert break silentlyInventory cert consumers before rotation
Ignoring "Warning" events because pods are RunningWarnings often precede failures (e.g., mounting issues, throttling)Review kubectl get events as part of daily checks

MCP Tools

  • mcp__flux-operator-mcp__get_kubernetes_resources
  • mcp__flux-operator-mcp__get_kubernetes_logs
  • mcp__flux-operator-mcp__get_kubernetes_metrics

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

helm-chart-maintenance

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

flux-gitops-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

flux-troubleshooting

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

k8s-continual-improvement

No summary provided by upstream source.

Repository SourceNeeds Review