skill:kubernetes-specialist - Kubernetes Cluster Management & Orchestration
Version: 1.0.0
Purpose
The kubernetes-specialist skill designs and manages production-ready Kubernetes clusters, implements deployment strategies, creates custom operators, and configures autoscaling, networking, security, and observability for containerized workloads.
Use this skill when:
-
Setting up production Kubernetes clusters (AKS, EKS, GKE, self-hosted)
-
Designing microservices deployment strategies
-
Creating custom Kubernetes operators
-
Implementing autoscaling (HPA, VPA, Cluster Autoscaler)
-
Configuring Kubernetes networking (CNI, Ingress, Service Mesh)
-
Securing Kubernetes workloads (RBAC, Network Policies, Pod Security)
-
Troubleshooting cluster and application issues
-
Migrating workloads to Kubernetes
Produces:
-
Kubernetes manifests (Deployments, Services, ConfigMaps, Secrets)
-
Helm charts for application packaging
-
Custom Resource Definitions (CRDs) and operators
-
Cluster configuration and setup scripts
-
Autoscaling and resource optimization recommendations
-
Security policies and RBAC configurations
-
Troubleshooting guides and runbooks
File Structure
skills/kubernetes-specialist/ ├── SKILL.md (this file) ├── examples.md └── templates/ └── k8s_architecture_template.md
Interface References
-
Context: Loaded via ContextProvider Interface
-
Memory: Accessed via MemoryStore Interface
-
Shared Patterns: Shared Loading Patterns
-
Schemas: Validated against context_metadata.schema.json and memory_entry.schema.json
Mandatory Workflow
IMPORTANT: Execute ALL steps in order. Do not skip any step.
Step 1: Initial Analysis
-
Gather cluster requirements:
-
Workload type: Stateless web apps, stateful databases, batch jobs, ML training
-
Scale expectations: Number of services, pods, nodes, requests/sec
-
Cloud provider: AKS (Azure), EKS (AWS), GKE (Google), or self-hosted
-
High availability: Multi-zone, multi-region requirements
-
Compliance: Security standards (CIS Kubernetes Benchmark, PCI-DSS)
-
Budget constraints: Node costs, storage costs, network egress
-
Team expertise: Kubernetes experience level
-
Detect existing Kubernetes configuration:
-
Analyze k8s/ , kubernetes/ , manifests/ , .yaml files
-
Review helm/ charts and kustomize/ overlays
-
Check for existing operators and CRDs
-
Identify current deployment patterns
-
Determine project name for memory lookup
Step 2: Load Memory
Follow Standard Memory Loading with skill="kubernetes-specialist" and domain="azure" (or detected domain).
Load project-specific memory:
memoryStore.getSkillMemory("kubernetes-specialist", "{project-name}")
Check for cross-skill insights:
memoryStore.getByProject("{project-name}")
Review memory for:
-
Previous cluster configurations and architecture decisions
-
Deployment patterns and strategies used
-
Autoscaling configurations and effectiveness
-
Security policies and RBAC configurations
-
Performance benchmarks and resource utilization
-
Troubleshooting history and incident learnings
Step 3: Load Context
Follow Standard Context Loading for the azure domain and other relevant domains. Stay within the file budget declared in frontmatter.
Use context indexes:
contextProvider.getDomainIndex("azure") contextProvider.getDomainIndex("docker") contextProvider.getDomainIndex("engineering") contextProvider.getDomainIndex("security")
Load relevant context files based on project needs:
-
Azure AKS patterns if using Azure Kubernetes Service
-
Docker/container patterns for image management
-
CI/CD patterns for deployment automation
-
Security guidelines for cluster hardening
Budget: 6 files maximum
Step 4: Cluster Architecture Design
-
Design Kubernetes cluster architecture:
-
Control Plane: Managed (AKS/EKS/GKE) vs self-hosted
-
Node Pools: System nodes, user workload nodes, GPU nodes
-
Node Sizing: VM sizes based on workload requirements
-
Availability Zones: Multi-AZ for high availability
-
Networking: CNI plugin (Azure CNI, AWS VPC CNI, Calico, Cilium)
-
Storage: StorageClasses for persistent volumes (Azure Disk/Files, EBS, GCE PD)
-
Ingress: Ingress controller selection (NGINX, Traefik, Kong, Istio Gateway)
-
DNS: CoreDNS configuration and external DNS integration
-
Choose Kubernetes distribution and justify:
-
AKS: Best for Azure workloads, managed control plane, AAD integration
-
EKS: Best for AWS workloads, managed control plane, IAM integration
-
GKE: Best for GCP workloads, best-in-class Kubernetes experience
-
Self-hosted: Best for on-premise, maximum control, air-gapped environments
-
Define cluster sizing and scaling strategy
-
Plan for disaster recovery and backup (Velero)
Step 5: Application Deployment Strategy
-
Design deployment patterns:
-
Deployment vs StatefulSet vs DaemonSet: Choose based on workload type
-
ReplicaSet sizing: Initial replicas and scaling bounds
-
Update strategy: RollingUpdate vs Recreate
-
PodDisruptionBudgets: Ensure availability during updates
-
Resource requests/limits: CPU and memory allocation
-
Implement deployment strategies:
-
Rolling Updates: Zero-downtime gradual rollout
-
Blue/Green: Complete environment switch
-
Canary: Progressive traffic shifting (Flagger, Argo Rollouts)
-
A/B Testing: Feature-based traffic routing
-
Design service exposure:
-
ClusterIP: Internal-only services
-
LoadBalancer: External cloud load balancer
-
NodePort: Direct node access (dev/testing)
-
Ingress: HTTP/S routing with path/host rules
-
Configure health checks:
-
Liveness probes: Restart unhealthy pods
-
Readiness probes: Remove unhealthy pods from load balancing
-
Startup probes: Handle slow-starting containers
Step 6: Autoscaling Configuration
-
Implement horizontal autoscaling:
-
Horizontal Pod Autoscaler (HPA):
-
Metrics: CPU, memory, custom metrics (queue depth, requests/sec)
-
Target utilization thresholds
-
Min/max replica counts
-
Scale-up/down behavior and stabilization windows
-
KEDA (Event-driven autoscaling):
-
Scale based on external metrics (Azure Queue, AWS SQS, Kafka)
-
Scale to zero for cost optimization
-
Implement vertical autoscaling:
-
Vertical Pod Autoscaler (VPA):
-
Automatic resource request/limit optimization
-
Update mode: Auto vs Recreate vs Initial vs Off
-
Resource policies for critical workloads
-
Implement cluster autoscaling:
-
Cluster Autoscaler: Add/remove nodes based on pending pods
-
Node pool configuration: Min/max node counts per pool
-
Scale-down policies: Graceful node draining
-
Cost optimization: Mix of spot/preemptible and on-demand nodes
Step 7: Networking & Service Mesh
-
Configure networking:
-
CNI Plugin: Choose and configure network plugin
-
Azure CNI: Azure VNET integration
-
AWS VPC CNI: AWS VPC integration
-
Calico: Network policies and performance
-
Cilium: eBPF-based networking and security
-
Network Policies: Pod-to-pod traffic rules
-
Service mesh evaluation: Istio, Linkerd, Consul
-
Implement ingress strategy:
-
Ingress Controller: NGINX, Traefik, Kong, Contour
-
TLS termination: Cert-manager for automatic certificate management
-
Rate limiting: Protect backend services
-
Path-based routing: Multiple services behind single domain
-
Sticky sessions: Session affinity when needed
-
Service mesh implementation (if needed):
-
Traffic management: Canary deployments, A/B testing, circuit breaking
-
Observability: Distributed tracing, metrics, service graph
-
Security: mTLS between services, authorization policies
-
Resilience: Retries, timeouts, fault injection
Step 8: Security Hardening
-
Implement RBAC (Role-Based Access Control):
-
Roles and ClusterRoles: Define permissions
-
RoleBindings and ClusterRoleBindings: Assign to users/groups
-
ServiceAccounts: Pod identity and permissions
-
Principle of least privilege: Minimal permissions required
-
Configure pod security:
-
Pod Security Standards: Privileged, Baseline, Restricted
-
SecurityContext: runAsNonRoot, readOnlyRootFilesystem, capabilities
-
AppArmor/SELinux: Mandatory access control
-
Seccomp profiles: Syscall filtering
-
Implement network security:
-
Network Policies: Default deny, allow specific traffic
-
Private cluster: No public API endpoint
-
Authorized networks: IP whitelisting for API access
-
Secrets management:
-
Kubernetes Secrets: Base64 encoded (not encrypted by default)
-
External Secrets Operator: Sync from Azure Key Vault, AWS Secrets Manager
-
Sealed Secrets: Encrypted secrets in Git
-
Secret rotation: Automated credential updates
-
Image security:
-
Private container registry: ACR, ECR, GCR, Harbor
-
Image scanning: Trivy, Anchore, Snyk for vulnerabilities
-
Image signing: Cosign, Notary for supply chain security
-
Admission controllers: OPA Gatekeeper, Kyverno for policy enforcement
Step 9: Observability & Monitoring
-
Implement logging:
-
Container logs: stdout/stderr collection
-
Log aggregation: Fluentd, Fluent Bit, Promtail
-
Log storage: Elasticsearch, Loki, Azure Log Analytics, CloudWatch
-
Structured logging: JSON format for parsing
-
Implement metrics:
-
Prometheus: Metrics collection and storage
-
Metrics-server: Resource metrics for HPA
-
Custom metrics: Application-specific metrics via Prometheus exporter
-
Grafana: Visualization dashboards
-
Pre-built dashboards: Cluster health, node utilization, pod metrics
-
Implement distributed tracing:
-
Jaeger or Tempo: Trace collection and storage
-
OpenTelemetry: Instrumentation standard
-
Service dependency graph: Visualize service interactions
-
Implement alerting:
-
Prometheus Alertmanager: Alert routing and deduplication
-
Alert rules: CPU/memory threshold, pod restarts, deployment failures
-
Notification channels: Slack, PagerDuty, email, webhook
-
Runbook automation: Link alerts to troubleshooting docs
Step 10: Custom Operators (if needed)
-
Evaluate operator need:
-
Stateful applications: Databases, message queues, caches
-
Complex lifecycle: Backup, restore, upgrade automation
-
Multi-resource coordination: Related Kubernetes resources
-
Operator implementation:
-
Operator Framework: Operator SDK, Kubebuilder
-
Custom Resource Definitions (CRDs): Define custom resources
-
Controller logic: Reconciliation loop
-
Idempotency: Safe to run multiple times
-
Finalizers: Cleanup on resource deletion
-
Operator deployment:
-
Operator Lifecycle Manager (OLM): Operator installation and updates
-
OperatorHub: Discover pre-built operators
-
Helm chart: Package operator for deployment
-
RBAC: Operator service account permissions
Step 11: Package Management & GitOps
-
Implement Helm charts:
-
Chart structure: templates/, values.yaml, Chart.yaml
-
Templating: Parameterize configurations
-
Dependencies: Manage chart dependencies
-
Versioning: Semantic versioning for charts
-
Repository: Helm chart repository (ChartMuseum, Artifact Hub)
-
Implement Kustomize overlays:
-
Base manifests: Common configuration
-
Overlays: Environment-specific customization (dev, staging, prod)
-
Patches: JSON/YAML patches for modifications
-
ConfigMap/Secret generators: Generate from files
-
Implement GitOps:
-
Git as source of truth: All manifests in version control
-
Argo CD or Flux: Continuous deployment to cluster
-
Automated sync: Git commit triggers deployment
-
Drift detection: Alert on manual cluster changes
-
Rollback: Git revert for deployment rollback
Step 12: Generate Output
-
Save Kubernetes documentation to /claudedocs/kubernetes-specialist_{project}_{YYYY-MM-DD}.md
-
Follow naming conventions in ../OUTPUT_CONVENTIONS.md
-
Use template from templates/k8s_architecture_template.md if available
-
Include:
-
Cluster architecture diagram
-
Complete Kubernetes manifests (Deployments, Services, ConfigMaps, Secrets, Ingress)
-
Helm charts or Kustomize overlays
-
Autoscaling configurations (HPA, VPA, Cluster Autoscaler)
-
RBAC policies and security configurations
-
Monitoring and alerting setup
-
Custom operators (if applicable)
-
Troubleshooting guide and runbooks
-
Disaster recovery and backup procedures
-
Next steps and optimization recommendations
Step 13: Update Memory
Follow Standard Memory Update for skill="kubernetes-specialist" .
Store learned insights:
memoryStore.updateSkillMemory("kubernetes-specialist", "{project-name}", { cluster_patterns: [...], deployment_strategies: [...], autoscaling_configs: [...], security_policies: [...], lessons_learned: [...] })
Update memory with:
-
Cluster architecture decisions and rationale
-
Deployment patterns and strategies
-
Autoscaling configurations and effectiveness
-
Networking and service mesh choices
-
Security policies and RBAC configurations
-
Performance metrics and resource utilization
-
Incident history and troubleshooting learnings
-
Operator implementations and learnings
Compliance Checklist
Before completing, verify:
-
All mandatory workflow steps executed in order
-
Standard Memory Loading pattern followed (Step 2)
-
Standard Context Loading pattern followed (Step 3)
-
Cluster architecture designed with HA and scaling (Step 4)
-
Deployment strategy defined with health checks (Step 5)
-
Autoscaling configured (HPA, VPA, Cluster Autoscaler) (Step 6)
-
Networking and ingress configured (Step 7)
-
Security hardening implemented (RBAC, Pod Security, Network Policies) (Step 8)
-
Observability stack configured (Logging, Metrics, Tracing, Alerting) (Step 9)
-
Custom operators implemented if needed (Step 10)
-
GitOps and package management configured (Step 11)
-
Output saved with standard naming convention (Step 12)
-
Standard Memory Update pattern followed (Step 13)
Kubernetes Expertise Areas
- Cluster Management
-
Multi-tenant clusters with namespace isolation
-
Cluster upgrades and maintenance windows
-
Node pool management (system, user, GPU)
-
Cluster backup and disaster recovery (Velero)
-
Multi-cluster management (Rancher, Anthos, Azure Arc)
- Workload Deployment
-
Deployment strategies (rolling, blue/green, canary)
-
StatefulSets for stateful applications
-
DaemonSets for node-level services
-
Jobs and CronJobs for batch processing
-
Init containers and sidecar patterns
- Networking
-
Service types (ClusterIP, NodePort, LoadBalancer, ExternalName)
-
Ingress controllers and routing rules
-
Network policies for pod-to-pod communication
-
Service mesh (Istio, Linkerd, Consul)
-
DNS and service discovery
- Storage
-
Persistent Volumes and Persistent Volume Claims
-
StorageClasses for dynamic provisioning
-
Volume snapshots and cloning
-
CSI drivers for cloud storage
-
StatefulSet volume management
- Autoscaling
-
Horizontal Pod Autoscaler (HPA) - pod count
-
Vertical Pod Autoscaler (VPA) - resource requests
-
Cluster Autoscaler - node count
-
KEDA - event-driven autoscaling
-
Predictive autoscaling with custom metrics
- Security
-
RBAC for access control
-
Pod Security Standards (Privileged, Baseline, Restricted)
-
Network Policies for traffic control
-
Secrets management (External Secrets, Sealed Secrets)
-
Image scanning and admission control (OPA, Kyverno)
- Observability
-
Prometheus and Grafana for metrics
-
ELK/Loki stack for logging
-
Jaeger/Tempo for distributed tracing
-
Alertmanager for alerting
-
Service Level Objectives (SLOs) and SLIs
Deployment Strategy Comparison
Strategy Downtime Rollback Speed Resource Overhead Complexity Best For
Rolling Update Zero Fast (revert) Low (gradual) Low Most applications
Blue/Green Zero Instant (switch) High (2x resources) Medium Critical apps
Canary Zero Fast (instant) Medium (extra pods) High Risk mitigation
Recreate Yes Fast (redeploy) None Very Low Dev/test only
Ingress Controller Comparison
Controller Features Performance TLS Complexity Best For
NGINX Mature, feature-rich Excellent cert-manager Medium General purpose
Traefik Dynamic config, middleware Very Good Built-in LE Low Modern apps
Kong API gateway, plugins Excellent Advanced High API-heavy
Istio Gateway Service mesh integration Good Advanced Very High Service mesh
Contour Envoy-based, simple Excellent cert-manager Low Simplicity
Service Mesh Comparison
Feature Istio Linkerd Consul Best Choice
Complexity High Low Medium Linkerd (simplicity)
Performance Good Excellent Good Linkerd (lowest overhead)
Features Most complete Essential only Good Istio (feature-rich)
Observability Excellent Excellent Good Istio/Linkerd (tie)
Multi-cluster Excellent Good Excellent Istio/Consul
Learning Curve Steep Gentle Moderate Linkerd (easiest)
Resource Usage High Low Medium Linkerd (lightest)
Managed Kubernetes Service Comparison
Feature AKS (Azure) EKS (AWS) GKE (Google)
Kubernetes Version Latest-1 Latest-2 Latest (fastest)
Control Plane Cost Free $0.10/hour Free
Upgrade Experience Good Manual Excellent (auto)
Integration Azure services AWS services GCP services
Networking Azure CNI VPC CNI GKE CNI
RBAC Integration Azure AD IAM Google IAM
Monitoring Azure Monitor CloudWatch Cloud Monitoring
Best For Azure workloads AWS workloads Best K8s experience
Common Kubernetes Patterns
Pattern 1: Sidecar Container
-
Purpose: Extend/enhance main container functionality
-
Use Cases: Log shipping, service mesh proxy, config reload
-
Example: Envoy proxy alongside application container
Pattern 2: Init Container
-
Purpose: Setup tasks before main container starts
-
Use Cases: Database migrations, config download, wait for dependencies
-
Example: Wait for database to be ready before starting app
Pattern 3: Ambassador Container
-
Purpose: Proxy connections to external services
-
Use Cases: Database connection pooling, protocol translation
-
Example: Cloud SQL proxy for secure database access
Pattern 4: Adapter Container
-
Purpose: Standardize output from main container
-
Use Cases: Log format conversion, metrics normalization
-
Example: Convert app logs to structured JSON for Fluentd
Pattern 5: Multi-Container Pod
-
Purpose: Tightly coupled containers sharing resources
-
Use Cases: Web server + log shipper, app + monitoring agent
-
Example: NGINX + Fluentd log collector
Resource Management Best Practices
- Set Resource Requests and Limits
resources: requests: cpu: 100m # Minimum guaranteed memory: 128Mi limits: cpu: 500m # Maximum allowed memory: 512Mi
- Use Quality of Service (QoS) Classes
-
Guaranteed: requests = limits (highest priority)
-
Burstable: requests < limits (medium priority)
-
BestEffort: no requests/limits (lowest priority, evicted first)
- Configure LimitRanges
-
Set default requests/limits for namespaces
-
Prevent resource hogging
-
Enforce organizational standards
- Use ResourceQuotas
-
Limit total resources per namespace
-
Prevent single team from consuming all resources
-
Track usage and billing
Version History
Version Date Changes
1.0.0 2026-02-12 Initial release with comprehensive Kubernetes orchestration capabilities