Kubernetes Management (Kubernetes 管理)

为运维人员提供 Kubernetes 集群管理和监控功能，包括查看 Pod 状态、日志、资源使用等。

Provides Kubernetes cluster management and monitoring for DevOps engineers, including pod status, logs, resource usage, and more.

Quick Start

View Pods (查看 Pod)

显示 production namespace 中的所有 pods List all pods in the production namespace

Check Pod Logs (查看 Pod 日志)

查看 pod api-server-abc123 的日志 Get logs from pod api-server-abc123

Check Resource Usage (查看资源使用)

显示集群资源使用情况 Show cluster resource usage

Key Features

Pod Management (Pod 管理)

View and monitor pods across namespaces:

Common Operations:

List pods

kubectl get pods -n <namespace>

Get pod details

kubectl describe pod <pod-name> -n <namespace>

View pod logs

kubectl logs <pod-name> -n <namespace>

Follow logs

kubectl logs -f <pod-name> -n <namespace>

Service and Deployment (服务与部署)

Monitor services and deployments:

Common Operations:

List services

kubectl get services -n <namespace>

List deployments

kubectl get deployments -n <namespace>

Check deployment status

kubectl rollout status deployment/<name> -n <namespace>

Resource Monitoring (资源监控)

Check cluster and pod resource usage:

Common Operations:

Node resource usage

kubectl top nodes

Pod resource usage

kubectl top pods -n <namespace>

Check cluster info

kubectl cluster-info

Common Workflows

Workflow 1: Troubleshoot Pod Issues

User: "production namespace 中有哪些 pod 出问题了？"
List pods: kubectl get pods -n production
Identify pods with issues (Not Running, CrashLoopBackOff)
Get pod details: kubectl describe pod <problem-pod>
Check logs: kubectl logs <problem-pod>
Analyze error messages
Suggest solutions

Workflow 2: Check Deployment Health

User: "检查 api-server deployment 的健康状态"
Get deployment: kubectl get deployment api-server -n production
Check replicas: desired vs available
Check rollout status
View recent events
Report status and any issues

Workflow 3: Monitor Resource Usage

User: "集群资源使用情况如何？"
Get node metrics: kubectl top nodes
Get pod metrics: kubectl top pods --all-namespaces
Calculate utilization percentages
Identify resource-intensive pods
Provide recommendations

Examples (示例)

Example 1: List Pods (列出 Pods)

User (Chinese):

显示 production namespace 中的所有 pods

Assistant Response:

我会列出 production namespace 中的所有 pods。

<executes: kubectl get pods -n production>

Production Namespace - Pods:

Name	Status	Restarts	Age
api-server-7d4f8b9c-abc123	Running	0	5d
api-server-7d4f8b9c-def456	Running	1	5d
web-app-5c8a7b-ghi789	Running	0	3d
worker-queue-9f3e2d-jkl012	CrashLoopBackOff	10	2h
database-6a4b5c-mno345	Running	0	15d

Summary:

Total: 5 pods
Running: 4 pods
Issues: 1 pod (worker-queue)

⚠️ Alert: worker-queue-9f3e2d-jkl012 is in CrashLoopBackOff状态。需要检查日志吗？

Example 2: View Pod Logs (查看 Pod 日志)

User (English):

Get logs from pod worker-queue-9f3e2d-jkl012 in production namespace

Assistant Response:

I'll retrieve the logs from the pod.

<executes: kubectl logs worker-queue-9f3e2d-jkl012 -n production --tail=50>

Pod Logs: worker-queue-9f3e2d-jkl012

[2026-01-16 10:15:23] INFO: Worker starting... [2026-01-16 10:15:24] INFO: Connecting to Redis... [2026-01-16 10:15:25] ERROR: Connection failed: redis://redis-service:6379 [2026-01-16 10:15:25] ERROR: Error: getaddrinfo ENOTFOUND redis-service [2026-01-16 10:15:26] INFO: Retrying connection (1/10)... [2026-01-16 10:15:27] ERROR: Connection failed: redis://redis-service:6379 [2026-01-16 10:15:28] FATAL: Max retries exceeded, exiting

Analysis:

Issue: Cannot connect to Redis service
Root Cause: DNS resolution failure for redis-service
Possible Reasons:
1. Redis service not running in the namespace
2. Service name mismatch
3. Network policy blocking connection

Recommendations:

Check if Redis service exists: kubectl get svc redis-service -n production
Verify service endpoints: kubectl get endpoints redis-service -n production
Check network policies
Review deployment configuration

Example 3: Resource Usage (资源使用)

User (Chinese):

显示集群各节点的资源使用情况

Assistant Response:

我会显示集群节点的资源使用情况。

<executes: kubectl top nodes>

Cluster Resource Usage:

Node	CPU (cores)	CPU %	Memory (GB)	Memory %
node-1	3.2 / 8	40%	12.5 / 32	39%
node-2	5.6 / 8	70%	18.2 / 32	57%
node-3	2.1 / 8	26%	8.7 / 32	27%
node-4	6.8 / 8	85%	24.3 / 32	76%

Analysis:

node-4 is running at high utilization (85% CPU, 76% Memory)
node-3 has the most available capacity
Average cluster utilization: 55% CPU, 50% Memory

Recommendations:

Consider redistributing pods from node-4
Monitor node-4 for potential issues
Check if any pods need resource limit adjustments
Review horizontal pod autoscaling settings

Important Notes

Access Control (访问控制)

Requires kubectl configured with cluster access
User permissions defined by RBAC roles
Some operations may be restricted based on role
Always verify namespace access before operations

Safety Guidelines (安全指南)

Read operations only through this skill
No pod deletion or modification
No deployment changes
No configuration updates
For changes, use standard deployment processes

Best Practices (最佳实践)

Always specify namespace
Use labels for filtering
Check resource limits before scaling
Monitor metrics before making decisions
Document all investigations

Prerequisites

kubectl Configuration

Ensure kubectl is configured to access your clusters:

Verify kubectl is installed

kubectl version --client

List available contexts

kubectl config get-contexts

Switch context if needed

kubectl config use-context <context-name>

Verify access

kubectl get nodes

GKE Authentication (GCP)

For GKE clusters:

Authenticate with GCP

gcloud auth login

Get cluster credentials

gcloud container clusters get-credentials <cluster-name> --region=<region>

Verify access

kubectl cluster-info

Required Permissions

Minimum RBAC permissions needed:

pods/list
List pods
pods/get
Get pod details
pods/log
Read pod logs
services/list
List services
deployments/list
List deployments
nodes/list
List nodes

Limitations

Current Limitations

Read-only operations via kubectl commands
No direct MCP server integration (uses kubectl CLI)
No real-time monitoring dashboards
No automated remediation
Limited to configured clusters

Future Enhancements

Direct Kubernetes API integration
Real-time pod monitoring
Automated issue detection
Integration with monitoring systems (Prometheus, Grafana)
Pod restart and scaling capabilities
Log aggregation and search

Troubleshooting

Issue 1: "kubectl: command not found"

Solutions:

Install kubectl: https://kubernetes.io/docs/tasks/tools/
Add to PATH
Restart Claude Code

Issue 2: "Unable to connect to cluster"

Solutions:

Verify kubectl context: kubectl config current-context
Check cluster credentials
Verify network access
Re-authenticate if needed

Issue 3: "Forbidden: User cannot list pods"

Solutions:

Check RBAC permissions
Contact cluster admin
Verify service account
Switch to correct namespace

Related Skills

cloud-resources : Manage cloud infrastructure
Future: monitoring-alerts , incident-response

Safety & Compliance

This skill provides read-only access to Kubernetes clusters:

✅ View pod status and logs
✅ Monitor resource usage
✅ Check service health
❌ Cannot delete pods
❌ Cannot modify deployments
❌ Cannot change configurations

For any modifications, use established CI/CD pipelines and change management processes.

k8s-management

Safety Notice

Copy this and send it to your AI assistant to learn

List pods

Get pod details

View pod logs

Follow logs

List services

List deployments

Check deployment status

Node resource usage

Pod resource usage

Check cluster info

Verify kubectl is installed

List available contexts

Switch context if needed

Verify access

Authenticate with GCP

Get cluster credentials

Verify access

Source Transparency

Related Skills

lark-docs

slurm

lark-messages

mac-setup