RKE2 Operations
Overview
RKE2 is a FIPS-compliant Kubernetes distribution that manages its own TLS certificates and provides built-in upgrade mechanisms. Understanding certificate lifecycle and upgrade procedures is essential for maintaining cluster health and security.
Core principle: Always upgrade control plane (server) nodes before worker (agent) nodes. Never skip certificate inspection before rotation, and never skip pre-upgrade health checks before version upgrades.
When to Use
- Inspecting or rotating RKE2 TLS certificates
- Upgrading RKE2 cluster versions (manual or automated)
- Deploying the System Upgrade Controller for automated rolling upgrades
- Troubleshooting certificate expiration warnings or TLS errors
- Planning maintenance windows for certificate or version operations
Not for: Initial RKE2 installation (use rke2-deployment), Kubespray-managed clusters (use kubespray-operations), Rancher UI-driven upgrades (use Rancher documentation)
Certificate Management
Certificate Validity
| Certificate Type | Default Validity | Notes |
|---|---|---|
| Client certificates | 365 days | API server, scheduler, controller-manager, kubelet, kube-proxy, etcd |
| Server certificates | 365 days | All serving certificates |
| CA certificates | 10 years | Root trust anchors, not rotated automatically |
All RKE2 components communicate over TLS. Every API call, etcd transaction, and kubelet heartbeat uses mutual TLS authentication.
Auto-Renewal Behavior
RKE2 checks certificate expiration on every service start. If any certificate is within 120 days of expiry, RKE2 automatically renews it during startup. Kubernetes also emits CertificateExpirationWarning events when certificates are less than 120 days from expiry.
Implication: If your cluster runs continuously without service restarts for longer than 245 days (365 minus 120), certificates will NOT be auto-renewed. Regular maintenance restarts or manual rotation are required.
Inspecting Certificates
rke2 certificate check --output table
Output columns:
| Column | Description |
|---|---|
| FILENAME | Path to the certificate file |
| SUBJECT | Certificate subject (CN and O fields) |
| USAGES | Key usage (client auth, server auth, or both) |
| EXPIRES | Expiration date and time |
| RESIDUAL TIME | Time remaining until expiry |
| STATUS | ok or expiring (within 120 days) |
Server Node Certificates
A server (control plane) node holds certificates for all components:
| Component | Purpose |
|---|---|
| kube-apiserver | API server serving and client certificates |
| kube-scheduler | Scheduler client certificate for API server auth |
| kube-controller-manager | Controller manager client certificate |
| kubelet | Kubelet client and serving certificates |
| kube-proxy | Proxy client certificate |
| etcd | etcd peer, server, and client certificates |
| rke2-supervisor | Supervisor API serving certificate |
Agent Node Certificates
An agent (worker) node holds a smaller set:
| Component | Purpose |
|---|---|
| kubelet | Kubelet client and serving certificates |
| kube-proxy | Proxy client certificate |
| rke2-controller | Agent controller client certificate |
Manual Certificate Rotation
Use manual rotation when certificates are approaching expiry and you cannot rely on a service restart triggering auto-renewal, or when you need to rotate certificates immediately for security reasons.
Step-by-Step Procedure
Step 1: Stop the RKE2 server service
systemctl stop rke2-server
Step 2: Rotate all certificates
rke2 certificate rotate
This command:
- Generates new certificates for all components
- Backs up old certificates to a timestamped directory (e.g.,
/var/lib/rancher/rke2/server/tls-YYYY-MM-DDTHH-MM-SS/) - The backup allows rollback if anything goes wrong
Step 3: Verify new certificate dates
rke2 certificate check --output table
Confirm that EXPIRES column shows dates approximately 365 days from now and STATUS shows ok for all entries.
Step 4: Start the RKE2 server service
systemctl start rke2-server
Step 5: Update your local kubeconfig
The rotation generates a new admin client certificate embedded in rke2.yaml. Copy it to your working kubeconfig:
cp /etc/rancher/rke2/rke2.yaml ~/.kube/config
If accessing the cluster remotely, also update the server: field in the kubeconfig to the correct external address.
Step 6: Verify cluster health
# Nodes should be Ready
kubectl get nodes
# All system pods running
kubectl get pods -n kube-system
# API server responsive
kubectl get --raw='/readyz?verbose'
Worker Node Behavior After Rotation
Worker (agent) nodes automatically reconnect to the server and receive new certificates. No manual action is required on agent nodes. The agent detects the trust chain change on its next heartbeat and re-enrolls.
Multi-Server Rotation
For HA clusters with multiple server nodes, rotate certificates on each server node one at a time:
# On server-1
systemctl stop rke2-server
rke2 certificate rotate
rke2 certificate check --output table
systemctl start rke2-server
# Wait for server-1 to fully rejoin before proceeding
# On server-2
systemctl stop rke2-server
rke2 certificate rotate
rke2 certificate check --output table
systemctl start rke2-server
# Wait for server-2, then proceed to server-3, etc.
Manual Version Upgrade
Pre-Upgrade Monitoring
Before starting any upgrade, establish baseline monitoring in separate terminals:
# Terminal 1: Watch application availability
watch -n 2 'curl -s -o /dev/null -w "%{http_code}" http://<app-endpoint>'
# Terminal 2: Watch pod status
watch -n 2 'kubectl get pods -A -o wide'
# Terminal 3: Watch node status
watch -n 2 'kubectl get nodes -o wide'
# Terminal 4: Check etcd cluster health
ETCDCTL_API=3 etcdctl member list \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
--endpoints=https://127.0.0.1:2379
Check Available Versions
curl -s https://update.rke2.io/v1-release/channels | jq '.data[] | {id, latest}'
This shows all release channels and their latest resolved versions.
Version Skew Policy
Kubernetes 1.28+ supports a 3 minor version skew between the control plane and worker nodes (earlier versions support 2). This means during an upgrade from v1.33 to v1.34, workers running v1.33 will continue to function normally while the control plane runs v1.34.
However, best practice is to upgrade workers promptly after the control plane to minimize the skew window.
Upgrade Order
Always upgrade server (control plane) nodes first, then agent (worker) nodes.
server-1 (v1.33 -> v1.34)
server-2 (v1.33 -> v1.34)
server-3 (v1.33 -> v1.34)
|
v (CP fully upgraded, then workers)
agent-1 (v1.33 -> v1.34)
agent-2 (v1.33 -> v1.34)
agent-3 (v1.33 -> v1.34)
Server (Control Plane) Upgrade
Step 1: Run the RKE2 installer with the target channel
curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=v1.34 sh -
This upgrades the RPM packages in-place (rke2-common, rke2-server) without starting the service.
Step 2: Restart the RKE2 server
systemctl restart rke2-server
Step 3: Verify the server is running the new version
kubectl get nodes -o wide
# VERSION column should show the new Kubernetes version for this server node
Step 4: Repeat for each additional server node
Wait for each server node to fully rejoin and show Ready status before proceeding to the next server.
Agent (Worker) Upgrade
After ALL server nodes are upgraded and healthy:
Step 1: Run the RKE2 installer for the agent
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=agent INSTALL_RKE2_CHANNEL=v1.34 sh -
This upgrades RPM packages (rke2-common, rke2-agent) in-place.
Step 2: Restart the RKE2 agent
systemctl restart rke2-agent
Step 3: Verify the agent is running the new version
kubectl get nodes -o wide
# VERSION column should show the new Kubernetes version for this agent node
Step 4: Repeat for each additional agent node
For production clusters, drain each agent before restarting and uncordon after:
# Drain the worker
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Upgrade and restart
curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=agent INSTALL_RKE2_CHANNEL=v1.34 sh -
systemctl restart rke2-agent
# Uncordon after the node is Ready
kubectl uncordon <node-name>
Post-Upgrade Verification
# All nodes on new version
kubectl get nodes -o wide
# All system pods running
kubectl get pods -n kube-system
# API server health
kubectl get --raw='/readyz?verbose'
# etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health \
--cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key \
--endpoints=https://127.0.0.1:2379
# Application still responding
curl -s -o /dev/null -w "%{http_code}" http://<app-endpoint>
Automated Upgrade with System Upgrade Controller
The System Upgrade Controller (SUC) automates RKE2 version upgrades using Kubernetes-native Plan CRDs. It creates Jobs that run on each node to perform the actual upgrade.
Install the System Upgrade Controller
Step 1: Apply the CRD and controller manifests
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml \
-f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
Step 2: Verify the installation
# Namespace created
kubectl get namespace system-upgrade
# Controller running
kubectl get deploy -n system-upgrade system-upgrade-controller
# CRD registered
kubectl get crd plans.upgrade.cattle.io
What Gets Created
| Resource | Purpose |
|---|---|
system-upgrade namespace | Isolates upgrade controller resources |
system-upgrade-controller Deployment | Watches Plan CRDs and creates upgrade Jobs |
system-upgrade ServiceAccount | Identity for the controller |
system-upgrade-controller ClusterRoleBinding | Grants permissions to manage nodes and jobs |
| Drainer ClusterRole | Allows the controller to cordon and drain nodes |
plans.upgrade.cattle.io CRD | Custom resource for defining upgrade plans |
Upgrade Plans
Two Plan resources are needed: one for server nodes (control plane) and one for agent nodes (workers). The agent plan references the server plan in its prepare step, ensuring the control plane is fully upgraded before any worker upgrade begins.
Server Plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: In
values:
- "true"
serviceAccountName: system-upgrade
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/etcd
operator: Exists
effect: NoExecute
upgrade:
image: rancher/rke2-upgrade
channel: https://update.rke2.io/v1-release/channels/latest
Key fields:
concurrency: 1-- Upgrade one server node at a time to maintain quorumcordon: true-- Mark node as unschedulable during upgradenodeSelector-- Targets only nodes with thenode-role.kubernetes.io/control-plane: "true"labelchannel-- The controller resolves the latest version from this URLimage: rancher/rke2-upgrade-- Container image that performs the actual RKE2 binary upgrade
Agent Plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 2
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
prepare:
image: rancher/rke2-upgrade
args:
- prepare
- server-plan
serviceAccountName: system-upgrade
upgrade:
image: rancher/rke2-upgrade
channel: https://update.rke2.io/v1-release/channels/latest
Key fields:
nodeSelectorwithDoesNotExist-- Targets nodes WITHOUT the control-plane label (i.e., workers only)preparestep -- Referencesserver-planby name; the agent plan waits until the server plan has completed on all server nodes before startingconcurrency: 2-- Can upgrade two workers in parallel (adjust based on cluster capacity)
Apply the Plans
kubectl apply -f server-plan.yaml
kubectl apply -f agent-plan.yaml
Monitor Upgrade Progress
# Watch plan status
kubectl get plans -n system-upgrade -w
# Watch upgrade jobs
kubectl get jobs -n system-upgrade -w
# Check node versions as they upgrade
watch -n 5 'kubectl get nodes -o wide'
How the Upgrade Works Internally
- The controller reads the
channelURL and resolves the latest version - For each node matching the plan's
nodeSelector, the controller creates a Job - The upgrade pod runs with elevated privileges:
- Mounts the host root filesystem (
/) with read-write access - Uses host IPC, NET, and PID namespaces
- Has
CAP_SYS_BOOTcapability (to reboot the node if needed)
- Mounts the host root filesystem (
- The pod replaces RKE2 binaries on the host and restarts the RKE2 service
- The node comes back with the new version
- The controller marks the node as upgraded and proceeds to the next
Cleanup of System Upgrade Controller
After the upgrade is complete and verified, remove the SUC resources:
# Delete the plans first
kubectl delete plan -n system-upgrade server-plan agent-plan
# Delete the controller and RBAC
kubectl delete -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/system-upgrade-controller.yaml
# Delete the CRD (removes all Plan resources if any remain)
kubectl delete -f https://github.com/rancher/system-upgrade-controller/releases/latest/download/crd.yaml
Verify cleanup:
# Namespace should be gone or empty
kubectl get all -n system-upgrade
# CRD should be removed
kubectl get crd plans.upgrade.cattle.io
# Expected: Error from server (NotFound)
Quick Reference
Certificate Commands
| Action | Command |
|---|---|
| Inspect all certificates | rke2 certificate check --output table |
| Rotate all certificates | rke2 certificate rotate (with service stopped) |
| Check certificate events | kubectl get events --field-selector reason=CertificateExpirationWarning |
Manual Upgrade Commands
| Action | Command |
|---|---|
| Check available versions | curl -s https://update.rke2.io/v1-release/channels | jq .data |
| Upgrade server binary | curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=v1.34 sh - |
| Upgrade agent binary | curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE=agent INSTALL_RKE2_CHANNEL=v1.34 sh - |
| Restart server | systemctl restart rke2-server |
| Restart agent | systemctl restart rke2-agent |
System Upgrade Controller Commands
| Action | Command |
|---|---|
| Install SUC | kubectl apply -f .../crd.yaml -f .../system-upgrade-controller.yaml |
| Check plans | kubectl get plans -n system-upgrade |
| Watch upgrade jobs | kubectl get jobs -n system-upgrade -w |
| Remove SUC | Delete plans, then controller, then CRD (see Cleanup section) |
Common Errors (Searchable)
x509: certificate has expired or is not yet valid
Cause: RKE2 certificates have expired. The service ran for more than 245 days without a restart, missing the 120-day auto-renewal window. Fix: Stop the service, run rke2 certificate rotate, verify with rke2 certificate check --output table, then start the service.
CertificateExpirationWarning
Cause: Kubernetes event indicating a certificate is within 120 days of expiry. Fix: Schedule a maintenance window to restart RKE2 (triggers auto-renewal) or manually rotate certificates.
Unable to connect to the server: x509: certificate signed by unknown authority
Cause: kubeconfig contains an old client certificate after rotation. Fix: Copy the updated kubeconfig: cp /etc/rancher/rke2/rke2.yaml ~/.kube/config.
error: error upgrading connection: error dialing backend: x509: certificate is valid for <old-names>, not <new-name>
Cause: SAN mismatch after node hostname or IP change. The certificate was issued for different Subject Alternative Names. Fix: Rotate certificates to regenerate with current node identity.
level=error msg="unable to start controller: tls: failed to find any PEM data in certificate input"
Cause: Certificate file is empty or corrupted, possibly from a failed rotation. Fix: Check the timestamped backup directory under /var/lib/rancher/rke2/server/tls-*/, restore the previous certificates, and retry the rotation.
Error from server (NotFound): plans.upgrade.cattle.io "server-plan" not found
Cause: The Plan CRD is not installed or the plan was not applied. Fix: Ensure the CRD is installed with kubectl get crd plans.upgrade.cattle.io, then apply the plan YAML.
Job has reached the specified backoff limit
Cause: The upgrade job on a node failed repeatedly. Fix: Check the job pod logs with kubectl logs -n system-upgrade <pod-name>. Common issues: node disk full, network issues pulling the upgrade image, or insufficient permissions.
node "<node-name>" already has a newer version
Cause: The plan targets a node that already runs a version equal to or newer than the channel's resolved version. Fix: No action needed; the controller skips nodes that are already at or above the target version.
error: unable to drain node: cannot evict pod as it would violate the pod's disruption budget
Cause: A PodDisruptionBudget prevents draining the node during upgrade. Fix: Audit PDBs with kubectl get pdb -A, adjust maxUnavailable if set to 0, or temporarily delete the blocking PDB for the upgrade window.
rke2-server.service: Failed with result 'exit-code'
Cause: RKE2 server failed to start after upgrade or certificate rotation. Fix: Check full logs with journalctl -xeu rke2-server. Common causes: port conflicts, corrupted certificates, or incompatible configuration after version upgrade.
Common Mistakes
| Mistake | Consequence |
|---|---|
| Upgrading agents before servers | Agent kubelet version newer than API server; unsupported skew, potential API incompatibilities |
Running rke2 certificate rotate without stopping the service first | Rotation may fail or produce inconsistent state; always systemctl stop rke2-server first |
| Not copying updated kubeconfig after certificate rotation | kubectl commands fail with x509 errors because the local kubeconfig has old client certificates |
| Letting the cluster run 245+ days without a service restart | Certificates pass the 120-day auto-renewal window and expire at 365 days, causing cluster outage |
| Applying agent-plan without server-plan | Workers upgrade but control plane stays on the old version; reversed version skew breaks the cluster |
| Setting SUC agent-plan concurrency too high | Too many workers drain simultaneously; workloads have nowhere to schedule, causing application downtime |
Not checking rke2 certificate check after rotation | Rotation may have partially failed; unverified certificates lead to surprise outages |
| Skipping pre-upgrade monitoring setup | No visibility into whether the upgrade caused application downtime; problems discovered too late |
| Forgetting tolerations on server-plan | Upgrade pods cannot schedule on control plane nodes that have taints; upgrade never starts |
| Not cleaning up the System Upgrade Controller after upgrade | Leftover controller may trigger unintended upgrades when a new version appears in the channel |
| Upgrading RKE2 without draining workers in production | Pods on the node are abruptly terminated during restart; causes brief application unavailability |
| Not verifying etcd health before starting upgrade | Starting an upgrade with a degraded etcd cluster risks total data loss |