Cluster Upgrades and Day-2 Operations

First PublishedApr 24, 2026ByAtif Alam

Upgrades are where skew, deprecated APIs, and add-on behavior show up. This page is a generic runbook; managed offerings (for example EKS) still follow the same mental model: control plane first, then data plane, with validation between waves.

Control plane vs worker nodes

Layer	Typical order	Why
Control plane	Upgrade first (or let the cloud upgrade it)	API version skew rules; nodes should not run a kubelet ahead of what the API supports
Add-ons	After control plane, before or with node waves	CNI, DNS, and metrics depend on API and node capabilities
Worker nodes / node pools	Rolling waves	Keeps capacity; limits blast radius

Always read your vendor’s supported skew table (for example how many minor versions behind the control plane nodes may be).

Add-on validation matrix

After any upgrade touching the control plane or CNI, run a short smoke pass before declaring success:

Area	Quick check
Pod networking	`kubectl run` debug pod; ping/DNS across nodes and namespaces
CoreDNS	Resolve `kubernetes.default`; watch CoreDNS logs under load
kube-proxy / datapath	Service ClusterIP and NodePort reachability; if you use eBPF/Cilium, run their health CLI
Ingress / Gateway	Hit a known route; TLS handshake succeeds
Metrics	`kubectl top nodes`; Prometheus targets UP if installed
Autoscaling	HPA sees metrics; cluster autoscaler or Karpenter logs clean

Draining nodes with PodDisruptionBudgets

Cordon the node: kubectl cordon <node> — no new pods scheduled there.
Drain in batches: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data (add --grace-period as needed).
Respect PDBs — if eviction is blocked, either add capacity, temporarily relax PDBs with change control, or fix workloads that cannot move.
Never use --disable-eviction casually; it bypasses PDBs and can take out more availability than intended.

See also Production patterns for PDB basics.

CRD upgrades when operators lag the cluster

Maintain a compatibility matrix: cluster version vs operator chart vs CRD apiVersion bundles.
Order matters: often install/upgrade the operator (or Helm chart that owns CRDs) in a window tested by the vendor; avoid removing deprecated APIs until every controller that watches them is upgraded.
Scan for deprecated APIs before upgrade: community tools such as Pluto compare manifests in Git and live objects to removed/changed APIs. Older runbooks sometimes mention kubectl convert; it is deprecated/removed on many modern kubectl builds — prefer Pluto, kubectl explain, and vendor migration guides for each version jump.
If Helm manages CRDs, understand crds hook behavior — blind helm upgrade can strand you with CRDs newer than controllers (or vice versa).

Platform-wide certificate inventory

Treat certificates as a fleet, not only Ingress TLS:

Certificate class	Where it lives	Renewal signal
API server / etcd	Control plane (often vendor-managed)	Expiry alerts from cloud or PKI
Kubelet client/server	Per node	Node NotReady, kubelet logs, rotation docs
Ingress / Gateway	cert-manager `Certificate`, cloud ACM	cert-manager metrics, ACM expiry
Service mesh	Istio/Linkerd CA and workload certs	mesh dashboards, `istioctl` warnings

Automate renewal where possible; alert at 30/14/7 days before expiry; run a rotation game day twice a year.

Multi-region and fleet upgrades

Running many clusters (regions or environments) adds coordination, not just repetition:

Skew policy — align with your vendor’s supported kubelet ↔ API skew across regions; avoid a state where one region’s workers are ahead of another region’s control plane during a staggered rollout.
Wave order — common pattern: upgrade non-production clusters first per region, then canary production, then remainder; never all regions at once without a rollback story.
PodDisruptionBudgets per region — PDBs are per cluster; validate that each region still meets availability SLO when nodes drain during the upgrade window.
Global traffic — shift DNS / GSLB / mesh weight toward regions that finished validation; keep automatic rollback (traffic revert) until error budgets recover.
Communication — one change calendar entry per fleet wave; explicit owner for add-ons (CNI, ingress, metrics) in every region.
Managed control planes (EKS example) — AWS upgrades the control plane on a cadence you influence; you still own node groups, add-ons, and CRD/operator compatibility. Treat each cluster as the unit of validation, then correlate dashboards across regions.

Pair with Multi-cluster management for GitOps across clusters and Troubleshooting and debugging for incident-first steps.

Rollback checkpoints and soak

Take a checkpoint: Git tag or Argo CD app revision, Helm release history number, and a short etcd backup (if you manage etcd).
Soak the first upgraded pool or AZ at low traffic; watch error rate, scheduling, and DNS for one or two release windows.
If regressions appear, roll back nodes before chasing app bugs — restores a known-good platform layer.

EKS overview — AWS-managed control plane and a short EKS-flavored upgrade callout.
Troubleshooting and debugging — Incident triage order.
Architecture review answers — Prompts this page deepens.