Skip to content

Cluster Upgrades and Day-2 Operations

First PublishedByAtif Alam

Upgrades are where skew, deprecated APIs, and add-on behavior show up. This page is a generic runbook; managed offerings (for example EKS) still follow the same mental model: control plane first, then data plane, with validation between waves.

LayerTypical orderWhy
Control planeUpgrade first (or let the cloud upgrade it)API version skew rules; nodes should not run a kubelet ahead of what the API supports
Add-onsAfter control plane, before or with node wavesCNI, DNS, and metrics depend on API and node capabilities
Worker nodes / node poolsRolling wavesKeeps capacity; limits blast radius

Always read your vendor’s supported skew table (for example how many minor versions behind the control plane nodes may be).

After any upgrade touching the control plane or CNI, run a short smoke pass before declaring success:

AreaQuick check
Pod networkingkubectl run debug pod; ping/DNS across nodes and namespaces
CoreDNSResolve kubernetes.default; watch CoreDNS logs under load
kube-proxy / datapathService ClusterIP and NodePort reachability; if you use eBPF/Cilium, run their health CLI
Ingress / GatewayHit a known route; TLS handshake succeeds
Metricskubectl top nodes; Prometheus targets UP if installed
AutoscalingHPA sees metrics; cluster autoscaler or Karpenter logs clean
  1. Cordon the node: kubectl cordon <node> — no new pods scheduled there.
  2. Drain in batches: kubectl drain <node> --ignore-daemonsets --delete-emptydir-data (add --grace-period as needed).
  3. Respect PDBs — if eviction is blocked, either add capacity, temporarily relax PDBs with change control, or fix workloads that cannot move.
  4. Never use --disable-eviction casually; it bypasses PDBs and can take out more availability than intended.

See also Production patterns for PDB basics.

CRD upgrades when operators lag the cluster

Section titled “CRD upgrades when operators lag the cluster”
  • Maintain a compatibility matrix: cluster version vs operator chart vs CRD apiVersion bundles.
  • Order matters: often install/upgrade the operator (or Helm chart that owns CRDs) in a window tested by the vendor; avoid removing deprecated APIs until every controller that watches them is upgraded.
  • Scan for deprecated APIs before upgrade: community tools such as Pluto compare manifests in Git and live objects to removed/changed APIs. Older runbooks sometimes mention kubectl convert; it is deprecated/removed on many modern kubectl builds — prefer Pluto, kubectl explain, and vendor migration guides for each version jump.
  • If Helm manages CRDs, understand crds hook behavior — blind helm upgrade can strand you with CRDs newer than controllers (or vice versa).

Treat certificates as a fleet, not only Ingress TLS:

Certificate classWhere it livesRenewal signal
API server / etcdControl plane (often vendor-managed)Expiry alerts from cloud or PKI
Kubelet client/serverPer nodeNode NotReady, kubelet logs, rotation docs
Ingress / Gatewaycert-manager Certificate, cloud ACMcert-manager metrics, ACM expiry
Service meshIstio/Linkerd CA and workload certsmesh dashboards, istioctl warnings

Automate renewal where possible; alert at 30/14/7 days before expiry; run a rotation game day twice a year.

Running many clusters (regions or environments) adds coordination, not just repetition:

  • Skew policy — align with your vendor’s supported kubelet ↔ API skew across regions; avoid a state where one region’s workers are ahead of another region’s control plane during a staggered rollout.
  • Wave order — common pattern: upgrade non-production clusters first per region, then canary production, then remainder; never all regions at once without a rollback story.
  • PodDisruptionBudgets per region — PDBs are per cluster; validate that each region still meets availability SLO when nodes drain during the upgrade window.
  • Global traffic — shift DNS / GSLB / mesh weight toward regions that finished validation; keep automatic rollback (traffic revert) until error budgets recover.
  • Communication — one change calendar entry per fleet wave; explicit owner for add-ons (CNI, ingress, metrics) in every region.
  • Managed control planes (EKS example) — AWS upgrades the control plane on a cadence you influence; you still own node groups, add-ons, and CRD/operator compatibility. Treat each cluster as the unit of validation, then correlate dashboards across regions.

Pair with Multi-cluster management for GitOps across clusters and Troubleshooting and debugging for incident-first steps.

  • Take a checkpoint: Git tag or Argo CD app revision, Helm release history number, and a short etcd backup (if you manage etcd).
  • Soak the first upgraded pool or AZ at low traffic; watch error rate, scheduling, and DNS for one or two release windows.
  • If regressions appear, roll back nodes before chasing app bugs — restores a known-good platform layer.