Cluster Upgrades and Day-2 Operations
Upgrades are where skew, deprecated APIs, and add-on behavior show up. This page is a generic runbook; managed offerings (for example EKS) still follow the same mental model: control plane first, then data plane, with validation between waves.
Control plane vs worker nodes
Section titled “Control plane vs worker nodes”| Layer | Typical order | Why |
|---|---|---|
| Control plane | Upgrade first (or let the cloud upgrade it) | API version skew rules; nodes should not run a kubelet ahead of what the API supports |
| Add-ons | After control plane, before or with node waves | CNI, DNS, and metrics depend on API and node capabilities |
| Worker nodes / node pools | Rolling waves | Keeps capacity; limits blast radius |
Always read your vendor’s supported skew table (for example how many minor versions behind the control plane nodes may be).
Add-on validation matrix
Section titled “Add-on validation matrix”After any upgrade touching the control plane or CNI, run a short smoke pass before declaring success:
| Area | Quick check |
|---|---|
| Pod networking | kubectl run debug pod; ping/DNS across nodes and namespaces |
| CoreDNS | Resolve kubernetes.default; watch CoreDNS logs under load |
| kube-proxy / datapath | Service ClusterIP and NodePort reachability; if you use eBPF/Cilium, run their health CLI |
| Ingress / Gateway | Hit a known route; TLS handshake succeeds |
| Metrics | kubectl top nodes; Prometheus targets UP if installed |
| Autoscaling | HPA sees metrics; cluster autoscaler or Karpenter logs clean |
Draining nodes with PodDisruptionBudgets
Section titled “Draining nodes with PodDisruptionBudgets”- Cordon the node:
kubectl cordon <node>— no new pods scheduled there. - Drain in batches:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data(add--grace-periodas needed). - Respect PDBs — if eviction is blocked, either add capacity, temporarily relax PDBs with change control, or fix workloads that cannot move.
- Never use
--disable-evictioncasually; it bypasses PDBs and can take out more availability than intended.
See also Production patterns for PDB basics.
CRD upgrades when operators lag the cluster
Section titled “CRD upgrades when operators lag the cluster”- Maintain a compatibility matrix: cluster version vs operator chart vs CRD
apiVersionbundles. - Order matters: often install/upgrade the operator (or Helm chart that owns CRDs) in a window tested by the vendor; avoid removing deprecated APIs until every controller that watches them is upgraded.
- Scan for deprecated APIs before upgrade: community tools such as Pluto compare manifests in Git and live objects to removed/changed APIs. Older runbooks sometimes mention
kubectl convert; it is deprecated/removed on many modern kubectl builds — prefer Pluto,kubectl explain, and vendor migration guides for each version jump. - If Helm manages CRDs, understand
crdshook behavior — blindhelm upgradecan strand you with CRDs newer than controllers (or vice versa).
Platform-wide certificate inventory
Section titled “Platform-wide certificate inventory”Treat certificates as a fleet, not only Ingress TLS:
| Certificate class | Where it lives | Renewal signal |
|---|---|---|
| API server / etcd | Control plane (often vendor-managed) | Expiry alerts from cloud or PKI |
| Kubelet client/server | Per node | Node NotReady, kubelet logs, rotation docs |
| Ingress / Gateway | cert-manager Certificate, cloud ACM | cert-manager metrics, ACM expiry |
| Service mesh | Istio/Linkerd CA and workload certs | mesh dashboards, istioctl warnings |
Automate renewal where possible; alert at 30/14/7 days before expiry; run a rotation game day twice a year.
Multi-region and fleet upgrades
Section titled “Multi-region and fleet upgrades”Running many clusters (regions or environments) adds coordination, not just repetition:
- Skew policy — align with your vendor’s supported kubelet ↔ API skew across regions; avoid a state where one region’s workers are ahead of another region’s control plane during a staggered rollout.
- Wave order — common pattern: upgrade non-production clusters first per region, then canary production, then remainder; never all regions at once without a rollback story.
- PodDisruptionBudgets per region — PDBs are per cluster; validate that each region still meets availability SLO when nodes drain during the upgrade window.
- Global traffic — shift DNS / GSLB / mesh weight toward regions that finished validation; keep automatic rollback (traffic revert) until error budgets recover.
- Communication — one change calendar entry per fleet wave; explicit owner for add-ons (CNI, ingress, metrics) in every region.
- Managed control planes (EKS example) — AWS upgrades the control plane on a cadence you influence; you still own node groups, add-ons, and CRD/operator compatibility. Treat each cluster as the unit of validation, then correlate dashboards across regions.
Pair with Multi-cluster management for GitOps across clusters and Troubleshooting and debugging for incident-first steps.
Rollback checkpoints and soak
Section titled “Rollback checkpoints and soak”- Take a checkpoint: Git tag or Argo CD app revision, Helm release history number, and a short etcd backup (if you manage etcd).
- Soak the first upgraded pool or AZ at low traffic; watch error rate, scheduling, and DNS for one or two release windows.
- If regressions appear, roll back nodes before chasing app bugs — restores a known-good platform layer.
Related
Section titled “Related”- EKS overview — AWS-managed control plane and a short EKS-flavored upgrade callout.
- Troubleshooting and debugging — Incident triage order.
- Architecture review answers — Prompts this page deepens.