Skip to content

Kubernetes Architecture Review Questions

First PublishedByAtif Alam

Use this page when you need architecture reviews, or readiness checks with someone who owns platform work and production blast radius.

These are prompts, not answers. Good follow-ups: “What broke last time?”, “Show the YAML”, “What metric proves it?”, “Who owns rollback?”

  • Demand specifics — Names of controllers, objects, ports, failure symptoms, and commands they actually run.
  • Force trade-offs — Cost, complexity, operability, security, and time-to-recover under pressure.
  • Start from incidents — Ask them to replay a real outage or near-miss; watch for honest unknowns.
  • Stay out of trivia — Memorized kubectl flags matter less than reasoning about the control plane and data paths.
  • Close with ownership — Who changes defaults, who gets paged, and how drift is detected.

Reference answers (same domains, concise explanations): Kubernetes architecture review answers.

  • Walk through what happens between kubectl apply and a Pod running on a node.
  • Which control-plane components must be healthy for scheduling to succeed, and how would you prove each is healthy?
  • How does the API server enforce admission, and what happens when an admission webhook is slow or failing?
  • Explain etcd’s role and what symptoms you expect if etcd is impaired.
  • How would you safely rotate control-plane certificates in your environment?

Reference answers: Cluster architecture and control plane

  • How does the scheduler choose a node, and what common constraints cause Pending?
  • How do you debug a pod that never schedules when resources “look fine”?
  • How do requests and limits differ, and what failure modes come from setting limits without requests?
  • When would you use taints and tolerations versus affinity rules?
  • How do you detect and mitigate noisy neighbor CPU/memory contention on a node?

Reference answers: Scheduling and capacity

  • Explain how a ClusterIP Service routes to endpoints; what updates when pods churn?
  • How does DNS for Services work inside the cluster, and what breaks it first under load?
  • Compare Ingress, LoadBalancer Services, and NodePort for north-south traffic; when is each wrong?
  • How would you debug intermittent 503 between two in-cluster services?
  • What is the difference between NetworkPolicy as “default deny” versus selective allow lists?

Reference answers: Networking

  • How does RBAC interact with authentication in your clusters (cert, OIDC, cloud IAM exec)?
  • What is the blast radius of a compromised namespace-scoped RoleBinding versus a cluster ClusterRoleBinding?
  • How do you manage secrets at rest and in Git for cluster config?
  • What guardrails prevent a workload from talking to the metadata service or the API server when it should not?
  • How would you design pod security standards (or equivalent) for a multi-team cluster?

Reference answers: Security and identity

  • Compare ephemeral volumes, PVCs, and StatefulSet identity; when must you use each?
  • What causes a PVC to stay Pending, and how do you read describe output for storage issues?
  • How do you choose a StorageClass strategy for stateful apps across zones or single-node labs?
  • What happens to data when a StatefulSet pod is deleted versus rescheduled?
  • How would you back up and restore a stateful workload on Kubernetes?

Reference answers: Storage

  • Compare Deployment rolling updates versus StatefulSet ordered rollout; failure modes of each.
  • How do liveness, readiness, and startup probes differ, and how can a bad probe cause an incident?
  • What signals tell you a rollout is unhealthy before users complain?
  • How do you safely roll back a bad deployment in production?
  • When would you choose Jobs or CronJobs instead of long-running Deployments?

Reference answers: Workloads and rollouts

  • What is your first dashboard or query when CPU spikes cluster-wide?
  • How do you distinguish “the network is slow” from “the app is slow” with metrics you already have?
  • How would you wire application metrics into Prometheus in a way that survives Helm upgrades?
  • What do you alert on versus what you only dashboard?
  • How do you keep cardinality under control in Prometheus labels?

Reference answers: Observability

  • How do you prevent a GitOps controller from overwriting manual hotfixes?
  • What is your promotion model between environments, and how are secrets handled?
  • How do you detect and recover from sync loops or partial applies?
  • What policy checks run before manifests reach the cluster?
  • How do you coordinate Helm releases with GitOps without fighting state?

Reference answers: GitOps and delivery

  • What is your Kubernetes upgrade strategy for control plane versus nodes?
  • How do you validate add-ons (CNI, ingress, metrics) after an upgrade?
  • What is your process for draining nodes and handling PDBs during maintenance?
  • How do you manage CRD upgrades when operators lag the cluster version?
  • What is your story for certificate expiry across the platform?

Reference answers: Upgrades and day-2 operations

  • How do you isolate noisy teams: namespaces, quotas, network policy, admission?
  • What admission policies do you enforce cluster-wide, and how do you roll them out safely?
  • How do you handle shared ingress controllers fairly between tenants?
  • What is your default for ResourceQuota and how do you tune it without blocking legitimate work?
  • How do you audit who changed security-sensitive objects?

Reference answers: Multi-tenancy and policy

Use these when reviewing a design doc or a new platform capability.

  • What is the smallest failure domain you can contain this change to?
  • What happens if this component is down for 5 minutes? For an hour?
  • What manual steps remain if automation fails mid-flight?
  • What are the safe defaults for teams who do not read the README?
  • What telemetry proves this feature is healthy in production?
  • What runbook steps exist for the on-call engineer at 3 a.m.?
  • What is the rollback plan, and how long does it take to execute?
  • How do you migrate existing workloads without coordinated downtime?
  • What version skew issues appear between clients, API server, and node components?
  • What does this cost in engineer time and cloud spend at steady state?
  • What simpler design did you reject, and why?
  • Where is TLS terminated, and who owns certificate lifecycle?
  • How does traffic fail over if the ingress tier degrades?
  • What data crosses trust boundaries, and how is it protected?
  • What evidence would satisfy an auditor that controls are effective?
  • Where is authoritative state stored, and how is it backed up?
  • What consistency guarantees does the application need versus what Kubernetes provides?
  • What SLIs do you need before launch, and what error budget policy applies?

Production Operations and Incident Readiness Questions

Section titled “Production Operations and Incident Readiness Questions”

Use these when validating operational maturity.

  • What is your first 10-minute checklist when user-facing latency spikes?
  • How do you decide between rollback, traffic shift, and hotfix under pressure?
  • How do you communicate status without slowing down mitigation?
  • What prevents a bad manifest from reaching production?
  • What prevents a single team from starving the cluster?
  • Where do runbooks live, and how often are they tested?
  • What incident types repeat, and what permanent fixes are still open?
  • How do you forecast node capacity, and what triggers a scale-up?
  • How do you handle regional or zonal impairment if applicable?
  • What is your blameless postmortem template, and who must attend?
  • How do you track action items to completion?
  1. Describe the path from kubectl apply to a running Pod and where it can fail.
  2. How do you debug a Pending pod that never schedules?
  3. How does a ClusterIP Service route traffic when endpoints churn?
  4. How do requests and limits change scheduling and runtime behavior?
  5. What is your rollout and rollback strategy for a bad deployment?
  6. How do you prove control-plane health during an incident?
  7. What is your secret handling model for cluster and GitOps config?
  8. How do you prevent or detect noisy neighbor issues on a node?
  9. What is your Kubernetes upgrade strategy for control plane versus workers?
  10. What is the first query or dashboard you open for cluster-wide latency?