Kubernetes Architecture Review Questions

First PublishedApr 24, 2026ByAtif Alam

Use this page when you need architecture reviews, or readiness checks with someone who owns platform work and production blast radius.

These are prompts, not answers. Good follow-ups: “What broke last time?”, “Show the YAML”, “What metric proves it?”, “Who owns rollback?”

How to Run a Good Session

Demand specifics — Names of controllers, objects, ports, failure symptoms, and commands they actually run.
Force trade-offs — Cost, complexity, operability, security, and time-to-recover under pressure.
Start from incidents — Ask them to replay a real outage or near-miss; watch for honest unknowns.
Stay out of trivia — Memorized kubectl flags matter less than reasoning about the control plane and data paths.
Close with ownership — Who changes defaults, who gets paged, and how drift is detected.

Discussion Prompts by Domain

Reference answers (same domains, concise explanations): Kubernetes architecture review answers.

Cluster Architecture and Control Plane

Walk through what happens between kubectl apply and a Pod running on a node.
Which control-plane components must be healthy for scheduling to succeed, and how would you prove each is healthy?
How does the API server enforce admission, and what happens when an admission webhook is slow or failing?
Explain etcd’s role and what symptoms you expect if etcd is impaired.
How would you safely rotate control-plane certificates in your environment?

Reference answers: Cluster architecture and control plane

Scheduling and Capacity

How does the scheduler choose a node, and what common constraints cause Pending?
How do you debug a pod that never schedules when resources “look fine”?
How do requests and limits differ, and what failure modes come from setting limits without requests?
When would you use taints and tolerations versus affinity rules?
How do you detect and mitigate noisy neighbor CPU/memory contention on a node?

Reference answers: Scheduling and capacity

Networking

Explain how a ClusterIP Service routes to endpoints; what updates when pods churn?
How does DNS for Services work inside the cluster, and what breaks it first under load?
Compare Ingress, LoadBalancer Services, and NodePort for north-south traffic; when is each wrong?
How would you debug intermittent 503 between two in-cluster services?
What is the difference between NetworkPolicy as “default deny” versus selective allow lists?

Reference answers: Networking

Security and Identity

How does RBAC interact with authentication in your clusters (cert, OIDC, cloud IAM exec)?
What is the blast radius of a compromised namespace-scoped RoleBinding versus a cluster ClusterRoleBinding?
How do you manage secrets at rest and in Git for cluster config?
What guardrails prevent a workload from talking to the metadata service or the API server when it should not?
How would you design pod security standards (or equivalent) for a multi-team cluster?

Reference answers: Security and identity

Storage

Compare ephemeral volumes, PVCs, and StatefulSet identity; when must you use each?
What causes a PVC to stay Pending, and how do you read describe output for storage issues?
How do you choose a StorageClass strategy for stateful apps across zones or single-node labs?
What happens to data when a StatefulSet pod is deleted versus rescheduled?
How would you back up and restore a stateful workload on Kubernetes?

Reference answers: Storage

Workloads and Rollouts

Compare Deployment rolling updates versus StatefulSet ordered rollout; failure modes of each.
How do liveness, readiness, and startup probes differ, and how can a bad probe cause an incident?
What signals tell you a rollout is unhealthy before users complain?
How do you safely roll back a bad deployment in production?
When would you choose Jobs or CronJobs instead of long-running Deployments?

Reference answers: Workloads and rollouts

Observability

What is your first dashboard or query when CPU spikes cluster-wide?
How do you distinguish “the network is slow” from “the app is slow” with metrics you already have?
How would you wire application metrics into Prometheus in a way that survives Helm upgrades?
What do you alert on versus what you only dashboard?
How do you keep cardinality under control in Prometheus labels?

Reference answers: Observability

GitOps and Delivery

How do you prevent a GitOps controller from overwriting manual hotfixes?
What is your promotion model between environments, and how are secrets handled?
How do you detect and recover from sync loops or partial applies?
What policy checks run before manifests reach the cluster?
How do you coordinate Helm releases with GitOps without fighting state?

Reference answers: GitOps and delivery

Upgrades and Day-2 Operations

What is your Kubernetes upgrade strategy for control plane versus nodes?
How do you validate add-ons (CNI, ingress, metrics) after an upgrade?
What is your process for draining nodes and handling PDBs during maintenance?
How do you manage CRD upgrades when operators lag the cluster version?
What is your story for certificate expiry across the platform?

Reference answers: Upgrades and day-2 operations

Multi-Tenancy and Policy

How do you isolate noisy teams: namespaces, quotas, network policy, admission?
What admission policies do you enforce cluster-wide, and how do you roll them out safely?
How do you handle shared ingress controllers fairly between tenants?
What is your default for ResourceQuota and how do you tune it without blocking legitimate work?
How do you audit who changed security-sensitive objects?

Reference answers: Multi-tenancy and policy

Architecture and Design Review Questions

Use these when reviewing a design doc or a new platform capability.

Boundaries and Blast Radius

What is the smallest failure domain you can contain this change to?
What happens if this component is down for 5 minutes? For an hour?
What manual steps remain if automation fails mid-flight?

Defaults and Operability

What are the safe defaults for teams who do not read the README?
What telemetry proves this feature is healthy in production?
What runbook steps exist for the on-call engineer at 3 a.m.?

Migration and Compatibility

What is the rollback plan, and how long does it take to execute?
How do you migrate existing workloads without coordinated downtime?
What version skew issues appear between clients, API server, and node components?

Cost and Complexity

What does this cost in engineer time and cloud spend at steady state?
What simpler design did you reject, and why?

Networking and Edge

Where is TLS terminated, and who owns certificate lifecycle?
How does traffic fail over if the ingress tier degrades?

Security and Compliance

What data crosses trust boundaries, and how is it protected?
What evidence would satisfy an auditor that controls are effective?

Data and State

Where is authoritative state stored, and how is it backed up?
What consistency guarantees does the application need versus what Kubernetes provides?

Observability and SLOs

What SLIs do you need before launch, and what error budget policy applies?

Production Operations and Incident Readiness Questions

Use these when validating operational maturity.

Triage and Response

What is your first 10-minute checklist when user-facing latency spikes?
How do you decide between rollback, traffic shift, and hotfix under pressure?
How do you communicate status without slowing down mitigation?

Guardrails and Prevention

What prevents a bad manifest from reaching production?
What prevents a single team from starving the cluster?

Runbooks and Knowledge

Where do runbooks live, and how often are they tested?
What incident types repeat, and what permanent fixes are still open?

Capacity and Scaling

How do you forecast node capacity, and what triggers a scale-up?
How do you handle regional or zonal impairment if applicable?

Post-Incident Learning

What is your blameless postmortem template, and who must attend?
How do you track action items to completion?

Pointers to deeper runbooks on this site

If You Only Ask Ten Questions

Describe the path from kubectl apply to a running Pod and where it can fail.
How do you debug a Pending pod that never schedules?
How does a ClusterIP Service route traffic when endpoints churn?
How do requests and limits change scheduling and runtime behavior?
What is your rollout and rollback strategy for a bad deployment?
How do you prove control-plane health during an incident?
What is your secret handling model for cluster and GitOps config?
How do you prevent or detect noisy neighbor issues on a node?
What is your Kubernetes upgrade strategy for control plane versus workers?
What is the first query or dashboard you open for cluster-wide latency?