Kubernetes Architecture Review Questions
Use this page when you need architecture reviews, or readiness checks with someone who owns platform work and production blast radius.
These are prompts, not answers. Good follow-ups: “What broke last time?”, “Show the YAML”, “What metric proves it?”, “Who owns rollback?”
How to Run a Good Session
Section titled “How to Run a Good Session”- Demand specifics — Names of controllers, objects, ports, failure symptoms, and commands they actually run.
- Force trade-offs — Cost, complexity, operability, security, and time-to-recover under pressure.
- Start from incidents — Ask them to replay a real outage or near-miss; watch for honest unknowns.
- Stay out of trivia — Memorized
kubectlflags matter less than reasoning about the control plane and data paths. - Close with ownership — Who changes defaults, who gets paged, and how drift is detected.
Discussion Prompts by Domain
Section titled “Discussion Prompts by Domain”Reference answers (same domains, concise explanations): Kubernetes architecture review answers.
Cluster Architecture and Control Plane
Section titled “Cluster Architecture and Control Plane”- Walk through what happens between
kubectl applyand a Pod running on a node. - Which control-plane components must be healthy for scheduling to succeed, and how would you prove each is healthy?
- How does the API server enforce admission, and what happens when an admission webhook is slow or failing?
- Explain etcd’s role and what symptoms you expect if etcd is impaired.
- How would you safely rotate control-plane certificates in your environment?
Reference answers: Cluster architecture and control plane
Scheduling and Capacity
Section titled “Scheduling and Capacity”- How does the scheduler choose a node, and what common constraints cause
Pending? - How do you debug a pod that never schedules when resources “look fine”?
- How do requests and limits differ, and what failure modes come from setting limits without requests?
- When would you use taints and tolerations versus affinity rules?
- How do you detect and mitigate noisy neighbor CPU/memory contention on a node?
Reference answers: Scheduling and capacity
Networking
Section titled “Networking”- Explain how a
ClusterIPService routes to endpoints; what updates when pods churn? - How does DNS for Services work inside the cluster, and what breaks it first under load?
- Compare Ingress,
LoadBalancerServices, and NodePort for north-south traffic; when is each wrong? - How would you debug intermittent
503between two in-cluster services? - What is the difference between NetworkPolicy as “default deny” versus selective allow lists?
Reference answers: Networking
Security and Identity
Section titled “Security and Identity”- How does RBAC interact with authentication in your clusters (cert, OIDC, cloud IAM exec)?
- What is the blast radius of a compromised namespace-scoped
RoleBindingversus a clusterClusterRoleBinding? - How do you manage secrets at rest and in Git for cluster config?
- What guardrails prevent a workload from talking to the metadata service or the API server when it should not?
- How would you design pod security standards (or equivalent) for a multi-team cluster?
Reference answers: Security and identity
Storage
Section titled “Storage”- Compare ephemeral volumes, PVCs, and StatefulSet identity; when must you use each?
- What causes a PVC to stay
Pending, and how do you readdescribeoutput for storage issues? - How do you choose a
StorageClassstrategy for stateful apps across zones or single-node labs? - What happens to data when a StatefulSet pod is deleted versus rescheduled?
- How would you back up and restore a stateful workload on Kubernetes?
Reference answers: Storage
Workloads and Rollouts
Section titled “Workloads and Rollouts”- Compare Deployment rolling updates versus StatefulSet ordered rollout; failure modes of each.
- How do liveness, readiness, and startup probes differ, and how can a bad probe cause an incident?
- What signals tell you a rollout is unhealthy before users complain?
- How do you safely roll back a bad deployment in production?
- When would you choose Jobs or CronJobs instead of long-running Deployments?
Reference answers: Workloads and rollouts
Observability
Section titled “Observability”- What is your first dashboard or query when CPU spikes cluster-wide?
- How do you distinguish “the network is slow” from “the app is slow” with metrics you already have?
- How would you wire application metrics into Prometheus in a way that survives Helm upgrades?
- What do you alert on versus what you only dashboard?
- How do you keep cardinality under control in Prometheus labels?
Reference answers: Observability
GitOps and Delivery
Section titled “GitOps and Delivery”- How do you prevent a GitOps controller from overwriting manual hotfixes?
- What is your promotion model between environments, and how are secrets handled?
- How do you detect and recover from sync loops or partial applies?
- What policy checks run before manifests reach the cluster?
- How do you coordinate Helm releases with GitOps without fighting state?
Reference answers: GitOps and delivery
Upgrades and Day-2 Operations
Section titled “Upgrades and Day-2 Operations”- What is your Kubernetes upgrade strategy for control plane versus nodes?
- How do you validate add-ons (CNI, ingress, metrics) after an upgrade?
- What is your process for draining nodes and handling PDBs during maintenance?
- How do you manage CRD upgrades when operators lag the cluster version?
- What is your story for certificate expiry across the platform?
Reference answers: Upgrades and day-2 operations
Multi-Tenancy and Policy
Section titled “Multi-Tenancy and Policy”- How do you isolate noisy teams: namespaces, quotas, network policy, admission?
- What admission policies do you enforce cluster-wide, and how do you roll them out safely?
- How do you handle shared ingress controllers fairly between tenants?
- What is your default for
ResourceQuotaand how do you tune it without blocking legitimate work? - How do you audit who changed security-sensitive objects?
Reference answers: Multi-tenancy and policy
Architecture and Design Review Questions
Section titled “Architecture and Design Review Questions”Use these when reviewing a design doc or a new platform capability.
Boundaries and Blast Radius
Section titled “Boundaries and Blast Radius”- What is the smallest failure domain you can contain this change to?
- What happens if this component is down for 5 minutes? For an hour?
- What manual steps remain if automation fails mid-flight?
Defaults and Operability
Section titled “Defaults and Operability”- What are the safe defaults for teams who do not read the README?
- What telemetry proves this feature is healthy in production?
- What runbook steps exist for the on-call engineer at 3 a.m.?
Migration and Compatibility
Section titled “Migration and Compatibility”- What is the rollback plan, and how long does it take to execute?
- How do you migrate existing workloads without coordinated downtime?
- What version skew issues appear between clients, API server, and node components?
Cost and Complexity
Section titled “Cost and Complexity”- What does this cost in engineer time and cloud spend at steady state?
- What simpler design did you reject, and why?
Networking and Edge
Section titled “Networking and Edge”- Where is TLS terminated, and who owns certificate lifecycle?
- How does traffic fail over if the ingress tier degrades?
Security and Compliance
Section titled “Security and Compliance”- What data crosses trust boundaries, and how is it protected?
- What evidence would satisfy an auditor that controls are effective?
Data and State
Section titled “Data and State”- Where is authoritative state stored, and how is it backed up?
- What consistency guarantees does the application need versus what Kubernetes provides?
Observability and SLOs
Section titled “Observability and SLOs”- What SLIs do you need before launch, and what error budget policy applies?
Production Operations and Incident Readiness Questions
Section titled “Production Operations and Incident Readiness Questions”Use these when validating operational maturity.
Triage and Response
Section titled “Triage and Response”- What is your first 10-minute checklist when user-facing latency spikes?
- How do you decide between rollback, traffic shift, and hotfix under pressure?
- How do you communicate status without slowing down mitigation?
Guardrails and Prevention
Section titled “Guardrails and Prevention”- What prevents a bad manifest from reaching production?
- What prevents a single team from starving the cluster?
Runbooks and Knowledge
Section titled “Runbooks and Knowledge”- Where do runbooks live, and how often are they tested?
- What incident types repeat, and what permanent fixes are still open?
Capacity and Scaling
Section titled “Capacity and Scaling”- How do you forecast node capacity, and what triggers a scale-up?
- How do you handle regional or zonal impairment if applicable?
Post-Incident Learning
Section titled “Post-Incident Learning”- What is your blameless postmortem template, and who must attend?
- How do you track action items to completion?
Pointers to deeper runbooks on this site
Section titled “Pointers to deeper runbooks on this site”- Troubleshooting and Debugging
- Production Patterns
- Production Platform Checklist
- Incident response and on-call
If You Only Ask Ten Questions
Section titled “If You Only Ask Ten Questions”- Describe the path from
kubectl applyto a running Pod and where it can fail. - How do you debug a
Pendingpod that never schedules? - How does a
ClusterIPService route traffic when endpoints churn? - How do requests and limits change scheduling and runtime behavior?
- What is your rollout and rollback strategy for a bad deployment?
- How do you prove control-plane health during an incident?
- What is your secret handling model for cluster and GitOps config?
- How do you prevent or detect noisy neighbor issues on a node?
- What is your Kubernetes upgrade strategy for control plane versus workers?
- What is the first query or dashboard you open for cluster-wide latency?