Kubernetes Architecture Review Answers
Companion to Kubernetes architecture review questions.
This page repeats each original prompt and gives a concise reference answer under it.
Cluster Architecture and Control Plane
Section titled “Cluster Architecture and Control Plane”Walk through what happens between kubectl apply and a Pod running on a node.
Answer: The API server validates and stores objects in etcd, controllers reconcile desired state (for example Deployment -> ReplicaSet -> Pod), the scheduler binds Pods to nodes, and kubelet starts containers via the runtime.
Which control-plane components must be healthy for scheduling to succeed, and how would you prove each is healthy?
Answer: API server, etcd, scheduler, and controller manager must be healthy. Prove with component health endpoints/logs, scheduling events on Pending Pods, and successful control-plane CRUD operations.
How does the API server enforce admission, and what happens when an admission webhook is slow or failing?
Answer: Requests pass through authn/authz, then mutating and validating admission chains. Slow/failing webhooks cause request latency, timeouts, and possible create/update failures depending on failure policy.
Explain etcd’s role and what symptoms you expect if etcd is impaired.
Answer: etcd is the control-plane source of truth for API objects. Impairment shows as API timeouts, stale watches, reconciliation lag, and possible cluster instability.
How would you safely rotate control-plane certificates in your environment?
Answer: Follow distro/vendor runbook, back up etcd first, rotate certs in a planned order that preserves quorum, restart components carefully, and verify node/control-plane connectivity after rotation.
See also: Architecture, Scheduling and placement, etcd and control plane health, Admission controllers, Kubeconfig and authentication
Scheduling and Capacity
Section titled “Scheduling and Capacity”How does the scheduler choose a node, and what common constraints cause Pending?
Answer: Scheduler filters then scores nodes based on resources, constraints, and policies. Pending commonly comes from insufficient resources, taints/tolerations mismatch, affinity rules, or volume constraints.
How do you debug a pod that never schedules when resources “look fine”?
Answer:
Start with kubectl describe pod events, then inspect requests/limits, taints, affinity/topology spread rules, PDB interactions, and scheduler logs if needed.
How do requests and limits differ, and what failure modes come from setting limits without requests?
Answer: Requests drive scheduling guarantees; limits cap runtime usage. Limits without meaningful requests can overpack nodes and trigger throttling/OOM behavior under contention.
When would you use taints and tolerations versus affinity rules?
Answer: Use taints/tolerations to keep general workloads off protected/specialized nodes; use affinity/anti-affinity to express placement preferences or hard colocation/separation logic.
How do you detect and mitigate noisy neighbor CPU/memory contention on a node?
Answer: Detect via node/pod saturation metrics, throttling, and OOM events; mitigate with sane requests/limits, better placement, quotas, and dedicated pools for heavy tenants.
See also: Scheduling and placement, Workload types, Production patterns, Troubleshooting and debugging
Networking
Section titled “Networking”Explain how a ClusterIP Service routes to endpoints; what updates when pods churn?
Answer: Service virtual IP maps to EndpointSlices, and kube-proxy/eBPF datapath programs forwarding rules. EndpointSlices update as Pods become ready/unready or restart.
How does DNS for Services work inside the cluster, and what breaks it first under load?
Answer: CoreDNS resolves service names to ClusterIP or pod records (headless). Under load, DNS saturation, upstream latency, and cache/misconfiguration issues tend to fail first.
Compare Ingress, LoadBalancer Services, and NodePort for north-south traffic; when is each wrong?
Answer: Ingress is L7 HTTP routing; LoadBalancer Service is cloud/L4-L7 exposure; NodePort is low-level and usually not ideal internet edge. Each is wrong when protocol/operational needs do not match.
How would you debug intermittent 503 between two in-cluster services?
Answer: Check readiness flaps, EndpointSlice churn, service/ingress timeouts, network policies, app retries, and upstream dependency saturation.
What is the difference between NetworkPolicy as “default deny” versus selective allow lists?
Answer: Default deny blocks all traffic until explicitly allowed; selective allow lists define permitted flows and are the practical implementation of least privilege.
See also: Networking, Services and endpoints, Network policies, Ingress controllers
Security and Identity
Section titled “Security and Identity”How does RBAC interact with authentication in your clusters (cert, OIDC, cloud IAM exec)?
Answer: Authentication establishes identity; RBAC authorizes what that identity can do. Effective access is the combination of auth method, group mapping, and bindings.
What is the blast radius of a compromised namespace-scoped RoleBinding versus a cluster ClusterRoleBinding?
Answer: Namespace RoleBinding generally limits damage to one namespace; a privileged ClusterRoleBinding can grant broad cluster-wide access.
How do you manage secrets at rest and in Git for cluster config?
Answer: Encrypt at rest (etcd/KMS), avoid plaintext in Git, and use external secret managers or encrypted Git workflows (for example SOPS/SealedSecrets).
What guardrails prevent a workload from talking to the metadata service or the API server when it should not?
Answer: Use NetworkPolicy egress restrictions, least-privilege service accounts, hardened node/metadata settings, and admission controls to block risky pod specs.
How would you design pod security standards (or equivalent) for a multi-team cluster?
Answer: Define baseline/restricted profiles, enforce via Pod Security Admission or policy engine, roll out in audit/warn first, then enforce with clear exception governance.
See also: RBAC, Pod Security Standards, Network policies, Kubeconfig and authentication, Production platform checklist
Storage
Section titled “Storage”Compare ephemeral volumes, PVCs, and StatefulSet identity; when must you use each?
Answer: Ephemeral storage for scratch data, PVC for persistent state, StatefulSet when stable pod identity plus persistent volume mapping is required.
What causes a PVC to stay Pending, and how do you read describe output for storage issues?
Answer:
Common causes include missing/default StorageClass mismatch, topology/zone constraints, quota, or provisioner failure. describe events identify the exact blocker.
How do you choose a StorageClass strategy for stateful apps across zones or single-node labs?
Answer: In multi-zone, choose topology-aware classes and replication where needed; in single-node labs, simpler local/default classes are acceptable with explicit durability limits.
What happens to data when a StatefulSet pod is deleted versus rescheduled?
Answer: Pod deletion usually does not delete PVC data; rescheduled pod reattaches the same claim if storage backend supports attachment in target zone/node.
How would you back up and restore a stateful workload on Kubernetes?
Answer: Combine volume snapshots/object backup with app-consistent backup logic where needed, test restores regularly, and document RPO/RTO.
See also: Storage, Stateful backup and restore
Workloads and Rollouts
Section titled “Workloads and Rollouts”Compare Deployment rolling updates versus StatefulSet ordered rollout; failure modes of each.
Answer: Deployments optimize stateless rolling replacement; StatefulSets preserve order/identity for stateful apps. Failures differ: readiness deadlocks, partitioning mistakes, and incompatible updates.
How do liveness, readiness, and startup probes differ, and how can a bad probe cause an incident?
Answer: Startup gates initial boot, readiness controls traffic eligibility, liveness triggers restart. Misconfigured probes cause restart loops or route traffic to unhealthy pods.
What signals tell you a rollout is unhealthy before users complain?
Answer: Rising error rate/latency, readiness failure spikes, crash loops, pending replicas, and burn-rate alerts.
How do you safely roll back a bad deployment in production?
Answer: Use rollout undo/canary rollback, keep prior image/tag immutable, verify dependencies, and monitor SLO recovery before closing incident.
When would you choose Jobs or CronJobs instead of long-running Deployments?
Answer: Use Jobs/CronJobs for finite or scheduled batch work; use Deployments for continuously serving workloads.
See also: Workload types, Production patterns
Observability
Section titled “Observability”What is your first dashboard or query when CPU spikes cluster-wide?
Answer: Start with cluster/node saturation and top namespaces/workloads, then narrow to culprit pods and recent deploy/config changes.
How do you distinguish “the network is slow” from “the app is slow” with metrics you already have?
Answer: Compare transport-level indicators (drops/retransmits/latency) against app latency, queue depth, and error metrics; correlate with traces if available.
How would you wire application metrics into Prometheus in a way that survives Helm upgrades?
Answer: Use stable ServiceMonitor/PodMonitor patterns and chart values under version control; avoid ad hoc manual scraping config drift.
What do you alert on versus what you only dashboard?
Answer: Alert on actionable SLO-impacting conditions; dashboard exploratory, trend, or non-actionable context metrics.
How do you keep cardinality under control in Prometheus labels?
Answer: Avoid unbounded labels, aggregate where possible, enforce instrumentation conventions, and review top-cardinality series routinely.
See also: Incident first look, Observability setup, Prometheus, Scaling Prometheus
GitOps and Delivery
Section titled “GitOps and Delivery”How do you prevent a GitOps controller from overwriting manual hotfixes?
Answer: Treat Git as source of truth; either codify hotfix quickly or use controlled temporary pause/override with clear expiry and reconciliation plan.
What is your promotion model between environments, and how are secrets handled?
Answer: Promote immutable artifacts across envs with environment-specific config overlays; fetch secrets from external managers instead of committing plaintext.
How do you detect and recover from sync loops or partial applies?
Answer: Watch controller health/status events, diff desired vs live state, fix invalid resources/order dependencies, then resync in controlled scope.
What policy checks run before manifests reach the cluster?
Answer: Schema validation, policy-as-code checks, security scans/signature checks, and admission dry-runs in CI.
How do you coordinate Helm releases with GitOps without fighting state?
Answer: Pick a single reconciliation owner model (GitOps drives Helm or GitOps applies rendered manifests), define ownership boundaries, and avoid dual-writer drift.
See also: Helm, Helm vs operators vs GitOps, GitOps, Policy as code
Upgrades and Day-2 Operations
Section titled “Upgrades and Day-2 Operations”What is your Kubernetes upgrade strategy for control plane versus nodes?
Answer: Upgrade control plane first (per vendor-supported skew), then worker pools in waves with soak and rollback checkpoints.
How do you validate add-ons (CNI, ingress, metrics) after an upgrade?
Answer: Run smoke tests for networking, ingress routes, DNS, metrics ingestion, and autoscaling (HPA signals, adapters, node capacity) before broad rollout.
What is your process for draining nodes and handling PDBs during maintenance?
Answer: Cordon/drain in batches, respect PDBs, pre-check disruption budgets, and maintain enough capacity for safe eviction.
How do you manage CRD upgrades when operators lag the cluster version?
Answer: Track compatibility matrix, upgrade operators/CRDs in tested order, and avoid API removals until all controllers are compatible.
What is your story for certificate expiry across the platform?
Answer: Inventory certs, automate renewal where possible, alert ahead of expiry, and periodically test rotation runbooks.
See also: Cluster upgrades, Troubleshooting and debugging, Production patterns, EKS overview, Production platform checklist
Multi-Tenancy and Policy
Section titled “Multi-Tenancy and Policy”How do you isolate noisy teams: namespaces, quotas, network policy, admission?
Answer: Use namespace boundaries, resource quotas/limits, deny-by-default network policies, and admission controls to enforce baseline standards.
What admission policies do you enforce cluster-wide, and how do you roll them out safely?
Answer: Start in audit/warn mode, phase enforcement by environment, measure false positives, and provide clear exception process.
How do you handle shared ingress controllers fairly between tenants?
Answer: Define per-tenant constraints (rate, class, namespace ownership), monitor controller saturation, and split controllers when needed.
What is your default for ResourceQuota and how do you tune it without blocking legitimate work?
Answer: Set sensible defaults from observed usage, review regularly, and provide rapid exception workflow with post-hoc right-sizing.
How do you audit who changed security-sensitive objects?
Answer: Enable audit logs, centralize and retain them, and alert on high-risk object changes (RBAC, webhook configs, security policy resources).
See also: Multi-tenancy and policy, Multi-cluster management, Admission controllers, Network policies, Production platform checklist, Operators