Skip to content

Production Scenarios

First PublishedByAtif Alam

These scenarios are for operational practice. The goal is not a single “correct” answer, but a clear, low-risk reasoning path.

Scenario 1: Error rate spikes after rollout

Section titled “Scenario 1: Error rate spikes after rollout”

Context: A deployment completes in one namespace. Within minutes, 5xx error rate increases and latency worsens.

  • Confirm scope: one service, one namespace, or downstream impact.
  • Compare before/after rollout timings in metrics and logs.
  • Check readiness/startup probes and container restart behavior.
  • Validate config/secret/schema compatibility across old/new versions.
  • Decide mitigation quickly: rollback, pause rollout, or reduce traffic to new pods.
  • Document what changed and when for post-incident follow-up.
Terminal window
# pause in-progress rollout while you investigate
kubectl rollout pause deployment/my-app -n <namespace>
# resume if healthy after checks
kubectl rollout resume deployment/my-app -n <namespace>
# rollback to previous revision
kubectl rollout undo deployment/my-app -n <namespace>

For traffic reduction patterns during rollout (blue/green, canary weights, feature flags), see Deployment Strategies.

Scenario 2: Pods stay Pending during traffic growth

Section titled “Scenario 2: Pods stay Pending during traffic growth”

Context: HPA increases desired replicas, but new pods remain Pending and request latency rises.

  • Scheduler events for resource fit failures.
  • Requests/limits realism vs actual workload profile.
  • Node capacity and autoscaler behavior.
  • Taints, tolerations, affinity/anti-affinity, and topology constraints.
  • PVC binding delays or storage class constraints for stateful workloads.
  • Namespace quotas or limit ranges blocking admission.
Terminal window
# see why pods are Pending
kubectl describe pod <pending-pod> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# short-term stabilization options
kubectl scale deployment/my-app -n <namespace> --replicas=<known-safe-count>
kubectl patch deployment my-app -n <namespace> -p '{"spec":{"template":{"spec":{"priorityClassName":"high-priority"}}}}'
# inspect capacity pressure
kubectl top nodes
kubectl get pods -n <namespace> -o wide

Use deployment-specific guardrails in production (resource tuning, autoscaler settings, and quotas) after immediate stabilization.

Scenario 3: Intermittent 503s after networking change

Section titled “Scenario 3: Intermittent 503s after networking change”

Context: Traffic mostly works, but clients see intermittent 503s after an Ingress or service-mesh rule update.

  • Service/endpoints and target pod readiness at failure times.
  • DNS resolution and connectivity from an in-cluster debug pod.
  • Ingress/VirtualService/DestinationRule route match precedence.
  • NetworkPolicy changes that might block only some paths.
  • mTLS mode mismatches, certificate validity, or policy drift.
  • Roll back networking rule change first if impact is active.
Terminal window
# confirm service endpoints and pod readiness
kubectl get svc,endpoints -n <namespace>
kubectl get pods -n <namespace> -l app=my-app
# inspect ingress or gateway object changes
kubectl describe ingress <ingress-name> -n <namespace>
kubectl get ingress -n <namespace> -o yaml
# run in-cluster connectivity test
kubectl run net-debug --image=busybox -it --rm -n <namespace> -- /bin/sh
# if needed, roll back to previous ingress definition
kubectl rollout undo deployment/<ingress-controller-deployment> -n <ingress-namespace>

If you use service mesh routing, also validate VirtualService/DestinationRule revisions before restoring traffic weights.

  • Time-box triage and pick the highest-signal checks first.
  • Prefer reversible mitigations while impact is ongoing.
  • Record assumptions explicitly and verify them with evidence.
  • Close with concrete follow-ups: alerts, runbooks, tests, and rollout guardrails.