Production Scenarios
These scenarios are for operational practice. The goal is not a single “correct” answer, but a clear, low-risk reasoning path.
Scenario 1: Error rate spikes after rollout
Section titled “Scenario 1: Error rate spikes after rollout”Context: A deployment completes in one namespace. Within minutes, 5xx error rate increases and latency worsens.
Angles to check
Section titled “Angles to check”- Confirm scope: one service, one namespace, or downstream impact.
- Compare before/after rollout timings in metrics and logs.
- Check readiness/startup probes and container restart behavior.
- Validate config/secret/schema compatibility across old/new versions.
- Decide mitigation quickly: rollback, pause rollout, or reduce traffic to new pods.
- Document what changed and when for post-incident follow-up.
Common mitigation commands (Kubernetes)
Section titled “Common mitigation commands (Kubernetes)”# pause in-progress rollout while you investigatekubectl rollout pause deployment/my-app -n <namespace>
# resume if healthy after checkskubectl rollout resume deployment/my-app -n <namespace>
# rollback to previous revisionkubectl rollout undo deployment/my-app -n <namespace>For traffic reduction patterns during rollout (blue/green, canary weights, feature flags), see Deployment Strategies.
Scenario 2: Pods stay Pending during traffic growth
Section titled “Scenario 2: Pods stay Pending during traffic growth”Context: HPA increases desired replicas, but new pods remain Pending and request latency rises.
Angles to check
Section titled “Angles to check”- Scheduler events for resource fit failures.
- Requests/limits realism vs actual workload profile.
- Node capacity and autoscaler behavior.
- Taints, tolerations, affinity/anti-affinity, and topology constraints.
- PVC binding delays or storage class constraints for stateful workloads.
- Namespace quotas or limit ranges blocking admission.
Common mitigation commands (Kubernetes)
Section titled “Common mitigation commands (Kubernetes)”# see why pods are Pendingkubectl describe pod <pending-pod> -n <namespace>kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# short-term stabilization optionskubectl scale deployment/my-app -n <namespace> --replicas=<known-safe-count>kubectl patch deployment my-app -n <namespace> -p '{"spec":{"template":{"spec":{"priorityClassName":"high-priority"}}}}'
# inspect capacity pressurekubectl top nodeskubectl get pods -n <namespace> -o wideUse deployment-specific guardrails in production (resource tuning, autoscaler settings, and quotas) after immediate stabilization.
Scenario 3: Intermittent 503s after networking change
Section titled “Scenario 3: Intermittent 503s after networking change”Context: Traffic mostly works, but clients see intermittent 503s after an Ingress or service-mesh rule update.
Angles to check
Section titled “Angles to check”- Service/endpoints and target pod readiness at failure times.
- DNS resolution and connectivity from an in-cluster debug pod.
- Ingress/VirtualService/DestinationRule route match precedence.
- NetworkPolicy changes that might block only some paths.
- mTLS mode mismatches, certificate validity, or policy drift.
- Roll back networking rule change first if impact is active.
Common mitigation commands (Kubernetes)
Section titled “Common mitigation commands (Kubernetes)”# confirm service endpoints and pod readinesskubectl get svc,endpoints -n <namespace>kubectl get pods -n <namespace> -l app=my-app
# inspect ingress or gateway object changeskubectl describe ingress <ingress-name> -n <namespace>kubectl get ingress -n <namespace> -o yaml
# run in-cluster connectivity testkubectl run net-debug --image=busybox -it --rm -n <namespace> -- /bin/sh
# if needed, roll back to previous ingress definitionkubectl rollout undo deployment/<ingress-controller-deployment> -n <ingress-namespace>If you use service mesh routing, also validate VirtualService/DestinationRule revisions before restoring traffic weights.
How to use these scenarios
Section titled “How to use these scenarios”- Time-box triage and pick the highest-signal checks first.
- Prefer reversible mitigations while impact is ongoing.
- Record assumptions explicitly and verify them with evidence.
- Close with concrete follow-ups: alerts, runbooks, tests, and rollout guardrails.