Production Scenarios

First PublishedApr 1, 2026ByAtif Alam

These scenarios are for operational practice. The goal is not a single “correct” answer, but a clear, low-risk reasoning path.

Scenario 1: Error rate spikes after rollout

Context: A deployment completes in one namespace. Within minutes, 5xx error rate increases and latency worsens.

Angles to check

Confirm scope: one service, one namespace, or downstream impact.
Compare before/after rollout timings in metrics and logs.
Check readiness/startup probes and container restart behavior.
Validate config/secret/schema compatibility across old/new versions.
Decide mitigation quickly: rollback, pause rollout, or reduce traffic to new pods.
Document what changed and when for post-incident follow-up.

Common mitigation commands (Kubernetes)

1
# pause in-progress rollout while you investigate
2
kubectl rollout pause deployment/my-app -n <namespace>
3

4
# resume if healthy after checks
5
kubectl rollout resume deployment/my-app -n <namespace>
6

7
# rollback to previous revision
8
kubectl rollout undo deployment/my-app -n <namespace>

For traffic reduction patterns during rollout (blue/green, canary weights, feature flags), see Deployment Strategies.

Scenario 2: Pods stay Pending during traffic growth

Context: HPA increases desired replicas, but new pods remain Pending and request latency rises.

Angles to check

Scheduler events for resource fit failures.
Requests/limits realism vs actual workload profile.
Node capacity and autoscaler behavior.
Taints, tolerations, affinity/anti-affinity, and topology constraints.
PVC binding delays or storage class constraints for stateful workloads.
Namespace quotas or limit ranges blocking admission.

Common mitigation commands (Kubernetes)

1
# see why pods are Pending
2
kubectl describe pod <pending-pod> -n <namespace>
3
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
4

5
# short-term stabilization options
6
kubectl scale deployment/my-app -n <namespace> --replicas=<known-safe-count>
7
kubectl patch deployment my-app -n <namespace> -p '{"spec":{"template":{"spec":{"priorityClassName":"high-priority"}}}}'
8

9
# inspect capacity pressure
10
kubectl top nodes
11
kubectl get pods -n <namespace> -o wide

Use deployment-specific guardrails in production (resource tuning, autoscaler settings, and quotas) after immediate stabilization.

Scenario 3: Intermittent 503s after networking change

Context: Traffic mostly works, but clients see intermittent 503s after an Ingress or service-mesh rule update.

Angles to check

Service/endpoints and target pod readiness at failure times.
DNS resolution and connectivity from an in-cluster debug pod.
Ingress/VirtualService/DestinationRule route match precedence.
NetworkPolicy changes that might block only some paths.
mTLS mode mismatches, certificate validity, or policy drift.
Roll back networking rule change first if impact is active.

Common mitigation commands (Kubernetes)

1
# confirm service endpoints and pod readiness
2
kubectl get svc,endpoints -n <namespace>
3
kubectl get pods -n <namespace> -l app=my-app
4

5
# inspect ingress or gateway object changes
6
kubectl describe ingress <ingress-name> -n <namespace>
7
kubectl get ingress -n <namespace> -o yaml
8

9
# run in-cluster connectivity test
10
kubectl run net-debug --image=busybox -it --rm -n <namespace> -- /bin/sh
11

12
# if needed, roll back to previous ingress definition
13
kubectl rollout undo deployment/<ingress-controller-deployment> -n <ingress-namespace>

If you use service mesh routing, also validate VirtualService/DestinationRule revisions before restoring traffic weights.

How to use these scenarios

Time-box triage and pick the highest-signal checks first.
Prefer reversible mitigations while impact is ongoing.
Record assumptions explicitly and verify them with evidence.
Close with concrete follow-ups: alerts, runbooks, tests, and rollout guardrails.

Production Scenarios

Scenario 1: Error rate spikes after rollout

Angles to check

Common mitigation commands (Kubernetes)

Scenario 2: Pods stay Pending during traffic growth

Angles to check

Common mitigation commands (Kubernetes)

Scenario 3: Intermittent 503s after networking change

Angles to check

Common mitigation commands (Kubernetes)

How to use these scenarios

Related