Troubleshooting and Debugging

First PublishedApr 1, 2026ByAtif Alam

Use this page when production behavior changes and you need a fast, structured path to isolate cause and reduce impact.

Triage order (first 10-15 minutes)

Confirm user impact: what is broken, for whom, and since when.
Check recent changes: deploys, config, secret rotations, policy updates, node maintenance.
Scope blast radius: one pod, one deployment, one namespace, or cluster-wide.
Stabilize first: rollback, scale, traffic shift, or disable risky path if needed.
Then investigate deeply with metrics, logs, traces, and Kubernetes events.

For incident coordination patterns, see Incident response and on-call.

Workload-level checks

Start with the workload and pod conditions before diving into cluster internals.

1
kubectl get deploy,pods -n <ns>
2
kubectl describe pod <pod> -n <ns>
3
kubectl logs <pod> -n <ns> --tail=200
4
kubectl logs <pod> -n <ns> --previous    # if container restarted
5
kubectl get events -n <ns> --sort-by='.lastTimestamp'

Common pod states and likely causes

Symptom	Common causes	First checks
`CrashLoopBackOff`	app crash, bad env var/secret, probe mismatch	container logs, previous logs, startup/readiness paths
`ImagePullBackOff`	wrong image tag, auth issue, registry outage	image name/tag, pull secret, node egress
`Pending`	insufficient resources, affinity/taints, PVC not bound, quota	scheduler events, requests/limits, PVC state, namespace quota
`OOMKilled`	memory limit too low, leak, traffic spike	limits vs usage, `kubectl top`, recent load
readiness failing	dependency unavailable, bad readiness path, startup too slow	probe config and app dependency health

Platform and cluster checks

If many workloads are affected, verify node and control-plane-adjacent signals.

1
kubectl get nodes
2
kubectl describe node <node>
3
kubectl top nodes
4
kubectl get events --sort-by='.lastTimestamp'

Focus areas:

Node health: NotReady, disk/memory pressure, container runtime issues.
Scheduling constraints: taints/tolerations, affinity, resource fragmentation.
Policy/admission changes: new webhook or policy rejections.
Namespace constraints: resource quota and limit range changes.

See Architecture for control-plane context.

Networking checks in Kubernetes

When symptoms are 5xx/timeouts or partial reachability:

1
kubectl get svc,endpoints -n <ns>
2
kubectl describe svc <service> -n <ns>
3
kubectl get networkpolicy -n <ns>
4
kubectl run debug --image=busybox -it --rm -- /bin/sh

From the debug shell, test DNS and service reachability. Then validate Ingress/service-mesh rules if present.

See Kubernetes Networking and Network troubleshooting flow.

GitOps and Helm drift checks

Compare desired state in Git with live cluster objects.
Verify Helm-rendered manifests vs what is running.
Inspect recent sync history and health if using Argo CD/Flux.