Skip to content

Troubleshooting and Debugging

First PublishedByAtif Alam

Use this page when production behavior changes and you need a fast, structured path to isolate cause and reduce impact.

  1. Confirm user impact: what is broken, for whom, and since when.
  2. Check recent changes: deploys, config, secret rotations, policy updates, node maintenance.
  3. Scope blast radius: one pod, one deployment, one namespace, or cluster-wide.
  4. Stabilize first: rollback, scale, traffic shift, or disable risky path if needed.
  5. Then investigate deeply with metrics, logs, traces, and Kubernetes events.

For incident coordination patterns, see Incident response and on-call.

Start with the workload and pod conditions before diving into cluster internals.

Terminal window
kubectl get deploy,pods -n <ns>
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --tail=200
kubectl logs <pod> -n <ns> --previous # if container restarted
kubectl get events -n <ns> --sort-by='.lastTimestamp'
SymptomCommon causesFirst checks
CrashLoopBackOffapp crash, bad env var/secret, probe mismatchcontainer logs, previous logs, startup/readiness paths
ImagePullBackOffwrong image tag, auth issue, registry outageimage name/tag, pull secret, node egress
Pendinginsufficient resources, affinity/taints, PVC not bound, quotascheduler events, requests/limits, PVC state, namespace quota
OOMKilledmemory limit too low, leak, traffic spikelimits vs usage, kubectl top, recent load
readiness failingdependency unavailable, bad readiness path, startup too slowprobe config and app dependency health

If many workloads are affected, verify node and control-plane-adjacent signals.

Terminal window
kubectl get nodes
kubectl describe node <node>
kubectl top nodes
kubectl get events --sort-by='.lastTimestamp'

Focus areas:

  • Node health: NotReady, disk/memory pressure, container runtime issues.
  • Scheduling constraints: taints/tolerations, affinity, resource fragmentation.
  • Policy/admission changes: new webhook or policy rejections.
  • Namespace constraints: resource quota and limit range changes.

See Architecture for control-plane context.

When symptoms are 5xx/timeouts or partial reachability:

Terminal window
kubectl get svc,endpoints -n <ns>
kubectl describe svc <service> -n <ns>
kubectl get networkpolicy -n <ns>
kubectl run debug --image=busybox -it --rm -- /bin/sh

From the debug shell, test DNS and service reachability. Then validate Ingress/service-mesh rules if present.

See Kubernetes Networking and Network troubleshooting flow.

  • Compare desired state in Git with live cluster objects.
  • Verify Helm-rendered manifests vs what is running.
  • Inspect recent sync history and health if using Argo CD/Flux.

Related: