Troubleshooting and Debugging
Use this page when production behavior changes and you need a fast, structured path to isolate cause and reduce impact.
Triage order (first 10-15 minutes)
Section titled “Triage order (first 10-15 minutes)”- Confirm user impact: what is broken, for whom, and since when.
- Check recent changes: deploys, config, secret rotations, policy updates, node maintenance.
- Scope blast radius: one pod, one deployment, one namespace, or cluster-wide.
- Stabilize first: rollback, scale, traffic shift, or disable risky path if needed.
- Then investigate deeply with metrics, logs, traces, and Kubernetes events.
For incident coordination patterns, see Incident response and on-call.
Workload-level checks
Section titled “Workload-level checks”Start with the workload and pod conditions before diving into cluster internals.
kubectl get deploy,pods -n <ns>kubectl describe pod <pod> -n <ns>kubectl logs <pod> -n <ns> --tail=200kubectl logs <pod> -n <ns> --previous # if container restartedkubectl get events -n <ns> --sort-by='.lastTimestamp'Common pod states and likely causes
Section titled “Common pod states and likely causes”| Symptom | Common causes | First checks |
|---|---|---|
CrashLoopBackOff | app crash, bad env var/secret, probe mismatch | container logs, previous logs, startup/readiness paths |
ImagePullBackOff | wrong image tag, auth issue, registry outage | image name/tag, pull secret, node egress |
Pending | insufficient resources, affinity/taints, PVC not bound, quota | scheduler events, requests/limits, PVC state, namespace quota |
OOMKilled | memory limit too low, leak, traffic spike | limits vs usage, kubectl top, recent load |
| readiness failing | dependency unavailable, bad readiness path, startup too slow | probe config and app dependency health |
Platform and cluster checks
Section titled “Platform and cluster checks”If many workloads are affected, verify node and control-plane-adjacent signals.
kubectl get nodeskubectl describe node <node>kubectl top nodeskubectl get events --sort-by='.lastTimestamp'Focus areas:
- Node health:
NotReady, disk/memory pressure, container runtime issues. - Scheduling constraints: taints/tolerations, affinity, resource fragmentation.
- Policy/admission changes: new webhook or policy rejections.
- Namespace constraints: resource quota and limit range changes.
See Architecture for control-plane context.
Networking checks in Kubernetes
Section titled “Networking checks in Kubernetes”When symptoms are 5xx/timeouts or partial reachability:
kubectl get svc,endpoints -n <ns>kubectl describe svc <service> -n <ns>kubectl get networkpolicy -n <ns>kubectl run debug --image=busybox -it --rm -- /bin/shFrom the debug shell, test DNS and service reachability. Then validate Ingress/service-mesh rules if present.
See Kubernetes Networking and Network troubleshooting flow.
GitOps and Helm drift checks
Section titled “GitOps and Helm drift checks”- Compare desired state in Git with live cluster objects.
- Verify Helm-rendered manifests vs what is running.
- Inspect recent sync history and health if using Argo CD/Flux.
Related: