EKS Troubleshooting Cheat Sheet
Symptom-driven debugging reference for EKS SRE work. Focus areas: networking and autoscaling.
Universal First Moves
Section titled “Universal First Moves”Before diving into any specific symptom, run these:
# Pod-levelkubectl describe pod <pod> -n <ns>kubectl logs <pod> -n <ns> --previous # last crash outputkubectl logs <pod> -n <ns> -c <container> -f # multi-container
# Cluster-level events (most recent)kubectl get events -A --sort-by='.lastTimestamp' | tail -50
# Node healthkubectl get nodes -o widekubectl describe node <node> | grep -A5 Conditions
# Quick resource viewkubectl top nodeskubectl top pods -A --sort-by=memoryDrop into a debug pod for in-cluster network testing:
kubectl run debug --rm -it --image=nicolaka/netshoot -- bashFor a stuck or dying container, use ephemeral debug containers:
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container> -n <ns>Symptom 1: Pod Stuck in Pending
Section titled “Symptom 1: Pod Stuck in Pending”kubectl describe pod <pod> -n <ns> # check Events at the bottomkubectl get events -n <ns> --sort-by='.lastTimestamp'| Event message | Cause | Fix |
|---|---|---|
Insufficient cpu/memory | No node has capacity | Scale nodes or reduce requests |
node(s) had untolerated taint | Taint mismatch | Add toleration or remove taint |
node(s) didn't match node selector | Label/affinity mismatch | Check nodeSelector, nodeAffinity |
pod has unbound immediate PVCs | PVC not bound | Check StorageClass and PV availability |
0/N nodes are available | All filtered out | Read full message, usually a combination of constraints |
| No events / scheduler silent | Karpenter or CAS not provisioning | See Autoscaling section below |
EKS-specific check for IP exhaustion that appears as Pending:
kubectl logs -n kube-system -l k8s-app=aws-node --tail=100 | grep -i "no available IP"Symptom 2: CrashLoopBackOff
Section titled “Symptom 2: CrashLoopBackOff”kubectl logs <pod> -n <ns> --previouskubectl describe pod <pod> -n <ns> | grep -A5 "Last State"| Exit code | Meaning |
|---|---|
| 0 | Clean exit, but Kubernetes expected long-running process |
| 1 | Generic application error, inspect app logs |
| 137 | OOMKilled (SIGKILL) |
| 139 | Segfault (SIGSEGV) |
| 143 | SIGTERM (often shutdown exceeded grace period) |
Liveness probe killing healthy pods:
kubectl describe pod <pod> -n <ns> | grep -A3 Liveness# Look for "Liveness probe failed" in eventsCommon fix: increase initialDelaySeconds, raise failureThreshold, or add startupProbe for slow-booting apps.
Symptom 3: ImagePullBackOff (EKS)
Section titled “Symptom 3: ImagePullBackOff (EKS)”kubectl describe pod <pod> -n <ns> | grep -A3 -i "failed to pull"EKS-specific causes (in typical frequency order):
- Node IAM role missing ECR permissions (
AmazonEC2ContainerRegistryReadOnly) - Cross-account ECR pull missing source repo policy
- Architecture mismatch (
amd64image onarm64node, or reverse) - Docker Hub rate limit
- Missing
imagePullSecretin the same namespace
# Verify ECR auth from a nodeaws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account>.dkr.ecr.<region>.amazonaws.comSymptom 4: Service Not Reachable
Section titled “Symptom 4: Service Not Reachable”Endpoints check (always step 1):
kubectl get endpoints <svc> -n <ns>kubectl get endpointslices -n <ns> -l kubernetes.io/service-name=<svc>If endpoints are empty:
- Selector does not match pod labels ->
kubectl get pods -n <ns> --show-labels - Pods are not
Ready-> endpoints include only ready pods
In-cluster DNS test:
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash# Inside:nslookup <svc>.<ns>.svc.cluster.localcurl -v http://<svc>.<ns>.svc.cluster.local:<port>CoreDNS checks:
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100kubectl get cm coredns -n kube-system -o yamlSymptom 5: Node NotReady
Section titled “Symptom 5: Node NotReady”kubectl describe node <node> | grep -A10 ConditionsCondition quick meanings:
MemoryPressure: eviction riskDiskPressure: disk cleanup / image GC pressurePIDPressure: process count pressureNetworkUnavailable: CNI not initialized
Get onto the node (EKS):
aws ssm start-session --target <instance-id>sudo journalctl -u kubelet -fsudo crictl ps -asudo crictl logs <container-id>EKS join failures (node never becomes Ready):
- Check
/var/log/cloud-init-output.log - Verify
aws-authConfigMap mapping (legacy) or EKS Access Entries (new model) - Check VPC CNI pod startup and node/control-plane SG rules
Networking Deep Dive (EKS)
Section titled “Networking Deep Dive (EKS)”VPC CNI Mental Model
Section titled “VPC CNI Mental Model”AWS VPC CNI assigns each pod a real VPC IP via node ENIs:
- Pods become first-class VPC IPs
- Pod density is limited by instance ENI/IP limits
- Subnet IP planning directly controls pod scale
Check current IP allocation:
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable_pods: .status.allocatable.pods}'
# On aws-node pod:kubectl exec -n kube-system <aws-node-pod> -- /app/grpc-health-probe -addr=:50051kubectl logs -n kube-system <aws-node-pod> | grep -i "ip address"IP Exhaustion Playbook
Section titled “IP Exhaustion Playbook”Symptom: pods stuck in ContainerCreating with CNI errors, or no assignable pod IPs.
kubectl describe pod <pod> -n <ns> | grep -i "failed to assign"kubectl logs -n kube-system -l k8s-app=aws-node --tail=200 | grep -iE "ip|eni|exhaust"Remediations (prefer in this order):
- Prefix delegation on supported instances:
Terminal window kubectl set env ds aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true - Custom networking (pod subnets via ENIConfig):
Terminal window kubectl set env ds aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true - Add secondary VPC CIDR for pod subnets
- Switch to an overlay-based approach if VPC-native pod IP is not required
Pod-to-Pod Traffic Path (Cross-Node)
Section titled “Pod-to-Pod Traffic Path (Cross-Node)”Pod A (eth0) -> veth -> host namespace -> host route table / ENI routing -> node A ENI -> VPC routing -> node B ENI -> host route -> pod B veth -> Pod B (eth0)No overlay and no pod-to-pod NAT in the default VPC CNI path, so VPC Flow Logs show pod IPs.
kube-proxy Modes
Section titled “kube-proxy Modes”kubectl get cm kube-proxy-config -n kube-system -o yaml | grep modeiptables: default, rule-walk basedipvs: hash lookup model, better at larger service counts- eBPF datapath (for example Cilium replacing kube-proxy): improved performance/visibility trade-offs
LoadBalancer / Ingress Issues (AWS Load Balancer Controller)
Section titled “LoadBalancer / Ingress Issues (AWS Load Balancer Controller)”kubectl logs -n kube-system deployment/aws-load-balancer-controller --tail=200kubectl describe svc <svc> -n <ns>kubectl describe ingress <ing> -n <ns>| Symptom | Likely cause |
|---|---|
| ALB not provisioning | Missing subnet tags (kubernetes.io/role/elb or kubernetes.io/role/internal-elb) |
| Targets unhealthy | SG rules or readiness path mismatch |
| ALB 504 | ALB idle timeout lower than backend response time |
| Wrong target type | instance mode hairpins via NodePort; use ip mode where supported |
Subnet tag checklist:
aws ec2 describe-subnets --subnet-ids <id> --query 'Subnets[].Tags'Required tags:
kubernetes.io/cluster/<cluster-name> = shared|ownedkubernetes.io/role/elb = 1(public)kubernetes.io/role/internal-elb = 1(internal)
NetworkPolicy on EKS
Section titled “NetworkPolicy on EKS”VPC CNI by itself does not always enforce Kubernetes NetworkPolicy semantics. Common options:
- Calico policy-only mode with VPC CNI
- VPC CNI native Network Policy feature (where enabled)
- Cilium for broader L3-L7 policy
kubectl run test --rm -it --image=nicolaka/netshoot -n <restricted-ns> -- curl <target>Autoscaling Deep Dive
Section titled “Autoscaling Deep Dive”For concepts, install paths, and metric pipelines, see Autoscaling on EKS. Custom metrics: Prometheus Adapter for HPA, Container Insights for HPA.
Three layers that are orthogonal:
HPA / VPA / KEDA -> pod count and sizeCAS / Karpenter -> node count and typeEC2 / Fargate -> underlying compute supplyHPA (Horizontal Pod Autoscaler)
Section titled “HPA (Horizontal Pod Autoscaler)”kubectl get hpa -Akubectl describe hpa <name> -n <ns>Debug “HPA not scaling”:
Symptom in describe | Likely cause |
|---|---|
unable to get metrics | metrics-server unavailable |
current: <unknown> | missing pod resource requests |
| desired equals current under load | target threshold/metric mismatch |
| scaling thrash | behavior windows not tuned |
kubectl top pods -Akubectl get apiservice v1beta1.metrics.k8s.io -o yamlFor queue/event-driven scaling, KEDA is commonly simpler than custom adapter plumbing. Metric install guides: Prometheus Adapter, Container Insights.
VPA (Vertical Pod Autoscaler)
Section titled “VPA (Vertical Pod Autoscaler)”Modes:
Off: recommendations onlyInitial: set on pod creationAuto: can evict/recreate for resize
Avoid running HPA and VPA on the same signal (for example both on CPU).
Cluster Autoscaler vs Karpenter
Section titled “Cluster Autoscaler vs Karpenter”| Cluster Autoscaler | Karpenter | |
|---|---|---|
| Provisioning unit | ASG/node groups | direct EC2 launches |
| Instance selection | pre-defined by group | per-pod-fit selection |
| Scale-up speed | usually slower | usually faster |
| Best fit | predictable regulated environments | dynamic cost-optimized environments |
Karpenter Debugging
Section titled “Karpenter Debugging”kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=200kubectl get nodepoolkubectl get ec2nodeclasskubectl get nodeclaimIf pods are Pending and not provisioning:
- Check NodePool requirements and matching
- Check pod constraints (
nodeSelector, taints, topology) - Check EC2 quotas
- Check subnet IP capacity
- Check NodeClass IAM/permissions and events
kubectl describe pod <pending-pod>Consolidation blockers commonly include restrictive PDBs, non-evictable pods, and long termination windows.
Cluster Autoscaler Debugging
Section titled “Cluster Autoscaler Debugging”kubectl logs -n kube-system deployment/cluster-autoscaler --tail=200kubectl get configmap cluster-autoscaler-status -n kube-system -o yamlCommon gotchas:
- ASG max size reached
- mixed instance simulation mismatch
- missing autoscaler discovery tags
- scale-down blocked by local-storage pods, restrictive PDBs, or
safe-to-evict: "false"
Spot Interruption Handling
Section titled “Spot Interruption Handling”- Karpenter-managed environments: native interruption handling and drain path
- CAS-based environments: verify Node Termination Handler daemonset
kubectl get ds -n kube-system aws-node-termination-handlerSymptom 6: Intermittent 5xx (Layered Approach)
Section titled “Symptom 6: Intermittent 5xx (Layered Approach)”# 1. Are specific pods unhealthy?kubectl get pods -n <ns> -o widekubectl top pods -n <ns>
# 2. CPU throttlingkubectl exec <pod> -n <ns> -- cat /sys/fs/cgroup/cpu.stat | grep throttled
# 3. Recent restartskubectl get pods -n <ns> -o json | jq '.items[] | {name: .metadata.name, restarts: [.status.containerStatuses[].restartCount]}'
# 4. ALB target healthaws elbv2 describe-target-health --target-group-arn <arn>
# 5. Correlate with deployskubectl rollout history deployment/<name> -n <ns>Symptom 7: IRSA Not Working
Section titled “Symptom 7: IRSA Not Working”# Service account annotationkubectl get sa <name> -n <ns> -o yaml | grep eks.amazonaws.com/role-arn
# Pod service account usagekubectl get pod <pod> -n <ns> -o yaml | grep serviceAccountName
# Projected token / envkubectl describe pod <pod> -n <ns> | grep -A3 "aws-iam-token"kubectl exec <pod> -n <ns> -- env | grep -E "AWS_ROLE_ARN|AWS_WEB_IDENTITY_TOKEN_FILE"
# Effective identitykubectl exec <pod> -n <ns> -- aws sts get-caller-identityTrust policy must match exact OIDC provider, service account subject, and audience:
{ "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID:sub": "system:serviceaccount:NAMESPACE:SA_NAME", "oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID:aud": "sts.amazonaws.com" } }}Alternative model to know: EKS Pod Identity.
Symptom 8: API Server / Control Plane Slow
Section titled “Symptom 8: API Server / Control Plane Slow”Control plane is managed in EKS, but workload behavior can still overwhelm API usage.
If enabled, inspect EKS CloudWatch control-plane logs:
auditapiauthenticatorcontrollerManagerscheduler
In-cluster API metrics samples:
kubectl get --raw /metrics | grep apiserver_request_duration_seconds | grep -v "#"kubectl get --raw /metrics | grep apiserver_request_total | sort -t'"' -k4 | tail -20Frequent stressors:
- high-churn operators watching broad scope
- oversized ConfigMaps or Secrets
- many CRDs with expensive reconciliation loops
3 AM Incident Playbook
Section titled “3 AM Incident Playbook”- Acknowledge and declare incident ownership
- Establish blast radius (namespace/service/customer impact)
- Check recent changes quickly:
Terminal window kubectl rollout history deployment -A | grep -v REVISION - Check AWS Service Health
- Roll back suspicious changes rapidly:
Terminal window kubectl rollout undo deployment/<name> -n <ns> - Communicate status every 15 minutes
- Capture evidence for post-incident analysis
Quick Reference: High-Frequency Commands
Section titled “Quick Reference: High-Frequency Commands”kubectl get events -A --sort-by='.lastTimestamp' | tail -50kubectl describe pod <pod> -n <ns>kubectl logs <pod> -n <ns> --previouskubectl logs <pod> -n <ns> -f --tail=100kubectl get pods -n <ns> -o widekubectl get endpoints <svc> -n <ns>kubectl top pods -A --sort-by=memorykubectl top nodeskubectl describe node <node>kubectl get hpa -Akubectl describe hpa <name> -n <ns>kubectl rollout status deployment/<name> -n <ns>kubectl rollout history deployment/<name> -n <ns>kubectl rollout undo deployment/<name> -n <ns>kubectl exec -it <pod> -n <ns> -- shkubectl debug -it <pod> -n <ns> --image=nicolaka/netshoot --target=<container>kubectl run debug --rm -it --image=nicolaka/netshoot -- bashkubectl get nodepoolkubectl get nodeclaimkubectl logs -n kube-system -l k8s-app=aws-nodekubectl logs -n karpenter -l app.kubernetes.io/name=karpenter