Skip to content

EKS Troubleshooting Cheat Sheet

First PublishedLast UpdatedByAtif Alam

Symptom-driven debugging reference for EKS SRE work. Focus areas: networking and autoscaling.


Before diving into any specific symptom, run these:

Terminal window
# Pod-level
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous # last crash output
kubectl logs <pod> -n <ns> -c <container> -f # multi-container
# Cluster-level events (most recent)
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
# Node health
kubectl get nodes -o wide
kubectl describe node <node> | grep -A5 Conditions
# Quick resource view
kubectl top nodes
kubectl top pods -A --sort-by=memory

Drop into a debug pod for in-cluster network testing:

Terminal window
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash

For a stuck or dying container, use ephemeral debug containers:

Terminal window
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container> -n <ns>

Terminal window
kubectl describe pod <pod> -n <ns> # check Events at the bottom
kubectl get events -n <ns> --sort-by='.lastTimestamp'
Event messageCauseFix
Insufficient cpu/memoryNo node has capacityScale nodes or reduce requests
node(s) had untolerated taintTaint mismatchAdd toleration or remove taint
node(s) didn't match node selectorLabel/affinity mismatchCheck nodeSelector, nodeAffinity
pod has unbound immediate PVCsPVC not boundCheck StorageClass and PV availability
0/N nodes are availableAll filtered outRead full message, usually a combination of constraints
No events / scheduler silentKarpenter or CAS not provisioningSee Autoscaling section below

EKS-specific check for IP exhaustion that appears as Pending:

Terminal window
kubectl logs -n kube-system -l k8s-app=aws-node --tail=100 | grep -i "no available IP"

Terminal window
kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns> | grep -A5 "Last State"
Exit codeMeaning
0Clean exit, but Kubernetes expected long-running process
1Generic application error, inspect app logs
137OOMKilled (SIGKILL)
139Segfault (SIGSEGV)
143SIGTERM (often shutdown exceeded grace period)

Liveness probe killing healthy pods:

Terminal window
kubectl describe pod <pod> -n <ns> | grep -A3 Liveness
# Look for "Liveness probe failed" in events

Common fix: increase initialDelaySeconds, raise failureThreshold, or add startupProbe for slow-booting apps.


Terminal window
kubectl describe pod <pod> -n <ns> | grep -A3 -i "failed to pull"

EKS-specific causes (in typical frequency order):

  1. Node IAM role missing ECR permissions (AmazonEC2ContainerRegistryReadOnly)
  2. Cross-account ECR pull missing source repo policy
  3. Architecture mismatch (amd64 image on arm64 node, or reverse)
  4. Docker Hub rate limit
  5. Missing imagePullSecret in the same namespace
Terminal window
# Verify ECR auth from a node
aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account>.dkr.ecr.<region>.amazonaws.com

Endpoints check (always step 1):

Terminal window
kubectl get endpoints <svc> -n <ns>
kubectl get endpointslices -n <ns> -l kubernetes.io/service-name=<svc>

If endpoints are empty:

  • Selector does not match pod labels -> kubectl get pods -n <ns> --show-labels
  • Pods are not Ready -> endpoints include only ready pods

In-cluster DNS test:

Terminal window
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
# Inside:
nslookup <svc>.<ns>.svc.cluster.local
curl -v http://<svc>.<ns>.svc.cluster.local:<port>

CoreDNS checks:

Terminal window
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
kubectl get cm coredns -n kube-system -o yaml

Terminal window
kubectl describe node <node> | grep -A10 Conditions

Condition quick meanings:

  • MemoryPressure: eviction risk
  • DiskPressure: disk cleanup / image GC pressure
  • PIDPressure: process count pressure
  • NetworkUnavailable: CNI not initialized

Get onto the node (EKS):

Terminal window
aws ssm start-session --target <instance-id>
sudo journalctl -u kubelet -f
sudo crictl ps -a
sudo crictl logs <container-id>

EKS join failures (node never becomes Ready):

  • Check /var/log/cloud-init-output.log
  • Verify aws-auth ConfigMap mapping (legacy) or EKS Access Entries (new model)
  • Check VPC CNI pod startup and node/control-plane SG rules

AWS VPC CNI assigns each pod a real VPC IP via node ENIs:

  • Pods become first-class VPC IPs
  • Pod density is limited by instance ENI/IP limits
  • Subnet IP planning directly controls pod scale

Check current IP allocation:

Terminal window
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable_pods: .status.allocatable.pods}'
# On aws-node pod:
kubectl exec -n kube-system <aws-node-pod> -- /app/grpc-health-probe -addr=:50051
kubectl logs -n kube-system <aws-node-pod> | grep -i "ip address"

Symptom: pods stuck in ContainerCreating with CNI errors, or no assignable pod IPs.

Terminal window
kubectl describe pod <pod> -n <ns> | grep -i "failed to assign"
kubectl logs -n kube-system -l k8s-app=aws-node --tail=200 | grep -iE "ip|eni|exhaust"

Remediations (prefer in this order):

  1. Prefix delegation on supported instances:
    Terminal window
    kubectl set env ds aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
  2. Custom networking (pod subnets via ENIConfig):
    Terminal window
    kubectl set env ds aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true
  3. Add secondary VPC CIDR for pod subnets
  4. Switch to an overlay-based approach if VPC-native pod IP is not required
Pod A (eth0) -> veth -> host namespace
-> host route table / ENI routing
-> node A ENI -> VPC routing -> node B ENI
-> host route -> pod B veth -> Pod B (eth0)

No overlay and no pod-to-pod NAT in the default VPC CNI path, so VPC Flow Logs show pod IPs.

Terminal window
kubectl get cm kube-proxy-config -n kube-system -o yaml | grep mode
  • iptables: default, rule-walk based
  • ipvs: hash lookup model, better at larger service counts
  • eBPF datapath (for example Cilium replacing kube-proxy): improved performance/visibility trade-offs

LoadBalancer / Ingress Issues (AWS Load Balancer Controller)

Section titled “LoadBalancer / Ingress Issues (AWS Load Balancer Controller)”
Terminal window
kubectl logs -n kube-system deployment/aws-load-balancer-controller --tail=200
kubectl describe svc <svc> -n <ns>
kubectl describe ingress <ing> -n <ns>
SymptomLikely cause
ALB not provisioningMissing subnet tags (kubernetes.io/role/elb or kubernetes.io/role/internal-elb)
Targets unhealthySG rules or readiness path mismatch
ALB 504ALB idle timeout lower than backend response time
Wrong target typeinstance mode hairpins via NodePort; use ip mode where supported

Subnet tag checklist:

Terminal window
aws ec2 describe-subnets --subnet-ids <id> --query 'Subnets[].Tags'

Required tags:

  • kubernetes.io/cluster/<cluster-name> = shared|owned
  • kubernetes.io/role/elb = 1 (public)
  • kubernetes.io/role/internal-elb = 1 (internal)

VPC CNI by itself does not always enforce Kubernetes NetworkPolicy semantics. Common options:

  • Calico policy-only mode with VPC CNI
  • VPC CNI native Network Policy feature (where enabled)
  • Cilium for broader L3-L7 policy
Terminal window
kubectl run test --rm -it --image=nicolaka/netshoot -n <restricted-ns> -- curl <target>

For concepts, install paths, and metric pipelines, see Autoscaling on EKS. Custom metrics: Prometheus Adapter for HPA, Container Insights for HPA.

Three layers that are orthogonal:

HPA / VPA / KEDA -> pod count and size
CAS / Karpenter -> node count and type
EC2 / Fargate -> underlying compute supply
Terminal window
kubectl get hpa -A
kubectl describe hpa <name> -n <ns>

Debug “HPA not scaling”:

Symptom in describeLikely cause
unable to get metricsmetrics-server unavailable
current: <unknown>missing pod resource requests
desired equals current under loadtarget threshold/metric mismatch
scaling thrashbehavior windows not tuned
Terminal window
kubectl top pods -A
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml

For queue/event-driven scaling, KEDA is commonly simpler than custom adapter plumbing. Metric install guides: Prometheus Adapter, Container Insights.

Modes:

  • Off: recommendations only
  • Initial: set on pod creation
  • Auto: can evict/recreate for resize

Avoid running HPA and VPA on the same signal (for example both on CPU).

Cluster AutoscalerKarpenter
Provisioning unitASG/node groupsdirect EC2 launches
Instance selectionpre-defined by groupper-pod-fit selection
Scale-up speedusually slowerusually faster
Best fitpredictable regulated environmentsdynamic cost-optimized environments
Terminal window
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=200
kubectl get nodepool
kubectl get ec2nodeclass
kubectl get nodeclaim

If pods are Pending and not provisioning:

  1. Check NodePool requirements and matching
  2. Check pod constraints (nodeSelector, taints, topology)
  3. Check EC2 quotas
  4. Check subnet IP capacity
  5. Check NodeClass IAM/permissions and events
Terminal window
kubectl describe pod <pending-pod>

Consolidation blockers commonly include restrictive PDBs, non-evictable pods, and long termination windows.

Terminal window
kubectl logs -n kube-system deployment/cluster-autoscaler --tail=200
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

Common gotchas:

  • ASG max size reached
  • mixed instance simulation mismatch
  • missing autoscaler discovery tags
  • scale-down blocked by local-storage pods, restrictive PDBs, or safe-to-evict: "false"
  • Karpenter-managed environments: native interruption handling and drain path
  • CAS-based environments: verify Node Termination Handler daemonset
Terminal window
kubectl get ds -n kube-system aws-node-termination-handler

Symptom 6: Intermittent 5xx (Layered Approach)

Section titled “Symptom 6: Intermittent 5xx (Layered Approach)”
Terminal window
# 1. Are specific pods unhealthy?
kubectl get pods -n <ns> -o wide
kubectl top pods -n <ns>
# 2. CPU throttling
kubectl exec <pod> -n <ns> -- cat /sys/fs/cgroup/cpu.stat | grep throttled
# 3. Recent restarts
kubectl get pods -n <ns> -o json | jq '.items[] | {name: .metadata.name, restarts: [.status.containerStatuses[].restartCount]}'
# 4. ALB target health
aws elbv2 describe-target-health --target-group-arn <arn>
# 5. Correlate with deploys
kubectl rollout history deployment/<name> -n <ns>

Terminal window
# Service account annotation
kubectl get sa <name> -n <ns> -o yaml | grep eks.amazonaws.com/role-arn
# Pod service account usage
kubectl get pod <pod> -n <ns> -o yaml | grep serviceAccountName
# Projected token / env
kubectl describe pod <pod> -n <ns> | grep -A3 "aws-iam-token"
kubectl exec <pod> -n <ns> -- env | grep -E "AWS_ROLE_ARN|AWS_WEB_IDENTITY_TOKEN_FILE"
# Effective identity
kubectl exec <pod> -n <ns> -- aws sts get-caller-identity

Trust policy must match exact OIDC provider, service account subject, and audience:

{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID:sub": "system:serviceaccount:NAMESPACE:SA_NAME",
"oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID:aud": "sts.amazonaws.com"
}
}
}

Alternative model to know: EKS Pod Identity.


Symptom 8: API Server / Control Plane Slow

Section titled “Symptom 8: API Server / Control Plane Slow”

Control plane is managed in EKS, but workload behavior can still overwhelm API usage.

If enabled, inspect EKS CloudWatch control-plane logs:

  • audit
  • api
  • authenticator
  • controllerManager
  • scheduler

In-cluster API metrics samples:

Terminal window
kubectl get --raw /metrics | grep apiserver_request_duration_seconds | grep -v "#"
kubectl get --raw /metrics | grep apiserver_request_total | sort -t'"' -k4 | tail -20

Frequent stressors:

  • high-churn operators watching broad scope
  • oversized ConfigMaps or Secrets
  • many CRDs with expensive reconciliation loops

  1. Acknowledge and declare incident ownership
  2. Establish blast radius (namespace/service/customer impact)
  3. Check recent changes quickly:
    Terminal window
    kubectl rollout history deployment -A | grep -v REVISION
  4. Check AWS Service Health
  5. Roll back suspicious changes rapidly:
    Terminal window
    kubectl rollout undo deployment/<name> -n <ns>
  6. Communicate status every 15 minutes
  7. Capture evidence for post-incident analysis

Terminal window
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -f --tail=100
kubectl get pods -n <ns> -o wide
kubectl get endpoints <svc> -n <ns>
kubectl top pods -A --sort-by=memory
kubectl top nodes
kubectl describe node <node>
kubectl get hpa -A
kubectl describe hpa <name> -n <ns>
kubectl rollout status deployment/<name> -n <ns>
kubectl rollout history deployment/<name> -n <ns>
kubectl rollout undo deployment/<name> -n <ns>
kubectl exec -it <pod> -n <ns> -- sh
kubectl debug -it <pod> -n <ns> --image=nicolaka/netshoot --target=<container>
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
kubectl get nodepool
kubectl get nodeclaim
kubectl logs -n kube-system -l k8s-app=aws-node
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter