EKS Troubleshooting Cheat Sheet

First PublishedApr 24, 2026Last UpdatedMay 18, 2026ByAtif Alam

Symptom-driven debugging reference for EKS SRE work. Focus areas: networking and autoscaling.

Universal First Moves

Before diving into any specific symptom, run these:

1
# Pod-level
2
kubectl describe pod <pod> -n <ns>
3
kubectl logs <pod> -n <ns> --previous          # last crash output
4
kubectl logs <pod> -n <ns> -c <container> -f   # multi-container
5

6
# Cluster-level events (most recent)
7
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
8

9
# Node health
10
kubectl get nodes -o wide
11
kubectl describe node <node> | grep -A5 Conditions
12

13
# Quick resource view
14
kubectl top nodes
15
kubectl top pods -A --sort-by=memory

Drop into a debug pod for in-cluster network testing:

1
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash

For a stuck or dying container, use ephemeral debug containers:

1
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container> -n <ns>

Symptom 1: Pod Stuck in Pending

1
kubectl describe pod <pod> -n <ns>             # check Events at the bottom
2
kubectl get events -n <ns> --sort-by='.lastTimestamp'

Event message	Cause	Fix
`Insufficient cpu/memory`	No node has capacity	Scale nodes or reduce requests
`node(s) had untolerated taint`	Taint mismatch	Add toleration or remove taint
`node(s) didn't match node selector`	Label/affinity mismatch	Check `nodeSelector`, `nodeAffinity`
`pod has unbound immediate PVCs`	PVC not bound	Check StorageClass and PV availability
`0/N nodes are available`	All filtered out	Read full message, usually a combination of constraints
No events / scheduler silent	Karpenter or CAS not provisioning	See Autoscaling section below

EKS-specific check for IP exhaustion that appears as Pending:

1
kubectl logs -n kube-system -l k8s-app=aws-node --tail=100 | grep -i "no available IP"

Symptom 2: CrashLoopBackOff

1
kubectl logs <pod> -n <ns> --previous
2
kubectl describe pod <pod> -n <ns> | grep -A5 "Last State"

Exit code	Meaning
0	Clean exit, but Kubernetes expected long-running process
1	Generic application error, inspect app logs
137	OOMKilled (`SIGKILL`)
139	Segfault (`SIGSEGV`)
143	`SIGTERM` (often shutdown exceeded grace period)

Liveness probe killing healthy pods:

1
kubectl describe pod <pod> -n <ns> | grep -A3 Liveness
2
# Look for "Liveness probe failed" in events

Common fix: increase initialDelaySeconds, raise failureThreshold, or add startupProbe for slow-booting apps.

Symptom 3: ImagePullBackOff (EKS)

1
kubectl describe pod <pod> -n <ns> | grep -A3 -i "failed to pull"

EKS-specific causes (in typical frequency order):

Node IAM role missing ECR permissions (AmazonEC2ContainerRegistryReadOnly)
Cross-account ECR pull missing source repo policy
Architecture mismatch (amd64 image on arm64 node, or reverse)
Docker Hub rate limit
Missing imagePullSecret in the same namespace

1
# Verify ECR auth from a node
2
aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account>.dkr.ecr.<region>.amazonaws.com

Symptom 4: Service Not Reachable

Endpoints check (always step 1):

1
kubectl get endpoints <svc> -n <ns>
2
kubectl get endpointslices -n <ns> -l kubernetes.io/service-name=<svc>

If endpoints are empty:

Selector does not match pod labels -> kubectl get pods -n <ns> --show-labels
Pods are not Ready -> endpoints include only ready pods

In-cluster DNS test:

1
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
2
# Inside:
3
nslookup <svc>.<ns>.svc.cluster.local
4
curl -v http://<svc>.<ns>.svc.cluster.local:<port>

CoreDNS checks:

1
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
2
kubectl get cm coredns -n kube-system -o yaml

Symptom 5: Node NotReady

1
kubectl describe node <node> | grep -A10 Conditions

Condition quick meanings:

MemoryPressure: eviction risk
DiskPressure: disk cleanup / image GC pressure
PIDPressure: process count pressure
NetworkUnavailable: CNI not initialized

Get onto the node (EKS):

1
aws ssm start-session --target <instance-id>
2
sudo journalctl -u kubelet -f
3
sudo crictl ps -a
4
sudo crictl logs <container-id>

EKS join failures (node never becomes Ready):

Check /var/log/cloud-init-output.log
Verify aws-auth ConfigMap mapping (legacy) or EKS Access Entries (new model)
Check VPC CNI pod startup and node/control-plane SG rules

Networking Deep Dive (EKS)

VPC CNI Mental Model

AWS VPC CNI assigns each pod a real VPC IP via node ENIs:

Pods become first-class VPC IPs
Pod density is limited by instance ENI/IP limits
Subnet IP planning directly controls pod scale

Check current IP allocation:

1
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable_pods: .status.allocatable.pods}'
2

3
# On aws-node pod:
4
kubectl exec -n kube-system <aws-node-pod> -- /app/grpc-health-probe -addr=:50051
5
kubectl logs -n kube-system <aws-node-pod> | grep -i "ip address"

IP Exhaustion Playbook

Symptom: pods stuck in ContainerCreating with CNI errors, or no assignable pod IPs.

1
kubectl describe pod <pod> -n <ns> | grep -i "failed to assign"
2
kubectl logs -n kube-system -l k8s-app=aws-node --tail=200 | grep -iE "ip|eni|exhaust"

Remediations (prefer in this order):

Prefix delegation on supported instances:

1
kubectl set env ds aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true

Custom networking (pod subnets via ENIConfig):

1
kubectl set env ds aws-node -n kube-system AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true

Add secondary VPC CIDR for pod subnets
Switch to an overlay-based approach if VPC-native pod IP is not required

Pod-to-Pod Traffic Path (Cross-Node)

1
Pod A (eth0) -> veth -> host namespace
2
             -> host route table / ENI routing
3
             -> node A ENI -> VPC routing -> node B ENI
4
             -> host route -> pod B veth -> Pod B (eth0)

No overlay and no pod-to-pod NAT in the default VPC CNI path, so VPC Flow Logs show pod IPs.

kube-proxy Modes

1
kubectl get cm kube-proxy-config -n kube-system -o yaml | grep mode

iptables: default, rule-walk based
ipvs: hash lookup model, better at larger service counts
eBPF datapath (for example Cilium replacing kube-proxy): improved performance/visibility trade-offs

LoadBalancer / Ingress Issues (AWS Load Balancer Controller)

1
kubectl logs -n kube-system deployment/aws-load-balancer-controller --tail=200
2
kubectl describe svc <svc> -n <ns>
3
kubectl describe ingress <ing> -n <ns>

Symptom	Likely cause
ALB not provisioning	Missing subnet tags (`kubernetes.io/role/elb` or `kubernetes.io/role/internal-elb`)
Targets unhealthy	SG rules or readiness path mismatch
ALB 504	ALB idle timeout lower than backend response time
Wrong target type	`instance` mode hairpins via NodePort; use `ip` mode where supported

Subnet tag checklist:

1
aws ec2 describe-subnets --subnet-ids <id> --query 'Subnets[].Tags'

Required tags:

kubernetes.io/cluster/<cluster-name> = shared|owned
kubernetes.io/role/elb = 1 (public)
kubernetes.io/role/internal-elb = 1 (internal)

NetworkPolicy on EKS

VPC CNI by itself does not always enforce Kubernetes NetworkPolicy semantics. Common options:

Calico policy-only mode with VPC CNI
VPC CNI native Network Policy feature (where enabled)
Cilium for broader L3-L7 policy

1
kubectl run test --rm -it --image=nicolaka/netshoot -n <restricted-ns> -- curl <target>

Autoscaling Deep Dive

For concepts, install paths, and metric pipelines, see Autoscaling on EKS. Custom metrics: Prometheus Adapter for HPA, Container Insights for HPA.

Three layers that are orthogonal:

1
HPA / VPA / KEDA       -> pod count and size
2
CAS / Karpenter        -> node count and type
3
EC2 / Fargate          -> underlying compute supply

HPA (Horizontal Pod Autoscaler)

1
kubectl get hpa -A
2
kubectl describe hpa <name> -n <ns>

Debug “HPA not scaling”:

Symptom in `describe`	Likely cause
`unable to get metrics`	metrics-server unavailable
`current: <unknown>`	missing pod resource requests
desired equals current under load	target threshold/metric mismatch
scaling thrash	behavior windows not tuned

1
kubectl top pods -A
2
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml

For queue/event-driven scaling, KEDA is commonly simpler than custom adapter plumbing. Metric install guides: Prometheus Adapter, Container Insights.

VPA (Vertical Pod Autoscaler)

Modes:

Off: recommendations only
Initial: set on pod creation
Auto: can evict/recreate for resize

Avoid running HPA and VPA on the same signal (for example both on CPU).

Cluster Autoscaler vs Karpenter

	Cluster Autoscaler	Karpenter
Provisioning unit	ASG/node groups	direct EC2 launches
Instance selection	pre-defined by group	per-pod-fit selection
Scale-up speed	usually slower	usually faster
Best fit	predictable regulated environments	dynamic cost-optimized environments

Karpenter Debugging

1
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=200
2
kubectl get nodepool
3
kubectl get ec2nodeclass
4
kubectl get nodeclaim

If pods are Pending and not provisioning:

Check NodePool requirements and matching
Check pod constraints (nodeSelector, taints, topology)
Check EC2 quotas
Check subnet IP capacity
Check NodeClass IAM/permissions and events

1
kubectl describe pod <pending-pod>

Consolidation blockers commonly include restrictive PDBs, non-evictable pods, and long termination windows.

Cluster Autoscaler Debugging

1
kubectl logs -n kube-system deployment/cluster-autoscaler --tail=200
2
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

Common gotchas:

ASG max size reached
mixed instance simulation mismatch
missing autoscaler discovery tags
scale-down blocked by local-storage pods, restrictive PDBs, or safe-to-evict: "false"

Spot Interruption Handling

Karpenter-managed environments: native interruption handling and drain path
CAS-based environments: verify Node Termination Handler daemonset

1
kubectl get ds -n kube-system aws-node-termination-handler

Symptom 6: Intermittent 5xx (Layered Approach)

1
# 1. Are specific pods unhealthy?
2
kubectl get pods -n <ns> -o wide
3
kubectl top pods -n <ns>
4

5
# 2. CPU throttling
6
kubectl exec <pod> -n <ns> -- cat /sys/fs/cgroup/cpu.stat | grep throttled
7

8
# 3. Recent restarts
9
kubectl get pods -n <ns> -o json | jq '.items[] | {name: .metadata.name, restarts: [.status.containerStatuses[].restartCount]}'
10

11
# 4. ALB target health
12
aws elbv2 describe-target-health --target-group-arn <arn>
13

14
# 5. Correlate with deploys
15
kubectl rollout history deployment/<name> -n <ns>

Symptom 7: IRSA Not Working

1
# Service account annotation
2
kubectl get sa <name> -n <ns> -o yaml | grep eks.amazonaws.com/role-arn
3

4
# Pod service account usage
5
kubectl get pod <pod> -n <ns> -o yaml | grep serviceAccountName
6

7
# Projected token / env
8
kubectl describe pod <pod> -n <ns> | grep -A3 "aws-iam-token"
9
kubectl exec <pod> -n <ns> -- env | grep -E "AWS_ROLE_ARN|AWS_WEB_IDENTITY_TOKEN_FILE"
10

11
# Effective identity
12
kubectl exec <pod> -n <ns> -- aws sts get-caller-identity

Trust policy must match exact OIDC provider, service account subject, and audience:

1
{
2
  "Effect": "Allow",
3
  "Principal": {
4
    "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID"
5
  },
6
  "Action": "sts:AssumeRoleWithWebIdentity",
7
  "Condition": {
8
    "StringEquals": {
9
      "oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID:sub": "system:serviceaccount:NAMESPACE:SA_NAME",
10
      "oidc.eks.REGION.amazonaws.com/id/CLUSTER_ID:aud": "sts.amazonaws.com"
11
    }
12
  }
13
}

Alternative model to know: EKS Pod Identity.

Symptom 8: API Server / Control Plane Slow

Control plane is managed in EKS, but workload behavior can still overwhelm API usage.

If enabled, inspect EKS CloudWatch control-plane logs:

audit
api
authenticator
controllerManager
scheduler

In-cluster API metrics samples:

1
kubectl get --raw /metrics | grep apiserver_request_duration_seconds | grep -v "#"
2
kubectl get --raw /metrics | grep apiserver_request_total | sort -t'"' -k4 | tail -20

Frequent stressors:

high-churn operators watching broad scope
oversized ConfigMaps or Secrets
many CRDs with expensive reconciliation loops

3 AM Incident Playbook

Acknowledge and declare incident ownership
Establish blast radius (namespace/service/customer impact)

Check recent changes quickly:

1
kubectl rollout history deployment -A | grep -v REVISION

Check AWS Service Health

Roll back suspicious changes rapidly:

1
kubectl rollout undo deployment/<name> -n <ns>

Communicate status every 15 minutes
Capture evidence for post-incident analysis

Quick Reference: High-Frequency Commands

1
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
2
kubectl describe pod <pod> -n <ns>
3
kubectl logs <pod> -n <ns> --previous
4
kubectl logs <pod> -n <ns> -f --tail=100
5
kubectl get pods -n <ns> -o wide
6
kubectl get endpoints <svc> -n <ns>
7
kubectl top pods -A --sort-by=memory
8
kubectl top nodes
9
kubectl describe node <node>
10
kubectl get hpa -A
11
kubectl describe hpa <name> -n <ns>
12
kubectl rollout status deployment/<name> -n <ns>
13
kubectl rollout history deployment/<name> -n <ns>
14
kubectl rollout undo deployment/<name> -n <ns>
15
kubectl exec -it <pod> -n <ns> -- sh
16
kubectl debug -it <pod> -n <ns> --image=nicolaka/netshoot --target=<container>
17
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
18
kubectl get nodepool
19
kubectl get nodeclaim
20
kubectl logs -n kube-system -l k8s-app=aws-node
21
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter