Skip to content

Scheduling and Placement

First PublishedLast UpdatedByAtif Alam

The kube-scheduler assigns each schedulable Pod to exactly one node by running a two-phase pipeline: filter (predicates — node must pass all) then score (priorities — pick the best among survivors). The result is written to spec.nodeName; kubelet then starts the workload.

This page is the vocabulary you need for SME-level scheduling reviews and production design discussions. For control-plane context, see Architecture. For Pending pods and capacity, see EKS troubleshooting cheat sheet — Symptom 1: Pod stuck in Pending and Workload types (DaemonSets and node placement).

Predicates answer: can this Pod legally run on this node? If any required predicate fails, the node is dropped. Common built-in themes (exact names vary by version and profile):

ThemeWhat it checks
Resource fitEnough allocatable CPU/memory (and hugepages if requested) after subtracting pods already bound.
Node selector / affinityHard (requiredDuringSchedulingIgnoredDuringExecution) node affinity and nodeSelector must match labels.
Taints and tolerationsPod must tolerate every taint on the node unless the taint is tolerated as NoExecute/NoSchedule in a way that admits the pod.
Volume topologyFor PVCs with topology or WaitForFirstConsumer, nodes must satisfy storage and zone constraints.
Pod affinity/anti-affinityHard inter-pod rules (e.g. “must sit in same zone as cache”) must be satisfiable.
Ports and hostNetworkHost port conflicts and similar collisions.

When no node passes filtering, the Pod stays Pending; kubectl describe pod shows a concise message (for example 0/3 nodes are available: 3 Insufficient cpu).

Priorities answer: among feasible nodes, which is best? The scheduler assigns a score per node; the highest wins (with ties broken pseudo-randomly for spread). Examples of what scoring tends to reward:

  • Spread — avoid piling too many pods of the same workload on one node.
  • Preferred (soft) affinity — “like” to be in the same rack or region, without excluding nodes if impossible.
  • Resource balance — prefer nodes with more headroom after the pod lands.
  • Image locality — slight preference if the image is already pulled on the node.

Scheduler profiles (and disabled/default priority configurations) can change which priority functions run; managed distributions may ship a tuned profile.

PriorityClass vs scheduler “priorities”

Section titled “PriorityClass vs scheduler “priorities””

These names collide in conversation:

ConceptWhat it is
Scheduler priority / scoringInternal numeric ranking of nodes during normal scheduling.
PriorityClassA Pod field (priorityClassName) that sets priority on the Pod spec — used for scheduling and preemption ordering: higher Pod priority can evict lower-priority Pods to free a node (preemption), subject to PDBs and fairness.

When debugging unexpected evictions, check PriorityClass, PDBs, and recent changes to priority values — not only resource requests.

  1. kubectl describe pod <pod> -n <ns> — read Events at the bottom; scheduler messages are usually explicit.
  2. Requests vs allocatablekubectl describe node for Allocatable and running pods’ requests (not just limits).
  3. Taints / tolerations / affinity — compare node labels, pod template, and any webhook-injected fields.
  4. PVCskubectl get pvc, StorageClass volumeBindingMode, and topology; see Storage.
  5. Cluster-wide eventskubectl get events -A --sort-by='.lastTimestamp' | tail -50 for quota, webhook, or autoscaler signals.

For node provisioning (Karpenter, Cluster Autoscaler) when the scheduler is “silent,” see Autoscaling on EKS and the Pending playbook linked above.