Skip to content

Saturation and Monitoring Frameworks

First PublishedLast UpdatedByAtif Alam

For Kubernetes environments, this page maps node, pod, and cluster saturation to USE, RED, and the Four Golden Signals.

Use it when you shape on-call triage, capacity reasoning, dashboards, and alerts.

Metric names below are typical of stacks built around kube-prometheus, node_exporter, and cAdvisor-style scrapes.

Your exact series names and labels differ by version and scrape config — always confirm in the Prometheus Targets and Graph UI before wiring alerts.

Related definitions: short USE / RED / Golden overviews live in Grafana dashboard design patterns. This page goes deeper on saturation and how the frameworks fit together.

Why These Frameworks Complement Each Other

Section titled “Why These Frameworks Complement Each Other”
FrameworkLensBest for
USE (Gregg)Each resource (CPU, memory, disk, network, kernel tables)Utilization, saturation (queues, wait), errors — bottom-up infrastructure
RED (Wilkie)Each service handling requestsRate, errors, duration — user-visible request path
Four Golden Signals (Google SRE)User-facing systemLatency, traffic, errors, saturation — RED plus explicit saturation

They are not competing checklists. RED and Golden Signals tell you that users feel pain (latency, errors, traffic shape). USE helps explain which resource is constraining the system once you know something is wrong. Golden Signals add saturation explicitly so executive-style dashboards stay honest about capacity, not only RPS and error rate.

Tie severity and burn to SLOs, SLIs, and error budgets when you have them — the same signals should back both alerting and post-incident review.

A practical ordering when triaging or designing observability:

  1. Request path (RED / Golden) — Ingress and service metrics: rate, error ratio, latency percentiles. Answers: are we out of SLO, and for whom?
  2. Workload and node (USE) — For each resource class, utilization plus saturation (throttles, PSI, queue depth, drops) plus hard errors (OOM, disk failures).
  3. Cluster and control plane — Pending pods, scheduler pressure, API latency, etcd health (where you own it), CNI IP pools on cloud overlays — answers: is the cluster out of scheduling or network identity capacity?

Golden rule: RED/Golden flags the problem bucket; USE walks the resources until a saturation signal lights up.

These usually come from node_exporter (and kernel PSI if exposed). Treat load averages relative to CPU count: node_load1 (or 5/15) above core count for sustained intervals often means runnable threads are queuing on CPU.

AreaSignals (examples)Notes
CPUnode_cpu_seconds_total (especially mode="iowait"), load averages, node_pressure_cpu_waiting_seconds_total (PSI)High iowait with low user CPU points at disk or remote storage, not application math.
Memorynode_memory_MemAvailable_bytes, swap usage, node_pressure_memory_* (PSI), OOM counters from logs or node metricsAvailable memory dropping toward zero is saturation of RAM; swap churn amplifies latency.
Disknode_disk_io_time_seconds_total, node_disk_io_time_weighted_seconds_total, inode usage, filesystem free space, node_pressure_io_* (PSI)Queue-heavy disks show up as latency under load even when CPU looks idle.
Networknode_network_transmit_bytes_total / receive vs link capacity, node_network_*_drop_total, TCP retransmits, nf_conntrack_count vs maxDrops and retransmits are saturation or path problems; conntrack exhaustion looks like random connection failures.
KubeletRunning pod count vs configured max pods per node, PLEG latency, image pull backlogHitting the pod cap or a slow PLEG looks like NotReady or delayed status updates, not always CPU.

PSI (/proc/pressure/*), when exported, is often the clearest saturation signal: it measures time tasks spent stalled waiting for a resource — aligned with the S in USE.

Usually from cAdvisor-style container_* metrics (names and labels vary by runtime and scrape path).

AreaSignals (examples)Notes
CPU limitscontainer_cpu_cfs_throttled_seconds_total, throttled period countersThrottling means the cgroup hit its CPU quota — saturation against the limit, not necessarily against the whole machine.
Memory limitscontainer_memory_working_set_bytes vs limit, page fault behavior, OOMKilled in pod statusWorking set near limit risks OOM; limits set too low create throttle-like latency without high node CPU.
Disk I/Ocontainer_fs_io_time_seconds_total, read/write bytes vs volume limitsNoisy neighbors on shared storage show here before node disk averages move.
NetworkPod-scoped drops, retransmits, socket backlog where exposedMay require CNI or sidecar metrics in addition to node counters.
SignalWhat it indicates
kube_pod_status_phase{phase="Pending"} with unschedulable reasonsInsufficient CPU/memory on any node, taints, volume binding failures — scheduler saturation or misconfiguration.
Scheduler latency / queue depth (where your stack exports them)Control plane or scheduler overload at scale.
Allocatable vs sum of requestsCommitment ratio — if requests exceed allocatable, new pods cannot land even if instantaneous usage looks low.
CNI IP pool (e.g. AWS VPC CNI ipamd style metrics)IP exhaustion on the overlay — new pods cannot get a sandbox address; classic on large node counts or dense DaemonSets.

Always correlate with events (kubectl get events) and recent deploys, HPA, or quota changes — see Incident first look.

Worked Example: Latency Spike Without Obvious Node CPU

Section titled “Worked Example: Latency Spike Without Obvious Node CPU”
  1. Ingress or service RED — p99 latency up, errors may still be low. Confirm which route and dependency.
  2. Pod USE — CPUcontainer_cpu_cfs_throttled_seconds_total rising for the hot pods → quota saturation. If throttling is high but node CPU is low, raise limits or reduce per-replica work before scaling out.
  3. If throttling is not the story, memory PSI or working set vs limit → memory pressure or GC under constraint.
  4. If CPU and memory look healthy, disk wait (node or container I/O time) then network (retransmits, drops, conntrack).
  5. If pods flap Pending, check scheduler messages and CNI IP metrics before chasing application code.

That walk narrows a latency incident without guessing.

  • Dashboards separate RED/Golden at the edge from USE rows per node pool.
  • Alerts fire on symptom (SLO burn, latency, errors) and saturation (throttle, PSI, pending pods), not only average CPU.
  • Throttling panels exist anywhere CPU limits are enforced.
  • CNI or IP-pool metrics are on the board if you have ever hit address exhaustion.
  • Metric names in this doc were verified once against your live Prometheus.