Saturation and Monitoring Frameworks

First PublishedMay 1, 2026Last UpdatedMay 18, 2026ByAtif Alam

For Kubernetes environments, this page maps node, pod, and cluster saturation to USE, RED, and the Four Golden Signals.

Use it when you shape on-call triage, capacity reasoning, dashboards, and alerts.

Metric names below are typical of stacks built around kube-prometheus, node_exporter, and cAdvisor-style scrapes.

Your exact series names and labels differ by version and scrape config — always confirm in the Prometheus Targets and Graph UI before wiring alerts.

Related definitions: short USE / RED / Golden overviews live in Grafana dashboard design patterns. This page goes deeper on saturation and how the frameworks fit together.

Why These Frameworks Complement Each Other

Framework	Lens	Best for
USE (Gregg)	Each resource (CPU, memory, disk, network, kernel tables)	Utilization, saturation (queues, wait), errors — bottom-up infrastructure
RED (Wilkie)	Each service handling requests	Rate, errors, duration — user-visible request path
Four Golden Signals (Google SRE)	User-facing system	Latency, traffic, errors, saturation — RED plus explicit saturation

They are not competing checklists. RED and Golden Signals tell you that users feel pain (latency, errors, traffic shape). USE helps explain which resource is constraining the system once you know something is wrong. Golden Signals add saturation explicitly so executive-style dashboards stay honest about capacity, not only RPS and error rate.

Tie severity and burn to SLOs, SLIs, and error budgets when you have them — the same signals should back both alerting and post-incident review.

Layered Monitoring Guidance

A practical ordering when triaging or designing observability:

Request path (RED / Golden) — Ingress and service metrics: rate, error ratio, latency percentiles. Answers: are we out of SLO, and for whom?
Workload and node (USE) — For each resource class, utilization plus saturation (throttles, PSI, queue depth, drops) plus hard errors (OOM, disk failures).
Cluster and control plane — Pending pods, scheduler pressure, API latency, etcd health (where you own it), CNI IP pools on cloud overlays — answers: is the cluster out of scheduling or network identity capacity?

Golden rule: RED/Golden flags the problem bucket; USE walks the resources until a saturation signal lights up.

Node-Level Saturation Signals

These usually come from node_exporter (and kernel PSI if exposed). Treat load averages relative to CPU count: node_load1 (or 5/15) above core count for sustained intervals often means runnable threads are queuing on CPU.

Area	Signals (examples)	Notes
CPU	`node_cpu_seconds_total` (especially `mode="iowait"`), load averages, `node_pressure_cpu_waiting_seconds_total` (PSI)	High iowait with low user CPU points at disk or remote storage, not application math.
Memory	`node_memory_MemAvailable_bytes`, swap usage, `node_pressure_memory_*` (PSI), OOM counters from logs or node metrics	Available memory dropping toward zero is saturation of RAM; swap churn amplifies latency.
Disk	`node_disk_io_time_seconds_total`, `node_disk_io_time_weighted_seconds_total`, inode usage, filesystem free space, `node_pressure_io_*` (PSI)	Queue-heavy disks show up as latency under load even when CPU looks idle.
Network	`node_network_transmit_bytes_total` / receive vs link capacity, `node_network_*_drop_total`, TCP retransmits, `nf_conntrack_count` vs max	Drops and retransmits are saturation or path problems; conntrack exhaustion looks like random connection failures.
Kubelet	Running pod count vs configured max pods per node, PLEG latency, image pull backlog	Hitting the pod cap or a slow PLEG looks like NotReady or delayed status updates, not always CPU.

PSI (/proc/pressure/*), when exported, is often the clearest saturation signal: it measures time tasks spent stalled waiting for a resource — aligned with the S in USE.

Pod and Container Saturation Signals

Usually from cAdvisor-style container_* metrics (names and labels vary by runtime and scrape path).

Area	Signals (examples)	Notes
CPU limits	`container_cpu_cfs_throttled_seconds_total`, throttled period counters	Throttling means the cgroup hit its CPU quota — saturation against the limit, not necessarily against the whole machine.
Memory limits	`container_memory_working_set_bytes` vs limit, page fault behavior, `OOMKilled` in pod status	Working set near limit risks OOM; limits set too low create throttle-like latency without high node CPU.
Disk I/O	`container_fs_io_time_seconds_total`, read/write bytes vs volume limits	Noisy neighbors on shared storage show here before node disk averages move.
Network	Pod-scoped drops, retransmits, socket backlog where exposed	May require CNI or sidecar metrics in addition to node counters.

Cluster and Scheduling Saturation

Signal	What it indicates
`kube_pod_status_phase{phase="Pending"}` with unschedulable reasons	Insufficient CPU/memory on any node, taints, volume binding failures — scheduler saturation or misconfiguration.
Scheduler latency / queue depth (where your stack exports them)	Control plane or scheduler overload at scale.
Allocatable vs sum of requests	Commitment ratio — if requests exceed allocatable, new pods cannot land even if instantaneous usage looks low.
CNI IP pool (e.g. AWS VPC CNI `ipamd` style metrics)	IP exhaustion on the overlay — new pods cannot get a sandbox address; classic on large node counts or dense DaemonSets.

Always correlate with events (kubectl get events) and recent deploys, HPA, or quota changes — see Incident first look.

Worked Example: Latency Spike Without Obvious Node CPU

Ingress or service RED — p99 latency up, errors may still be low. Confirm which route and dependency.
Pod USE — CPU — container_cpu_cfs_throttled_seconds_total rising for the hot pods → quota saturation. If throttling is high but node CPU is low, raise limits or reduce per-replica work before scaling out.
If throttling is not the story, memory PSI or working set vs limit → memory pressure or GC under constraint.
If CPU and memory look healthy, disk wait (node or container I/O time) then network (retransmits, drops, conntrack).
If pods flap Pending, check scheduler messages and CNI IP metrics before chasing application code.

That walk narrows a latency incident without guessing.

Checklist

Dashboards separate RED/Golden at the edge from USE rows per node pool.
Alerts fire on symptom (SLO burn, latency, errors) and saturation (throttle, PSI, pending pods), not only average CPU.
Throttling panels exist anywhere CPU limits are enforced.
CNI or IP-pool metrics are on the board if you have ever hit address exhaustion.
Metric names in this doc were verified once against your live Prometheus.

Grafana — USE, RED, and Golden dashboard patterns
Prometheus — PromQL and scrape model
Exporters — Node exporter and instrumentation
Incident first look — First five minutes of a cluster-wide spike
Alerting — Routing and noise control
Troubleshooting and debugging — Kubernetes triage flow
Autoscaling on EKS — HPA, node autoscalers, and metric pipelines
Production patterns — Limits, probes, capacity planning