Saturation and Monitoring Frameworks
For Kubernetes environments, this page maps node, pod, and cluster saturation to USE, RED, and the Four Golden Signals.
Use it when you shape on-call triage, capacity reasoning, dashboards, and alerts.
Metric names below are typical of stacks built around kube-prometheus, node_exporter, and cAdvisor-style scrapes.
Your exact series names and labels differ by version and scrape config — always confirm in the Prometheus Targets and Graph UI before wiring alerts.
Related definitions: short USE / RED / Golden overviews live in Grafana dashboard design patterns. This page goes deeper on saturation and how the frameworks fit together.
Why These Frameworks Complement Each Other
Section titled “Why These Frameworks Complement Each Other”| Framework | Lens | Best for |
|---|---|---|
| USE (Gregg) | Each resource (CPU, memory, disk, network, kernel tables) | Utilization, saturation (queues, wait), errors — bottom-up infrastructure |
| RED (Wilkie) | Each service handling requests | Rate, errors, duration — user-visible request path |
| Four Golden Signals (Google SRE) | User-facing system | Latency, traffic, errors, saturation — RED plus explicit saturation |
They are not competing checklists. RED and Golden Signals tell you that users feel pain (latency, errors, traffic shape). USE helps explain which resource is constraining the system once you know something is wrong. Golden Signals add saturation explicitly so executive-style dashboards stay honest about capacity, not only RPS and error rate.
Tie severity and burn to SLOs, SLIs, and error budgets when you have them — the same signals should back both alerting and post-incident review.
Layered Monitoring Guidance
Section titled “Layered Monitoring Guidance”A practical ordering when triaging or designing observability:
- Request path (RED / Golden) — Ingress and service metrics: rate, error ratio, latency percentiles. Answers: are we out of SLO, and for whom?
- Workload and node (USE) — For each resource class, utilization plus saturation (throttles, PSI, queue depth, drops) plus hard errors (OOM, disk failures).
- Cluster and control plane — Pending pods, scheduler pressure, API latency, etcd health (where you own it), CNI IP pools on cloud overlays — answers: is the cluster out of scheduling or network identity capacity?
Golden rule: RED/Golden flags the problem bucket; USE walks the resources until a saturation signal lights up.
Node-Level Saturation Signals
Section titled “Node-Level Saturation Signals”These usually come from node_exporter (and kernel PSI if exposed). Treat load averages relative to CPU count: node_load1 (or 5/15) above core count for sustained intervals often means runnable threads are queuing on CPU.
| Area | Signals (examples) | Notes |
|---|---|---|
| CPU | node_cpu_seconds_total (especially mode="iowait"), load averages, node_pressure_cpu_waiting_seconds_total (PSI) | High iowait with low user CPU points at disk or remote storage, not application math. |
| Memory | node_memory_MemAvailable_bytes, swap usage, node_pressure_memory_* (PSI), OOM counters from logs or node metrics | Available memory dropping toward zero is saturation of RAM; swap churn amplifies latency. |
| Disk | node_disk_io_time_seconds_total, node_disk_io_time_weighted_seconds_total, inode usage, filesystem free space, node_pressure_io_* (PSI) | Queue-heavy disks show up as latency under load even when CPU looks idle. |
| Network | node_network_transmit_bytes_total / receive vs link capacity, node_network_*_drop_total, TCP retransmits, nf_conntrack_count vs max | Drops and retransmits are saturation or path problems; conntrack exhaustion looks like random connection failures. |
| Kubelet | Running pod count vs configured max pods per node, PLEG latency, image pull backlog | Hitting the pod cap or a slow PLEG looks like NotReady or delayed status updates, not always CPU. |
PSI (/proc/pressure/*), when exported, is often the clearest saturation signal: it measures time tasks spent stalled waiting for a resource — aligned with the S in USE.
Pod and Container Saturation Signals
Section titled “Pod and Container Saturation Signals”Usually from cAdvisor-style container_* metrics (names and labels vary by runtime and scrape path).
| Area | Signals (examples) | Notes |
|---|---|---|
| CPU limits | container_cpu_cfs_throttled_seconds_total, throttled period counters | Throttling means the cgroup hit its CPU quota — saturation against the limit, not necessarily against the whole machine. |
| Memory limits | container_memory_working_set_bytes vs limit, page fault behavior, OOMKilled in pod status | Working set near limit risks OOM; limits set too low create throttle-like latency without high node CPU. |
| Disk I/O | container_fs_io_time_seconds_total, read/write bytes vs volume limits | Noisy neighbors on shared storage show here before node disk averages move. |
| Network | Pod-scoped drops, retransmits, socket backlog where exposed | May require CNI or sidecar metrics in addition to node counters. |
Cluster and Scheduling Saturation
Section titled “Cluster and Scheduling Saturation”| Signal | What it indicates |
|---|---|
kube_pod_status_phase{phase="Pending"} with unschedulable reasons | Insufficient CPU/memory on any node, taints, volume binding failures — scheduler saturation or misconfiguration. |
| Scheduler latency / queue depth (where your stack exports them) | Control plane or scheduler overload at scale. |
| Allocatable vs sum of requests | Commitment ratio — if requests exceed allocatable, new pods cannot land even if instantaneous usage looks low. |
CNI IP pool (e.g. AWS VPC CNI ipamd style metrics) | IP exhaustion on the overlay — new pods cannot get a sandbox address; classic on large node counts or dense DaemonSets. |
Always correlate with events (kubectl get events) and recent deploys, HPA, or quota changes — see Incident first look.
Worked Example: Latency Spike Without Obvious Node CPU
Section titled “Worked Example: Latency Spike Without Obvious Node CPU”- Ingress or service RED — p99 latency up, errors may still be low. Confirm which route and dependency.
- Pod USE — CPU —
container_cpu_cfs_throttled_seconds_totalrising for the hot pods → quota saturation. If throttling is high but node CPU is low, raise limits or reduce per-replica work before scaling out. - If throttling is not the story, memory PSI or working set vs limit → memory pressure or GC under constraint.
- If CPU and memory look healthy, disk wait (node or container I/O time) then network (retransmits, drops, conntrack).
- If pods flap Pending, check scheduler messages and CNI IP metrics before chasing application code.
That walk narrows a latency incident without guessing.
Checklist
Section titled “Checklist”- Dashboards separate RED/Golden at the edge from USE rows per node pool.
- Alerts fire on symptom (SLO burn, latency, errors) and saturation (throttle, PSI, pending pods), not only average CPU.
- Throttling panels exist anywhere CPU limits are enforced.
- CNI or IP-pool metrics are on the board if you have ever hit address exhaustion.
- Metric names in this doc were verified once against your live Prometheus.
Related
Section titled “Related”- Grafana — USE, RED, and Golden dashboard patterns
- Prometheus — PromQL and scrape model
- Exporters — Node exporter and instrumentation
- Incident first look — First five minutes of a cluster-wide spike
- Alerting — Routing and noise control
- Troubleshooting and debugging — Kubernetes triage flow
- Autoscaling on EKS — HPA, node autoscalers, and metric pipelines
- Production patterns — Limits, probes, capacity planning