Skip to content

Incident first look

First PublishedLast UpdatedByAtif Alam

Use this page when something big changed — cluster-wide CPU, memory pressure, latency step-change, or error budget burn — and you need a repeatable first pass before deep dives.

Terminal window
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
kubectl top nodes
kubectl top pods -A --sort-by=memory | tail -30
kubectl get pods -A -o wide | head -80

Look for CrashLoop, Pending, OOMKilled, and recent rollouts in the same window as the metric move.

Prometheus and Grafana: where to look first

Section titled “Prometheus and Grafana: where to look first”
  1. Node saturationnode_cpu_seconds_total, node_memory_*, disk and network pressure panels (if using node exporter).
  2. Top namespaces — sum CPU or memory by namespace to find a noisy neighbor namespace.
  3. Top workloadssum by (pod, namespace)(rate(container_cpu_usage_seconds_total[5m])) (adjust metric names to your scrape labels).
  4. Deploy correlation — align the timestamp with Argo CD / Flux syncs, Helm releases, or HPA scale events.

If you use kube-prometheus-stack, start from the USE Method / Node dashboards, then drill into Kubernetes / Compute Resources / Namespace.

Signal typeOften points to
TCP retransmits, SYN backlog, RTT spikes, interface dropsNetwork path, NIC saturation, overlay issues, DNS latency
HTTP 5xx rate, queue depth, GC pauses, thread pool exhaustionApplication or dependency behavior
Both rising togetherOverload — capacity or autoscaling is the first lever

Correlate with traces (Tempo/Jaeger) when available: a flat trace with long client spans often indicates network or DNS; long server spans often indicate app work.

If traffic enters via ALB, check target health and HTTPCode_Target_5XX in the same window as cluster metrics — sometimes the cluster is healthy while the edge is not (or vice versa).