Incident first look

First PublishedApr 24, 2026Last UpdatedMay 1, 2026ByAtif Alam

Use this page when something big changed — cluster-wide CPU, memory pressure, latency step-change, or error budget burn — and you need a repeatable first pass before deep dives.

First five minutes: Kubernetes signals

1
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
2
kubectl top nodes
3
kubectl top pods -A --sort-by=memory | tail -30
4
kubectl get pods -A -o wide | head -80

Look for CrashLoop, Pending, OOMKilled, and recent rollouts in the same window as the metric move.

Prometheus and Grafana: where to look first

Node saturation — node_cpu_seconds_total, node_memory_*, disk and network pressure panels (if using node exporter).
Top namespaces — sum CPU or memory by namespace to find a noisy neighbor namespace.
Top workloads — sum by (pod, namespace)(rate(container_cpu_usage_seconds_total[5m])) (adjust metric names to your scrape labels).
Deploy correlation — align the timestamp with Argo CD / Flux syncs, Helm releases, or HPA scale events.

If you use kube-prometheus-stack, start from the USE Method / Node dashboards, then drill into Kubernetes / Compute Resources / Namespace.

Network slow vs app slow

Signal type	Often points to
TCP retransmits, SYN backlog, RTT spikes, interface drops	Network path, NIC saturation, overlay issues, DNS latency
HTTP 5xx rate, queue depth, GC pauses, thread pool exhaustion	Application or dependency behavior
Both rising together	Overload — capacity or autoscaling is the first lever

Correlate with traces (Tempo/Jaeger) when available: a flat trace with long client spans often indicates network or DNS; long server spans often indicate app work.

Load balancer edge (AWS example)

If traffic enters via ALB, check target health and HTTPCode_Target_5XX in the same window as cluster metrics — sometimes the cluster is healthy while the edge is not (or vice versa).

Saturation and monitoring frameworks — Layering USE, RED, and Golden Signals with node, pod, and cluster saturation metrics.
Troubleshooting and debugging — Structured Kubernetes triage.
Observability setup — ServiceMonitor, PodMonitor, and Helm release labels.
Prometheus — PromQL building blocks.
Architecture review answers — Prompts this page deepens.

Incident first look

First five minutes: Kubernetes signals

Prometheus and Grafana: where to look first

Network slow vs app slow

Load balancer edge (AWS example)

Related