Incident first look
Use this page when something big changed — cluster-wide CPU, memory pressure, latency step-change, or error budget burn — and you need a repeatable first pass before deep dives.
First five minutes: Kubernetes signals
Section titled “First five minutes: Kubernetes signals”kubectl get events -A --sort-by='.lastTimestamp' | tail -50kubectl top nodeskubectl top pods -A --sort-by=memory | tail -30kubectl get pods -A -o wide | head -80Look for CrashLoop, Pending, OOMKilled, and recent rollouts in the same window as the metric move.
Prometheus and Grafana: where to look first
Section titled “Prometheus and Grafana: where to look first”- Node saturation —
node_cpu_seconds_total,node_memory_*, disk and network pressure panels (if using node exporter). - Top namespaces — sum CPU or memory by
namespaceto find a noisy neighbor namespace. - Top workloads —
sum by (pod, namespace)(rate(container_cpu_usage_seconds_total[5m]))(adjust metric names to your scrape labels). - Deploy correlation — align the timestamp with Argo CD / Flux syncs, Helm releases, or HPA scale events.
If you use kube-prometheus-stack, start from the USE Method / Node dashboards, then drill into Kubernetes / Compute Resources / Namespace.
Network slow vs app slow
Section titled “Network slow vs app slow”| Signal type | Often points to |
|---|---|
| TCP retransmits, SYN backlog, RTT spikes, interface drops | Network path, NIC saturation, overlay issues, DNS latency |
| HTTP 5xx rate, queue depth, GC pauses, thread pool exhaustion | Application or dependency behavior |
| Both rising together | Overload — capacity or autoscaling is the first lever |
Correlate with traces (Tempo/Jaeger) when available: a flat trace with long client spans often indicates network or DNS; long server spans often indicate app work.
Load balancer edge (AWS example)
Section titled “Load balancer edge (AWS example)”If traffic enters via ALB, check target health and HTTPCode_Target_5XX in the same window as cluster metrics — sometimes the cluster is healthy while the edge is not (or vice versa).
Related
Section titled “Related”- Saturation and monitoring frameworks — Layering USE, RED, and Golden Signals with node, pod, and cluster saturation metrics.
- Troubleshooting and debugging — Structured Kubernetes triage.
- Observability setup — ServiceMonitor, PodMonitor, and Helm release labels.
- Prometheus — PromQL building blocks.
- Architecture review answers — Prompts this page deepens.