Observability Overview
Observability is the ability to understand what’s happening inside your systems by examining their outputs. When something breaks at 3 AM, observability is what lets you figure out why without guessing.
The Three Pillars
Section titled “The Three Pillars”| Pillar | What It Is | Example Tools |
|---|---|---|
| Metrics | Numeric measurements over time (CPU, memory, request rate, error rate, latency) | Prometheus, Datadog, CloudWatch |
| Logs | Timestamped text records of events (app errors, access logs, audit trails) | Loki, Elasticsearch, CloudWatch Logs |
| Traces | End-to-end request paths through distributed services | Jaeger, Tempo, Zipkin, OpenTelemetry |
Each pillar answers different questions:
- Metrics → “Is something wrong?” (alerting, dashboards)
- Logs → “What exactly happened?” (debugging, audit)
- Traces → “Where is the bottleneck?” (latency analysis across services)
Network evidence complements the three pillars for platform and connectivity incidents: VPC flow logs show allow/deny and volume at the cloud boundary (Flow logs and network RCA); packet captures show TCP/TLS behavior on hosts (Packet capture). Use them after metrics narrow the time window and owning service.
Why Monitoring Matters
Section titled “Why Monitoring Matters”- Detect problems before users do — Alerts on error rates, latency spikes, resource exhaustion.
- Reduce mean time to resolution (MTTR) — Dashboards and logs help you find the root cause fast.
- Capacity planning — Metrics over time show growth trends, so you can scale before hitting limits.
- SLOs and SLAs — SLOs, SLIs, and error budgets tie metrics to agreed targets and release tradeoffs; SLAs are often contractual.
The Prometheus + Grafana Ecosystem
Section titled “The Prometheus + Grafana Ecosystem”This section focuses on the open-source stack most commonly used for Kubernetes and cloud-native monitoring:
scrape query metrics┌───────────┐ ◄────────── ┌──────────────┐ ◄────────────── ┌─────────────┐│ Your App │ │ Prometheus │ │ ││ (metrics) │ │ (TSDB) │────► Alert Rules │ Grafana │└───────────┘ └──────────────┘ ──► Alertmgr │ (dashboards)│ │ │ push logs query logs │ │┌───────────┐ ──────────► ┌──────────────┐ ◄────────────── │ ││ Your App │ │ Loki │ │ ││ (logs) │ │ (log store) │ └─────────────┘└───────────┘ └──────────────┘
OpenTelemetry / OTLP query traces┌───────────┐ ────────────────► ┌──────────────┐ ◄──────────── ┌─────────────┐│ Your App │ │ Grafana Tempo │ │ Grafana ││ (traces) │ │ (trace store) │ │ (Explore) │└───────────┘ └──────────────┘ └─────────────┘| Component | Role |
|---|---|
| Prometheus | Scrapes and stores metrics, evaluates alert rules |
| Grafana | Visualizes metrics, logs, and traces; dashboards and Explore |
| Alertmanager | Routes, groups, and delivers alerts (Slack, PagerDuty, email) |
| Loki | Stores and queries logs (like Prometheus but for logs) |
| Grafana Tempo | Stores traces; often fed by OpenTelemetry; query from Grafana |
| Exporters | Expose metrics from systems that don’t natively support Prometheus (Node exporter, MySQL exporter, etc.) |
Monitoring vs Observability
Section titled “Monitoring vs Observability”- Monitoring is a subset of observability — it tells you something is wrong (dashboards, alerts).
- Observability goes further — it lets you ask arbitrary questions about your system to understand why (ad-hoc queries, correlation across metrics/logs/traces).
In practice, the term “observability” is used broadly to cover both.
Suggested reading order
Section titled “Suggested reading order”- Prometheus — metrics model and PromQL.
- Grafana — dashboards and data sources.
- SLOs, SLIs, and error budgets — how metrics become targets and budgets.
- Alerting — Alertmanager, noise, and severity.
- Exporters — node, blackbox, and app metrics.
- Loki — logs and LogQL.
- Observability Setup — compose or Kubernetes bring-up.
- OpenTelemetry — unified instrumentation.
- Distributed Tracing — spans, Tempo/Jaeger, propagation.
- Scaling Prometheus — Thanos, Mimir, long-term storage.
Skip steps you already know; use the list as a skills path for the Prometheus–Grafana–Loki–Tempo–OTel stack.
Topics in This Section
Section titled “Topics in This Section”Start with Prometheus (metrics collection), then Grafana (visualization), then SLOs, alerting, exporters, Loki (logs), setup, OpenTelemetry, tracing, and scaling.
- Prometheus — Architecture, scrape config, metric types, and PromQL queries.
- Grafana — Data sources, dashboards, panels, variables, and visualization types.
- SLOs, SLIs, and error budgets — SLIs, SLO targets, error budgets, and tradeoffs with velocity and cost.
- Alerting — Alertmanager routing, Grafana alerts, alert design best practices.
- Exporters — Node exporter, blackbox exporter, application instrumentation, and custom exporters.
- Loki — Log aggregation, LogQL, labels, and Grafana integration.
- Observability Setup — Docker Compose and Kubernetes setup for the full stack from zero to dashboards.
- OpenTelemetry — Unified instrumentation for metrics, logs, and traces with the OTel SDK, auto-instrumentation, and the OTel Collector.
- Distributed Tracing — Following requests across services with spans, context propagation, Jaeger, and Grafana Tempo.
- Scaling Prometheus — Long-term storage and global querying with Thanos, Cortex, and Grafana Mimir.
Related
Section titled “Related”- AIOps — AI-assisted anomaly detection, correlation, and diagnostics that build on metrics, logs, and traces.