Observability for Systems

First PublishedMay 5, 2026ByAtif Alam

This page does not replace the Observability library — it summarizes what design reviews typically expect so service boundaries ship measurable.

Structured Logs

Emit structured fields (for example JSON with stable keys): request id, user or tenant identifiers (when policy allows), latency, outcome code.Queryable logs underpin debugging and auditing in ways raw printf strings resist.

Deep guides: Loki, Observability overview.

Metrics: Percentiles, Not Only Averages

Averages lie under heavy tails.Report histograms or percentiles (p95, p99) for latency; pair with rates for errors.Saturated resources rarely show their story in averages alone.

Traces Across Service Boundaries

Distributed tracing links spans across hops so slow requests expose which dependency dominates.Instrumentation belongs in outbound clients and synchronous paths.

Hands-on pointers: Distributed tracing, OpenTelemetry.

Alerts on Symptoms

Tune alerts toward customer-visible degradation (error rate spikes, SLO burn) rather than noisy low-level chatter only.Correlation: error counts often spike before latency percentiles balloon — surface both.

Framework context: Saturation and monitoring frameworks, RED (rate, errors, duration) and USE (utilization, saturation, errors).

Dashboards

Standard golden signals view: QPS, errors, latency, saturation per service and dependency.Saturation includes queue depth, thread pool usage, connection pool waits — not just CPU.

Postmortems and Learning

After incidents, blameless postmortems capture timeline, root cause, and action items with owners — the feedback loop that turns observability data into durable reliability work.

Related: Incident response and on-call, SLOs, SLIs, and error budgets, Alerting, glossary.