Skip to content

Observability for Systems

First PublishedByAtif Alam

This page does not replace the Observability library — it summarizes what design reviews typically expect so service boundaries ship measurable.

Emit structured fields (for example JSON with stable keys): request id, user or tenant identifiers (when policy allows), latency, outcome code.Queryable logs underpin debugging and auditing in ways raw printf strings resist.

Deep guides: Loki, Observability overview.

Averages lie under heavy tails.Report histograms or percentiles (p95, p99) for latency; pair with rates for errors.Saturated resources rarely show their story in averages alone.

Distributed tracing links spans across hops so slow requests expose which dependency dominates.Instrumentation belongs in outbound clients and synchronous paths.

Hands-on pointers: Distributed tracing, OpenTelemetry.

Tune alerts toward customer-visible degradation (error rate spikes, SLO burn) rather than noisy low-level chatter only.Correlation: error counts often spike before latency percentiles balloon — surface both.

Framework context: Saturation and monitoring frameworks, RED (rate, errors, duration) and USE (utilization, saturation, errors).

Standard golden signals view: QPS, errors, latency, saturation per service and dependency.Saturation includes queue depth, thread pool usage, connection pool waits — not just CPU.

After incidents, blameless postmortems capture timeline, root cause, and action items with owners — the feedback loop that turns observability data into durable reliability work.

Related: Incident response and on-call, SLOs, SLIs, and error budgets, Alerting, glossary.