SLOs, SLIs, and error budgets
Reliability is not “zero incidents”—it is agreed behavior over time. SLIs measure that behavior; SLOs set the target; error budgets turn the gap between perfect and the target into a shared quantity you can spend (releases) or protect (hold the line).
This page is framework-agnostic. You implement SLIs with metrics—often in Prometheus—and policies with people and process. See Alerting for routing and noise control.
| Term | Meaning |
|---|---|
| SLI (service level indicator) | A measurable signal of good service from the user’s perspective (e.g. successful HTTP requests, latency under a threshold). |
| SLO (service level objective) | A target for an SLI over a window (e.g. “99.9% of requests succeed per 30 days”). |
| Error budget | The allowable bad events implied by the SLO: if availability is 99.9%, you have 0.1% “budget” for errors in that window. |
| SLA (service level agreement) | Often a contract with customers (refunds, credits). SLOs are usually internal targets that may be stricter than the SLA. |
Good SLIs are user-centric
Section titled “Good SLIs are user-centric”Prefer signals that reflect real user or business impact:
- Availability — proportion of requests that succeed (or meet latency) over a period.
- Latency — proportion of requests faster than a threshold (e.g. p99 under 300 ms).
Avoid infrastructure-only SLIs unless they clearly proxy user pain (e.g. “Kafka lag” only if it directly drives user-visible delay).
From SLI to SLO
Section titled “From SLI to SLO”- Choose the SLI — e.g.
good_requests / total_requestsfor HTTP. - Pick a time window — rolling 30 days is common; calendar months work for reporting.
- Set the target — e.g. 99.9% availability. Higher targets cost more engineering and infrastructure; cost vs reliability is a real tradeoff.
- Measure with the same math everywhere — dashboard, alert rules, and postmortems should agree on the definition.
Example (conceptual): if total_requests and failed_requests exist as counters, availability over a window is roughly:
availability ≈ 1 - (failed_requests / total_requests)In PromQL, you typically use rate() or increase() over a range and compare ratios—see your metrics’ exact names and labels.
Error budget
Section titled “Error budget”If your SLO is 99.9% good requests in 30 days, the error budget is 0.1% of requests in that window—requests that may fail without breaking the SLO.
- Budget remaining → you can accept more risk (experiments, faster deploys) if policy allows.
- Budget exhausted or burning fast → slow releases, freeze non-critical changes, or invest in reliability until the burn slows.
Velocity vs reliability: shipping faster often consumes budget faster. Cost: tighter SLOs usually mean more redundancy, testing, and on-call attention—explicit targets make those tradeoffs discussable.
Alerting on SLOs and burn
Section titled “Alerting on SLOs and burn”Alert on symptoms and budget burn, not every blip:
- Fast burn — error rate is high enough that you will miss the SLO soon unless you act (page).
- Slow burn — trend will miss the SLO over the window if it continues (ticket or lower urgency).
Google’s multi-window / multi-burn-rate approach is a common pattern; your Alertmanager routes can mirror severity. Keep pages actionable—see Alerting.
Checklist
Section titled “Checklist”- SLI definitions are documented and match dashboards and alerts.
- SLO targets are agreed with product and engineering (not only SRE).
- Error budget policy says what happens when budget is low (e.g. release freeze, reliability sprint).
- Postmortems reference whether the incident consumed budget and what will change.
Related
Section titled “Related”- Prometheus — metrics and PromQL
- Alerting — routing and alert quality
- Grafana — SLO dashboards and recording rules in practice
- Deployment strategies — canaries and risk vs release speed
- QA and reliability guide — testing and incidents in context