Skip to content

Requirements and SLAs

First PublishedByAtif Alam

Well-scoped requirements keep design reviews factual. Spend the first slice of discussion agreeing what “good” means in numbers and semantics before drawing boxes — otherwise every later debate replays unstated assumptions.

Establish the ratio of reads to writes early. Read-heavy workloads push you toward caches, replicas, and denormalized read models; write-heavy workloads push partitioning, queues, and conflict handling. Mis-stating this ratio is one of the most expensive mistakes because it cascades through storage, API shape, and cost.

If stakeholders only think in “daily users,” derive expected actions per user per day before capacity work.

Capture order-of-magnitude targets you can defend:

  • DAU (daily active users) or equivalent actor count.
  • Actions per user per day that hit your service (not every click — the ones that matter for load).
  • Peak QPS if you have traffic shape data; otherwise derive a first pass in capacity estimation.
  • Storage: working set size, retention, and growth.

These numbers do not need three significant figures; they need shared agreement so capacity math is reviewable.

Functional requirements describe behavior users or systems can observe (what the API does). Non-functional requirements cover latency, availability, durability, compliance, and cost constraints. Non-functional needs often decide between SQL and NoSQL, sync versus async, and multi-region cost.

Define SLA-style or SLO-style targets (availability, p99 latency, durability) before committing to a topology. The same architecture can be “correct” for 99.5% and wrong for 99.99% or strict p99 budgets. Tie targets to product impact so tradeoffs are negotiable.

See also SLOs, SLIs, and error budgets and service readiness checklist.

Decide where strong consistency is non-negotiable (balances, inventory, idempotent billing) versus where eventual visibility is acceptable (feeds, analytics, recommendations). That choice belongs in requirements, not as an afterthought when replicas lag.

Single region simplifies consistency and operations; global or multi-region adds latency, failover, and often data residency rules. Record which regions matter and whether users must be pinned to a geography for policy reasons.

Bound the initial requirements pass so the group does not drift. Agree success metrics (what would prove the design works in production) up front; they anchor later validation and observability design.

Related: Capacity estimation, design review checklist, Consistent data at the database layer.