Skip to content

Design Review Checklist

First PublishedByAtif Alam

Use this page for a quick pass before a design review or request for comments (RFC). Expand any section via the linked topic page. Full term list: glossary.


  • Establish read versus write mix first; it drives caching, replication, and storage shape.
  • Capture scale targets: daily active users (DAU), expected queries per second (QPS), and storage growth.
  • Split functional needs from non-functional (latency, availability, durability).
  • Agree service level agreement (SLA) and service level objective (SLO) targets before committing to a diagram.
  • Decide strong versus eventual consistency where it matters for user-visible behavior.
  • Single region versus multi-region (latency, failover, data residency).
  • Timebox the requirements pass; capture success metrics up front.

Details →


  • Order-of-magnitude queries per second (QPS) from daily active users (DAU) and actions per user per day (QPS ≈ DAU × actions / 86400).
  • Treat writes as often 1–10% of reads unless data says otherwise.
  • Storage ≈ rows × average row size × retention window.
  • Bandwidthqueries per second (QPS) × typical payload size.
  • Show reasoning, not fake precision; identify the likely bottleneck early.

Details →


  • Structured Query Language (SQL) for relational invariants; Not only SQL (NoSQL) where write volume or schema flexibility dominates.
  • Normalize first; denormalize for read-heavy, known access paths.
  • Index foreign keys and real query predicates.
  • Partition by time or hash when tables outgrow single-node comfort.
  • Prefer soft deletes when audit trails matter.
  • Pick shard keys that minimize cross-shard queries.

Details →


  • Default to representational state transfer (REST); justify gRPC (remote procedure calls) or GraphQL (graph query language) when the problem benefits.
  • Hypertext Transfer Protocol (HTTP) POST create, PUT replace, PATCH partial update; HTTP 4xx client errors and HTTP 5xx server errors.
  • Paginate lists; prefer cursor-based pagination at scale.
  • Version application programming interfaces (APIs) from the start (for example /v1/...).
  • Idempotency keys on retryable writes.
  • Rate limit every public surface.

Details →


  • In-memory cache before leaning on read replicas for hot keys.
  • Cache-aside for most app-owned caches; set time to live (TTL); stale often beats absent.
  • Content delivery network (CDN) for static and cacheable edge responses.
  • Fan-out on write for read-heavy timelines and feeds.
  • Materialized views (or equivalents) for expensive aggregates.
  • If hit rate stays well below ~80%, revisit key design or cache placement.

Details →


  • Queues absorb bursts and decouple producers from consumers.
  • Batch writes to cut round trips where semantics allow.
  • Asynchronous persistence for non-critical paths; avoid synchronous dual-writes across systems.
  • Connection pooling at every tier that opens database (DB) connections.
  • Shard by user or tenant identifier (ID) when horizontal write scale is real.
  • Measure before sharding; fix the real bottleneck first.

Details →


  • Cache reads, not writes as a default pattern name.
  • Cache-aside: app fills on miss; write-through: higher consistency cost, more latency on write path.
  • Least recently used (LRU) eviction is a common default; tune per workload.
  • Mitigate cache stampede (probabilistic early expiration, locking, or single-flight).
  • Mitigate thundering herd (warm paths, staggered time to live (TTL) expirations).
  • Redis Cluster (or similar) when a single cache node is not enough.

Details → (also overlaps read scaling)


  • Strong: reads see latest committed writes; eventual: replicas may lag visibility.
  • Prefer sagas and compensations over two-phase commit for distributed transactions at scale.
  • Idempotency on consumers to survive duplicates.
  • Optimistic locking when conflicts are rare.
  • Explain CAP (consistency, availability, partition tolerance) and PACELC (partition / latency tradeoffs beyond CAP) choices per operation, not as a one-line slogan for the whole system.

Details →


  • Assume every dependency fails; specify behavior when it does.
  • Retries with backoff and jitter; cap total attempts.
  • Circuit breakers to stop cascade overload.
  • Degrade features instead of hard failing the whole product when possible.
  • Serve stale reads when the primary data path is down and the product allows it.
  • Blue/green and similar patterns for safer releases; chaos-style testing in controlled environments.

Details →


  • Structured logs (for example JavaScript Object Notation (JSON) fields), not unstructured printf-only strings where you need queries.
  • Metrics: favor percentiles (95th percentile (p95), 99th percentile (p99)) over simple averages for latency.
  • Traces across service boundaries for latency and dependency insight.
  • Alert on user-visible symptoms ahead of low-level causes only when they predict impact.
  • Dashboards: queries per second (QPS), errors, latency, saturation — RED (rate, errors, duration) and USE (utilization, saturation, errors) framing.
  • Postmortems and documented follow-ups close the feedback loop after incidents.

Details →


  • Trust boundaries, authentication (authn) / authorization (authz), secrets handling, encryption in transit and at rest.
  • Safe deployment and rollback; align with continuous integration and continuous delivery (CI/CD) guardrails.
  • Write down tradeoffs and rejected alternatives so future readers understand the constraint surface.

Details →