Design Review Checklist

First PublishedMay 5, 2026ByAtif Alam

Use this page for a quick pass before a design review or request for comments (RFC). Expand any section via the linked topic page. Full term list: glossary.

1. Requirements

Establish read versus write mix first; it drives caching, replication, and storage shape.
Capture scale targets: daily active users (DAU), expected queries per second (QPS), and storage growth.
Split functional needs from non-functional (latency, availability, durability).
Agree service level agreement (SLA) and service level objective (SLO) targets before committing to a diagram.
Decide strong versus eventual consistency where it matters for user-visible behavior.
Single region versus multi-region (latency, failover, data residency).
Timebox the requirements pass; capture success metrics up front.

Details →

2. Capacity

Order-of-magnitude queries per second (QPS) from daily active users (DAU) and actions per user per day (QPS ≈ DAU × actions / 86400).
Treat writes as often 1–10% of reads unless data says otherwise.
Storage ≈ rows × average row size × retention window.
Bandwidth ≈ queries per second (QPS) × typical payload size.
Show reasoning, not fake precision; identify the likely bottleneck early.

Details →

3. Data model

Structured Query Language (SQL) for relational invariants; Not only SQL (NoSQL) where write volume or schema flexibility dominates.
Normalize first; denormalize for read-heavy, known access paths.
Index foreign keys and real query predicates.
Partition by time or hash when tables outgrow single-node comfort.
Prefer soft deletes when audit trails matter.
Pick shard keys that minimize cross-shard queries.

Details →

4. API

Default to representational state transfer (REST); justify gRPC (remote procedure calls) or GraphQL (graph query language) when the problem benefits.
Hypertext Transfer Protocol (HTTP) POST create, PUT replace, PATCH partial update; HTTP 4xx client errors and HTTP 5xx server errors.
Paginate lists; prefer cursor-based pagination at scale.
Version application programming interfaces (APIs) from the start (for example /v1/...).
Idempotency keys on retryable writes.
Rate limit every public surface.

Details →

5. Scale reads

In-memory cache before leaning on read replicas for hot keys.
Cache-aside for most app-owned caches; set time to live (TTL); stale often beats absent.
Content delivery network (CDN) for static and cacheable edge responses.
Fan-out on write for read-heavy timelines and feeds.
Materialized views (or equivalents) for expensive aggregates.
If hit rate stays well below ~80%, revisit key design or cache placement.

Details →

6. Scale writes

Queues absorb bursts and decouple producers from consumers.
Batch writes to cut round trips where semantics allow.
Asynchronous persistence for non-critical paths; avoid synchronous dual-writes across systems.
Connection pooling at every tier that opens database (DB) connections.
Shard by user or tenant identifier (ID) when horizontal write scale is real.
Measure before sharding; fix the real bottleneck first.

Details →

7. Cache

Cache reads, not writes as a default pattern name.
Cache-aside: app fills on miss; write-through: higher consistency cost, more latency on write path.
Least recently used (LRU) eviction is a common default; tune per workload.
Mitigate cache stampede (probabilistic early expiration, locking, or single-flight).
Mitigate thundering herd (warm paths, staggered time to live (TTL) expirations).
Redis Cluster (or similar) when a single cache node is not enough.

Details → (also overlaps read scaling)

8. Consistency

Strong: reads see latest committed writes; eventual: replicas may lag visibility.
Prefer sagas and compensations over two-phase commit for distributed transactions at scale.
Idempotency on consumers to survive duplicates.
Optimistic locking when conflicts are rare.
Explain CAP (consistency, availability, partition tolerance) and PACELC (partition / latency tradeoffs beyond CAP) choices per operation, not as a one-line slogan for the whole system.

Details →

9. Failures

Assume every dependency fails; specify behavior when it does.
Retries with backoff and jitter; cap total attempts.
Circuit breakers to stop cascade overload.
Degrade features instead of hard failing the whole product when possible.
Serve stale reads when the primary data path is down and the product allows it.
Blue/green and similar patterns for safer releases; chaos-style testing in controlled environments.

Details →

10. Observe

Structured logs (for example JavaScript Object Notation (JSON) fields), not unstructured printf-only strings where you need queries.
Metrics: favor percentiles (95th percentile (p95), 99th percentile (p99)) over simple averages for latency.
Traces across service boundaries for latency and dependency insight.
Alert on user-visible symptoms ahead of low-level causes only when they predict impact.
Dashboards: queries per second (QPS), errors, latency, saturation — RED (rate, errors, duration) and USE (utilization, saturation, errors) framing.
Postmortems and documented follow-ups close the feedback loop after incidents.

Details →

11. Security and rollout

Trust boundaries, authentication (authn) / authorization (authz), secrets handling, encryption in transit and at rest.
Safe deployment and rollback; align with continuous integration and continuous delivery (CI/CD) guardrails.
Write down tradeoffs and rejected alternatives so future readers understand the constraint surface.

Details →