Design Review Checklist
Use this page for a quick pass before a design review or request for comments (RFC). Expand any section via the linked topic page. Full term list: glossary.
1. Requirements
Section titled “1. Requirements”- Establish read versus write mix first; it drives caching, replication, and storage shape.
- Capture scale targets: daily active users (DAU), expected queries per second (QPS), and storage growth.
- Split functional needs from non-functional (latency, availability, durability).
- Agree service level agreement (SLA) and service level objective (SLO) targets before committing to a diagram.
- Decide strong versus eventual consistency where it matters for user-visible behavior.
- Single region versus multi-region (latency, failover, data residency).
- Timebox the requirements pass; capture success metrics up front.
2. Capacity
Section titled “2. Capacity”- Order-of-magnitude queries per second (QPS) from daily active users (DAU) and actions per user per day (
QPS ≈ DAU × actions / 86400). - Treat writes as often 1–10% of reads unless data says otherwise.
- Storage ≈ rows × average row size × retention window.
- Bandwidth ≈ queries per second (QPS) × typical payload size.
- Show reasoning, not fake precision; identify the likely bottleneck early.
3. Data model
Section titled “3. Data model”- Structured Query Language (SQL) for relational invariants; Not only SQL (NoSQL) where write volume or schema flexibility dominates.
- Normalize first; denormalize for read-heavy, known access paths.
- Index foreign keys and real query predicates.
- Partition by time or hash when tables outgrow single-node comfort.
- Prefer soft deletes when audit trails matter.
- Pick shard keys that minimize cross-shard queries.
4. API
Section titled “4. API”- Default to representational state transfer (REST); justify gRPC (remote procedure calls) or GraphQL (graph query language) when the problem benefits.
- Hypertext Transfer Protocol (HTTP) POST create, PUT replace, PATCH partial update; HTTP 4xx client errors and HTTP 5xx server errors.
- Paginate lists; prefer cursor-based pagination at scale.
- Version application programming interfaces (APIs) from the start (for example
/v1/...). - Idempotency keys on retryable writes.
- Rate limit every public surface.
5. Scale reads
Section titled “5. Scale reads”- In-memory cache before leaning on read replicas for hot keys.
- Cache-aside for most app-owned caches; set time to live (TTL); stale often beats absent.
- Content delivery network (CDN) for static and cacheable edge responses.
- Fan-out on write for read-heavy timelines and feeds.
- Materialized views (or equivalents) for expensive aggregates.
- If hit rate stays well below ~80%, revisit key design or cache placement.
6. Scale writes
Section titled “6. Scale writes”- Queues absorb bursts and decouple producers from consumers.
- Batch writes to cut round trips where semantics allow.
- Asynchronous persistence for non-critical paths; avoid synchronous dual-writes across systems.
- Connection pooling at every tier that opens database (DB) connections.
- Shard by user or tenant identifier (ID) when horizontal write scale is real.
- Measure before sharding; fix the real bottleneck first.
7. Cache
Section titled “7. Cache”- Cache reads, not writes as a default pattern name.
- Cache-aside: app fills on miss; write-through: higher consistency cost, more latency on write path.
- Least recently used (LRU) eviction is a common default; tune per workload.
- Mitigate cache stampede (probabilistic early expiration, locking, or single-flight).
- Mitigate thundering herd (warm paths, staggered time to live (TTL) expirations).
- Redis Cluster (or similar) when a single cache node is not enough.
Details → (also overlaps read scaling)
8. Consistency
Section titled “8. Consistency”- Strong: reads see latest committed writes; eventual: replicas may lag visibility.
- Prefer sagas and compensations over two-phase commit for distributed transactions at scale.
- Idempotency on consumers to survive duplicates.
- Optimistic locking when conflicts are rare.
- Explain CAP (consistency, availability, partition tolerance) and PACELC (partition / latency tradeoffs beyond CAP) choices per operation, not as a one-line slogan for the whole system.
9. Failures
Section titled “9. Failures”- Assume every dependency fails; specify behavior when it does.
- Retries with backoff and jitter; cap total attempts.
- Circuit breakers to stop cascade overload.
- Degrade features instead of hard failing the whole product when possible.
- Serve stale reads when the primary data path is down and the product allows it.
- Blue/green and similar patterns for safer releases; chaos-style testing in controlled environments.
10. Observe
Section titled “10. Observe”- Structured logs (for example JavaScript Object Notation (JSON) fields), not unstructured printf-only strings where you need queries.
- Metrics: favor percentiles (95th percentile (p95), 99th percentile (p99)) over simple averages for latency.
- Traces across service boundaries for latency and dependency insight.
- Alert on user-visible symptoms ahead of low-level causes only when they predict impact.
- Dashboards: queries per second (QPS), errors, latency, saturation — RED (rate, errors, duration) and USE (utilization, saturation, errors) framing.
- Postmortems and documented follow-ups close the feedback loop after incidents.
11. Security and rollout
Section titled “11. Security and rollout”- Trust boundaries, authentication (authn) / authorization (authz), secrets handling, encryption in transit and at rest.
- Safe deployment and rollback; align with continuous integration and continuous delivery (CI/CD) guardrails.
- Write down tradeoffs and rejected alternatives so future readers understand the constraint surface.