Fault Tolerance
Production systems fail partially far more often than they fail totally. Fault-tolerance sections of a design should read like explicit failure contracts: what callers experience when each dependency disappears.
Design for Failure
Section titled “Design for Failure”Assume timeouts, brownouts, crashes, bad deploys. State fallback behavior:degraded UI, cached answers, queued work, hard errors with clear remediation.
Retries With Backoff and Jitter
Section titled “Retries With Backoff and Jitter”Transient failures merit bounded retries with exponential backoff plus full jitter to avoid synchronized retry storms. Cap retry attempts at clients and intermediaries. Non-idempotent POST calls need idempotency keys before blind retries.
See API design.
Circuit Breakers
Section titled “Circuit Breakers”After repeated failures to a downstream, open the circuit: fail fast briefly while it recovers.Stops cascading overload across services sharing thread pools.Half-open probing re-admits trial traffic gradually.
Graceful Degradation
Section titled “Graceful Degradation”Prefer subset of functionality alive (read-only catalog, stale recommendations) versus 503 everything. Document product expectations for degraded modes.
Stale Reads When the Database Path Is Impaired
Section titled “Stale Reads When the Database Path Is Impaired”If read replicas, cache, or static bundles stay available when the authoritative writer path is degraded, carefully allow explicitly stale reads rather than catastrophic failure.Disclose staleness bounds internally and audit whether money-critical paths bypass this mode.
Safer Releases and Blue/Green Patterns
Section titled “Safer Releases and Blue/Green Patterns”Techniques such as blue/green deployments (two environments, switched traffic), canary releases, and feature flags reduce blast radius.Pair with rollback automation and observable health gates.
Operational detail lives in Deployment strategies alongside pipeline design.
Chaos and Game Days
Section titled “Chaos and Game Days”Controlled fault injection in non-production (game days) surfaces missing timeouts, backoff bugs, and mis-sized pools before customers find them.If you rarely exercise failure drills, outages become the rehearsal.
List the top five dependencies whose failure deserves a scripted response in runbooks.
Related: Observability for systems, Incident response and on-call, QA.