Fault Tolerance

First PublishedMay 5, 2026ByAtif Alam

Production systems fail partially far more often than they fail totally. Fault-tolerance sections of a design should read like explicit failure contracts: what callers experience when each dependency disappears.

Design for Failure

Assume timeouts, brownouts, crashes, bad deploys. State fallback behavior:degraded UI, cached answers, queued work, hard errors with clear remediation.

Retries With Backoff and Jitter

Transient failures merit bounded retries with exponential backoff plus full jitter to avoid synchronized retry storms. Cap retry attempts at clients and intermediaries. Non-idempotent POST calls need idempotency keys before blind retries.

See API design.

Circuit Breakers

After repeated failures to a downstream, open the circuit: fail fast briefly while it recovers.Stops cascading overload across services sharing thread pools.Half-open probing re-admits trial traffic gradually.

Graceful Degradation

Prefer subset of functionality alive (read-only catalog, stale recommendations) versus 503 everything. Document product expectations for degraded modes.

Stale Reads When the Database Path Is Impaired

If read replicas, cache, or static bundles stay available when the authoritative writer path is degraded, carefully allow explicitly stale reads rather than catastrophic failure.Disclose staleness bounds internally and audit whether money-critical paths bypass this mode.

Safer Releases and Blue/Green Patterns

Techniques such as blue/green deployments (two environments, switched traffic), canary releases, and feature flags reduce blast radius.Pair with rollback automation and observable health gates.

Operational detail lives in Deployment strategies alongside pipeline design.

Chaos and Game Days

Controlled fault injection in non-production (game days) surfaces missing timeouts, backoff bugs, and mis-sized pools before customers find them.If you rarely exercise failure drills, outages become the rehearsal.

List the top five dependencies whose failure deserves a scripted response in runbooks.