Skip to content

Fault Tolerance

First PublishedByAtif Alam

Production systems fail partially far more often than they fail totally. Fault-tolerance sections of a design should read like explicit failure contracts: what callers experience when each dependency disappears.

Assume timeouts, brownouts, crashes, bad deploys. State fallback behavior:degraded UI, cached answers, queued work, hard errors with clear remediation.

Transient failures merit bounded retries with exponential backoff plus full jitter to avoid synchronized retry storms. Cap retry attempts at clients and intermediaries. Non-idempotent POST calls need idempotency keys before blind retries.

See API design.

After repeated failures to a downstream, open the circuit: fail fast briefly while it recovers.Stops cascading overload across services sharing thread pools.Half-open probing re-admits trial traffic gradually.

Prefer subset of functionality alive (read-only catalog, stale recommendations) versus 503 everything. Document product expectations for degraded modes.

Stale Reads When the Database Path Is Impaired

Section titled “Stale Reads When the Database Path Is Impaired”

If read replicas, cache, or static bundles stay available when the authoritative writer path is degraded, carefully allow explicitly stale reads rather than catastrophic failure.Disclose staleness bounds internally and audit whether money-critical paths bypass this mode.

Techniques such as blue/green deployments (two environments, switched traffic), canary releases, and feature flags reduce blast radius.Pair with rollback automation and observable health gates.

Operational detail lives in Deployment strategies alongside pipeline design.

Controlled fault injection in non-production (game days) surfaces missing timeouts, backoff bugs, and mis-sized pools before customers find them.If you rarely exercise failure drills, outages become the rehearsal.

List the top five dependencies whose failure deserves a scripted response in runbooks.

Related: Observability for systems, Incident response and on-call, QA.