Skip to content

Databases Overview

First PublishedByAtif Alam

This section is the cloud-agnostic home for database reliability and data on-call patterns. It complements the cloud-specific pages — AWS databases and Azure databases — which cover provisioning and vendor features. Pages here cover what breaks at 3 AM and how to recover.

The focus is operations for relational stores (PostgreSQL, MySQL, SQL Server, Oracle, and managed equivalents). NoSQL is touched only where the on-call patterns differ.

Most cloud DB documentation answers “how do I provision an instance?” — but the questions that matter on-call are different:

  • The pool is exhausted; what is the right knob to turn?
  • A read replica is 30 seconds behind; do users see stale data, or worse?
  • A failover happened during a deploy; is the new primary the right one?
  • A schema migration locked a hot table; can we abort safely?
  • We need to restore last night’s backup; how long will it take, and have we ever tested this?

These are vendor-agnostic patterns that show up on RDS, Cloud SQL, Aurora, on-prem Postgres, and Azure SQL alike.

  • RDBMS Reliability and On-Call — Connection pools, replication lag, failover and split-brain awareness, backup and restore drills, schema migration risk, and observability for the database boundary.

More pages may be added over time (NoSQL on-call, caching, queueing) — but the entry point is the RDBMS page above.

TopicWhere to Go
Cloud DB provisioning (AWS)AWS databases
Cloud DB provisioning (Azure)Azure databases
Metrics, dashboards, alertsObservability, Alerting
Reliability targetsSLOs, SLIs, error budgets
Service readiness gatesService readiness checklist
Stateful workloads on KubernetesStateful backup and restore