RDBMS Reliability and On-Call
This page covers operational patterns for relational databases (PostgreSQL, MySQL, SQL Server, Oracle, and managed equivalents). Provisioning is covered in the cloud-specific pages — AWS databases and Azure databases. The focus here is what breaks at 3 AM and how to recover.
The patterns generalize across self-managed Postgres, RDS, Aurora, Cloud SQL, Azure SQL, and on-prem MySQL. Numbers and knob names differ; the failure modes do not.
Related: Observability, Alerting, SLOs and error budgets, Service readiness checklist, Incident response and on-call, Stateful backup and restore on Kubernetes.
The Database Boundary
Section titled “The Database Boundary”Most outages that look like a database problem are really one of four things:
| Class | Symptom | Where to Look |
|---|---|---|
| Saturation | Latency rising; connections exhausted; locks growing. | Connection pool, slow query log, active queries. |
| Replication issue | Stale reads; replica behind; failover storm. | Replication lag, primary/replica state. |
| Capacity | Disk full; CPU pinned; memory pressure. | Host metrics, query plans, autovacuum activity (Postgres). |
| Schema or migration | New errors after a deploy; lock waits; broken queries. | Recent migrations, application logs, lock waits. |
The first triage move is to place the symptom in one of these classes. Different classes have very different mitigations.
Connection Pools
Section titled “Connection Pools”A pool is a cache of open database connections held by an application or a proxy (PgBouncer, ProxySQL, RDS Proxy). When the pool is exhausted, application requests block on getting a connection — which looks identical to slow queries from the user’s side.
Sizing the Pool
Section titled “Sizing the Pool”A common mistake is “more is always better.” Too many connections create lock contention and CPU saturation in the database itself.
| Layer | Reasonable Starting Point |
|---|---|
| Per app instance | 5–20 connections, depending on workload. |
| Per pool / proxy | (cores × 2) to (cores × 4) on the database, divided across apps. |
| Database max | Set above expected usage but below the point where the database struggles. |
For Postgres specifically: as a rule of thumb, total active connections should not exceed (cores × 2 to 4); use a connection pooler (PgBouncer, RDS Proxy) to fan out to many app instances.
On-Call Triage for Pool Exhaustion
Section titled “On-Call Triage for Pool Exhaustion”Symptom: "All requests are timing out" │ ▼Are connections being acquired? ──────► No → pool exhausted │ │ │ ▼ ▼ Why? Slow queries?Yes, but slow │ │ ▼ ▼ Long-running transaction?Slow query? Lock wait? │ ▼ App code holding a connection?The expensive mitigation is “raise the pool size.” The right one is usually “find the slow query or long-running transaction and stop the bleeding.”
Useful Queries (Postgres example)
Section titled “Useful Queries (Postgres example)”-- Currently active queries, sorted by how long they've been runningSELECT pid, now() - query_start AS duration, state, queryFROM pg_stat_activityWHERE state != 'idle'ORDER BY duration DESCLIMIT 20;
-- Lock waitsSELECT blocked_locks.pid AS blocked_pid, blocking_locks.pid AS blocking_pid, blocked_activity.query AS blocked_query, blocking_activity.query AS blocking_queryFROM pg_locks blocked_locksJOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pidJOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktypeJOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pidWHERE NOT blocked_locks.granted AND blocking_locks.granted;For MySQL, the equivalents are SHOW PROCESSLIST and SHOW ENGINE INNODB STATUS.
Replication Lag and Read-After-Write
Section titled “Replication Lag and Read-After-Write”Most production setups have a primary (writes) and one or more read replicas. Replicas always lag behind the primary by some amount — usually milliseconds, sometimes seconds, sometimes minutes during incidents.
Why Lag Matters
Section titled “Why Lag Matters”| Scenario | What Lag Causes |
|---|---|
| Read after write | User saves a record on the primary, then reads from a replica — and sees stale data. |
| Reporting / analytics | Long-running query on a replica delays replication and amplifies lag. |
| Failover decision | Promoting a replica that is far behind the primary loses recent writes. |
Mitigations
Section titled “Mitigations”- Read-after-write paths (e.g. user just saved their profile) read from the primary, not a replica. Wire this into your data-access layer.
- Stale-OK paths (dashboards, search) can tolerate replicas; mark them in code.
- Lag alerts — page when lag exceeds your tolerance (e.g. 30 seconds), critical when it exceeds the data-loss tolerance for failover (e.g. 5 minutes).
- Long queries on replicas — kill or quarantine; replicas are not your analytics warehouse.
Lag-Driven Failover Trade-offs
Section titled “Lag-Driven Failover Trade-offs”Promoting a lagging replica is a data-loss decision in disguise. The on-call playbook should make this explicit:
Primary down. Replica lag = 4 seconds. ├── Promote replica now? → ~4 seconds of writes lost ├── Wait for primary recovery? → likely longer outage, fewer lost writes └── Replay primary WAL/binlog onto replica before promote? → longest outage, no data lossThe right answer depends on the application. Decide and document it before the incident, not during.
Failover and Split-Brain Awareness
Section titled “Failover and Split-Brain Awareness”Managed services (RDS Multi-AZ, Aurora, Cloud SQL HA, Azure SQL) do automatic failover with seconds-to-minutes downtime. Self-managed setups use Patroni, repmgr, MHA, Orchestrator, or similar.
Failover Modes
Section titled “Failover Modes”| Mode | What It Looks Like |
|---|---|
| Synchronous replica → primary | Zero data loss; lower write throughput (waiting for replica ack). |
| Asynchronous replica → primary | Some data loss possible; higher write throughput. |
| Manual | Operator decides; safest in ambiguous failures. |
| Automatic with consensus | Etcd/Consul/ZooKeeper makes the call; faster but can misfire on network partitions. |
Split-Brain
Section titled “Split-Brain”Split-brain happens when two nodes think they are the primary and both accept writes. The conflict is messy to recover from: divergent data, broken replication, human cleanup.
Defenses:
- Fencing — when a node loses leadership, its writes are rejected (or its connections are killed) by something other than itself.
- Quorum — leader election requires a majority; a minority partition cannot promote.
- Witness or arbiter — a separate node breaks ties in 2-node clusters.
- STONITH (“shoot the other node in the head”) — kill the suspected old primary at the infrastructure layer before promoting a new one.
For managed services, fencing is the vendor’s problem; for self-managed, it is yours.
Backup, Restore, and Restore Drills
Section titled “Backup, Restore, and Restore Drills”A backup that has never been restored is a hope, not a backup.
Backup Types
Section titled “Backup Types”| Type | What It Is | Recovery Properties |
|---|---|---|
| Full | Complete snapshot of the database. | Slowest to take; simplest to restore. |
| Incremental | Changes since the last full or incremental. | Faster to take; restore replays a chain. |
| WAL / binlog archiving | Continuous stream of write-ahead-log files. | Enables point-in-time recovery. |
Logical (pg_dump, mysqldump) | SQL representation. | Cross-version friendly; slower for large DBs. |
| Physical / block | File-system or block-level snapshot. | Fastest for large DBs; tied to engine version. |
RPO and RTO
Section titled “RPO and RTO”| Term | Meaning | Example |
|---|---|---|
| RPO (Recovery Point Objective) | How much data can you lose? | ”We can lose 5 minutes of writes.” |
| RTO (Recovery Time Objective) | How long can recovery take? | ”We can be down for 1 hour.” |
These drive backup frequency and infrastructure choices. A 1-minute RPO requires continuous WAL archiving or synchronous replication; a 24-hour RPO can use nightly snapshots.
Restore Drills
Section titled “Restore Drills”A restore drill is a scheduled exercise:
- Pick a backup (last night, last week, or random).
- Restore to a separate environment.
- Verify the database starts and a known query returns expected results.
- Time the entire process and compare to RTO.
- Note anything that surprised you. Update the runbook.
Cadence: at least quarterly. Document the last drill date in the service catalog. If you have not done one in the last six months, schedule one before the next time you would need it for real.
See also Stateful backup and restore on Kubernetes for backup tooling specific to K8s-hosted databases (Velero, CSI snapshots).
Point-in-Time Recovery
Section titled “Point-in-Time Recovery”Most managed services and well-configured self-managed setups support point-in-time recovery (PITR) — restore to any second within the WAL/binlog retention window.
Friday 10:23 UTC: bad migration wipes a tableFriday 10:25 UTC: discovered │ ▼PITR target: Friday 10:22:30 UTC (just before the migration)PITR usually runs into a fresh database (cannot rewind the existing one). The recovery flow is: spin up a new instance from the backup → replay WAL up to the target time → switch the application connection (or extract the missing data and reapply).
Schema Migrations
Section titled “Schema Migrations”Schema changes are one of the most common sources of database incidents. The risk profile is very different from application deploys.
High-Risk Migration Patterns
Section titled “High-Risk Migration Patterns”| Pattern | Why It’s Risky |
|---|---|
ALTER TABLE on a hot table | Default behavior in many engines locks the table or rewrites it; long lock waits. |
| Adding a non-nullable column with default | Some engines rewrite the entire table. |
| Dropping a column | Breaks any code path still using it. |
| Renaming | The application uses both names during deploy. |
| Index creation | Long; locks vary by engine. Use CONCURRENTLY (Postgres) or online DDL (MySQL 8). |
Safer Patterns
Section titled “Safer Patterns”- Backwards-compatible always: never break the previous app version’s queries in a single deploy. Two-phase: add the new column, deploy code that writes both, backfill, deploy code that reads new, drop old.
- Online DDL: Postgres
CREATE INDEX CONCURRENTLY, MySQL 8ALGORITHM=INPLACE, LOCK=NONEwhere supported. - Migration tooling that supports retries and timeouts (Flyway, Liquibase, Alembic, gh-ost, pt-online-schema-change).
- Lock timeouts in the migration session (
SET lock_timeout = '5s'in Postgres). Better to fail and retry than to block all writes. - Off-peak window for the migration step that actually carries risk.
When a Migration Goes Wrong
Section titled “When a Migration Goes Wrong”Symptom: app errors after deploy │ ▼Was a migration part of the deploy? ──── No → roll back app │ ▼ YesDid the migration succeed? ──── No → migration tool will tell you │ ▼ YesAre queries hitting locks? ──── Yes → check pg_locks / SHOW PROCESSLIST │ ▼ NoIs the schema actually what you expect? → describe / SHOW CREATE TABLERolling back a migration is often harder than rolling forward — schema changes are not symmetric. This is why backwards-compatible patterns matter.
Database Observability
Section titled “Database Observability”The same three pillars apply: metrics, logs, traces. A few signals matter most for relational databases.
Golden Signals at the Database Boundary
Section titled “Golden Signals at the Database Boundary”| Signal | Examples |
|---|---|
| Latency | pg_stat_statements mean exec time; per-query p99 in your APM. |
| Throughput | Transactions per second; queries per second. |
| Errors | Connection refused, lock wait timeouts, deadlocks, query failures. |
| Saturation | Active connections vs max; CPU; disk I/O wait; cache hit ratio. |
Useful Database-Specific Metrics
Section titled “Useful Database-Specific Metrics”| Engine | Watch |
|---|---|
| Postgres | Replication lag (pg_stat_replication), bloat, autovacuum activity, long transactions, lock waits. |
| MySQL | Innodb buffer pool hit rate, slave/replica lag, lock waits, open table count. |
| All | Slow query log, error log, connection count vs max. |
Useful Alerts
Section titled “Useful Alerts”- Replication lag above tolerance, with critical threshold set below the failover-data-loss tolerance.
- Connection saturation (e.g. >80% of max for 5 minutes).
- Disk free below threshold (e.g. less than 20% remaining, critical below 10%).
- Backup age — alert if no successful backup in the last 24 hours (or your RPO window).
- Restore drill staleness — alert if no drill in the last quarter.
Slow Query Discipline
Section titled “Slow Query Discipline”Enable the slow query log in production with a sane threshold (e.g. 1 second). Review the top offenders weekly. Most database “performance problems” are 5–10 specific queries that account for 80% of database load.
On-Call Runbook Skeleton
Section titled “On-Call Runbook Skeleton”A reusable runbook structure:
Service: <name>Owners: <team>Database: <engine, version, hosting>Replicas: <count, lag tolerance>Backups: <type, frequency, retention>RPO/RTO: <values>Last restore drill: <date>
Symptoms and First Steps- Latency spike → see "Slow query triage" below- Connection errors → see "Pool exhaustion" below- Replica lag alert → see "Replica lag" below- Disk full alert → see "Disk full" below
Slow Query Triage1. ...
Pool Exhaustion1. ...
Replica Lag1. ...
Disk Full1. ...
Failover Procedure (manual)1. ...
Rollback / Restore (PITR)1. ...
Escalation- L2: <team>- L3: <vendor / DBA team>Each section is short. Long runbooks are not read during incidents.
Checklist
Section titled “Checklist”- Connection pool sizing is documented and matches database max-connections math.
- Read-after-write paths are routed to the primary; stale-OK paths use replicas.
- Replication lag has both warning and critical alerts with rationale.
- Failover behavior (automatic vs manual; data-loss tolerance) is documented in the runbook.
- Backups run at the cadence required by RPO; a monitor alerts on backup age.
- At least one restore drill has happened in the last quarter; results are documented.
- PITR target retention covers your RPO window.
- Migrations use a tool with timeouts; high-risk patterns are flagged in code review.
- Slow query log is on in production; top queries are reviewed regularly.
- On-call runbook exists, is short, and was last verified within the last quarter.
Related
Section titled “Related”- AWS databases — RDS, Aurora, DynamoDB provisioning and features
- Azure databases — Azure SQL, Cosmos DB, managed Postgres/MySQL
- Service readiness checklist — gates a database-backed service should pass
- Stateful backup and restore on Kubernetes — Velero, CSI snapshots, restore drills for K8s-hosted DBs
- Observability — metrics, logs, and traces; ties to alerts above
- Incident response and on-call — incident command for database outages