RDBMS Reliability and On-Call

First PublishedApr 29, 2026ByAtif Alam

This page covers operational patterns for relational databases (PostgreSQL, MySQL, SQL Server, Oracle, and managed equivalents). Provisioning is covered in the cloud-specific pages — AWS databases and Azure databases. The focus here is what breaks at 3 AM and how to recover.

The patterns generalize across self-managed Postgres, RDS, Aurora, Cloud SQL, Azure SQL, and on-prem MySQL. Numbers and knob names differ; the failure modes do not.

The Database Boundary

Most outages that look like a database problem are really one of four things:

Class	Symptom	Where to Look
Saturation	Latency rising; connections exhausted; locks growing.	Connection pool, slow query log, active queries.
Replication issue	Stale reads; replica behind; failover storm.	Replication lag, primary/replica state.
Capacity	Disk full; CPU pinned; memory pressure.	Host metrics, query plans, autovacuum activity (Postgres).
Schema or migration	New errors after a deploy; lock waits; broken queries.	Recent migrations, application logs, lock waits.

The first triage move is to place the symptom in one of these classes. Different classes have very different mitigations.

Connection Pools

A pool is a cache of open database connections held by an application or a proxy (PgBouncer, ProxySQL, RDS Proxy). When the pool is exhausted, application requests block on getting a connection — which looks identical to slow queries from the user’s side.

Sizing the Pool

A common mistake is “more is always better.” Too many connections create lock contention and CPU saturation in the database itself.

Layer	Reasonable Starting Point
Per app instance	5–20 connections, depending on workload.
Per pool / proxy	(cores × 2) to (cores × 4) on the database, divided across apps.
Database max	Set above expected usage but below the point where the database struggles.

For Postgres specifically: as a rule of thumb, total active connections should not exceed (cores × 2 to 4); use a connection pooler (PgBouncer, RDS Proxy) to fan out to many app instances.

On-Call Triage for Pool Exhaustion

1
Symptom: "All requests are timing out"
2
   │
3
   ▼
4
Are connections being acquired? ──────► No → pool exhausted
5
   │                                        │
6
   │                                        ▼
7
   ▼                              Why? Slow queries?
8
Yes, but slow                              │
9
   │                                        ▼
10
   ▼                              Long-running transaction?
11
Slow query? Lock wait?                    │
12
                                           ▼
13
                                  App code holding a connection?

The expensive mitigation is “raise the pool size.” The right one is usually “find the slow query or long-running transaction and stop the bleeding.”

Useful Queries (Postgres example)

1
-- Currently active queries, sorted by how long they've been running
2
SELECT pid, now() - query_start AS duration, state, query
3
FROM pg_stat_activity
4
WHERE state != 'idle'
5
ORDER BY duration DESC
6
LIMIT 20;
7

8
-- Lock waits
9
SELECT blocked_locks.pid AS blocked_pid,
10
       blocking_locks.pid AS blocking_pid,
11
       blocked_activity.query AS blocked_query,
12
       blocking_activity.query AS blocking_query
13
FROM pg_locks blocked_locks
14
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
15
JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
16
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
17
WHERE NOT blocked_locks.granted AND blocking_locks.granted;

For MySQL, the equivalents are SHOW PROCESSLIST and SHOW ENGINE INNODB STATUS.

Replication Lag and Read-After-Write

Most production setups have a primary (writes) and one or more read replicas. Replicas always lag behind the primary by some amount — usually milliseconds, sometimes seconds, sometimes minutes during incidents.

Why Lag Matters

Scenario	What Lag Causes
Read after write	User saves a record on the primary, then reads from a replica — and sees stale data.
Reporting / analytics	Long-running query on a replica delays replication and amplifies lag.
Failover decision	Promoting a replica that is far behind the primary loses recent writes.

Mitigations

Read-after-write paths (e.g. user just saved their profile) read from the primary, not a replica. Wire this into your data-access layer.
Stale-OK paths (dashboards, search) can tolerate replicas; mark them in code.
Lag alerts — page when lag exceeds your tolerance (e.g. 30 seconds), critical when it exceeds the data-loss tolerance for failover (e.g. 5 minutes).
Long queries on replicas — kill or quarantine; replicas are not your analytics warehouse.

Lag-Driven Failover Trade-offs

Promoting a lagging replica is a data-loss decision in disguise. The on-call playbook should make this explicit:

1
Primary down. Replica lag = 4 seconds.
2
   ├── Promote replica now? → ~4 seconds of writes lost
3
   ├── Wait for primary recovery? → likely longer outage, fewer lost writes
4
   └── Replay primary WAL/binlog onto replica before promote? → longest outage, no data loss

The right answer depends on the application. Decide and document it before the incident, not during.

Failover and Split-Brain Awareness

Managed services (RDS Multi-AZ, Aurora, Cloud SQL HA, Azure SQL) do automatic failover with seconds-to-minutes downtime. Self-managed setups use Patroni, repmgr, MHA, Orchestrator, or similar.

Failover Modes

Mode	What It Looks Like
Synchronous replica → primary	Zero data loss; lower write throughput (waiting for replica ack).
Asynchronous replica → primary	Some data loss possible; higher write throughput.
Manual	Operator decides; safest in ambiguous failures.
Automatic with consensus	Etcd/Consul/ZooKeeper makes the call; faster but can misfire on network partitions.

Split-Brain

Split-brain happens when two nodes think they are the primary and both accept writes. The conflict is messy to recover from: divergent data, broken replication, human cleanup.

Defenses:

Fencing — when a node loses leadership, its writes are rejected (or its connections are killed) by something other than itself.
Quorum — leader election requires a majority; a minority partition cannot promote.
Witness or arbiter — a separate node breaks ties in 2-node clusters.
STONITH (“shoot the other node in the head”) — kill the suspected old primary at the infrastructure layer before promoting a new one.

For managed services, fencing is the vendor’s problem; for self-managed, it is yours.

Backup, Restore, and Restore Drills

A backup that has never been restored is a hope, not a backup.

Backup Types

Type	What It Is	Recovery Properties
Full	Complete snapshot of the database.	Slowest to take; simplest to restore.
Incremental	Changes since the last full or incremental.	Faster to take; restore replays a chain.
WAL / binlog archiving	Continuous stream of write-ahead-log files.	Enables point-in-time recovery.
Logical (`pg_dump`, `mysqldump`)	SQL representation.	Cross-version friendly; slower for large DBs.
Physical / block	File-system or block-level snapshot.	Fastest for large DBs; tied to engine version.

RPO and RTO

Term	Meaning	Example
RPO (Recovery Point Objective)	How much data can you lose?	”We can lose 5 minutes of writes.”
RTO (Recovery Time Objective)	How long can recovery take?	”We can be down for 1 hour.”

These drive backup frequency and infrastructure choices. A 1-minute RPO requires continuous WAL archiving or synchronous replication; a 24-hour RPO can use nightly snapshots.

Restore Drills

A restore drill is a scheduled exercise:

Pick a backup (last night, last week, or random).
Restore to a separate environment.
Verify the database starts and a known query returns expected results.
Time the entire process and compare to RTO.
Note anything that surprised you. Update the runbook.

Cadence: at least quarterly. Document the last drill date in the service catalog. If you have not done one in the last six months, schedule one before the next time you would need it for real.

See also Stateful backup and restore on Kubernetes for backup tooling specific to K8s-hosted databases (Velero, CSI snapshots).

Point-in-Time Recovery

Most managed services and well-configured self-managed setups support point-in-time recovery (PITR) — restore to any second within the WAL/binlog retention window.

1
Friday 10:23 UTC: bad migration wipes a table
2
Friday 10:25 UTC: discovered
3
   │
4
   ▼
5
PITR target: Friday 10:22:30 UTC (just before the migration)

PITR usually runs into a fresh database (cannot rewind the existing one). The recovery flow is: spin up a new instance from the backup → replay WAL up to the target time → switch the application connection (or extract the missing data and reapply).

Schema Migrations

Schema changes are one of the most common sources of database incidents. The risk profile is very different from application deploys.

High-Risk Migration Patterns

Pattern	Why It’s Risky
`ALTER TABLE` on a hot table	Default behavior in many engines locks the table or rewrites it; long lock waits.
Adding a non-nullable column with default	Some engines rewrite the entire table.
Dropping a column	Breaks any code path still using it.
Renaming	The application uses both names during deploy.
Index creation	Long; locks vary by engine. Use `CONCURRENTLY` (Postgres) or online DDL (MySQL 8).

Safer Patterns

Backwards-compatible always: never break the previous app version’s queries in a single deploy. Two-phase: add the new column, deploy code that writes both, backfill, deploy code that reads new, drop old.
Online DDL: Postgres CREATE INDEX CONCURRENTLY, MySQL 8 ALGORITHM=INPLACE, LOCK=NONE where supported.
Migration tooling that supports retries and timeouts (Flyway, Liquibase, Alembic, gh-ost, pt-online-schema-change).
Lock timeouts in the migration session (SET lock_timeout = '5s' in Postgres). Better to fail and retry than to block all writes.
Off-peak window for the migration step that actually carries risk.

When a Migration Goes Wrong

1
Symptom: app errors after deploy
2
   │
3
   ▼
4
Was a migration part of the deploy? ──── No → roll back app
5
   │
6
   ▼ Yes
7
Did the migration succeed? ──── No → migration tool will tell you
8
   │
9
   ▼ Yes
10
Are queries hitting locks? ──── Yes → check pg_locks / SHOW PROCESSLIST
11
   │
12
   ▼ No
13
Is the schema actually what you expect? → describe / SHOW CREATE TABLE

Rolling back a migration is often harder than rolling forward — schema changes are not symmetric. This is why backwards-compatible patterns matter.

Database Observability

The same three pillars apply: metrics, logs, traces. A few signals matter most for relational databases.

Golden Signals at the Database Boundary

Signal	Examples
Latency	`pg_stat_statements` mean exec time; per-query p99 in your APM.
Throughput	Transactions per second; queries per second.
Errors	Connection refused, lock wait timeouts, deadlocks, query failures.
Saturation	Active connections vs max; CPU; disk I/O wait; cache hit ratio.

Useful Database-Specific Metrics

Engine	Watch
Postgres	Replication lag (`pg_stat_replication`), bloat, autovacuum activity, long transactions, lock waits.
MySQL	Innodb buffer pool hit rate, slave/replica lag, lock waits, open table count.
All	Slow query log, error log, connection count vs max.

Useful Alerts

Replication lag above tolerance, with critical threshold set below the failover-data-loss tolerance.
Connection saturation (e.g. >80% of max for 5 minutes).
Disk free below threshold (e.g. less than 20% remaining, critical below 10%).
Backup age — alert if no successful backup in the last 24 hours (or your RPO window).
Restore drill staleness — alert if no drill in the last quarter.

Slow Query Discipline

Enable the slow query log in production with a sane threshold (e.g. 1 second). Review the top offenders weekly. Most database “performance problems” are 5–10 specific queries that account for 80% of database load.

On-Call Runbook Skeleton

A reusable runbook structure:

1
Service: <name>
2
Owners: <team>
3
Database: <engine, version, hosting>
4
Replicas: <count, lag tolerance>
5
Backups: <type, frequency, retention>
6
RPO/RTO: <values>
7
Last restore drill: <date>
8

9
Symptoms and First Steps
10
- Latency spike → see "Slow query triage" below
11
- Connection errors → see "Pool exhaustion" below
12
- Replica lag alert → see "Replica lag" below
13
- Disk full alert → see "Disk full" below
14

15
Slow Query Triage
16
1. ...
17

18
Pool Exhaustion
19
1. ...
20

21
Replica Lag
22
1. ...
23

24
Disk Full
25
1. ...
26

27
Failover Procedure (manual)
28
1. ...
29

30
Rollback / Restore (PITR)
31
1. ...
32

33
Escalation
34
- L2: <team>
35
- L3: <vendor / DBA team>

Each section is short. Long runbooks are not read during incidents.

Checklist

AWS databases — RDS, Aurora, DynamoDB provisioning and features
Azure databases — Azure SQL, Cosmos DB, managed Postgres/MySQL
Service readiness checklist — gates a database-backed service should pass
Stateful backup and restore on Kubernetes — Velero, CSI snapshots, restore drills for K8s-hosted DBs
Observability — metrics, logs, and traces; ties to alerts above
Incident response and on-call — incident command for database outages