Skip to content

RDBMS Reliability and On-Call

First PublishedByAtif Alam

This page covers operational patterns for relational databases (PostgreSQL, MySQL, SQL Server, Oracle, and managed equivalents). Provisioning is covered in the cloud-specific pages — AWS databases and Azure databases. The focus here is what breaks at 3 AM and how to recover.

The patterns generalize across self-managed Postgres, RDS, Aurora, Cloud SQL, Azure SQL, and on-prem MySQL. Numbers and knob names differ; the failure modes do not.

Related: Observability, Alerting, SLOs and error budgets, Service readiness checklist, Incident response and on-call, Stateful backup and restore on Kubernetes.

Most outages that look like a database problem are really one of four things:

ClassSymptomWhere to Look
SaturationLatency rising; connections exhausted; locks growing.Connection pool, slow query log, active queries.
Replication issueStale reads; replica behind; failover storm.Replication lag, primary/replica state.
CapacityDisk full; CPU pinned; memory pressure.Host metrics, query plans, autovacuum activity (Postgres).
Schema or migrationNew errors after a deploy; lock waits; broken queries.Recent migrations, application logs, lock waits.

The first triage move is to place the symptom in one of these classes. Different classes have very different mitigations.

A pool is a cache of open database connections held by an application or a proxy (PgBouncer, ProxySQL, RDS Proxy). When the pool is exhausted, application requests block on getting a connection — which looks identical to slow queries from the user’s side.

A common mistake is “more is always better.” Too many connections create lock contention and CPU saturation in the database itself.

LayerReasonable Starting Point
Per app instance5–20 connections, depending on workload.
Per pool / proxy(cores × 2) to (cores × 4) on the database, divided across apps.
Database maxSet above expected usage but below the point where the database struggles.

For Postgres specifically: as a rule of thumb, total active connections should not exceed (cores × 2 to 4); use a connection pooler (PgBouncer, RDS Proxy) to fan out to many app instances.

Symptom: "All requests are timing out"
Are connections being acquired? ──────► No → pool exhausted
│ │
│ ▼
▼ Why? Slow queries?
Yes, but slow │
│ ▼
▼ Long-running transaction?
Slow query? Lock wait? │
App code holding a connection?

The expensive mitigation is “raise the pool size.” The right one is usually “find the slow query or long-running transaction and stop the bleeding.”

-- Currently active queries, sorted by how long they've been running
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC
LIMIT 20;
-- Lock waits
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted AND blocking_locks.granted;

For MySQL, the equivalents are SHOW PROCESSLIST and SHOW ENGINE INNODB STATUS.

Most production setups have a primary (writes) and one or more read replicas. Replicas always lag behind the primary by some amount — usually milliseconds, sometimes seconds, sometimes minutes during incidents.

ScenarioWhat Lag Causes
Read after writeUser saves a record on the primary, then reads from a replica — and sees stale data.
Reporting / analyticsLong-running query on a replica delays replication and amplifies lag.
Failover decisionPromoting a replica that is far behind the primary loses recent writes.
  • Read-after-write paths (e.g. user just saved their profile) read from the primary, not a replica. Wire this into your data-access layer.
  • Stale-OK paths (dashboards, search) can tolerate replicas; mark them in code.
  • Lag alerts — page when lag exceeds your tolerance (e.g. 30 seconds), critical when it exceeds the data-loss tolerance for failover (e.g. 5 minutes).
  • Long queries on replicas — kill or quarantine; replicas are not your analytics warehouse.

Promoting a lagging replica is a data-loss decision in disguise. The on-call playbook should make this explicit:

Primary down. Replica lag = 4 seconds.
├── Promote replica now? → ~4 seconds of writes lost
├── Wait for primary recovery? → likely longer outage, fewer lost writes
└── Replay primary WAL/binlog onto replica before promote? → longest outage, no data loss

The right answer depends on the application. Decide and document it before the incident, not during.

Managed services (RDS Multi-AZ, Aurora, Cloud SQL HA, Azure SQL) do automatic failover with seconds-to-minutes downtime. Self-managed setups use Patroni, repmgr, MHA, Orchestrator, or similar.

ModeWhat It Looks Like
Synchronous replica → primaryZero data loss; lower write throughput (waiting for replica ack).
Asynchronous replica → primarySome data loss possible; higher write throughput.
ManualOperator decides; safest in ambiguous failures.
Automatic with consensusEtcd/Consul/ZooKeeper makes the call; faster but can misfire on network partitions.

Split-brain happens when two nodes think they are the primary and both accept writes. The conflict is messy to recover from: divergent data, broken replication, human cleanup.

Defenses:

  • Fencing — when a node loses leadership, its writes are rejected (or its connections are killed) by something other than itself.
  • Quorum — leader election requires a majority; a minority partition cannot promote.
  • Witness or arbiter — a separate node breaks ties in 2-node clusters.
  • STONITH (“shoot the other node in the head”) — kill the suspected old primary at the infrastructure layer before promoting a new one.

For managed services, fencing is the vendor’s problem; for self-managed, it is yours.

A backup that has never been restored is a hope, not a backup.

TypeWhat It IsRecovery Properties
FullComplete snapshot of the database.Slowest to take; simplest to restore.
IncrementalChanges since the last full or incremental.Faster to take; restore replays a chain.
WAL / binlog archivingContinuous stream of write-ahead-log files.Enables point-in-time recovery.
Logical (pg_dump, mysqldump)SQL representation.Cross-version friendly; slower for large DBs.
Physical / blockFile-system or block-level snapshot.Fastest for large DBs; tied to engine version.
TermMeaningExample
RPO (Recovery Point Objective)How much data can you lose?”We can lose 5 minutes of writes.”
RTO (Recovery Time Objective)How long can recovery take?”We can be down for 1 hour.”

These drive backup frequency and infrastructure choices. A 1-minute RPO requires continuous WAL archiving or synchronous replication; a 24-hour RPO can use nightly snapshots.

A restore drill is a scheduled exercise:

  1. Pick a backup (last night, last week, or random).
  2. Restore to a separate environment.
  3. Verify the database starts and a known query returns expected results.
  4. Time the entire process and compare to RTO.
  5. Note anything that surprised you. Update the runbook.

Cadence: at least quarterly. Document the last drill date in the service catalog. If you have not done one in the last six months, schedule one before the next time you would need it for real.

See also Stateful backup and restore on Kubernetes for backup tooling specific to K8s-hosted databases (Velero, CSI snapshots).

Most managed services and well-configured self-managed setups support point-in-time recovery (PITR) — restore to any second within the WAL/binlog retention window.

Friday 10:23 UTC: bad migration wipes a table
Friday 10:25 UTC: discovered
PITR target: Friday 10:22:30 UTC (just before the migration)

PITR usually runs into a fresh database (cannot rewind the existing one). The recovery flow is: spin up a new instance from the backup → replay WAL up to the target time → switch the application connection (or extract the missing data and reapply).

Schema changes are one of the most common sources of database incidents. The risk profile is very different from application deploys.

PatternWhy It’s Risky
ALTER TABLE on a hot tableDefault behavior in many engines locks the table or rewrites it; long lock waits.
Adding a non-nullable column with defaultSome engines rewrite the entire table.
Dropping a columnBreaks any code path still using it.
RenamingThe application uses both names during deploy.
Index creationLong; locks vary by engine. Use CONCURRENTLY (Postgres) or online DDL (MySQL 8).
  • Backwards-compatible always: never break the previous app version’s queries in a single deploy. Two-phase: add the new column, deploy code that writes both, backfill, deploy code that reads new, drop old.
  • Online DDL: Postgres CREATE INDEX CONCURRENTLY, MySQL 8 ALGORITHM=INPLACE, LOCK=NONE where supported.
  • Migration tooling that supports retries and timeouts (Flyway, Liquibase, Alembic, gh-ost, pt-online-schema-change).
  • Lock timeouts in the migration session (SET lock_timeout = '5s' in Postgres). Better to fail and retry than to block all writes.
  • Off-peak window for the migration step that actually carries risk.
Symptom: app errors after deploy
Was a migration part of the deploy? ──── No → roll back app
▼ Yes
Did the migration succeed? ──── No → migration tool will tell you
▼ Yes
Are queries hitting locks? ──── Yes → check pg_locks / SHOW PROCESSLIST
▼ No
Is the schema actually what you expect? → describe / SHOW CREATE TABLE

Rolling back a migration is often harder than rolling forward — schema changes are not symmetric. This is why backwards-compatible patterns matter.

The same three pillars apply: metrics, logs, traces. A few signals matter most for relational databases.

SignalExamples
Latencypg_stat_statements mean exec time; per-query p99 in your APM.
ThroughputTransactions per second; queries per second.
ErrorsConnection refused, lock wait timeouts, deadlocks, query failures.
SaturationActive connections vs max; CPU; disk I/O wait; cache hit ratio.
EngineWatch
PostgresReplication lag (pg_stat_replication), bloat, autovacuum activity, long transactions, lock waits.
MySQLInnodb buffer pool hit rate, slave/replica lag, lock waits, open table count.
AllSlow query log, error log, connection count vs max.
  • Replication lag above tolerance, with critical threshold set below the failover-data-loss tolerance.
  • Connection saturation (e.g. >80% of max for 5 minutes).
  • Disk free below threshold (e.g. less than 20% remaining, critical below 10%).
  • Backup age — alert if no successful backup in the last 24 hours (or your RPO window).
  • Restore drill staleness — alert if no drill in the last quarter.

Enable the slow query log in production with a sane threshold (e.g. 1 second). Review the top offenders weekly. Most database “performance problems” are 5–10 specific queries that account for 80% of database load.

A reusable runbook structure:

Service: <name>
Owners: <team>
Database: <engine, version, hosting>
Replicas: <count, lag tolerance>
Backups: <type, frequency, retention>
RPO/RTO: <values>
Last restore drill: <date>
Symptoms and First Steps
- Latency spike → see "Slow query triage" below
- Connection errors → see "Pool exhaustion" below
- Replica lag alert → see "Replica lag" below
- Disk full alert → see "Disk full" below
Slow Query Triage
1. ...
Pool Exhaustion
1. ...
Replica Lag
1. ...
Disk Full
1. ...
Failover Procedure (manual)
1. ...
Rollback / Restore (PITR)
1. ...
Escalation
- L2: <team>
- L3: <vendor / DBA team>

Each section is short. Long runbooks are not read during incidents.

  • Connection pool sizing is documented and matches database max-connections math.
  • Read-after-write paths are routed to the primary; stale-OK paths use replicas.
  • Replication lag has both warning and critical alerts with rationale.
  • Failover behavior (automatic vs manual; data-loss tolerance) is documented in the runbook.
  • Backups run at the cadence required by RPO; a monitor alerts on backup age.
  • At least one restore drill has happened in the last quarter; results are documented.
  • PITR target retention covers your RPO window.
  • Migrations use a tool with timeouts; high-risk patterns are flagged in code review.
  • Slow query log is on in production; top queries are reviewed regularly.
  • On-call runbook exists, is short, and was last verified within the last quarter.