Running a database in production (the parts they don't teach)

Your CS class taught you B-trees, MVCC, and Raft, but what wakes you up at 3 a.m. is a pg_dump that is silently truncating, a 600 GB table that needs a new column on Diwali eve, or a query plan that flipped from Index Scan to Seq Scan after a minor version bump. This chapter is the operational layer no syllabus covers, distilled from how Razorpay, Zerodha, Flipkart, and the broader SRE / DBRE world actually keep databases alive.

A database in production is judged on six things you cannot derive from a paper: backups you have restored, migrations that did not lock a hot table, capacity headroom, observability into slow queries, an incident response that does not need the senior engineer, and plan stability across upgrades. Everything else is theatre. Why these six: each one represents a class of failure where the database is technically up but the business is down — corruption you cannot reverse, locks that block writes, disk that fills overnight, queries that ran in 5 ms yesterday and 5 s today.

The five-promise gap between staging and production

A database that passes its test suite is a database that has answered easy questions. A database in production has to answer the same questions while:

Charity Majors has a line that captures this: the test that matters is the one your production traffic runs. Andy Pavlo's CMU DBRE lectures put it more bluntly — most database outages are not bugs in the database. They are misuse, misconfiguration, and missing operational rituals. This chapter is those rituals.

The shape of what you need:

Six pillars of running a database in productionSix labelled pillars stand on a baseline labelled production database. From left: backup and restore, schema migrations, capacity planning, observability, incident response, plan stability. Each pillar has a one-line failure mode underneath.Backupsrestore drillsRPO / RTOPITR"we have backups"Migrationsexpand /contractno exclusive"ALTER TABLE locked us"Capacitydisk slopeconnectionIOPS budget"disk full at 2 a.m."Observabilityslow logpg_stat_statements"DB is slow today"Incidentrunbookson-callpostmortem"who has the password"Planstabilitystats freshnessversion pin"upgrade broke a query"production database
The six pillars. Knock any one out and the database appears to be working until the moment it isn't.

Backups: the only test that matters is the restore

Riya at Zerodha runs pg_dump every night to S3. Every dump completes. The S3 bucket has 480 files. Riya has never restored one of them. Riya does not have backups. Riya has a folder of files that might be backups.

This is the single most-violated rule in production databases, and Charity Majors' framing is the right one: if you have not restored a backup in the last 30 days, you do not have backups. The backup pipeline is a system. Systems break silently. The dump can be truncated by an OOM kill, encrypted with a key you no longer have, or written to a region that has just been quarantined. You will only learn this on the day you need it.

The drill — what to actually do, monthly:

# 1. Pick a real backup at random (not the latest — the latest is the one you trust)
aws s3 ls s3://zerodha-pg-backups/2026/03/ | shuf | head -1
# 2026-03-14_03-15-00_full.dump.gz

# 2. Pull it to a fresh box
aws s3 cp s3://zerodha-pg-backups/2026/03/2026-03-14_03-15-00_full.dump.gz .

# 3. Restore into a throwaway Postgres
gunzip < 2026-03-14_03-15-00_full.dump.gz | pg_restore -d restore_test_$(date +%s) -j 4

# 4. Run a smoke query that proves business data is there, not just schemas
psql restore_test_xxx -c "SELECT count(*), max(created_at) FROM orders WHERE created_at::date = '2026-03-13';"
# count: 84,217  max: 2026-03-13 23:59:58

That last step is non-negotiable. A restore that succeeds but lands an empty orders table is the worst kind of failure — silent, plausible, and only detected when you needed the data. Why a smoke query and not a row count alone: row counts can match across two snapshots if the dump captured an empty replica or a sharded subset. A semantically meaningful query (a known order, a known user, a known sum-of-payments-on-a-date) verifies that the content survived, not just the cardinality.

RPO and RTO: the two numbers leadership cares about

A 600 GB Postgres database takes ~3 hours to restore from a pg_dump on a fast NVMe box. If your RTO is 30 minutes, pg_dump is the wrong primitive — you need WAL-shipped physical backups via pgBackRest or a managed PITR window via RDS / Aurora. Pick the primitive after you have written down RPO and RTO, not before.

Point-in-time recovery is a different animal from backups

A nightly snapshot lets you restore to last midnight. Point-in-time recovery (PITR) lets you restore to the moment before the bad UPDATE ran at 14:32:05. PITR works by combining a base backup with the continuous stream of WAL segments that follow it. Postgres ships these to an archive with archive_command. The archive must be tested independently of the base backups — a corrupt WAL chain is silent until you replay it. Schedule a quarterly PITR drill: pick a random commit from yesterday, restore the cluster to one second before it, verify the row that the commit changed is in its old shape.

Schema migrations on hot tables

A 200 GB payments table at PhonePe. You need to add a merchant_country column. Junior engineer writes:

ALTER TABLE payments ADD COLUMN merchant_country text DEFAULT 'IN' NOT NULL;

This is a five-second statement on an empty table and a forty-minute disaster on a hot one. In Postgres versions before 11, ADD COLUMN ... DEFAULT ... NOT NULL rewrites every row to fill the default. The rewrite holds an ACCESS EXCLUSIVE lock for the entire duration. Every read, every write, every transaction blocks. UPI traffic backs up. Customers see "transaction failed." On-call gets paged.

The rule, drilled into every DBRE: never run a migration that takes any lock stronger than SHARE UPDATE EXCLUSIVE on a hot table during business hours. Use the expand-and-contract pattern instead.

Expand-and-contract migrationThree vertical phases laid left to right. Phase 1 expand: schema gets new column, app dual-writes old and new. Phase 2 backfill: a background job fills new column for old rows. Phase 3 contract: app reads only new column, drop the old one. Each phase shows which lock level is needed and notes that none of them block writes.1. ExpandALTER TABLE ... ADDCOLUMN merchant_countrytext; -- no defaultapp: dual-writeold + newlock: instant2. BackfillUPDATE ... SETmerchant_country = ...WHERE id BETWEEN ...batched, throttled10k rows / sleep 1slock: row-level3. Contractapp: read new onlyALTER TABLE ... DROPCOLUMN old_merchant;deploy first,drop laterlock: brief excl.
Expand-and-contract. The lock-heavy step is split into three lock-light steps that the application straddles.

The rules in detail:

Tools that enforce this

The single rule that supersedes everything: run every migration on a copy of production-shape data first, with EXPLAIN and pg_locks watching, and time how long it takes. If it takes 12 minutes on staging with the same row count, it will take longer on production with a hot buffer pool.

Capacity planning, in numbers

Capacity in production is three resources, in this order: disk, connections, IOPS. CPU and RAM matter, but they are rarely what kills you first.

Disk: track the slope, not the level

A graph of disk_used_bytes over time has a slope (bytes/day) and a level. The slope is what matters. If your level is 60% and your slope is 4 GB/day, you have plenty of time. If your slope just doubled to 8 GB/day with no traffic increase, something has changed — probably an index bloat, a runaway log table, or a VACUUM that has been holding back. The level matters only for the moment you cross 85%, at which point Postgres autovacuum starts thrashing, write amplification climbs, and your slope steepens.

# capacity_alarm.py — alert when projected days-until-full drops below 14
import psycopg2, datetime as dt

def days_until_full(cursor, total_bytes):
    cursor.execute("""
        SELECT EXTRACT(EPOCH FROM (now() - measured_at)) / 86400 AS days_ago,
               disk_used_bytes
        FROM disk_metrics
        WHERE measured_at > now() - interval '14 days'
        ORDER BY measured_at
    """)
    rows = cursor.fetchall()
    # Linear regression: bytes/day
    n = len(rows)
    sx = sum(r[0] for r in rows)
    sy = sum(r[1] for r in rows)
    sxy = sum(r[0]*r[1] for r in rows)
    sxx = sum(r[0]*r[0] for r in rows)
    slope = (n*sxy - sx*sy) / (n*sxx - sx*sx)  # bytes/day, negative if growing
    if slope >= 0:
        return float("inf")
    current = rows[-1][1]
    free = total_bytes - current
    return free / abs(slope)

# alert if < 14 days
if days_until_full(cur, 600 * 1024**3) < 14:
    page_oncall("disk projected full in <14d")

Run this once an hour. The number it prints is the only capacity metric leadership and on-call need to share.

Connections: pgBouncer is not optional

A Postgres connection costs ~10 MB of RAM and a process. Five thousand application processes opening a connection each is a 50 GB working set just for connections. The fix, on every Postgres deployment that exists, is pgBouncer in transaction-pooling mode (covered in connection lifecycle, pooling, pgBouncer). Without pgBouncer, you will hit max_connections long before you hit any other limit, and the database will refuse new connections while existing ones idle.

IOPS: the budget you didn't know you had

On AWS RDS, an io2 volume gives you a fixed IOPS quota — say, 3000. A VACUUM FULL, a backup, and a noisy analytics query can together exhaust the IOPS budget. When you exhaust it, every other query suddenly takes 10× longer, because each disk read queues. The signal is disk_queue_depth > 1 for sustained periods. Track it. Cap your background jobs (analytics, backups, vacuums) so the foreground OLTP traffic always has IOPS headroom.

Observability: making "the database is slow" actionable

"The database is slow today" is the worst possible bug report. You cannot act on it. The point of database observability is to convert it into "the orders.created_at index lost its statistics yesterday at 4:13 a.m. and the planner is now full-scanning 200 GB on every checkout."

Three primitives every Postgres deployment must have configured before it sees production traffic:

1. The slow query log

Set log_min_duration_statement = 100 in postgresql.conf. Every statement over 100 ms gets logged with parameters. Cost: a few KB/min in normal operation, more when the database is in trouble. Benefit: when an outage hits, you can grep for the slow queries that started 10 minutes before users noticed.

2026-04-25 14:32:05.122 IST LOG:  duration: 12483.221 ms  statement:
    SELECT * FROM orders WHERE merchant_id = 4821 AND created_at > '2026-04-01';

That single line tells you the query, the parameters, and the wall-clock time. Couple it with a logger that ships to CloudWatch / Loki / ELK and you can answer "what was slow at 14:32" without SSH-ing into the box.

2. pg_stat_statements

The single most useful Postgres extension. Enable it in shared_preload_libraries, run CREATE EXTENSION pg_stat_statements, and query:

SELECT
    substring(query, 1, 60) AS q,
    calls,
    mean_exec_time::int AS mean_ms,
    total_exec_time::int AS total_ms,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

This gives you the queries that ate the most database time, across all calls. It turns "the database is slow" into "this one query, called 4 million times, accounts for 38% of database time, and it can be fixed with one index." Andy Pavlo's CMU lectures call this the single biggest win you can hand a junior engineer doing their first DBRE rotation.

3. EXPLAIN (ANALYZE, BUFFERS) on the suspect

Once you have the suspect query from pg_stat_statements, run:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE merchant_id = 4821 AND created_at > '2026-04-01';

The ANALYZE runs the query for real. BUFFERS shows how many shared-buffer pages were hit and read from disk. The output looks like:

Seq Scan on orders  (cost=0.00..145821.00 rows=82147 width=104)
                   (actual time=0.012..12483.212 rows=84217 loops=1)
  Filter: (merchant_id = 4821 AND created_at > '2026-04-01')
  Rows Removed by Filter: 198,427,134
  Buffers: shared hit=1,205 read=2,143,217
Planning Time: 0.412 ms
Execution Time: 12483.221 ms

Three things to read: the chosen access path (Seq Scan here — bad, you wanted Index Scan), the rows removed by filter (198M rows scanned to return 84K — terrible selectivity), and read=2,143,217 (each read is an 8 KB page from disk = 17 GB of disk reads). This query needs an index on (merchant_id, created_at). Why a composite, not two single-column indexes: a composite on (merchant_id, created_at) lets the planner seek to the merchant's section and then range-scan within it. Two separate indexes force a BitmapAnd step that reads more pages and works less well for highly correlated columns.

For the same job in MySQL, the equivalents are slow_query_log, performance_schema, and EXPLAIN ANALYZE (8.0+). For RDS specifically, Performance Insights ships an aggregated view of waits and top SQL with no extension needed — this is the right starting point for teams that don't want to roll their own dashboards.

Incident response: the runbook is the API

3 a.m. The on-call gets a page: pg-primary disk_used_pct > 92. They are nine weeks into the job. They have not seen this alert before. What happens next is determined entirely by what is in the runbook.

A good runbook is not a wiki article that explains how Postgres works. It is a sequence of exact commands the on-call can paste, in order, and a decision tree that tells them when to escalate. Camille Fournier's The Manager's Path and the Google SRE book both make the same point: runbooks are the API your senior engineers expose to your on-calls.

Anatomy of a working runbook entry:

## Alert: pg-primary disk_used_pct > 92

### Triage (5 minutes)
1. SSH: `ssh oncall@pg-primary-1.razorpay.internal`
2. Confirm the alert: `df -h /var/lib/postgresql/data`
3. Check WAL size: `du -sh /var/lib/postgresql/data/pg_wal`
   - If pg_wal > 50 GB, this is the cause. Go to "WAL backed up" runbook.
4. Check largest tables:
   psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
            FROM pg_class ORDER BY pg_total_relation_size(oid) DESC LIMIT 10;"

### Mitigate (15 minutes)
- If WAL backed up: restart standby or break replication slot (see WAL runbook).
- If a table is bloated: schedule pg_repack for off-hours, NOT now.
- If audit_log table is huge: truncate partitions older than 90d (safe).

### Escalate
- If disk > 96% and you have not freed space: page @dbre-lead.
- If you cannot SSH: page @sre-lead.

Two features worth copying. First, time budgets ("Triage 5 min, Mitigate 15 min"). They tell the on-call when to escalate without forcing a judgment call. Second, exact commands, not "check disk usage" — the on-call types the command, not their interpretation of the command. Charity Majors writes about this in the context of Honeycomb's on-call: the runbook is debugged by every on-call who runs it, and corrected by the on-call after the incident.

The same applies to postmortems. After every incident worth more than 15 minutes of downtime, write one. Public, blameless, with a timeline and action items. The Indian fintech ecosystem (Razorpay, PhonePe, Zerodha) has a healthy culture of publishing select postmortems publicly — read those when you find them. They are the closest thing to a textbook this discipline has.

Plan stability across upgrades

Your query was 5 ms yesterday. Today it is 5 seconds. Nothing in the code changed. The DBA bumped Postgres from 14.7 to 14.10 over the weekend.

This is plan instability, and it is one of the most demoralising bugs a database engineer encounters. The query planner uses statistics (collected by ANALYZE) and a cost model. Both can change across minor versions:

The defence is not "never upgrade." The defence is to detect the regression and have a rollback plan.

Detection

Capture EXPLAIN plans for your top 50 queries (from pg_stat_statements) before and after every upgrade. A simple diff catches plan flips:

# plan_diff.py — compare query plans across upgrade
import subprocess, hashlib

QUERIES = [...]  # top 50 from pg_stat_statements

for q in QUERIES:
    plan = subprocess.check_output(["psql", "-c", f"EXPLAIN {q}"]).decode()
    # normalize: strip costs and row estimates, keep the access path shape
    shape = "\n".join(line.split("(cost=")[0] for line in plan.splitlines())
    h = hashlib.md5(shape.encode()).hexdigest()[:8]
    print(f"{h}  {q[:60]}")

Hash the shape (access paths, join order) and compare. Any hash that changed is a plan that flipped. Investigate before users do.

Mitigation

Common confusions

Going deeper

The DBRE role and where it sits

Database Reliability Engineering (DBRE) is the term Laine Campbell and Charity Majors codified in 2017 for the practice of running databases the way SRE runs services. The role exists at scale (Netflix, Stripe, GitHub, Razorpay, Flipkart, Zerodha) because at scale a "DBA" who only writes SQL cannot keep up — the work shifts to automation, observability, and capacity. The seminal text is Campbell and Majors, Database Reliability Engineering (O'Reilly, 2017). Read it once, then re-read chapter 5 (operational visibility) every year. Andy Pavlo's CMU 15-721 has lectures on the subject ("Database Engineering Anti-Patterns") that are publicly available — they are a free DBRE bootcamp.

Connection storms and the thundering herd

A particular failure mode every database operator meets eventually: an application restarts, all 5,000 worker processes wake up simultaneously, and each opens a fresh database connection in the same millisecond. Postgres rejects them with too many connections, the workers retry, the herd thunders harder, and the database dies. The fix is not on the database — it is jittered retries in the client and pgBouncer in front. The pattern is general; see caching patterns and the thundering herd problem for the cache version.

Long-running transactions and replication lag

A single open transaction on the primary blocks VACUUM from cleaning up dead rows everywhere, and on replicas blocks WAL replay (because the replica might still be reading rows the primary wants to delete). At Razorpay-scale, a developer's BEGIN; SELECT * FROM payments; left open in their psql session has caused production incidents. The defence: idle_in_transaction_session_timeout = '5min' in postgresql.conf. The detection: monitor pg_stat_activity for transactions older than 60 seconds. The cultural norm: any BEGIN in a psql session is a code smell — either commit it or roll it back inside one screen of typing.

The "noisy neighbour" problem on managed databases

On RDS, Aurora, and Cloud SQL, your database lives on shared hardware. A neighbour's IOPS spike can degrade your tail latency. You will see this as p99 latency that climbs without any change in your traffic pattern. The defences are limited — you can pay for a larger instance class (which gets dedicated hardware above some tier), enable Provisioned IOPS, or move to single-tenant infrastructure. Most teams don't, and instead build their SLOs to absorb the noise. Marc Brooker's "metastable failures" essay is the canonical writeup of how this kind of noise turns into outages.

The on-call rotation as a learning system

Camille Fournier's framing is the right one: the on-call rotation is not a punishment, it is the team's main feedback loop into the operational quality of the system. Every page is a vote for "fix this." Track pages per week; track the same page recurring; track time-to-mitigate. If the same alert pages three times in a month, the action item is not a better runbook — it is to fix the root cause. A team that lets the same alert page weekly will eventually lose its senior engineers to burnout. Organisations that take this seriously (PagerDuty, Honeycomb, Cloudflare) publish their on-call guidelines; read them.

Where this leads next

References

  1. Laine Campbell & Charity Majors, Database Reliability Engineering (O'Reilly, 2017) — the foundational text for the role and its rituals. oreilly.com.
  2. Google SRE, Site Reliability Engineering (O'Reilly, 2016), Ch. 11 Being On-Call and Ch. 14 Managing Incidents — the runbook and incident-management chapters every on-call should read. sre.google/books.
  3. Andy Pavlo, CMU 15-721 Advanced Database Systems — public lecture series including DBRE anti-patterns and observability material. 15721.courses.cs.cmu.edu.
  4. Postgres documentation, pg_stat_statements — official extension reference for the most-used DBRE tool in Postgres. postgresql.org/docs.
  5. GitHub Engineering, gh-ost: triggerless online schema migration for MySQL — the design document for the most-copied online migration tool. github.blog.
  6. GitLab, Postmortem of database outage of January 31 (2017) — the canonical example of "we had backups, none worked," and the cultural model for blameless postmortems. about.gitlab.com.
  7. Marc Brooker, Metastable Failures in the Wild — how noise on shared infrastructure turns into runaway outages. brooker.co.za.
  8. The Append-Only Log — the substrate that every backup, replication, and PITR system in this chapter ultimately writes to.