Running a database in production (the parts they don't teach)
Your CS class taught you B-trees, MVCC, and Raft, but what wakes you up at 3 a.m. is a pg_dump that is silently truncating, a 600 GB table that needs a new column on Diwali eve, or a query plan that flipped from Index Scan to Seq Scan after a minor version bump. This chapter is the operational layer no syllabus covers, distilled from how Razorpay, Zerodha, Flipkart, and the broader SRE / DBRE world actually keep databases alive.
A database in production is judged on six things you cannot derive from a paper: backups you have restored, migrations that did not lock a hot table, capacity headroom, observability into slow queries, an incident response that does not need the senior engineer, and plan stability across upgrades. Everything else is theatre. Why these six: each one represents a class of failure where the database is technically up but the business is down — corruption you cannot reverse, locks that block writes, disk that fills overnight, queries that ran in 5 ms yesterday and 5 s today.
The five-promise gap between staging and production
A database that passes its test suite is a database that has answered easy questions. A database in production has to answer the same questions while:
- the disk is 84% full,
- a junior engineer just deployed a migration that takes an
ACCESS EXCLUSIVElock, - one replica is 47 seconds behind master because of a long
VACUUM, - the on-call for tonight has been an SRE for nine weeks,
- and Razorpay's payment flow depends on it returning under 50 ms p99.
Charity Majors has a line that captures this: the test that matters is the one your production traffic runs. Andy Pavlo's CMU DBRE lectures put it more bluntly — most database outages are not bugs in the database. They are misuse, misconfiguration, and missing operational rituals. This chapter is those rituals.
The shape of what you need:
Backups: the only test that matters is the restore
Riya at Zerodha runs pg_dump every night to S3. Every dump completes. The S3 bucket has 480 files. Riya has never restored one of them. Riya does not have backups. Riya has a folder of files that might be backups.
This is the single most-violated rule in production databases, and Charity Majors' framing is the right one: if you have not restored a backup in the last 30 days, you do not have backups. The backup pipeline is a system. Systems break silently. The dump can be truncated by an OOM kill, encrypted with a key you no longer have, or written to a region that has just been quarantined. You will only learn this on the day you need it.
The drill — what to actually do, monthly:
# 1. Pick a real backup at random (not the latest — the latest is the one you trust)
aws s3 ls s3://zerodha-pg-backups/2026/03/ | shuf | head -1
# 2026-03-14_03-15-00_full.dump.gz
# 2. Pull it to a fresh box
aws s3 cp s3://zerodha-pg-backups/2026/03/2026-03-14_03-15-00_full.dump.gz .
# 3. Restore into a throwaway Postgres
gunzip < 2026-03-14_03-15-00_full.dump.gz | pg_restore -d restore_test_$(date +%s) -j 4
# 4. Run a smoke query that proves business data is there, not just schemas
psql restore_test_xxx -c "SELECT count(*), max(created_at) FROM orders WHERE created_at::date = '2026-03-13';"
# count: 84,217 max: 2026-03-13 23:59:58
That last step is non-negotiable. A restore that succeeds but lands an empty orders table is the worst kind of failure — silent, plausible, and only detected when you needed the data. Why a smoke query and not a row count alone: row counts can match across two snapshots if the dump captured an empty replica or a sharded subset. A semantically meaningful query (a known order, a known user, a known sum-of-payments-on-a-date) verifies that the content survived, not just the cardinality.
RPO and RTO: the two numbers leadership cares about
- RPO (Recovery Point Objective) — how much data are you allowed to lose? For Razorpay payments, this is "zero, ever." For an analytics warehouse, "24 hours" may be fine. RPO drives backup frequency and log-shipping.
- RTO (Recovery Time Objective) — how fast must you be back up? For UPI rails, minutes. For a marketing dashboard, hours. RTO drives backup format (logical dump vs. physical snapshot vs. continuous PITR vs. hot standby).
A 600 GB Postgres database takes ~3 hours to restore from a pg_dump on a fast NVMe box. If your RTO is 30 minutes, pg_dump is the wrong primitive — you need WAL-shipped physical backups via pgBackRest or a managed PITR window via RDS / Aurora. Pick the primitive after you have written down RPO and RTO, not before.
Point-in-time recovery is a different animal from backups
A nightly snapshot lets you restore to last midnight. Point-in-time recovery (PITR) lets you restore to the moment before the bad UPDATE ran at 14:32:05. PITR works by combining a base backup with the continuous stream of WAL segments that follow it. Postgres ships these to an archive with archive_command. The archive must be tested independently of the base backups — a corrupt WAL chain is silent until you replay it. Schedule a quarterly PITR drill: pick a random commit from yesterday, restore the cluster to one second before it, verify the row that the commit changed is in its old shape.
Schema migrations on hot tables
A 200 GB payments table at PhonePe. You need to add a merchant_country column. Junior engineer writes:
ALTER TABLE payments ADD COLUMN merchant_country text DEFAULT 'IN' NOT NULL;
This is a five-second statement on an empty table and a forty-minute disaster on a hot one. In Postgres versions before 11, ADD COLUMN ... DEFAULT ... NOT NULL rewrites every row to fill the default. The rewrite holds an ACCESS EXCLUSIVE lock for the entire duration. Every read, every write, every transaction blocks. UPI traffic backs up. Customers see "transaction failed." On-call gets paged.
The rule, drilled into every DBRE: never run a migration that takes any lock stronger than SHARE UPDATE EXCLUSIVE on a hot table during business hours. Use the expand-and-contract pattern instead.
The rules in detail:
ADD COLUMNwithout a default: instant in Postgres 11+, just a catalog change. Add the column nullable, deploy the app to dual-write.ADD COLUMN ... DEFAULT: in Postgres 11+, defaults that are constant (notnow(), not a function call) are stored as a metadata default and do not rewrite the table. Defaults that depend on a function still rewrite. Read the docs before you assume.- Backfill in batches: never
UPDATE big_table SET col = ...in one statement. That holds locks per-row for the duration of the transaction and bloats WAL. Loop over primary key ranges of 10,000 rows, commit, sleep 100 ms. - Drop columns in a separate deploy from the code change: deploy "stop reading the column" first, wait a full deploy cycle, then deploy the
DROP COLUMN. This way a rollback never lands the application on a schema it does not match.
Tools that enforce this
pg_repack: rewrites a bloated table without taking an exclusive lock. Used at Heroku, Instagram, and others to compact tables hit by long-running transactions.gh-ost(GitHub's online schema tool, MySQL): does the expand-and-contract pattern via triggerless binlog tailing. Used at Flipkart and others for InnoDB migrations.- Sqitch / Flyway / Liquibase: migration frameworks that version your DDL the way Git versions code. Without one of these, your "migrations" are a folder of
.sqlfiles nobody can reorder safely.
The single rule that supersedes everything: run every migration on a copy of production-shape data first, with EXPLAIN and pg_locks watching, and time how long it takes. If it takes 12 minutes on staging with the same row count, it will take longer on production with a hot buffer pool.
Capacity planning, in numbers
Capacity in production is three resources, in this order: disk, connections, IOPS. CPU and RAM matter, but they are rarely what kills you first.
Disk: track the slope, not the level
A graph of disk_used_bytes over time has a slope (bytes/day) and a level. The slope is what matters. If your level is 60% and your slope is 4 GB/day, you have plenty of time. If your slope just doubled to 8 GB/day with no traffic increase, something has changed — probably an index bloat, a runaway log table, or a VACUUM that has been holding back. The level matters only for the moment you cross 85%, at which point Postgres autovacuum starts thrashing, write amplification climbs, and your slope steepens.
# capacity_alarm.py — alert when projected days-until-full drops below 14
import psycopg2, datetime as dt
def days_until_full(cursor, total_bytes):
cursor.execute("""
SELECT EXTRACT(EPOCH FROM (now() - measured_at)) / 86400 AS days_ago,
disk_used_bytes
FROM disk_metrics
WHERE measured_at > now() - interval '14 days'
ORDER BY measured_at
""")
rows = cursor.fetchall()
# Linear regression: bytes/day
n = len(rows)
sx = sum(r[0] for r in rows)
sy = sum(r[1] for r in rows)
sxy = sum(r[0]*r[1] for r in rows)
sxx = sum(r[0]*r[0] for r in rows)
slope = (n*sxy - sx*sy) / (n*sxx - sx*sx) # bytes/day, negative if growing
if slope >= 0:
return float("inf")
current = rows[-1][1]
free = total_bytes - current
return free / abs(slope)
# alert if < 14 days
if days_until_full(cur, 600 * 1024**3) < 14:
page_oncall("disk projected full in <14d")
Run this once an hour. The number it prints is the only capacity metric leadership and on-call need to share.
Connections: pgBouncer is not optional
A Postgres connection costs ~10 MB of RAM and a process. Five thousand application processes opening a connection each is a 50 GB working set just for connections. The fix, on every Postgres deployment that exists, is pgBouncer in transaction-pooling mode (covered in connection lifecycle, pooling, pgBouncer). Without pgBouncer, you will hit max_connections long before you hit any other limit, and the database will refuse new connections while existing ones idle.
IOPS: the budget you didn't know you had
On AWS RDS, an io2 volume gives you a fixed IOPS quota — say, 3000. A VACUUM FULL, a backup, and a noisy analytics query can together exhaust the IOPS budget. When you exhaust it, every other query suddenly takes 10× longer, because each disk read queues. The signal is disk_queue_depth > 1 for sustained periods. Track it. Cap your background jobs (analytics, backups, vacuums) so the foreground OLTP traffic always has IOPS headroom.
Observability: making "the database is slow" actionable
"The database is slow today" is the worst possible bug report. You cannot act on it. The point of database observability is to convert it into "the orders.created_at index lost its statistics yesterday at 4:13 a.m. and the planner is now full-scanning 200 GB on every checkout."
Three primitives every Postgres deployment must have configured before it sees production traffic:
1. The slow query log
Set log_min_duration_statement = 100 in postgresql.conf. Every statement over 100 ms gets logged with parameters. Cost: a few KB/min in normal operation, more when the database is in trouble. Benefit: when an outage hits, you can grep for the slow queries that started 10 minutes before users noticed.
2026-04-25 14:32:05.122 IST LOG: duration: 12483.221 ms statement:
SELECT * FROM orders WHERE merchant_id = 4821 AND created_at > '2026-04-01';
That single line tells you the query, the parameters, and the wall-clock time. Couple it with a logger that ships to CloudWatch / Loki / ELK and you can answer "what was slow at 14:32" without SSH-ing into the box.
2. pg_stat_statements
The single most useful Postgres extension. Enable it in shared_preload_libraries, run CREATE EXTENSION pg_stat_statements, and query:
SELECT
substring(query, 1, 60) AS q,
calls,
mean_exec_time::int AS mean_ms,
total_exec_time::int AS total_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
This gives you the queries that ate the most database time, across all calls. It turns "the database is slow" into "this one query, called 4 million times, accounts for 38% of database time, and it can be fixed with one index." Andy Pavlo's CMU lectures call this the single biggest win you can hand a junior engineer doing their first DBRE rotation.
3. EXPLAIN (ANALYZE, BUFFERS) on the suspect
Once you have the suspect query from pg_stat_statements, run:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE merchant_id = 4821 AND created_at > '2026-04-01';
The ANALYZE runs the query for real. BUFFERS shows how many shared-buffer pages were hit and read from disk. The output looks like:
Seq Scan on orders (cost=0.00..145821.00 rows=82147 width=104)
(actual time=0.012..12483.212 rows=84217 loops=1)
Filter: (merchant_id = 4821 AND created_at > '2026-04-01')
Rows Removed by Filter: 198,427,134
Buffers: shared hit=1,205 read=2,143,217
Planning Time: 0.412 ms
Execution Time: 12483.221 ms
Three things to read: the chosen access path (Seq Scan here — bad, you wanted Index Scan), the rows removed by filter (198M rows scanned to return 84K — terrible selectivity), and read=2,143,217 (each read is an 8 KB page from disk = 17 GB of disk reads). This query needs an index on (merchant_id, created_at). Why a composite, not two single-column indexes: a composite on (merchant_id, created_at) lets the planner seek to the merchant's section and then range-scan within it. Two separate indexes force a BitmapAnd step that reads more pages and works less well for highly correlated columns.
For the same job in MySQL, the equivalents are slow_query_log, performance_schema, and EXPLAIN ANALYZE (8.0+). For RDS specifically, Performance Insights ships an aggregated view of waits and top SQL with no extension needed — this is the right starting point for teams that don't want to roll their own dashboards.
Incident response: the runbook is the API
3 a.m. The on-call gets a page: pg-primary disk_used_pct > 92. They are nine weeks into the job. They have not seen this alert before. What happens next is determined entirely by what is in the runbook.
A good runbook is not a wiki article that explains how Postgres works. It is a sequence of exact commands the on-call can paste, in order, and a decision tree that tells them when to escalate. Camille Fournier's The Manager's Path and the Google SRE book both make the same point: runbooks are the API your senior engineers expose to your on-calls.
Anatomy of a working runbook entry:
## Alert: pg-primary disk_used_pct > 92
### Triage (5 minutes)
1. SSH: `ssh oncall@pg-primary-1.razorpay.internal`
2. Confirm the alert: `df -h /var/lib/postgresql/data`
3. Check WAL size: `du -sh /var/lib/postgresql/data/pg_wal`
- If pg_wal > 50 GB, this is the cause. Go to "WAL backed up" runbook.
4. Check largest tables:
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(oid))
FROM pg_class ORDER BY pg_total_relation_size(oid) DESC LIMIT 10;"
### Mitigate (15 minutes)
- If WAL backed up: restart standby or break replication slot (see WAL runbook).
- If a table is bloated: schedule pg_repack for off-hours, NOT now.
- If audit_log table is huge: truncate partitions older than 90d (safe).
### Escalate
- If disk > 96% and you have not freed space: page @dbre-lead.
- If you cannot SSH: page @sre-lead.
Two features worth copying. First, time budgets ("Triage 5 min, Mitigate 15 min"). They tell the on-call when to escalate without forcing a judgment call. Second, exact commands, not "check disk usage" — the on-call types the command, not their interpretation of the command. Charity Majors writes about this in the context of Honeycomb's on-call: the runbook is debugged by every on-call who runs it, and corrected by the on-call after the incident.
The same applies to postmortems. After every incident worth more than 15 minutes of downtime, write one. Public, blameless, with a timeline and action items. The Indian fintech ecosystem (Razorpay, PhonePe, Zerodha) has a healthy culture of publishing select postmortems publicly — read those when you find them. They are the closest thing to a textbook this discipline has.
Plan stability across upgrades
Your query was 5 ms yesterday. Today it is 5 seconds. Nothing in the code changed. The DBA bumped Postgres from 14.7 to 14.10 over the weekend.
This is plan instability, and it is one of the most demoralising bugs a database engineer encounters. The query planner uses statistics (collected by ANALYZE) and a cost model. Both can change across minor versions:
- A bug fix to the cost model can flip an
Index Scanto aSeq Scanor vice versa. - An
ANALYZEafter the upgrade may have collected different statistics due to sampling. - A new planner default (e.g.,
enable_partitionwise_join) may turn on automatically.
The defence is not "never upgrade." The defence is to detect the regression and have a rollback plan.
Detection
Capture EXPLAIN plans for your top 50 queries (from pg_stat_statements) before and after every upgrade. A simple diff catches plan flips:
# plan_diff.py — compare query plans across upgrade
import subprocess, hashlib
QUERIES = [...] # top 50 from pg_stat_statements
for q in QUERIES:
plan = subprocess.check_output(["psql", "-c", f"EXPLAIN {q}"]).decode()
# normalize: strip costs and row estimates, keep the access path shape
shape = "\n".join(line.split("(cost=")[0] for line in plan.splitlines())
h = hashlib.md5(shape.encode()).hexdigest()[:8]
print(f"{h} {q[:60]}")
Hash the shape (access paths, join order) and compare. Any hash that changed is a plan that flipped. Investigate before users do.
Mitigation
- Pin the planner:
pg_hint_plan(Postgres) andoptimizer hints(MySQL) let you force a specific access path for a specific query. Use sparingly — they are duct tape, not a fix — but they are the right duct tape to keep a critical query on its working plan during an incident. - Refresh statistics after upgrade:
ANALYZEevery table after a major or minor upgrade. Skipping this leaves the planner with stalepg_statisticdata. - Read the release notes: every Postgres minor release notes "planner change" items in its release notes. Greg Sabino Mullane's blog and the Postgres-General mailing list call out the high-impact ones. Make reading these part of the upgrade checklist.
Common confusions
- "We have backups, so we're safe." Not until you have restored one, recently, end-to-end, with a smoke query that proves the data made it back. The dump completing is a necessary condition, not a sufficient one. Multiple companies have learned this the hard way: GitLab's 2017 outage and the smaller incidents at Indian startups every quarter.
- "
pg_dumpand a snapshot are interchangeable." They are not.pg_dumpis a logical dump (replays as SQL) and is portable across versions. A snapshot is a block-level copy and is only restorable on the exact same Postgres version and architecture. Use logical dumps for upgrades and disaster recovery; use snapshots for fast PITR. Mixing them up is how teams lose data. - "
ALTER TABLE ADD COLUMNis fast in modern Postgres." It is fast only when the default is a constant or null. With a non-constant default, a check constraint that needs validation, or a type change, it still rewrites the entire table under anACCESS EXCLUSIVElock. Read the lock level the statement takes (the Postgres ALTER TABLE docs list it) before assuming. - "More indexes are always better." Each index has a write cost — every
INSERTandUPDATEupdates every index on the table. A table with 14 indexes writes ~14× the data per row change. The right number is "exactly the indexes that serve real query plans," not "every column anyone has ever filtered on." - "Vacuum is something the database does for me." Autovacuum is on by default but its thresholds (
autovacuum_vacuum_scale_factor = 0.2) mean a 1 TB table only triggers a vacuum after 200 GB of dead rows. On hot tables you must lower that threshold per-table or the table bloats indefinitely. Tune it; don't trust the default. - "
EXPLAINshows me what the query does."EXPLAINshows the planner's guess.EXPLAIN ANALYZEruns the query for real and shows what actually happened. The two often disagree spectacularly when statistics are stale. When debugging, always useANALYZE. When the query is destructive, wrap it in a transaction you roll back.
Going deeper
The DBRE role and where it sits
Database Reliability Engineering (DBRE) is the term Laine Campbell and Charity Majors codified in 2017 for the practice of running databases the way SRE runs services. The role exists at scale (Netflix, Stripe, GitHub, Razorpay, Flipkart, Zerodha) because at scale a "DBA" who only writes SQL cannot keep up — the work shifts to automation, observability, and capacity. The seminal text is Campbell and Majors, Database Reliability Engineering (O'Reilly, 2017). Read it once, then re-read chapter 5 (operational visibility) every year. Andy Pavlo's CMU 15-721 has lectures on the subject ("Database Engineering Anti-Patterns") that are publicly available — they are a free DBRE bootcamp.
Connection storms and the thundering herd
A particular failure mode every database operator meets eventually: an application restarts, all 5,000 worker processes wake up simultaneously, and each opens a fresh database connection in the same millisecond. Postgres rejects them with too many connections, the workers retry, the herd thunders harder, and the database dies. The fix is not on the database — it is jittered retries in the client and pgBouncer in front. The pattern is general; see caching patterns and the thundering herd problem for the cache version.
Long-running transactions and replication lag
A single open transaction on the primary blocks VACUUM from cleaning up dead rows everywhere, and on replicas blocks WAL replay (because the replica might still be reading rows the primary wants to delete). At Razorpay-scale, a developer's BEGIN; SELECT * FROM payments; left open in their psql session has caused production incidents. The defence: idle_in_transaction_session_timeout = '5min' in postgresql.conf. The detection: monitor pg_stat_activity for transactions older than 60 seconds. The cultural norm: any BEGIN in a psql session is a code smell — either commit it or roll it back inside one screen of typing.
The "noisy neighbour" problem on managed databases
On RDS, Aurora, and Cloud SQL, your database lives on shared hardware. A neighbour's IOPS spike can degrade your tail latency. You will see this as p99 latency that climbs without any change in your traffic pattern. The defences are limited — you can pay for a larger instance class (which gets dedicated hardware above some tier), enable Provisioned IOPS, or move to single-tenant infrastructure. Most teams don't, and instead build their SLOs to absorb the noise. Marc Brooker's "metastable failures" essay is the canonical writeup of how this kind of noise turns into outages.
The on-call rotation as a learning system
Camille Fournier's framing is the right one: the on-call rotation is not a punishment, it is the team's main feedback loop into the operational quality of the system. Every page is a vote for "fix this." Track pages per week; track the same page recurring; track time-to-mitigate. If the same alert pages three times in a month, the action item is not a better runbook — it is to fix the root cause. A team that lets the same alert page weekly will eventually lose its senior engineers to burnout. Organisations that take this seriously (PagerDuty, Honeycomb, Cloudflare) publish their on-call guidelines; read them.
Where this leads next
- Connection lifecycle, pooling, and pgBouncer — the connection layer that the capacity section gestured at.
- Persistence: RDB snapshots and AOF — Redis's take on backups, with a different RPO/RTO trade-off.
- Replication: sentinel and Redis Cluster — failover automation, the operational sibling of backup-restore.
- Checkpointing: bounding recovery time — why RTO is a function of how often you checkpoint.
- Caching patterns and the thundering herd problem — the cache-side mirror of connection-storm failures.
- Polyglot persistence: picking the right DB per workload — once you have one database in production, the question becomes how many.
References
- Laine Campbell & Charity Majors, Database Reliability Engineering (O'Reilly, 2017) — the foundational text for the role and its rituals. oreilly.com.
- Google SRE, Site Reliability Engineering (O'Reilly, 2016), Ch. 11 Being On-Call and Ch. 14 Managing Incidents — the runbook and incident-management chapters every on-call should read. sre.google/books.
- Andy Pavlo, CMU 15-721 Advanced Database Systems — public lecture series including DBRE anti-patterns and observability material. 15721.courses.cs.cmu.edu.
- Postgres documentation, pg_stat_statements — official extension reference for the most-used DBRE tool in Postgres. postgresql.org/docs.
- GitHub Engineering, gh-ost: triggerless online schema migration for MySQL — the design document for the most-copied online migration tool. github.blog.
- GitLab, Postmortem of database outage of January 31 (2017) — the canonical example of "we had backups, none worked," and the cultural model for blameless postmortems. about.gitlab.com.
- Marc Brooker, Metastable Failures in the Wild — how noise on shared infrastructure turns into runaway outages. brooker.co.za.
- The Append-Only Log — the substrate that every backup, replication, and PITR system in this chapter ultimately writes to.