In short
Point-in-Time Recovery (PITR) is the operational answer to the one question every replication strategy fails: what if the write was intentional, committed, and wrong? No amount of synchronous replication saves you from DELETE FROM users; without a WHERE clause — every replica applied it faithfully. PITR reaches back in time and produces a database as it was one second before the mistake.
The trick is two artifacts kept together. A base backup is a full physical copy of the data files at some past moment — usually taken nightly, stored in object storage. A WAL archive is every WAL segment the primary has generated since, streamed continuously to the same storage. Together they give you an unbroken chain: restore the base backup, replay WAL forward to any timestamp within the archive's retention window, stop there. Postgres exposes this through recovery_target_time, recovery_target_lsn, and recovery_target_xid.
The operational picture around that mechanism decides whether it saves you or not. Cadence: base backups daily (or more often for large databases), WAL archived every segment as it fills (~1-minute granularity under load). Retention: 30 days of base + WAL is a typical floor; regulated industries hold years. Tooling: pgBackRest, Barman, and WAL-G do this properly — parallel, compressed, deduplicated, verified. The 3-2-1 rule: three copies, two media, one offsite. And the discipline: an untested backup is not a backup. Monthly restore drills to a staging environment are the difference between a documented recovery and a 3 AM discovery that the WAL archive has a six-hour gap nobody noticed.
This chapter builds the mental model of the restore sequence, walks real retention numbers for a 1 TB database, covers the 3-2-1 rule applied to Postgres, and distinguishes PITR from what it is not (pg_dump, replicas, RAID). It closes with the tooling ecosystem and a concrete 500 GB accidental-UPDATE scenario worked end to end.
It is 3:47 PM on a Tuesday. An on-call engineer is cleaning up a staging dataset and, out of muscle memory, runs the query against production:
DELETE FROM users;
No WHERE. The transaction commits. Forty-eight million rows, gone in six seconds. The application begins returning 500s for every request that joins against users — which is almost every request. At 3:52 PM somebody notices. At 3:55 PM the engineer realises what happened and opens Slack.
Replication does not help. The replica in the neighbouring AZ applied the DELETE eight milliseconds after the primary; the one in Dublin applied it half a second later. All three databases agree: users is empty. There is no transaction to roll back — the COMMIT landed cleanly. The primary's WAL contains the deletion, durable on disk and in every replica's disk.
The only thing that saves the company is a backup archive sitting in S3 — a base backup from 02:00 that morning plus every WAL segment generated between 02:00 and 15:47. The recovery procedure restores the base, replays WAL forward to 2026-04-21 15:46:59 IST, and promotes. The database resumes exactly as it was one second before the fatal query. The company loses one minute of writes — the last fifty or so INSERTs, UPDATEs, and harmless DELETEs that arrived after 15:46:59 — in exchange for not losing forty-eight million users.
This chapter is about the machinery that makes that paragraph possible, and the operational practices that make it reliable rather than aspirational.
The backup components
PITR rests on three artifacts, and understanding them separately is the whole lesson.
Base backup. A full physical snapshot of the data files at some past LSN. In Postgres this is the contents of $PGDATA at the moment of the snapshot, produced by pg_basebackup or a tool that speaks the same protocol. The snapshot is consistent: the pg_backup_start() / pg_backup_stop() protocol brackets the copy so that replaying WAL from the start LSN to the stop LSN produces a recoverable state even though individual files were copied while the database was running. You do not need to stop the primary to take a base backup; that is the point.
Cadence: nightly for most systems, every few hours for very high-change-rate databases. Time to take one is roughly (size / read bandwidth) — a 1 TB database at 500 MB/s takes ~35 minutes; parallelised and compressed by modern tools, 10-15 minutes.
WAL archive. Every WAL segment file the primary generates, shipped off-cluster as soon as it is full. Postgres segments are 16 MB by default; under moderate write load a new segment fills every 30-120 seconds. The mechanism is the archive_command setting — a shell command Postgres invokes once per filled segment, whose contract is "this segment is now durably stored elsewhere; return 0". In practice the command is a wrapper from your backup tool: pgbackrest archive-push, barman-wal-archive, wal-g wal-push. The archive lives in S3, GCS, Azure Blob, or any object store with atomic-put semantics.
Why the WAL archive is the critical artifact: the base backup freezes the database at one moment; the WAL archive is the tape recording of everything that happened after. You can always take another base backup. You cannot regenerate lost WAL — if a segment is missing from the archive, the timeline from that point onward is unreachable. A base backup without its downstream WAL is useless for PITR; a WAL archive without a base to start from is useless by itself. They are a set.
Timeline tracking. When a database is restored to a past point and promoted, it begins a new timeline — Postgres increments the timeline ID and stamps all subsequent WAL with the new ID. The WAL from the old timeline is preserved in the archive: if the recovery target was wrong (15:46:59 restored, 15:40 was actually the bad moment), you restart from the base backup and replay forward on timeline 1 again. Timelines make the archive a branching history rather than a linear one. Every production backup tool tracks them automatically.
The restore sequence
The recovery recipe, once the components are in place, is mechanical. Postgres-specific but the shape holds for every system:
- Provision a fresh host. Do not overwrite the damaged primary; you may need to inspect it later.
- Restore the base backup. Copy the snapshot files from object storage to the new
$PGDATA. - Configure the recovery target. In
postgresql.conf:restore_command = 'pgbackrest --stanza=prod archive-get %f "%p"' recovery_target_time = '2026-04-21 15:46:59 IST' recovery_target_action = 'pause'restore_commandtells Postgres how to fetch each WAL segment from the archive on demand.recovery_target_timeis the timestamp to stop at;recovery_target_lsnorrecovery_target_xidexist for finer control. - Start Postgres in recovery mode. Touch
recovery.signalin$PGDATAand start the server. Postgres enters recovery, pulls WAL from the archive, and replays forward from the base backup's stop LSN. - Wait for the target. When replay crosses
recovery_target_timeit pauses. The database is now live read-only at exactly the requested moment. Run sanity queries — count rows, check recent timestamps, verify the bad change is absent. - Promote. Run
pg_ctl promote. Postgres exits recovery, opens a new timeline, becomes read-write. If the state is wrong, shut down, adjustrecovery_target_time, and start again — moving the target earlier is free, later requires more WAL.
In production this is scripted. You do not type these commands at 4 PM on the day of the incident; you run pgbackrest --stanza=prod --delta --type=time --target="2026-04-21 15:46:59" restore from a runbook.
Why this works — WAL contains the full deterministic history
The reason PITR is exact rather than approximate is the same reason async replication is byte-identical rather than divergent (see async log shipping): WAL is the complete, deterministic history of every committed change.
Every INSERT, UPDATE, DELETE, COMMIT, index update, sequence tick, and hint-bit flip lands in WAL before anything else. WAL records are physical — "page 17, offset 2048, these 73 bytes" — so replaying them on any disk that starts in the right state produces a byte-identical result. No interpretation, no re-execution of SQL, no dependency on wall clocks or sequences.
Why the timestamp-based target works even though WAL records are addressed by LSN: every commit record embeds the wall-clock timestamp at which the COMMIT landed on the primary. Recovery reads those timestamps as it replays forward and stops when it sees one at or after the target. Your recovery granularity is one transaction, not one second.
The only limit on how far back you can restore is the archive's retention. A base backup from 20 days ago plus 20 days of WAL reaches every moment in between.
The backup cadence — a real-world example
Concrete numbers for a 1 TB Postgres primary handling ~5,000 TPS:
- Base backup: daily at 02:00, Asia/Kolkata.
pgbackrest backup --type=fullweekly on Sunday,--type=incrthe other six nights. Full takes ~25 minutes across 8 parallel workers with zstd compression; incrementals take 3-8 minutes. - WAL archive: continuous. At 5,000 TPS the primary fills a 16 MB segment every ~45 seconds; each uploads in 1-2 seconds. Archive lag is bounded at tens of seconds.
- Retention: 30 days. S3 Standard-IA: roughly 30 × 250 GB incremental base + 30 × ~15 GB WAL per day ≈ 8 TB total, ~$80/month.
- Restore time for 1 TB: base restore across 8 parallel workers saturates the instance's 2.5 GB/s EBS throughput, ~7 minutes. Twelve hours of WAL (~9 GB) replays in ~30 seconds. End to end: 8-10 minutes for a recent target.
- Verification: daily. An automated job restores the most recent base + WAL into a sandbox, runs a checksum across canonical tables, pages if row counts drift or the restore fails.
These scale roughly linearly until you hit network bandwidth (cross-region archives) or apply throughput. A 10 TB database shifts the base-backup window into hours and pushes you towards filesystem snapshots (ZFS, EBS) as the base mechanism with WAL archive on top.
The 3-2-1 rule
The 3-2-1 rule predates databases; it is the standard hygiene for any irreplaceable data. Phrased minimally:
- 3 copies of the data. The original plus at least two backups.
- 2 different storage media. Not both on the same disk, not both in the same filesystem, not both in the same availability zone.
- 1 offsite. At least one copy in a physically separate location — another region, another cloud provider, another country.
Applied to a production Postgres setup this maps cleanly onto components you already have:
- Copy 1: the primary's local disk.
- Copy 2: a streaming replica in a different AZ.
- Copy 3: the base backup + WAL archive in object storage in a different region.
Two media: block storage on primary and replica, object storage for the archive. One offsite: the cross-region S3 bucket. The rule is satisfied incidentally by a well-configured HA setup with a working PITR archive.
The common failure mode is conflating copies that are not independent. Two replicas in the same AZ are one copy from a fire-in-the-building standpoint. An RDS instance with automatic snapshots is one system with internal redundancy, not three copies — if the AWS account is compromised or the region has a bad day, you can lose the primary and every snapshot in the same incident. Serious setups cross-replicate the S3 archive to a bucket in a separate AWS account with write-only IAM permissions, so a compromised primary cannot also delete its backups.
Backup testing — the forgotten practice
The most common and most demoralising backup failure is not "the backup did not exist" — it is "the backup existed but did not restore". Causes are depressingly varied:
- An
archive_commandwrapped in a shell script that silently suppressed errors; six hours of WAL missing. - A Postgres major version upgrade changed WAL format; the pre-upgrade archive cannot be replayed by the new binary.
- An S3 lifecycle rule quietly moved old WAL to Glacier, so restore needs a 12-hour thaw.
- Backup tool configuration drifted after a team handoff; backups are running but to the wrong bucket, unmonitored.
Every one of these is invisible until you try to restore. The only defence is exercising the restore path regularly.
The operational standard is a monthly restore drill at minimum, weekly for regulated environments: spin up a sandbox, restore the most recent base backup, replay WAL to a timestamp roughly 1 hour before the drill, promote, run row-count and checksum queries against canonical tables, compare against the primary or a replica, tear down, alert if any step fails.
pgbackrest restore --type=time --target=... plus pg_isready and a checksum script automates this to a nightly cron. Barman and WAL-G have equivalents. The pipeline is not complicated — what is hard is the discipline of running it and caring when it fails. A restore-drill failure three Tuesdays in a row that nobody pages on is the same as having no backups at all.
Common backup tools
Four tools dominate Postgres PITR:
- pgBackRest. The current production default. Parallel backup and restore, strong checksums, incremental and differential backups, bandwidth throttling. Operates against local filesystems or S3/GCS/Azure. Reputation built on reliability and scale — runs well on multi-terabyte databases.
- Barman. Feature-rich backup manager from 2ndQuadrant/EDB. Strong retention policies, incremental backup via rsync, detailed catalog commands. Common in regulated environments wanting a dedicated backup host.
- WAL-G. Cloud-first, built at Citus Data / Microsoft. Integrates with S3, GCS, and Azure Blob from day one; supports MySQL and SQL Server too. Low operational overhead if you are already on cloud object storage.
pg_basebackup. Built into Postgres. Takes a base backup and nothing else — no archive management, no retention, no incrementals. Adequate for one-off snapshots; inadequate as a PITR strategy. If your "backup strategy" ispg_basebackupin cron, you do not have a backup strategy.
All four speak the same underlying protocol; differences are operational ergonomics, not capability. Pick one, commit to it, learn its failure modes, test its restore path.
# pgBackRest: take a full backup, then run PITR to 15:46:59 IST
pgbackrest --stanza=prod --type=full backup
pgbackrest --stanza=prod \
--type=time \
--target="2026-04-21 15:46:59+05:30" \
--target-action=promote \
restore
# Verify archive completeness
pgbackrest --stanza=prod check
# Archive log inspection
pgbackrest --stanza=prod info
Logical backups — pg_dump is a complement, not a substitute
pg_dump produces a logical backup: a stream of SQL statements that, replayed, reconstruct the database. Useful but not PITR.
Where logical backups win: cross-version moves (pg_dump from PG15 restores cleanly into PG17; physical base backups do not — WAL formats change), dev seeding (schema plus sampled rows into a dev environment), cross-flavour migration (portable SQL imports into MySQL, Redshift with massaging), selective restore (pg_restore -t one_table; physical is all-or-nothing).
Where they fail: speed — restoring a 1 TB pg_dump takes 12-24 hours (single-threaded SQL, index rebuilds, FK revalidation); physical base + WAL for the same database is 10-15 minutes. Fidelity — a dump captures logical state at one moment; no in-flight transactions, no LSNs, no sub-dump granularity. PITR to "three seconds before the incident" is impossible with pg_dump.
Standard posture: physical PITR as primary strategy; nightly pg_dump to a separate bucket as a version-portable second copy. The dump is your fallback if something corrupted the physical archive; the physical archive is your go-to for everything else.
The restore-to-standby pattern
PITR's second-most-common use, after disaster recovery, is building a fresh standby. Restore base + WAL to a new host with no recovery target, then instead of promoting, configure the host as a streaming replica and restart; it connects to the live primary and resumes streaming from the archive's last LSN.
This bootstraps a new replica without running pg_basebackup against the primary — useful for large databases where a fresh base-backup run would spike primary I/O. It is also how you spin up branch databases (restore to yesterday, promote for a migration dry run), major-version upgrades (restore to a host running the new version, pg_upgrade, test, switch over), and forensics (restore to before an incident to inspect state).
pgbackrest restore --type=standby and wal-g backup-fetch --restore-replica automate the "don't promote, become a replica" variant. Cheap, low-risk, does not touch the live primary.
A 500 GB PITR drill, end to end
You run a Postgres 16 primary on an r6i.2xlarge, 500 GB, 2,000 TPS writes. At 15:47 a contractor with momentarily-excessive privileges runs:
UPDATE orders SET status = 'cancelled';
Six hundred thousand orders marked cancelled. Shipping, billing, and email pipelines all consume status — cancellation emails start going out within minutes.
At 16:30 the on-call detects the problem; by 16:45 the decision to PITR is made. Recovery target: 2026-04-21 15:46:45 IST (15 seconds of margin before the bad UPDATE).
Components in place: pgBackRest stanza prod, daily 02:00 base backup, continuous WAL archive, 30-day retention. Yesterday's base sits in S3. WAL segments from 02:00 through 16:47 are in the same bucket.
The restore:
pgbackrest --stanza=prod \
--type=time \
--target="2026-04-21 15:46:45+05:30" \
--target-action=promote \
--process-max=8 \
restore
# Streams 500 GB from S3 with 8 parallel workers.
# Saturates ~1.25 GB/s -> ~7 min for base.
# Replays ~14 h of WAL (~11 GB) -> ~40 s.
# Total: ~8 min.
sudo -u postgres psql -c "SELECT count(*) FROM orders WHERE status = 'cancelled';"
# Expect ~12,000 (historical baseline), not 600,000.
Damage summary: restore completed at 17:00, incident-to-recovery 73 minutes. Transactions lost: everything between 15:46:45 and the UPDATE (~15 seconds × 2,000 TPS = ~30,000), plus everything committed between 15:47 and 16:45 that cannot coexist with the restored past (~7 million). These must be replayed from application-side event logs or accepted as business loss.
Followup: cut traffic to the restored database and the old primary off, re-establish the replica topology from the same archive (restore-to-standby), examine audit trails for writes between 15:46:45 and the UPDATE that must be preserved, update the runbook and the permission model, schedule a blameless post-incident review.
The PITR mechanism did exactly what it was designed to do, in eight minutes. Everything around it — decision time, permissions, application replay, topology reconstruction — is where hours go. That operational overhead is part of the strategy, not an afterthought.
Common confusions
-
"Async replication is a backup." No. A replica applies every change the primary commits — including
DELETE FROM users;. Replicas are for availability and read scaling. A backup is a point-in-time snapshot preserved against operator error — which replicas, by construction, cannot be. -
"
pg_dumpis enough."pg_dumpcaptures logical state at the moment it ran. The best you can restore to is "whenever the last dump ran" — typically last night at 02:00. For most commercial contexts that 14-hour data loss is worse than the incident itself. -
"Cloud storage is automatically replicated, so I don't need backups." S3 replicates against hardware failure. It does not replicate against you: an
aws s3 rm --recursiveruns just as faithfully on the replicated copies. Cloud storage's durability is for hardware; backups are for humans. -
"RAID is a backup." RAID is availability. RAID-1 tolerates disk failure; RAID-6 tolerates two. Neither tolerates
rm -rf, a corrupt OS upgrade, or ransomware. RAID rebuilds the exact state just written, including the wrong state. Backups are history; RAID is not. -
"My cloud provider takes snapshots, so I don't need WAL archive." Provider snapshots are a valid base-backup mechanism. Without continuous WAL archive, though, the best you get is the snapshot's moment — 1-24 hours ago. PITR requires WAL in between snapshots; that piece is yours to run.
-
"Backups are the DBA's problem." The DBA sets up the pipeline; the SRE team runs drills; application engineers define acceptable RPO. Backup strategy is cross-functional — when it fails, the blast radius is every service that touched the database.
Going deeper
WAL compression and delta encoding
WAL is repetitive — same record types, same page numbers, similar tuples — so general-purpose compression reliably yields 3-5× on the archive. Postgres 9.5 added wal_compression = on, which compresses full-page images inside WAL records on the primary before the record is written to disk. A primary generating 100 GB/day shrinks to ~25 GB/day at a few-percent CPU cost. pgBackRest and WAL-G apply a second zstd pass at archive-push; together they hit ~6× on OLTP.
Incremental base backups
Classic nightly full backups ship the entire database each time — wasteful when 99% of the data did not change overnight. pgBackRest has shipped differential and incremental backups for years. Postgres 17 added incremental backup as a core feature (pg_basebackup --incremental=<LSN> with pg_combinebackup to reassemble). For a 10 TB database with 1% daily change, the archive stores 10 TB once per week plus 6 × 100 GB incrementals — 10.6 TB/week instead of 70 TB.
Cross-region DR with geographically distributed WAL
For tier-1 systems that must survive a whole-region outage, the archive itself must live in a second region. Two patterns: S3 Cross-Region Replication asynchronously copies every object from the primary bucket to a secondary-region bucket (10-60 s lag); dual-push archive writes the WAL segment to both buckets synchronously (higher archive-push latency, lower RPO). pgBackRest supports multiple repositories natively.
The CRR gotcha: a malicious or buggy delete on the primary bucket replicates to the DR bucket. Defence is object-lock or versioning on the DR bucket so delete events are recorded but not destructive.
Where this leads next
You now have the full backup picture for a single database: base backup, WAL archive, restore procedure, 3-2-1 rule, tools, testing discipline, restore-to-standby. Patterns generalise — MySQL's equivalents are mysqldump, binlog archive, and Percona XtraBackup; Oracle has RMAN; SQL Server has differential backups and log shipping to URL. Shape: physical snapshot plus log archive equals PITR.
-
Application consistency and the backup illusion — chapter 73. PITR restores the database, but real systems are multiple stores — Postgres plus Redis plus S3 plus Kafka plus external providers. Restoring to 15:46:59 does not un-send the 15:47:00 email, un-write the 15:47:03 row in ClickHouse, or un-deliver the 15:47:10 webhook. Chapter 73 designs around that mismatch.
-
Build 10: Leaderless replication. The models of Build 9 (async, sync, semi-sync, quorum) all have a distinguished primary. Dynamo-style and Cassandra-style systems do not. The backup question has a different answer there — per-node snapshots, anti-entropy repair, hinted handoffs.
PITR has bad historical press because most people only think about it after it has already failed them. Done correctly — cadence, monthly drills, 3-2-1 coverage, engineers who have actually executed the runbook — it is the single highest-leverage piece of operational infrastructure a production database has.
References
- PostgreSQL documentation, Chapter 26: Backup and Restore — the canonical reference for base backups, continuous archiving,
archive_command,restore_command,recovery_target_time, and timelines. - pgBackRest User Guide — configuration, stanza layout, parallel restore, retention policies, and the full command reference for the most widely used Postgres backup tool in production.
- 2ndQuadrant, Barman documentation — the equivalent reference for Barman, with detailed coverage of incremental backup via rsync, the backup catalog, and the
barman recovercommand. - Citus Data / Microsoft, WAL-G GitHub repository — README, configuration, and architecture notes for the cloud-native alternative, with first-class support for S3, GCS, and Azure Blob.
- NIST Special Publication 800-34, Contingency Planning Guide for Federal Information Systems — the formal backing for the 3-2-1 rule and for restore testing as a required control in regulated environments.
- Smith, PostgreSQL 9.0 High Performance, Packt 2010, chapter 14 — Gregory Smith's original book-length treatment of Postgres backup strategy, including the observation that an untested backup is indistinguishable from no backup at all, which the industry has spent fifteen years re-learning.