In short
A write-ahead log that has been appended to for six months holds every UPDATE record the database has ever written. If you have to replay all of them on crash recovery, the recovery itself takes longer than the uptime it protects — a database that takes four hours to come back is not a database, it is a tombstone. The fix is the checkpoint: a periodic WAL record that stores, at the LSN where it lands, two small tables: the active-transaction table (xids of transactions that are in flight at checkpoint time) and the dirty-page table (page IDs and the rec-LSNs of their earliest unflushed modifications). On recovery, the system reads the last checkpoint record — not the first record of the log — and begins its REDO pass from the earliest rec-LSN in the dirty-page table. Every record before that earliest rec-LSN is guaranteed to describe a modification that is already on disk, so replaying it would be wasted work; the checkpoint is the "you can safely ignore everything before this" guarantee. Consistent checkpoints pause all transactions while the checkpoint runs — simple, correct, but a full stall. Fuzzy checkpoints (which every real database uses) let transactions keep running while the checkpoint builds its tables; the resulting checkpoint record reflects a snapshot slightly in the past, and the recovery algorithm is written to handle that lag. The user-tunable knob is frequency. Too rare: recovery is slow. Too frequent: the checkpoint's page-flush phase causes an IO spike that hurts steady-state throughput. Postgres exposes this trade-off directly via checkpoint_timeout (default 5 minutes) and checkpoint_completion_target (default 0.9, which spreads the flush work across 90% of the interval). The checkpoint is the mechanism that turns "recovery time" from an unbounded property of the log's age into a bounded property of the last few minutes of activity.
A database that crashes is expected. A database that cannot come back is not. The implicit contract a storage engine makes with its operator is not just "your committed data is durable" — it is also "recovery completes in a time that is small compared to the uptime it protects." A log that has been quietly growing for six months has tens of terabytes of records in it. If recovery replays every one of them, it takes hours. Hours of downtime to replay six months of uptime is not a good ratio.
This chapter is about the mechanism that breaks the link between "how long the log is" and "how long recovery takes." It is called the checkpoint, and it is the reason a production database boots in thirty seconds instead of six hours.
Without checkpoints, recovery has no bound
Recall the WAL invariant from Log Records, LSNs, and the Log Sequence: every data page on disk has a page-LSN in its header, and for any page p written to the data file, page-LSN(p) ≤ flush-LSN. That invariant is what makes recovery correct — it guarantees that for every change on a data page there is a durable log record describing that change. But by itself it tells recovery nothing about where to start.
Absent other information, recovery is forced into a simple, catastrophic algorithm: open the log at byte 0 and replay every record to the end. For every UPDATE record it reads, it must fetch the target page, compare page-LSN to the record's LSN, and apply the after-image if the page is behind. For a log file that has been growing for six months at 50 MB/s, that is roughly 780 TiB of log records. Even if your disk can read sequentially at 2 GB/s, you are looking at 100+ hours of pure log-read time before you do any random page I/O. A database that takes four days to come back after a crash is not a database.
The observation that saves you is that most of the early log is already redundant. Every UPDATE record in that six-month log describes a modification to a page — and the vast majority of those pages have long since been flushed to disk with the modification included. Replaying their records would just re-compute page-LSN(p) < record.lsn, find it false, and skip the record. The work is wasted.
What recovery really needs is someone to tell it, at some recent point, "from here backward, every record describes a modification that is already on a data page; from here forward, you actually have to replay." That someone is the checkpoint.
A checkpoint is a small WAL record with two tables
A CHECKPOINT record is written to the WAL like any other record — same LSN machinery, same fsync path. Its payload is two small tables.
- Active-transaction table (ATT) — one row per transaction that was still in flight at checkpoint time. Each row holds the transaction's xid and the LSN of its most recent WAL record (so UNDO on recovery can find where to start walking backward).
- Dirty-page table (DPT) — one row per buffer-pool page that was dirty (modified in memory but not yet flushed to disk) at checkpoint time. Each row holds the page ID and its rec-LSN — the LSN of the earliest WAL record whose modifications to this page have not yet been written to disk.
The rec-LSN is the critical field. Take the minimum rec-LSN across the whole dirty-page table: that is the earliest LSN that could possibly still need replay. Every record before that LSN describes a modification whose effect is already on a data page — because either the page was dirty at some prior time and has since been flushed (so it is no longer in the DPT), or the record's modification has already been absorbed into a later page flush.
Recovery therefore begins its REDO pass at min(rec-LSN over DPT) rather than at the beginning of the log. The ATT, meanwhile, tells UNDO which transactions were in flight and therefore candidates for rollback.
# checkpoint.py — writing a checkpoint record and using it on recovery.
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Dict, List, Tuple
import struct, zlib
CHECKPOINT = 4 # same type byte as in the previous chapter
@dataclass
class BufferPool:
# page_id -> (bytes, page_lsn, dirty_rec_lsn_or_None)
pages: Dict[int, Tuple[bytes, int, int | None]] = field(default_factory=dict)
def dirty_page_table(self) -> List[Tuple[int, int]]:
"""Pairs of (page_id, rec_lsn) for pages whose earliest unflushed
modification lives at rec_lsn. Pages that are clean are omitted."""
return [(pid, rl) for pid, (_, _, rl) in self.pages.items() if rl is not None]
@dataclass
class TransactionTable:
# xid -> last_lsn
active: Dict[int, int] = field(default_factory=dict)
def snapshot(self) -> List[Tuple[int, int]]:
return list(self.active.items())
def write_checkpoint(log_file, log_manager, buffer_pool: BufferPool,
txn_table: TransactionTable) -> int:
"""Write a CHECKPOINT record. Returns the LSN at which it lands.
This is the FUZZY variant: we snapshot the tables under a very brief latch,
release the latch, and then write the record. New transactions and new
dirty pages can appear while the record is being serialized and fsynced.
The rec-LSN conservatism (take the earliest LSN in the table) handles
the small lag correctly.
"""
# Brief critical section: copy the tables. Release the latch immediately.
active = txn_table.snapshot()
dirty = buffer_pool.dirty_page_table()
body = struct.pack(">BI", CHECKPOINT, len(active))
for xid, last_lsn in active:
body += struct.pack(">QQ", xid, last_lsn)
body += struct.pack(">I", len(dirty))
for pid, rec_lsn in dirty:
body += struct.pack(">QQ", pid, rec_lsn)
crc = zlib.crc32(body) & 0xFFFFFFFF
framed = struct.pack(">I", len(body)) + body + struct.pack(">I", crc)
lsn = log_manager.append_and_flush(framed) # durable before we return
# Record the checkpoint's LSN in a small "control file" so recovery can
# find the most recent checkpoint without scanning the whole log.
log_manager.control_file.write_last_checkpoint(lsn)
return lsn
Two things worth naming in that routine.
The control file. Recovery does not scan the log forward looking for the last CHECKPOINT record; that would defeat the whole purpose. Instead, each successful checkpoint updates a small control file (Postgres calls it pg_control, InnoDB the system tablespace header) with the LSN of the most recent checkpoint. Recovery opens the control file first, reads that LSN, seeks directly there in the log, and starts from the checkpoint record. The control file is tiny (a few hundred bytes), but it is the entry point for every recovery run.
The fsync of the checkpoint record. The checkpoint record must itself be durable before the control file is updated. Otherwise a crash between "record appended" and "record fsynced" would leave the control file pointing at an LSN whose record is not on disk. The order is always: write checkpoint record → fsync log → update control file → fsync control file.
How recovery uses the checkpoint
The recovery algorithm, in skeleton, is three lines long.
def recover(log_manager, buffer_pool, data_files, control_file):
"""Skeleton recovery driver. The full ARIES algorithm is in the next three chapters."""
# Step 1 — find the entry point.
ckpt_lsn = control_file.read_last_checkpoint() # from the tiny control file
ckpt = log_manager.read_record(ckpt_lsn) # the CHECKPOINT record
# Step 2 — ANALYSIS: scan forward from the checkpoint, updating the DPT and ATT
# with records written *after* the checkpoint, so the tables reflect the state
# at crash, not at checkpoint.
dpt = {pid: rec_lsn for pid, rec_lsn in ckpt.dirty_pages}
att = {xid: last_lsn for xid, last_lsn in ckpt.active_xids}
for rec in log_manager.scan_from(ckpt_lsn + ckpt.size):
if rec.type == "UPDATE":
dpt.setdefault(rec.page_id, rec.lsn) # rec-LSN = first mod since last flush
att[rec.xid] = rec.lsn
elif rec.type == "COMMIT":
att.pop(rec.xid, None)
elif rec.type == "BEGIN":
att[rec.xid] = rec.lsn
# Step 3 — REDO: from the earliest rec-LSN in the DPT, replay forward,
# skipping records whose page-LSN on disk is already >= record.lsn.
redo_start = min(dpt.values()) if dpt else ckpt_lsn
for rec in log_manager.scan_from(redo_start):
if rec.type == "UPDATE":
page = buffer_pool.fetch(rec.page_id)
if page.page_lsn < rec.lsn:
page.apply_after_image(rec)
page.page_lsn = rec.lsn
# Step 4 — UNDO: for each xid still in ATT (never committed), walk prev_lsn
# chain backward, applying before-images. (Full algorithm in the UNDO chapter.)
for xid in att:
undo_transaction(xid, log_manager, buffer_pool)
The headline number — "how long does recovery take" — is determined almost entirely by step 3's redo_start. The further redo_start is from the crash, the more records have to be read in step 3, the longer recovery takes. And redo_start is the smallest rec-LSN in the DPT — which is dominated by whichever dirty page has been sitting unflushed the longest. Checkpoints exist to keep that age bounded.
Fuzzy vs consistent checkpoints — who has to be paused
A checkpoint's job is to produce a self-consistent ATT and DPT. A naïve implementation pauses all transactions while it walks the buffer pool and transaction table. This is a consistent checkpoint: the world stops, the tables are snapshotted, the record is written, the world restarts. It is simple and correct, but every consistent checkpoint is a full stall of the database — on a busy OLTP system, a 200 ms checkpoint means 200 ms during which no transaction can make progress. That is unacceptable for a production system.
Every real database therefore uses a fuzzy checkpoint, in which transactions keep running while the checkpoint's tables are being built. The tables end up reflecting a snapshot slightly in the past (new transactions begun during the checkpoint are not in the ATT; new dirty pages are not in the DPT), and the recovery algorithm is written to handle that lag.
The key insight that makes fuzzy checkpoints safe: recovery's analysis pass re-scans the log from the checkpoint record to the end, re-applying every BEGIN, UPDATE, and COMMIT it sees to the DPT and ATT it inherited from the checkpoint. So if a transaction started at t1.5 (between checkpoint-latch and checkpoint-record), its BEGIN record at LSN 1250 is replayed by analysis, the xid gets added to the ATT, and the scheme is correct. The checkpoint's tables are the starting point of the analysis pass, not the result of it.
The recovery-time metric and checkpoint frequency
Call the worst-case time to complete recovery after a crash the recovery-time metric, or sometimes Recovery Time Objective (RTO). In WAL-based systems, the metric is almost perfectly proportional to the number of log bytes between the last checkpoint and the crash.
If checkpoints fire every T seconds, then on average the crash catches the system T/2 seconds after the last checkpoint, with a worst case of just under T seconds. At a WAL rate of R bytes per second, recovery has to read roughly T · R bytes of log, so:
worst-case recovery time ≈ (T · R) / log_read_speed + random_page_IO_time
For T = 300 s (5 min), R = 50 MB/s (a healthy OLTP load), log_read_speed = 2 GB/s: recovery reads 15 GB of log in 7.5 seconds, plus maybe 20–30 seconds of random I/O to fetch and update dirty pages. Total: roughly 30 seconds. That is the number you want.
Make T = 30 minutes and the same calculation gives 3 minutes of recovery — usually still acceptable. Make T = 12 hours and recovery is an hour — your on-call engineer is now very unhappy.
The temptation is to make T as small as possible. Why not checkpoint every 10 seconds and recover in 2 seconds? The answer is the IO cost of the checkpoint itself, which the next section unpacks.
Bounding recovery on a 1 TB Postgres
A production Postgres cluster on a single 1 TB NVMe SSD runs an OLTP workload writing 40 MB/s of WAL. It configures checkpoint_timeout = 5min and checkpoint_completion_target = 0.9. A crash happens 4 minutes and 20 seconds after the most recent checkpoint completed.
Log bytes to replay. 260 seconds × 40 MB/s = 10.4 GB of WAL.
Log read time. The SSD reads sequentially at ~3 GB/s; 10.4 GB / 3 GB/s ≈ 3.5 seconds.
Pages touched during replay. Suppose 15% of records reference pages not already in the buffer pool after restart (the buffer pool is cold); at ~20 bytes of work per record and ~200 bytes per record average size, that is about 52 million records in 10.4 GB, of which 7.8 million need a page fetch. At 8 KiB per page and a random-read-per-page of 80 µs on this NVMe, that is 7.8M × 80 µs = 624 seconds of random I/O. That dominates.
Mitigation. Real recovery pipelines prefetch pages in parallel — Postgres's recovery prefetch (added in 15) can issue many posix_fadvise(WILLNEED) calls in parallel and cut random-read latency by 10–20×. With prefetch, the 624 seconds drops to ~40 seconds.
Total. 3.5 s (log read) + 40 s (prefetched page I/O) + a few seconds of book-keeping ≈ ~45 s worst-case recovery. On this system, reducing checkpoint_timeout to 2 minutes would cut recovery to ~20 s but would roughly double the average checkpoint IO rate. The operator chose 5 minutes because 45 s of recovery after a crash is acceptable and 20 s of IO headroom during steady-state is not.
The IO-spike tradeoff — why you cannot checkpoint too often
A checkpoint does two things: it writes a checkpoint record (tiny, cheap — one WAL append and one fsync), and it flushes the dirty pages referenced by the DPT to the data files. That second part is the expensive one. A buffer pool holding 4 GiB of dirty pages, when checkpointed, has to write 4 GiB out to the data files as a burst of I/O.
If checkpoints fire every 5 minutes and the database has averaged 500 MB of new dirty pages per minute between them, the checkpoint has to move 2.5 GB. If the SSD's write bandwidth is 1 GB/s, that is 2.5 seconds of 100%-full write queue — during which every non-checkpoint write is fighting for bandwidth, and latency on foreground transactions spikes badly.
The obvious fix is to spread the flush across the interval instead of bursting at the end. Postgres does exactly this via checkpoint_completion_target, a float in [0, 1] (default 0.9). It means: if checkpoint_timeout is 300 seconds and the target is 0.9, the background writer paces the dirty-page flush to complete by 270 seconds into the interval — i.e., to spread the IO across 90% of the checkpoint window. The IO rate becomes (dirty_bytes_at_checkpoint_start) / (checkpoint_timeout * completion_target) — far smoother than a burst.
Once you see checkpoint_completion_target, you see the whole trade-off:
checkpoint_timeouttoo small. Checkpoints run back-to-back; the background writer is never idle; IO budget is dominated by checkpoint traffic; foreground transactions compete for the same bandwidth; throughput drops.checkpoint_timeouttoo large. Recovery time grows; long crashes leave users waiting minutes; the DPT at checkpoint time is huge, so when the checkpoint does fire, even a spread flush is a lot of IO.
Most production Postgres deployments land in the range checkpoint_timeout = 5min to 15min with checkpoint_completion_target = 0.9. A rule of thumb: if your pg_stat_bgwriter.checkpoints_req (requested, meaning WAL-volume-triggered rather than timeout-triggered) is a significant fraction of checkpoints_timed, your max_wal_size is too small and checkpoints are firing faster than the timeout would dictate — the cure is a bigger max_wal_size, not a shorter timeout.
Common confusions
-
"Does a checkpoint flush the WAL, or flush the data pages?" Both, but for different reasons. The WAL flush is small and always happens: the CHECKPOINT record itself must be durable. The data-page flush is large and variable: every page in the DPT must be written to its data file. The whole point of the checkpoint is that second flush — it is what advances the earliest rec-LSN and shrinks the recovery window for the next crash.
-
"Isn't the rec-LSN just the page-LSN?" No, though they are related. The page-LSN (in the page header) is the LSN of the most recent WAL record that modified the page — it can change multiple times between flushes. The rec-LSN (in the DPT) is the LSN of the earliest WAL record whose modifications to this page have not yet reached disk — it is set when the page first transitions from clean to dirty, and it is cleared only when the page is flushed. One page, one rec-LSN, many possible page-LSNs over the page's lifetime.
-
"If I take a checkpoint every 10 seconds, recovery is fast, right?" Fast to run, yes — but the 10-second interval is now dominated by checkpoint IO. The database spends a significant fraction of its IO budget flushing pages that foreground transactions are still modifying, and foreground throughput drops. Recovery time is only half of the equation; steady-state throughput is the other half.
-
"What if the database crashes during a checkpoint?" Harmless. A checkpoint becomes the "most recent" only after its record is fsynced to the log and the control file has been updated to point at it. A crash mid-checkpoint leaves either the record not fsynced (in which case the control file still points at the previous checkpoint) or the record fsynced but the control file not yet updated (same result). Recovery uses the previous checkpoint, replays a little more of the log, and converges.
-
"Does a checkpoint need to pause commits?" No. A fuzzy checkpoint takes only a brief (sub-millisecond) latch to snapshot the ATT and DPT, then releases it. New commits can fly through the log while the checkpoint record is being serialized and fsynced. The snapshot's staleness is handled by the analysis pass on recovery.
-
"Is the checkpoint the same as an fsync?" No, though both push data to stable storage.
fsyncis an OS-level primitive that forces a file's dirty pages from the OS page cache to disk. A checkpoint is a database-level protocol that writes a WAL record and flushes dirty buffer-pool pages (using fsync internally as needed) and updates the control file. fsync is one tool the checkpoint uses; they are not the same thing.
Going deeper
Postgres — the background writer and spread checkpoints
Postgres's checkpointer is a separate background process (checkpointer), spawned at startup. It loops: sleep until either checkpoint_timeout has elapsed or WAL volume since the last checkpoint has hit max_wal_size, then start a checkpoint. The checkpoint's first action is to snapshot the transaction state and the dirty-buffer list; its second is to write the XLOG_CHECKPOINT_ONLINE record; its third — the long one — is the buffer sync, which iterates over the buffer pool and writes out every dirty buffer whose page-LSN is ≤ the checkpoint's LSN.
The buffer sync is paced by checkpoint_completion_target. Before writing each buffer, the checkpointer computes how far into the checkpoint interval it should be by now (based on wall-clock or WAL progress, whichever is ahead), compares that to how many buffers it has actually written, and sleeps if it is ahead. This produces the smooth IO curve you see in iostat on a healthy Postgres — peaks at checkpoint edges, valleys between, but nothing catastrophic.
After the buffer sync, the checkpointer updates pg_control with the new checkpoint's LSN (via UpdateControlFile(), which writes and fsyncs the 8 KiB control file atomically), and finally calls RemoveOldXlogFiles() to recycle WAL segments older than the checkpoint (they are no longer needed for crash recovery, though they may be kept for replication or point-in-time recovery).
InnoDB — fuzzy checkpoint with a separate LSN for the flushed list
InnoDB does not write a single CHECKPOINT record. Instead, the master thread and the page cleaners constantly flush dirty pages from the buffer pool to disk, and InnoDB separately maintains a flushed-up-to LSN in the redo-log header — the LSN up to which every dirty page has been flushed. Recovery starts from this LSN (read from ib_logfile0's header on startup) rather than from a checkpoint record in the log body.
Conceptually, InnoDB's "checkpoint" is continuous: every page flush advances the flushed-up-to LSN by the oldest rec-LSN of any remaining dirty page. The knob you tune is innodb_io_capacity and innodb_max_dirty_pages_pct, which together pace the page cleaners to keep the buffer pool from filling up with dirty pages and to keep the flushed-up-to LSN close to the current LSN. The effect is the same as Postgres's spread checkpoints — smooth IO, bounded recovery — but the mechanism is different. InnoDB's design prioritises the continuous model; Postgres's prioritises a discrete checkpoint event you can reason about in isolation.
Azure SQL Database — continuous checkpointing and accelerated database recovery
Azure SQL Database uses Indirect Checkpoints, which are fuzzy checkpoints spread continuously by a target recovery time (target_recovery_time_in_seconds) configured per database. The SQL Server engine computes, at every moment, the age of the oldest dirty page in the buffer pool; if that age exceeds the configured target, the background writer accelerates its flush rate. The checkpoint is no longer an event — it is a steady-state property the engine maintains.
Azure goes further with Accelerated Database Recovery (ADR), which replaces the traditional UNDO phase with a Persisted Version Store — every update's before-image is kept in a versioned structure on disk, so UNDO on a long-running transaction becomes a constant-time operation rather than proportional to transaction length. ADR breaks the other half of the recovery-time problem: not just "how long is the log" but "how long can one transaction be". With ADR, a 10-hour transaction that rolls back after a crash no longer imposes 10 hours of UNDO.
The checkpoint record in ARIES — a sharp-edged specification
The 1992 ARIES paper specifies the checkpoint record with precision. The begin_chkpt record is written; then, while transactions continue, the active-transaction table and dirty-page table are snapshotted; then the end_chkpt record is written, carrying both tables. Recovery uses end_chkpt.LSN as its starting point, inheriting the tables from the record and then re-applying records from begin_chkpt.LSN + 1 to the end-of-log in the analysis pass — capturing everything that happened during the checkpoint. Most real engines have collapsed the begin/end distinction (Postgres writes a single XLOG_CHECKPOINT_ONLINE), but the ARIES begin/end pattern is conceptually cleaner and you will see it in textbooks.
Where this leads next
You now have the third of the four mechanisms a WAL-based engine is built from: records, LSNs, fsync (via group commit), and checkpoints. With a checkpoint in hand, recovery is no longer "replay from the Big Bang"; it is "replay from the last few minutes." The remaining question is: how, precisely, does recovery use the checkpoint's ATT and DPT to reconstruct the state-at-crash?
The answer is the ARIES analysis pass — the first of three passes that together recover the database. Analysis reads the log from the checkpoint to the tail, updating the ATT and DPT with every record it sees, so by the time it hits the end-of-log it knows exactly which transactions were in flight at the crash and which pages need redo. Then comes redo (replay forward from the earliest rec-LSN), then undo (walk uncommitted transactions backward via prev_lsn). The next chapter, ARIES: the analysis pass, is where the checkpoint's tables are first put to work.
References
- Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — §4 formally defines fuzzy checkpoints, the begin_chkpt / end_chkpt record pair, and the analysis-pass use of the tables.
- PostgreSQL Global Development Group, WAL Configuration — checkpoint_timeout and checkpoint_completion_target, PostgreSQL 16 documentation — the operational reference for Postgres's spread-checkpoint mechanism.
- Oracle Corporation, InnoDB Checkpoints, MySQL 8.0 reference manual — InnoDB's fuzzy checkpoint, the flushed-up-to LSN, and the tuning knobs (
innodb_io_capacity,innodb_max_dirty_pages_pct). - Microsoft, Accelerated Database Recovery in SQL Server and Azure SQL, SQL Server 2022 documentation — the Persisted Version Store design and the continuous-checkpoint model.
- Gray and Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann 1992 — §10.3 on checkpoint trade-offs and §10.4 on their interaction with recovery time.
- Hellerstein, Stonebraker, Hamilton, Architecture of a Database System, Foundations and Trends in Databases, 2007 — §5.4 gives a vendor-neutral account of fuzzy checkpointing and the recovery-time metric.