In short

A write-ahead log that has been appended to for six months holds every UPDATE record the database has ever written. If you have to replay all of them on crash recovery, the recovery itself takes longer than the uptime it protects — a database that takes four hours to come back is not a database, it is a tombstone. The fix is the checkpoint: a periodic WAL record that stores, at the LSN where it lands, two small tables: the active-transaction table (xids of transactions that are in flight at checkpoint time) and the dirty-page table (page IDs and the rec-LSNs of their earliest unflushed modifications). On recovery, the system reads the last checkpoint record — not the first record of the log — and begins its REDO pass from the earliest rec-LSN in the dirty-page table. Every record before that earliest rec-LSN is guaranteed to describe a modification that is already on disk, so replaying it would be wasted work; the checkpoint is the "you can safely ignore everything before this" guarantee. Consistent checkpoints pause all transactions while the checkpoint runs — simple, correct, but a full stall. Fuzzy checkpoints (which every real database uses) let transactions keep running while the checkpoint builds its tables; the resulting checkpoint record reflects a snapshot slightly in the past, and the recovery algorithm is written to handle that lag. The user-tunable knob is frequency. Too rare: recovery is slow. Too frequent: the checkpoint's page-flush phase causes an IO spike that hurts steady-state throughput. Postgres exposes this trade-off directly via checkpoint_timeout (default 5 minutes) and checkpoint_completion_target (default 0.9, which spreads the flush work across 90% of the interval). The checkpoint is the mechanism that turns "recovery time" from an unbounded property of the log's age into a bounded property of the last few minutes of activity.

A database that crashes is expected. A database that cannot come back is not. The implicit contract a storage engine makes with its operator is not just "your committed data is durable" — it is also "recovery completes in a time that is small compared to the uptime it protects." A log that has been quietly growing for six months has tens of terabytes of records in it. If recovery replays every one of them, it takes hours. Hours of downtime to replay six months of uptime is not a good ratio.

This chapter is about the mechanism that breaks the link between "how long the log is" and "how long recovery takes." It is called the checkpoint, and it is the reason a production database boots in thirty seconds instead of six hours.

Without checkpoints, recovery has no bound

Recall the WAL invariant from Log Records, LSNs, and the Log Sequence: every data page on disk has a page-LSN in its header, and for any page p written to the data file, page-LSN(p) ≤ flush-LSN. That invariant is what makes recovery correct — it guarantees that for every change on a data page there is a durable log record describing that change. But by itself it tells recovery nothing about where to start.

Absent other information, recovery is forced into a simple, catastrophic algorithm: open the log at byte 0 and replay every record to the end. For every UPDATE record it reads, it must fetch the target page, compare page-LSN to the record's LSN, and apply the after-image if the page is behind. For a log file that has been growing for six months at 50 MB/s, that is roughly 780 TiB of log records. Even if your disk can read sequentially at 2 GB/s, you are looking at 100+ hours of pure log-read time before you do any random page I/O. A database that takes four days to come back after a crash is not a database.

Recovery without checkpoints replays the entire logA long horizontal bar represents the WAL, stretching across the full width of the figure. The left end is labelled "BEGIN of time (log file created 6 months ago)", the right end is labelled "crash (today, 3am)". An arrow along the entire length labelled "recovery has to replay ALL of this" points left-to-right. Below, a timeline shows "recovery time estimate: 100+ hours". Beside it, a smaller bar (one-twentieth the length) shows what happens with a checkpoint: recovery only replays from "last checkpoint (5 minutes before crash)" to crash, estimated recovery time 30 seconds.without vs with checkpointsWAL (6 months of records)BEGIN of logcrashreplay ALL of this — ~100 hourssame log, but checkpoint at LSN near the endlast checkpointcrash~30 s
Top: recovery without a checkpoint must replay the entire WAL — every UPDATE record back to log creation. For a six-month-old log this is ~100 hours. Bottom: with a checkpoint a few minutes before the crash, recovery replays only the tail of the log and completes in seconds. The checkpoint is the single mechanism that turns recovery time from an unbounded property of the log's age into a bounded property of the last checkpoint interval.

The observation that saves you is that most of the early log is already redundant. Every UPDATE record in that six-month log describes a modification to a page — and the vast majority of those pages have long since been flushed to disk with the modification included. Replaying their records would just re-compute page-LSN(p) < record.lsn, find it false, and skip the record. The work is wasted.

What recovery really needs is someone to tell it, at some recent point, "from here backward, every record describes a modification that is already on a data page; from here forward, you actually have to replay." That someone is the checkpoint.

A checkpoint is a small WAL record with two tables

A CHECKPOINT record is written to the WAL like any other record — same LSN machinery, same fsync path. Its payload is two small tables.

The rec-LSN is the critical field. Take the minimum rec-LSN across the whole dirty-page table: that is the earliest LSN that could possibly still need replay. Every record before that LSN describes a modification whose effect is already on a data page — because either the page was dirty at some prior time and has since been flushed (so it is no longer in the DPT), or the record's modification has already been absorbed into a later page flush.

Recovery therefore begins its REDO pass at min(rec-LSN over DPT) rather than at the beginning of the log. The ATT, meanwhile, tells UNDO which transactions were in flight and therefore candidates for rollback.

# checkpoint.py — writing a checkpoint record and using it on recovery.
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Dict, List, Tuple
import struct, zlib

CHECKPOINT = 4   # same type byte as in the previous chapter

@dataclass
class BufferPool:
    # page_id -> (bytes, page_lsn, dirty_rec_lsn_or_None)
    pages: Dict[int, Tuple[bytes, int, int | None]] = field(default_factory=dict)

    def dirty_page_table(self) -> List[Tuple[int, int]]:
        """Pairs of (page_id, rec_lsn) for pages whose earliest unflushed
        modification lives at rec_lsn. Pages that are clean are omitted."""
        return [(pid, rl) for pid, (_, _, rl) in self.pages.items() if rl is not None]

@dataclass
class TransactionTable:
    # xid -> last_lsn
    active: Dict[int, int] = field(default_factory=dict)

    def snapshot(self) -> List[Tuple[int, int]]:
        return list(self.active.items())

def write_checkpoint(log_file, log_manager, buffer_pool: BufferPool,
                     txn_table: TransactionTable) -> int:
    """Write a CHECKPOINT record. Returns the LSN at which it lands.

    This is the FUZZY variant: we snapshot the tables under a very brief latch,
    release the latch, and then write the record. New transactions and new
    dirty pages can appear while the record is being serialized and fsynced.
    The rec-LSN conservatism (take the earliest LSN in the table) handles
    the small lag correctly.
    """
    # Brief critical section: copy the tables. Release the latch immediately.
    active = txn_table.snapshot()
    dirty  = buffer_pool.dirty_page_table()

    body = struct.pack(">BI", CHECKPOINT, len(active))
    for xid, last_lsn in active:
        body += struct.pack(">QQ", xid, last_lsn)
    body += struct.pack(">I", len(dirty))
    for pid, rec_lsn in dirty:
        body += struct.pack(">QQ", pid, rec_lsn)

    crc = zlib.crc32(body) & 0xFFFFFFFF
    framed = struct.pack(">I", len(body)) + body + struct.pack(">I", crc)
    lsn = log_manager.append_and_flush(framed)   # durable before we return
    # Record the checkpoint's LSN in a small "control file" so recovery can
    # find the most recent checkpoint without scanning the whole log.
    log_manager.control_file.write_last_checkpoint(lsn)
    return lsn

Two things worth naming in that routine.

The control file. Recovery does not scan the log forward looking for the last CHECKPOINT record; that would defeat the whole purpose. Instead, each successful checkpoint updates a small control file (Postgres calls it pg_control, InnoDB the system tablespace header) with the LSN of the most recent checkpoint. Recovery opens the control file first, reads that LSN, seeks directly there in the log, and starts from the checkpoint record. The control file is tiny (a few hundred bytes), but it is the entry point for every recovery run.

The fsync of the checkpoint record. The checkpoint record must itself be durable before the control file is updated. Otherwise a crash between "record appended" and "record fsynced" would leave the control file pointing at an LSN whose record is not on disk. The order is always: write checkpoint record → fsync log → update control file → fsync control file.

How recovery uses the checkpoint

The recovery algorithm, in skeleton, is three lines long.

def recover(log_manager, buffer_pool, data_files, control_file):
    """Skeleton recovery driver. The full ARIES algorithm is in the next three chapters."""
    # Step 1 — find the entry point.
    ckpt_lsn = control_file.read_last_checkpoint()        # from the tiny control file
    ckpt = log_manager.read_record(ckpt_lsn)              # the CHECKPOINT record

    # Step 2 — ANALYSIS: scan forward from the checkpoint, updating the DPT and ATT
    # with records written *after* the checkpoint, so the tables reflect the state
    # at crash, not at checkpoint.
    dpt = {pid: rec_lsn for pid, rec_lsn in ckpt.dirty_pages}
    att = {xid: last_lsn for xid, last_lsn in ckpt.active_xids}
    for rec in log_manager.scan_from(ckpt_lsn + ckpt.size):
        if rec.type == "UPDATE":
            dpt.setdefault(rec.page_id, rec.lsn)         # rec-LSN = first mod since last flush
            att[rec.xid] = rec.lsn
        elif rec.type == "COMMIT":
            att.pop(rec.xid, None)
        elif rec.type == "BEGIN":
            att[rec.xid] = rec.lsn

    # Step 3 — REDO: from the earliest rec-LSN in the DPT, replay forward,
    # skipping records whose page-LSN on disk is already >= record.lsn.
    redo_start = min(dpt.values()) if dpt else ckpt_lsn
    for rec in log_manager.scan_from(redo_start):
        if rec.type == "UPDATE":
            page = buffer_pool.fetch(rec.page_id)
            if page.page_lsn < rec.lsn:
                page.apply_after_image(rec)
                page.page_lsn = rec.lsn

    # Step 4 — UNDO: for each xid still in ATT (never committed), walk prev_lsn
    # chain backward, applying before-images. (Full algorithm in the UNDO chapter.)
    for xid in att:
        undo_transaction(xid, log_manager, buffer_pool)

The headline number — "how long does recovery take" — is determined almost entirely by step 3's redo_start. The further redo_start is from the crash, the more records have to be read in step 3, the longer recovery takes. And redo_start is the smallest rec-LSN in the DPT — which is dominated by whichever dirty page has been sitting unflushed the longest. Checkpoints exist to keep that age bounded.

Fuzzy vs consistent checkpoints — who has to be paused

A checkpoint's job is to produce a self-consistent ATT and DPT. A naïve implementation pauses all transactions while it walks the buffer pool and transaction table. This is a consistent checkpoint: the world stops, the tables are snapshotted, the record is written, the world restarts. It is simple and correct, but every consistent checkpoint is a full stall of the database — on a busy OLTP system, a 200 ms checkpoint means 200 ms during which no transaction can make progress. That is unacceptable for a production system.

Every real database therefore uses a fuzzy checkpoint, in which transactions keep running while the checkpoint's tables are being built. The tables end up reflecting a snapshot slightly in the past (new transactions begun during the checkpoint are not in the ATT; new dirty pages are not in the DPT), and the recovery algorithm is written to handle that lag.

Consistent vs fuzzy checkpoints — what gets pausedTwo horizontal timelines labelled "consistent checkpoint (top)" and "fuzzy checkpoint (bottom)". In the top timeline, a wide red block labelled "STALL: all transactions paused" spans 200 ms between t1 and t2. Transactions T1, T2, T3 are shown as horizontal lines that stop at t1 and resume at t2. The CHECKPOINT record lands at t2. In the bottom timeline, the same three transactions T1, T2, T3 run continuously without a stall. A tiny red dot at t1 (labelled "brief latch: ~1 ms") marks the instant the tables are snapshotted. The CHECKPOINT record lands at t2. Between t1 and t2, more transaction activity happens and is visible in the log as records appearing after the snapshot.consistent vs fuzzy checkpointCONSISTENT (simple, stalls everything)STALL: all txns pausedT1T2T3t1: stall beginst2: CHECKPOINT record writtenFUZZY (brief latch only)T1T2T3t1: ~1 ms latch to snapshot tablest2: CHECKPOINT record written (snapshot is stale by ~t2-t1)
Top: a consistent checkpoint stalls every transaction for the duration of the checkpoint — easy to reason about, unusable in production. Bottom: a fuzzy checkpoint takes a brief latch at t1 to snapshot the tables, releases it, and lets transactions run through the rest of the checkpoint write. The snapshot in the CHECKPOINT record is slightly stale (it does not reflect t1→t2 activity), and recovery handles the staleness by replaying those records in its analysis pass.

The key insight that makes fuzzy checkpoints safe: recovery's analysis pass re-scans the log from the checkpoint record to the end, re-applying every BEGIN, UPDATE, and COMMIT it sees to the DPT and ATT it inherited from the checkpoint. So if a transaction started at t1.5 (between checkpoint-latch and checkpoint-record), its BEGIN record at LSN 1250 is replayed by analysis, the xid gets added to the ATT, and the scheme is correct. The checkpoint's tables are the starting point of the analysis pass, not the result of it.

The recovery-time metric and checkpoint frequency

Call the worst-case time to complete recovery after a crash the recovery-time metric, or sometimes Recovery Time Objective (RTO). In WAL-based systems, the metric is almost perfectly proportional to the number of log bytes between the last checkpoint and the crash.

If checkpoints fire every T seconds, then on average the crash catches the system T/2 seconds after the last checkpoint, with a worst case of just under T seconds. At a WAL rate of R bytes per second, recovery has to read roughly T · R bytes of log, so:

worst-case recovery time  ≈  (T · R) / log_read_speed  +  random_page_IO_time

For T = 300 s (5 min), R = 50 MB/s (a healthy OLTP load), log_read_speed = 2 GB/s: recovery reads 15 GB of log in 7.5 seconds, plus maybe 20–30 seconds of random I/O to fetch and update dirty pages. Total: roughly 30 seconds. That is the number you want.

Make T = 30 minutes and the same calculation gives 3 minutes of recovery — usually still acceptable. Make T = 12 hours and recovery is an hour — your on-call engineer is now very unhappy.

The temptation is to make T as small as possible. Why not checkpoint every 10 seconds and recover in 2 seconds? The answer is the IO cost of the checkpoint itself, which the next section unpacks.

Bounding recovery on a 1 TB Postgres

A production Postgres cluster on a single 1 TB NVMe SSD runs an OLTP workload writing 40 MB/s of WAL. It configures checkpoint_timeout = 5min and checkpoint_completion_target = 0.9. A crash happens 4 minutes and 20 seconds after the most recent checkpoint completed.

Log bytes to replay. 260 seconds × 40 MB/s = 10.4 GB of WAL.

Log read time. The SSD reads sequentially at ~3 GB/s; 10.4 GB / 3 GB/s ≈ 3.5 seconds.

Pages touched during replay. Suppose 15% of records reference pages not already in the buffer pool after restart (the buffer pool is cold); at ~20 bytes of work per record and ~200 bytes per record average size, that is about 52 million records in 10.4 GB, of which 7.8 million need a page fetch. At 8 KiB per page and a random-read-per-page of 80 µs on this NVMe, that is 7.8M × 80 µs = 624 seconds of random I/O. That dominates.

Mitigation. Real recovery pipelines prefetch pages in parallel — Postgres's recovery prefetch (added in 15) can issue many posix_fadvise(WILLNEED) calls in parallel and cut random-read latency by 10–20×. With prefetch, the 624 seconds drops to ~40 seconds.

Total. 3.5 s (log read) + 40 s (prefetched page I/O) + a few seconds of book-keeping ≈ ~45 s worst-case recovery. On this system, reducing checkpoint_timeout to 2 minutes would cut recovery to ~20 s but would roughly double the average checkpoint IO rate. The operator chose 5 minutes because 45 s of recovery after a crash is acceptable and 20 s of IO headroom during steady-state is not.

The IO-spike tradeoff — why you cannot checkpoint too often

A checkpoint does two things: it writes a checkpoint record (tiny, cheap — one WAL append and one fsync), and it flushes the dirty pages referenced by the DPT to the data files. That second part is the expensive one. A buffer pool holding 4 GiB of dirty pages, when checkpointed, has to write 4 GiB out to the data files as a burst of I/O.

If checkpoints fire every 5 minutes and the database has averaged 500 MB of new dirty pages per minute between them, the checkpoint has to move 2.5 GB. If the SSD's write bandwidth is 1 GB/s, that is 2.5 seconds of 100%-full write queue — during which every non-checkpoint write is fighting for bandwidth, and latency on foreground transactions spikes badly.

The obvious fix is to spread the flush across the interval instead of bursting at the end. Postgres does exactly this via checkpoint_completion_target, a float in [0, 1] (default 0.9). It means: if checkpoint_timeout is 300 seconds and the target is 0.9, the background writer paces the dirty-page flush to complete by 270 seconds into the interval — i.e., to spread the IO across 90% of the checkpoint window. The IO rate becomes (dirty_bytes_at_checkpoint_start) / (checkpoint_timeout * completion_target) — far smoother than a burst.

Once you see checkpoint_completion_target, you see the whole trade-off:

Most production Postgres deployments land in the range checkpoint_timeout = 5min to 15min with checkpoint_completion_target = 0.9. A rule of thumb: if your pg_stat_bgwriter.checkpoints_req (requested, meaning WAL-volume-triggered rather than timeout-triggered) is a significant fraction of checkpoints_timed, your max_wal_size is too small and checkpoints are firing faster than the timeout would dictate — the cure is a bigger max_wal_size, not a shorter timeout.

Common confusions

Going deeper

Postgres — the background writer and spread checkpoints

Postgres's checkpointer is a separate background process (checkpointer), spawned at startup. It loops: sleep until either checkpoint_timeout has elapsed or WAL volume since the last checkpoint has hit max_wal_size, then start a checkpoint. The checkpoint's first action is to snapshot the transaction state and the dirty-buffer list; its second is to write the XLOG_CHECKPOINT_ONLINE record; its third — the long one — is the buffer sync, which iterates over the buffer pool and writes out every dirty buffer whose page-LSN is ≤ the checkpoint's LSN.

The buffer sync is paced by checkpoint_completion_target. Before writing each buffer, the checkpointer computes how far into the checkpoint interval it should be by now (based on wall-clock or WAL progress, whichever is ahead), compares that to how many buffers it has actually written, and sleeps if it is ahead. This produces the smooth IO curve you see in iostat on a healthy Postgres — peaks at checkpoint edges, valleys between, but nothing catastrophic.

After the buffer sync, the checkpointer updates pg_control with the new checkpoint's LSN (via UpdateControlFile(), which writes and fsyncs the 8 KiB control file atomically), and finally calls RemoveOldXlogFiles() to recycle WAL segments older than the checkpoint (they are no longer needed for crash recovery, though they may be kept for replication or point-in-time recovery).

InnoDB — fuzzy checkpoint with a separate LSN for the flushed list

InnoDB does not write a single CHECKPOINT record. Instead, the master thread and the page cleaners constantly flush dirty pages from the buffer pool to disk, and InnoDB separately maintains a flushed-up-to LSN in the redo-log header — the LSN up to which every dirty page has been flushed. Recovery starts from this LSN (read from ib_logfile0's header on startup) rather than from a checkpoint record in the log body.

Conceptually, InnoDB's "checkpoint" is continuous: every page flush advances the flushed-up-to LSN by the oldest rec-LSN of any remaining dirty page. The knob you tune is innodb_io_capacity and innodb_max_dirty_pages_pct, which together pace the page cleaners to keep the buffer pool from filling up with dirty pages and to keep the flushed-up-to LSN close to the current LSN. The effect is the same as Postgres's spread checkpoints — smooth IO, bounded recovery — but the mechanism is different. InnoDB's design prioritises the continuous model; Postgres's prioritises a discrete checkpoint event you can reason about in isolation.

Azure SQL Database — continuous checkpointing and accelerated database recovery

Azure SQL Database uses Indirect Checkpoints, which are fuzzy checkpoints spread continuously by a target recovery time (target_recovery_time_in_seconds) configured per database. The SQL Server engine computes, at every moment, the age of the oldest dirty page in the buffer pool; if that age exceeds the configured target, the background writer accelerates its flush rate. The checkpoint is no longer an event — it is a steady-state property the engine maintains.

Azure goes further with Accelerated Database Recovery (ADR), which replaces the traditional UNDO phase with a Persisted Version Store — every update's before-image is kept in a versioned structure on disk, so UNDO on a long-running transaction becomes a constant-time operation rather than proportional to transaction length. ADR breaks the other half of the recovery-time problem: not just "how long is the log" but "how long can one transaction be". With ADR, a 10-hour transaction that rolls back after a crash no longer imposes 10 hours of UNDO.

The checkpoint record in ARIES — a sharp-edged specification

The 1992 ARIES paper specifies the checkpoint record with precision. The begin_chkpt record is written; then, while transactions continue, the active-transaction table and dirty-page table are snapshotted; then the end_chkpt record is written, carrying both tables. Recovery uses end_chkpt.LSN as its starting point, inheriting the tables from the record and then re-applying records from begin_chkpt.LSN + 1 to the end-of-log in the analysis pass — capturing everything that happened during the checkpoint. Most real engines have collapsed the begin/end distinction (Postgres writes a single XLOG_CHECKPOINT_ONLINE), but the ARIES begin/end pattern is conceptually cleaner and you will see it in textbooks.

Where this leads next

You now have the third of the four mechanisms a WAL-based engine is built from: records, LSNs, fsync (via group commit), and checkpoints. With a checkpoint in hand, recovery is no longer "replay from the Big Bang"; it is "replay from the last few minutes." The remaining question is: how, precisely, does recovery use the checkpoint's ATT and DPT to reconstruct the state-at-crash?

The answer is the ARIES analysis pass — the first of three passes that together recover the database. Analysis reads the log from the checkpoint to the tail, updating the ATT and DPT with every record it sees, so by the time it hits the end-of-log it knows exactly which transactions were in flight at the crash and which pages need redo. Then comes redo (replay forward from the earliest rec-LSN), then undo (walk uncommitted transactions backward via prev_lsn). The next chapter, ARIES: the analysis pass, is where the checkpoint's tables are first put to work.

References

  1. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — §4 formally defines fuzzy checkpoints, the begin_chkpt / end_chkpt record pair, and the analysis-pass use of the tables.
  2. PostgreSQL Global Development Group, WAL Configuration — checkpoint_timeout and checkpoint_completion_target, PostgreSQL 16 documentation — the operational reference for Postgres's spread-checkpoint mechanism.
  3. Oracle Corporation, InnoDB Checkpoints, MySQL 8.0 reference manual — InnoDB's fuzzy checkpoint, the flushed-up-to LSN, and the tuning knobs (innodb_io_capacity, innodb_max_dirty_pages_pct).
  4. Microsoft, Accelerated Database Recovery in SQL Server and Azure SQL, SQL Server 2022 documentation — the Persisted Version Store design and the continuous-checkpoint model.
  5. Gray and Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann 1992 — §10.3 on checkpoint trade-offs and §10.4 on their interaction with recovery time.
  6. Hellerstein, Stonebraker, Hamilton, Architecture of a Database System, Foundations and Trends in Databases, 2007 — §5.4 gives a vendor-neutral account of fuzzy checkpointing and the recovery-time metric.