In short

A write-ahead log is one file, opened once, appended to forever. Every record in it is immutable: once flushed, its bytes never change. Each record has a Log Sequence Number (LSN) — most commonly the byte offset at which the record starts, sometimes a pure 64-bit counter. LSNs are monotonically increasing: if record B was appended after record A, then LSN(B) > LSN(A). The four record types you need to make recovery work are BEGIN (a transaction starts, carrying its transaction ID), UPDATE (a modification, carrying both the before-image — what was there — and the after-image — what it became — so you can both redo forward and undo backward), COMMIT (the transaction is durable the moment this record's LSN has been flushed), and CHECKPOINT (a periodic snapshot of which transactions are live and which dirty pages exist, so recovery does not have to scan the whole log from the beginning of time). Every data page in the buffer pool and on disk carries one extra 8-byte field in its header: the page-LSN, the LSN of the last WAL record that modified this page. The write-ahead rule restated in one line is an invariant over two numbers: page-LSN(p) ≤ flush-LSN for every page p that reaches the disk. The flush-LSN is the highest LSN whose record has been fsynced to the log. If the inequality holds at the moment a page is written to its data file, then the record describing the modification is already durable, and recovery can replay it. If the inequality is ever violated, you have written a page whose modification is not yet in the log — a torn-write-style hazard of a different kind, invisible until the crash. The LSN is the passport every modification carries; the page-LSN is the stamp the page keeps; the flush-LSN is the border the WAL has reached; the invariant is the one rule the border guard enforces.

Open any serious database's data directory on a running system. You will find the data files — the tablespace, the heap files, the index files. And you will find one or more files named like pg_wal/000000010000000000000042 or ib_logfile0 or db.wal. These are the log files. If the data files are the library, the log is the diary of every change the library has ever undergone, in the order the changes happened.

This chapter is about the diary: what goes in each entry, how the entries are numbered, how those numbers get remembered by the pages the entries describe, and what one-line invariant the whole scheme rests on.

The log is a file of immutable records

Forget, for a moment, the sophistication of what the log will eventually do. At its bottom layer it is just a file. Opened once, at database startup. Appended to, record after record, for the life of the process. Never seeked backwards and overwritten — only appended, fsynced, appended, fsynced. When the file gets long, it is rotated: the old file is archived or deleted (after the data it describes is safely in the data files), a new file is opened, and appending continues.

Why append-only: an append never overwrites bytes that might be read during recovery. If the process crashes during an append, the partially-written record at the tail is easy to detect (checksum fails, or the record length extends past end-of-file) and easy to discard — truncate the file to the last good record boundary. In contrast, an in-place update anywhere in the file would require torn-write defences like those you saw in Torn Writes and the Need for a Log. The log sidesteps the hazard by never mutating anything it has already written.

Every record is immutable once the log manager has handed it back to the caller. The bytes of the BEGIN record for transaction 42 are the same at 11pm as they are at 3am the following Tuesday after the server has rebooted twice. This is what makes the log a trustworthy source of truth: the diary never rewrites itself.

The write-ahead log as a byte sequence with LSNs at record boundariesA long horizontal rectangle labelled "wal file" is divided into variable-width segments, each representing one log record. The segments are labelled, left to right: BEGIN (xid=42), UPDATE (xid=42, page 7), UPDATE (xid=42, page 9), COMMIT (xid=42), BEGIN (xid=43), CHECKPOINT. Below each segment boundary is an LSN — LSN 1000, LSN 1048, LSN 1140, LSN 1232, LSN 1272, LSN 1320 — shown as the byte offset into the file. The LSN values are monotonically increasing and spaced in proportion to the record width. To the right of the file, a cursor labelled "flush-LSN = 1232" marks the highest LSN that has been fsynced. Records to the right of the cursor (BEGIN xid=43, CHECKPOINT) are shaded lighter to indicate they are in the in-memory log buffer and not yet durable.one WAL file, appended left to rightBEGINxid=42UPDATExid=42, p=7UPDATExid=42, p=9COMMITxid=42BEGINxid=43CHECKPT(unused tail — next append lands here)LSN 100010481140123212721320flush-LSN = 1232LSN = byte offset at the record's start. Records to the left of flush-LSN are fsynced (durable).Records to the right of flush-LSN are in the in-memory log buffer — lost on crash.
One WAL file, left to right, with six records appended so far. Each record's LSN is the byte offset at which it starts — LSNs are monotonically increasing because appends can only grow the offset. The vertical cursor marks flush-LSN = 1232: the highest LSN whose record has been fsynced. Records past the cursor (BEGIN xid=43, CHECKPOINT) are still in RAM and would vanish on a crash right now.

The file, end-to-end, gives a total order on every modification the database has ever made. Two records sitting in the same file can always be compared: this one happened before that one, by comparing offsets. That total order is the backbone of recovery — replay the log in order, and you have reconstructed every change up to the last durable record.

The Log Sequence Number — what exactly is it?

The LSN is the name (and the sort key) of a record. Two common implementations.

Byte-offset LSN — the LSN of a record is the byte position at which the record starts in the virtual log. "Virtual" because the log is spread across multiple physical files, but the offsets continue across rotations: if file 1 is 16 MiB and file 2 starts at offset 16 MiB, a record at byte 100 of file 2 has LSN = 16 MiB + 100. This is how Postgres does it — its LSN is written 0/1A2B3C40 (16 hex digits, a 64-bit integer), and pg_current_wal_lsn() returns the current tail.

Counter-based LSN — a pure monotonic 64-bit counter incremented on every new record, independent of where the record lands in the file. This is conceptually cleaner (an LSN is a "record number," not an offset) but loses one convenient property of the offset scheme: given a byte offset, you do not know which LSN lives there without a separate index. Most engines use the offset scheme for exactly that reason.

Either way, the LSN satisfies three properties that recovery relies on:

  1. Monotonic. If record B was appended after record A, then LSN(B) > LSN(A). No ties.
  2. Unique. Every record has its own LSN; no two records share one.
  3. Comparable. Given two LSNs, you can tell which record came first by comparing their values.

Why 64 bits and not 32: because the log is a high-volume stream. A busy OLTP database writes tens of megabytes of WAL per second; a 32-bit byte-offset LSN would wrap in a matter of days, and a 32-bit counter would wrap in a matter of minutes. A 64-bit LSN at 1 GB/s of WAL would take 584 years to wrap. You are never going to see it wrap in production. (If you think you might — you will not.)

The four record types that make recovery work

Recovery needs to answer two questions about every transaction the log describes: (1) did it commit? (2) what did it change? The four record types below, taken together, answer both.

BEGIN — a transaction starts

A BEGIN record records only that a transaction has begun and its transaction ID (xid). It is the anchor against which every subsequent record for the same xid is grouped.

UPDATE — a modification, with before- and after-images

An UPDATE record carries the identity of the page and offset being modified, the before-image (the bytes that were there), and the after-image (the bytes that are now there). Two images, not one, because recovery might need to go either direction.

An update-record that omits the before-image is called a redo-only record; it cannot be undone. An update-record that omits the after-image is undo-only; it cannot be redone. The general-purpose record carries both, which is what ARIES requires. You will see the three variants formally in a later chapter; for now, assume both images are present.

COMMIT — the transaction is now durable

A COMMIT record is small: xid and nothing else (sometimes a timestamp). Its importance lies in its placement, not its contents. A transaction is considered durable if and only if its COMMIT record has been flushed to the log — i.e., its LSN is ≤ the flush-LSN. If the crash happens with the UPDATE records on disk but the COMMIT record still in the log buffer, recovery will undo the transaction using the before-images: it never committed.

CHECKPOINT — a snapshot of what is live

A CHECKPOINT record is a periodic meta-record that says, in effect: at this LSN, the following transactions were active (listed by xid) and the following dirty pages existed (listed with their page-LSNs) in the buffer pool. Recovery starts by finding the most recent CHECKPOINT and reading its contents — that gives it a place to start the replay from, rather than starting at the very beginning of the log. We will look at checkpoints in detail in Checkpointing: bounding recovery time.

The records in Python

# log_records.py — the four WAL record types and their serialization.
from __future__ import annotations
from dataclasses import dataclass
import struct
from typing import List, Tuple

# Record-type discriminator (first byte of every serialized record).
BEGIN, UPDATE, COMMIT, CHECKPOINT = 1, 2, 3, 4

@dataclass(frozen=True)
class BeginRecord:
    lsn: int          # assigned by the log manager when the record is appended
    xid: int          # transaction id this record begins

@dataclass(frozen=True)
class UpdateRecord:
    lsn: int
    xid: int
    page_id: int      # which data page is being modified
    offset: int       # byte offset within the page
    before: bytes     # bytes that were there (enables UNDO)
    after: bytes      # bytes that are there now (enables REDO)
    prev_lsn: int     # LSN of the previous record for the same xid; 0 if this is the first

@dataclass(frozen=True)
class CommitRecord:
    lsn: int
    xid: int

@dataclass(frozen=True)
class CheckpointRecord:
    lsn: int
    active_xids: List[int]                    # transactions in flight at checkpoint
    dirty_pages: List[Tuple[int, int]]        # (page_id, rec_lsn) — page and its earliest
                                              # unflushed modification's LSN

# --- serialization (fixed header + variable payload) ----------------------

def serialize(rec) -> bytes:
    """Encode a record to bytes. Length-prefixed so the reader can frame records."""
    if isinstance(rec, BeginRecord):
        body = struct.pack(">BQ", BEGIN, rec.xid)
    elif isinstance(rec, UpdateRecord):
        body = struct.pack(">BQQIIQ", UPDATE, rec.xid, rec.page_id,
                           rec.offset, len(rec.before), rec.prev_lsn)
        body += rec.before + rec.after        # both images; equal lengths in practice
    elif isinstance(rec, CommitRecord):
        body = struct.pack(">BQ", COMMIT, rec.xid)
    elif isinstance(rec, CheckpointRecord):
        body = struct.pack(">BI", CHECKPOINT, len(rec.active_xids))
        for x in rec.active_xids:
            body += struct.pack(">Q", x)
        body += struct.pack(">I", len(rec.dirty_pages))
        for pid, rec_lsn in rec.dirty_pages:
            body += struct.pack(">QQ", pid, rec_lsn)
    else:
        raise TypeError(f"unknown record type: {type(rec).__name__}")

    # Prepend 4-byte length and append 4-byte CRC32 so the reader can
    # detect a torn tail (the last record of a crashed process).
    import zlib
    crc = zlib.crc32(body) & 0xFFFFFFFF
    return struct.pack(">I", len(body)) + body + struct.pack(">I", crc)

Three details in that Python worth naming.

prev_lsn links every record back to the last record for the same xid. That chain is what lets UNDO walk a transaction's history backward without scanning the whole log: follow prev_lsn from the latest UPDATE back to the BEGIN, applying before-images along the way.

The before-image and after-image have the same length in a typical fixed-length-row engine. An UPDATE that changes a 200-byte row has a 200-byte before-image and a 200-byte after-image. For variable-length rows, each gets its own length prefix; real engines do this, but the simplified version above assumes equal lengths.

Every record ends with a CRC32. The log reader computes the CRC on each record it parses; if the CRC does not match, the record is assumed to be torn — this is the last record in a crashed process, the one whose append was interrupted. Recovery truncates the file at that point and begins replay from whatever lies before.

Every data page carries the LSN of its last modification

A log by itself does not tell you whether a particular data page on disk is up-to-date with respect to a given WAL record. To answer that, every data page reserves a small region of its header for one 8-byte field: the page-LSN.

A data page with a page-LSN in its headerA tall rectangle represents a single 16 KiB data page. The top 64 bytes are shaded and labelled "header". Inside the header, named fields are shown in a stack: checksum (4 bytes), page-LSN (8 bytes), flags (2 bytes), lower-free-pointer (2 bytes), upper-free-pointer (2 bytes), special-space-pointer (2 bytes). The page-LSN field is highlighted with a box and an arrow labelled "LSN of the last WAL record that modified this page". The rest of the page body is labelled "tuples, index entries, or whatever the page type stores". To the right, an arrow points from the page-LSN to the WAL file (drawn smaller, as the log sequence from the earlier figure), landing on an UPDATE record at LSN 1140.a 16 KiB data page — the page-LSN in its headerheader (64 bytes)checksum4 Bpage-LSN8 Bflags2 Blower / upper / special6 Btuples & index entries(the page body)points toWAL fileBEGIN1000UPDATE1048 (p=7)UPDATE1140 (p=9)COMMIT1232page 9's page-LSN field holdsLSN 1140 — the LSN of themost recent record thatmodified page 9.
A 16 KiB data page with one distinguished field in its header: the page-LSN, 8 bytes. It holds the LSN of the last WAL record that modified this page. The arrow connects the page-LSN to the record at LSN 1140 in the WAL. If more updates to page 9 are appended later, the page-LSN is overwritten (in memory) with the newer LSN. The page-LSN on disk reflects the state that was flushed.

Three things to notice about the page-LSN.

It is written on the page, not next to it. When the page is flushed to disk, the page-LSN goes with it. When the page is read back, the page-LSN comes back. This makes the page self-describing with respect to the log: you can tell how up-to-date a page is by looking only at the page.

It is updated in memory before the page is flushed. When the buffer manager applies an UPDATE (copying the new bytes into the page in the pool), it also writes the LSN of the WAL record describing that update into the page-LSN field. The update to the page body and the update to the page-LSN happen together, under the same latch.

On recovery, it tells you which records to skip. Suppose recovery is replaying WAL record at LSN 1140, which modifies page 9. It reads page 9 from disk; page 9's page-LSN is 1140 or higher. Then the modification is already on this page; replaying the record would apply the change twice. Recovery skips it. If page 9's page-LSN is less than 1140, the record has not yet been applied to the on-disk page, and recovery applies it.

The page-LSN is therefore the idempotency marker of recovery: the scheme that makes replay safe to do repeatedly, even if the process crashes during replay itself.

The WAL rule, restated in one line

You saw the write-ahead rule in The Write-Ahead Rule: no dirty page is written to disk before its describing log record has been durably written to the log. That rule, recast in the vocabulary of LSNs, is an inequality over two numbers.

Define:

The rule becomes:

For every page p written to its data file: page-LSN(p) ≤ flush-LSN.

That is the entire WAL invariant. One line. Every other recovery rule in ARIES — the analysis pass, the redo pass, the undo pass, checkpoints, the dirty-page table — is machinery to preserve this one inequality or to exploit the fact that it holds.

Why the inequality works: if page-LSN(p) ≤ flush-LSN, then the WAL record describing the modification on page p is durable. If the crash happens right after the page is flushed, recovery reads the record from the WAL and sees the modification. Whether the page on disk has the modification or not (because the flush could have been interrupted by the crash or might have been a CoW/double-write scenario), recovery can reconstruct the intended state by replaying the record. Conversely, if page-LSN(p) > flush-LSN, the page on disk carries a modification whose WAL record is not durable — if the crash happens, the page has information the log does not, and recovery cannot know it should exist. The invariant is the knot that ties the two durable artefacts together.

A walk through six records and one page-LSN

A transaction inserts two rows: one into page 7, one into page 9. The WAL evolves as follows.

Step 1 — the transaction begins. The log manager assigns LSN 1000 to the BEGIN record. The record is written into the in-memory log buffer. No fsync yet.

WAL: [BEGIN xid=42 lsn=1000] (in buffer)
flush-LSN = 0  (or the previous batch's tail)

Step 2 — the first insert modifies page 7. The log manager assigns LSN 1048 to the UPDATE record. The buffer manager applies the change to page 7 in the pool and sets page_7.page_lsn = 1048. The WAL buffer now holds BEGIN + UPDATE.

WAL: [BEGIN 1000] [UPDATE xid=42 page=7 lsn=1048] (in buffer)
page 7 in pool: page-LSN = 1048, contents changed
flush-LSN = 0

Step 3 — the second insert modifies page 9. LSN 1140. Page 9 in pool now has page-LSN 1140.

WAL: ... [UPDATE xid=42 page=9 lsn=1140] (in buffer)
page 7 in pool: page-LSN = 1048
page 9 in pool: page-LSN = 1140
flush-LSN = 0

Step 4 — the transaction commits. LSN 1232 for the COMMIT record. Before the client is told "ok", the log manager flushes the WAL buffer up to and including LSN 1232. An fsync returns. Now:

WAL on disk: [BEGIN 1000] [UPDATE 1048] [UPDATE 1140] [COMMIT 1232]
flush-LSN = 1232
pages still in pool (not yet flushed)

The transaction is now durable — even though pages 7 and 9 are still in the buffer pool and have not been written to the data file. This is the whole point of WAL: commit latency depends only on the sequential log flush, not on the random-I/O data-page writes.

Step 5 — sometime later, the buffer manager decides to flush page 7. Before it writes the page to disk, it checks: is page_7.page_lsn ≤ flush_lsn? 1048 ≤ 1232 — yes. The page is safe to flush. The page hits the disk with page-LSN = 1048 in its header.

Step 6 — a checkpoint fires. The checkpoint record at LSN 1320 records that xid=43 is active (suppose another transaction started after COMMIT) and that page 9 is still dirty with rec-LSN = 1140. Now if the process crashes, recovery starts from the checkpoint, sees page 9's rec-LSN = 1140, reads page 9 from disk, compares its page-LSN against 1140 — if disk page-LSN < 1140, replay the UPDATE; otherwise skip.

The page-LSN is the breadcrumb that makes step 6 correct.

Common confusions

Going deeper

Postgres XLog records — rmgr dispatch and the page image

Postgres's WAL record has a fixed header (XLogRecord) followed by a variable number of block-data sections (XLogRecordBlockHeader + payload) and a variable-length main-data section. The header includes a resource-manager ID (xl_rmid) — a single byte identifying which subsystem owns this record: Heap, Btree, Hash, Gin, Transaction, Standby, etc. Each rmgr has its own redo function, selected by a dispatch table in rmgrlist.h. When recovery reads a record, it looks up xl_rmid, calls that rmgr's rm_redo(record), and the rmgr knows how to apply the change to its pages.

Block-data sections can each carry a full-page image (XLR_BLOCK_ID_FPI) if the page is being logged for the first time since the last checkpoint (the full_page_writes machinery). The LSN is a 64-bit value written as two 32-bit words (xlogid:xrecoff) in textual form; programmatically it is an XLogRecPtr. The flush-LSN is called XLogFlushRecPtr and is accessible via pg_current_wal_flush_lsn().

InnoDB log records — MLOG types and Mini-Transactions

InnoDB's log is organised around mini-transactions (MTRs). An MTR is a short sequence of log records that together modify one or more pages atomically — for instance, a B-tree page split is one MTR spanning records for the old page and the new page. Each record has a type byte from the MLOG_* enum: MLOG_REC_INSERT, MLOG_REC_UPDATE_IN_PLACE, MLOG_REC_DELETE, MLOG_UNDO_INSERT, MLOG_1BYTE, MLOG_2BYTES, MLOG_4BYTES, MLOG_8BYTES, MLOG_WRITE_STRING, MLOG_CHECKPOINT, and many more. The small-typed variants (MLOG_1BYTE etc.) exist because InnoDB logs fine-grained modifications — changing a single 2-byte field in a page header emits an MLOG_2BYTES rather than a whole-page image.

The InnoDB LSN is a 64-bit byte offset into the redo log (ib_logfile0, ib_logfile1, ... in older MySQL; a single #innodb_redo directory in MySQL 8.0.30+). The flush-LSN is exposed as innodb_os_log_written. MTRs are framed with an MLOG_MULTI_REC_END sentinel so recovery can tell when a multi-record MTR is complete — if a crash leaves the MTR's records on disk without the sentinel, recovery discards the incomplete MTR's effects.

LSN space partitioning in distributed databases

A distributed database cannot use a single byte-offset LSN: there is no single log file. Different systems solve this differently.

The pattern is: for a single-node engine, the LSN is literally a byte offset in one file; for a distributed engine, the LSN is either scoped per shard (with a higher-level mechanism for global order) or globally assigned by a central service. Both schemes preserve the local property that a page-LSN on a given replica is comparable to the flush-LSN on that replica — the WAL invariant still holds, just per-shard.

Why LSNs and page-LSNs suffice — a one-paragraph ARIES preview

With LSNs and page-LSNs in hand, ARIES recovery becomes three passes. Analysis: read the log from the last checkpoint to the end, reconstructing the dirty-page table and active-transaction table. Redo: for each log record from the earliest rec-LSN in the dirty-page table to the end, compare the record's LSN against the target page's page-LSN on disk; apply if record.lsn > page.lsn, skip otherwise — this is idempotent replay. Undo: for each transaction that was active at crash (never committed), walk its records backward via prev_lsn, applying before-images to reverse its effects. Every branch of every pass reduces to the same primitive: compare an LSN to another LSN. The WAL invariant is the only thing that makes those comparisons meaningful.

Where this leads next

You now have the vocabulary of a write-ahead log: records, LSNs, page-LSNs, flush-LSN, and the one inequality that holds it all together. The next question is a performance question, not a correctness one.

Every COMMIT requires an fsync on the WAL — and fsync costs milliseconds. A system that commits a thousand small transactions per second would spend all its time fsyncing if each commit issued its own fsync. The answer is group commit: amortise one fsync across many COMMIT records, trading a little latency for a lot of throughput. That is the next chapter, Group Commit: Amortizing fsync.

After that, checkpoints (so recovery does not start at the Big Bang of your log), then the ARIES algorithm itself (the 1992 paper that formalised every rule in this chapter), then the analysis, redo, and undo passes. By the end of Build 5 you will be able to write a crash-safe storage engine from first principles, and every line of it will be an incarnation of the inequality page-LSN ≤ flush-LSN.

References

  1. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — the paper that defines LSNs, page-LSNs, and the WAL invariant in its modern form.
  2. PostgreSQL Global Development Group, WAL Internals — XLogRecord and LSN, PostgreSQL 16 documentation — the current reference on Postgres's record format and LSN representation.
  3. Oracle Corporation, InnoDB Redo Log, MySQL 8.0 reference manual — the structure of InnoDB's log, MTR framing, and LSN semantics.
  4. Gray and Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann 1992 — chapter 9 treats log records, LSNs, and the WAL invariant at textbook depth.
  5. Verbitski et al., Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases, SIGMOD 2017 — an example of LSN semantics in a disaggregated-log architecture.
  6. Hellerstein, Stonebraker, Hamilton, Architecture of a Database System, Foundations and Trends in Databases, 2007 — §5 gives a concise, vendor-neutral treatment of WAL and recovery that complements the ARIES paper.