In short
A write-ahead log is one file, opened once, appended to forever. Every record in it is immutable: once flushed, its bytes never change. Each record has a Log Sequence Number (LSN) — most commonly the byte offset at which the record starts, sometimes a pure 64-bit counter. LSNs are monotonically increasing: if record B was appended after record A, then LSN(B) > LSN(A). The four record types you need to make recovery work are BEGIN (a transaction starts, carrying its transaction ID), UPDATE (a modification, carrying both the before-image — what was there — and the after-image — what it became — so you can both redo forward and undo backward), COMMIT (the transaction is durable the moment this record's LSN has been flushed), and CHECKPOINT (a periodic snapshot of which transactions are live and which dirty pages exist, so recovery does not have to scan the whole log from the beginning of time). Every data page in the buffer pool and on disk carries one extra 8-byte field in its header: the page-LSN, the LSN of the last WAL record that modified this page. The write-ahead rule restated in one line is an invariant over two numbers: page-LSN(p) ≤ flush-LSN for every page p that reaches the disk. The flush-LSN is the highest LSN whose record has been fsynced to the log. If the inequality holds at the moment a page is written to its data file, then the record describing the modification is already durable, and recovery can replay it. If the inequality is ever violated, you have written a page whose modification is not yet in the log — a torn-write-style hazard of a different kind, invisible until the crash. The LSN is the passport every modification carries; the page-LSN is the stamp the page keeps; the flush-LSN is the border the WAL has reached; the invariant is the one rule the border guard enforces.
Open any serious database's data directory on a running system. You will find the data files — the tablespace, the heap files, the index files. And you will find one or more files named like pg_wal/000000010000000000000042 or ib_logfile0 or db.wal. These are the log files. If the data files are the library, the log is the diary of every change the library has ever undergone, in the order the changes happened.
This chapter is about the diary: what goes in each entry, how the entries are numbered, how those numbers get remembered by the pages the entries describe, and what one-line invariant the whole scheme rests on.
The log is a file of immutable records
Forget, for a moment, the sophistication of what the log will eventually do. At its bottom layer it is just a file. Opened once, at database startup. Appended to, record after record, for the life of the process. Never seeked backwards and overwritten — only appended, fsynced, appended, fsynced. When the file gets long, it is rotated: the old file is archived or deleted (after the data it describes is safely in the data files), a new file is opened, and appending continues.
Why append-only: an append never overwrites bytes that might be read during recovery. If the process crashes during an append, the partially-written record at the tail is easy to detect (checksum fails, or the record length extends past end-of-file) and easy to discard — truncate the file to the last good record boundary. In contrast, an in-place update anywhere in the file would require torn-write defences like those you saw in Torn Writes and the Need for a Log. The log sidesteps the hazard by never mutating anything it has already written.
Every record is immutable once the log manager has handed it back to the caller. The bytes of the BEGIN record for transaction 42 are the same at 11pm as they are at 3am the following Tuesday after the server has rebooted twice. This is what makes the log a trustworthy source of truth: the diary never rewrites itself.
The file, end-to-end, gives a total order on every modification the database has ever made. Two records sitting in the same file can always be compared: this one happened before that one, by comparing offsets. That total order is the backbone of recovery — replay the log in order, and you have reconstructed every change up to the last durable record.
The Log Sequence Number — what exactly is it?
The LSN is the name (and the sort key) of a record. Two common implementations.
Byte-offset LSN — the LSN of a record is the byte position at which the record starts in the virtual log. "Virtual" because the log is spread across multiple physical files, but the offsets continue across rotations: if file 1 is 16 MiB and file 2 starts at offset 16 MiB, a record at byte 100 of file 2 has LSN = 16 MiB + 100. This is how Postgres does it — its LSN is written 0/1A2B3C40 (16 hex digits, a 64-bit integer), and pg_current_wal_lsn() returns the current tail.
Counter-based LSN — a pure monotonic 64-bit counter incremented on every new record, independent of where the record lands in the file. This is conceptually cleaner (an LSN is a "record number," not an offset) but loses one convenient property of the offset scheme: given a byte offset, you do not know which LSN lives there without a separate index. Most engines use the offset scheme for exactly that reason.
Either way, the LSN satisfies three properties that recovery relies on:
- Monotonic. If record B was appended after record A, then
LSN(B) > LSN(A). No ties. - Unique. Every record has its own LSN; no two records share one.
- Comparable. Given two LSNs, you can tell which record came first by comparing their values.
Why 64 bits and not 32: because the log is a high-volume stream. A busy OLTP database writes tens of megabytes of WAL per second; a 32-bit byte-offset LSN would wrap in a matter of days, and a 32-bit counter would wrap in a matter of minutes. A 64-bit LSN at 1 GB/s of WAL would take 584 years to wrap. You are never going to see it wrap in production. (If you think you might — you will not.)
The four record types that make recovery work
Recovery needs to answer two questions about every transaction the log describes: (1) did it commit? (2) what did it change? The four record types below, taken together, answer both.
BEGIN — a transaction starts
A BEGIN record records only that a transaction has begun and its transaction ID (xid). It is the anchor against which every subsequent record for the same xid is grouped.
UPDATE — a modification, with before- and after-images
An UPDATE record carries the identity of the page and offset being modified, the before-image (the bytes that were there), and the after-image (the bytes that are now there). Two images, not one, because recovery might need to go either direction.
- After-image (redo) — what the page should look like after the update. If recovery is rolling a committed transaction forward from a pre-crash data page, it applies the after-image.
- Before-image (undo) — what the page looked like before the update. If recovery is rolling an uncommitted transaction backward (because the crash happened before COMMIT), it applies the before-image.
An update-record that omits the before-image is called a redo-only record; it cannot be undone. An update-record that omits the after-image is undo-only; it cannot be redone. The general-purpose record carries both, which is what ARIES requires. You will see the three variants formally in a later chapter; for now, assume both images are present.
COMMIT — the transaction is now durable
A COMMIT record is small: xid and nothing else (sometimes a timestamp). Its importance lies in its placement, not its contents. A transaction is considered durable if and only if its COMMIT record has been flushed to the log — i.e., its LSN is ≤ the flush-LSN. If the crash happens with the UPDATE records on disk but the COMMIT record still in the log buffer, recovery will undo the transaction using the before-images: it never committed.
CHECKPOINT — a snapshot of what is live
A CHECKPOINT record is a periodic meta-record that says, in effect: at this LSN, the following transactions were active (listed by xid) and the following dirty pages existed (listed with their page-LSNs) in the buffer pool. Recovery starts by finding the most recent CHECKPOINT and reading its contents — that gives it a place to start the replay from, rather than starting at the very beginning of the log. We will look at checkpoints in detail in Checkpointing: bounding recovery time.
The records in Python
# log_records.py — the four WAL record types and their serialization.
from __future__ import annotations
from dataclasses import dataclass
import struct
from typing import List, Tuple
# Record-type discriminator (first byte of every serialized record).
BEGIN, UPDATE, COMMIT, CHECKPOINT = 1, 2, 3, 4
@dataclass(frozen=True)
class BeginRecord:
lsn: int # assigned by the log manager when the record is appended
xid: int # transaction id this record begins
@dataclass(frozen=True)
class UpdateRecord:
lsn: int
xid: int
page_id: int # which data page is being modified
offset: int # byte offset within the page
before: bytes # bytes that were there (enables UNDO)
after: bytes # bytes that are there now (enables REDO)
prev_lsn: int # LSN of the previous record for the same xid; 0 if this is the first
@dataclass(frozen=True)
class CommitRecord:
lsn: int
xid: int
@dataclass(frozen=True)
class CheckpointRecord:
lsn: int
active_xids: List[int] # transactions in flight at checkpoint
dirty_pages: List[Tuple[int, int]] # (page_id, rec_lsn) — page and its earliest
# unflushed modification's LSN
# --- serialization (fixed header + variable payload) ----------------------
def serialize(rec) -> bytes:
"""Encode a record to bytes. Length-prefixed so the reader can frame records."""
if isinstance(rec, BeginRecord):
body = struct.pack(">BQ", BEGIN, rec.xid)
elif isinstance(rec, UpdateRecord):
body = struct.pack(">BQQIIQ", UPDATE, rec.xid, rec.page_id,
rec.offset, len(rec.before), rec.prev_lsn)
body += rec.before + rec.after # both images; equal lengths in practice
elif isinstance(rec, CommitRecord):
body = struct.pack(">BQ", COMMIT, rec.xid)
elif isinstance(rec, CheckpointRecord):
body = struct.pack(">BI", CHECKPOINT, len(rec.active_xids))
for x in rec.active_xids:
body += struct.pack(">Q", x)
body += struct.pack(">I", len(rec.dirty_pages))
for pid, rec_lsn in rec.dirty_pages:
body += struct.pack(">QQ", pid, rec_lsn)
else:
raise TypeError(f"unknown record type: {type(rec).__name__}")
# Prepend 4-byte length and append 4-byte CRC32 so the reader can
# detect a torn tail (the last record of a crashed process).
import zlib
crc = zlib.crc32(body) & 0xFFFFFFFF
return struct.pack(">I", len(body)) + body + struct.pack(">I", crc)
Three details in that Python worth naming.
prev_lsn links every record back to the last record for the same xid. That chain is what lets UNDO walk a transaction's history backward without scanning the whole log: follow prev_lsn from the latest UPDATE back to the BEGIN, applying before-images along the way.
The before-image and after-image have the same length in a typical fixed-length-row engine. An UPDATE that changes a 200-byte row has a 200-byte before-image and a 200-byte after-image. For variable-length rows, each gets its own length prefix; real engines do this, but the simplified version above assumes equal lengths.
Every record ends with a CRC32. The log reader computes the CRC on each record it parses; if the CRC does not match, the record is assumed to be torn — this is the last record in a crashed process, the one whose append was interrupted. Recovery truncates the file at that point and begins replay from whatever lies before.
Every data page carries the LSN of its last modification
A log by itself does not tell you whether a particular data page on disk is up-to-date with respect to a given WAL record. To answer that, every data page reserves a small region of its header for one 8-byte field: the page-LSN.
Three things to notice about the page-LSN.
It is written on the page, not next to it. When the page is flushed to disk, the page-LSN goes with it. When the page is read back, the page-LSN comes back. This makes the page self-describing with respect to the log: you can tell how up-to-date a page is by looking only at the page.
It is updated in memory before the page is flushed. When the buffer manager applies an UPDATE (copying the new bytes into the page in the pool), it also writes the LSN of the WAL record describing that update into the page-LSN field. The update to the page body and the update to the page-LSN happen together, under the same latch.
On recovery, it tells you which records to skip. Suppose recovery is replaying WAL record at LSN 1140, which modifies page 9. It reads page 9 from disk; page 9's page-LSN is 1140 or higher. Then the modification is already on this page; replaying the record would apply the change twice. Recovery skips it. If page 9's page-LSN is less than 1140, the record has not yet been applied to the on-disk page, and recovery applies it.
The page-LSN is therefore the idempotency marker of recovery: the scheme that makes replay safe to do repeatedly, even if the process crashes during replay itself.
The WAL rule, restated in one line
You saw the write-ahead rule in The Write-Ahead Rule: no dirty page is written to disk before its describing log record has been durably written to the log. That rule, recast in the vocabulary of LSNs, is an inequality over two numbers.
Define:
- page-LSN(p) — the LSN stored in page p's header; the LSN of the last WAL record that modified p.
- flush-LSN — the highest LSN whose record has been fsynced to the WAL. Records with LSN ≤ flush-LSN are durable; records with LSN > flush-LSN are still in the in-memory log buffer.
The rule becomes:
For every page p written to its data file:
page-LSN(p) ≤ flush-LSN.
That is the entire WAL invariant. One line. Every other recovery rule in ARIES — the analysis pass, the redo pass, the undo pass, checkpoints, the dirty-page table — is machinery to preserve this one inequality or to exploit the fact that it holds.
Why the inequality works: if page-LSN(p) ≤ flush-LSN, then the WAL record describing the modification on page p is durable. If the crash happens right after the page is flushed, recovery reads the record from the WAL and sees the modification. Whether the page on disk has the modification or not (because the flush could have been interrupted by the crash or might have been a CoW/double-write scenario), recovery can reconstruct the intended state by replaying the record. Conversely, if page-LSN(p) > flush-LSN, the page on disk carries a modification whose WAL record is not durable — if the crash happens, the page has information the log does not, and recovery cannot know it should exist. The invariant is the knot that ties the two durable artefacts together.
A walk through six records and one page-LSN
A transaction inserts two rows: one into page 7, one into page 9. The WAL evolves as follows.
Step 1 — the transaction begins. The log manager assigns LSN 1000 to the BEGIN record. The record is written into the in-memory log buffer. No fsync yet.
WAL: [BEGIN xid=42 lsn=1000] (in buffer)
flush-LSN = 0 (or the previous batch's tail)
Step 2 — the first insert modifies page 7. The log manager assigns LSN 1048 to the UPDATE record. The buffer manager applies the change to page 7 in the pool and sets page_7.page_lsn = 1048. The WAL buffer now holds BEGIN + UPDATE.
WAL: [BEGIN 1000] [UPDATE xid=42 page=7 lsn=1048] (in buffer)
page 7 in pool: page-LSN = 1048, contents changed
flush-LSN = 0
Step 3 — the second insert modifies page 9. LSN 1140. Page 9 in pool now has page-LSN 1140.
WAL: ... [UPDATE xid=42 page=9 lsn=1140] (in buffer)
page 7 in pool: page-LSN = 1048
page 9 in pool: page-LSN = 1140
flush-LSN = 0
Step 4 — the transaction commits. LSN 1232 for the COMMIT record. Before the client is told "ok", the log manager flushes the WAL buffer up to and including LSN 1232. An fsync returns. Now:
WAL on disk: [BEGIN 1000] [UPDATE 1048] [UPDATE 1140] [COMMIT 1232]
flush-LSN = 1232
pages still in pool (not yet flushed)
The transaction is now durable — even though pages 7 and 9 are still in the buffer pool and have not been written to the data file. This is the whole point of WAL: commit latency depends only on the sequential log flush, not on the random-I/O data-page writes.
Step 5 — sometime later, the buffer manager decides to flush page 7. Before it writes the page to disk, it checks: is page_7.page_lsn ≤ flush_lsn? 1048 ≤ 1232 — yes. The page is safe to flush. The page hits the disk with page-LSN = 1048 in its header.
Step 6 — a checkpoint fires. The checkpoint record at LSN 1320 records that xid=43 is active (suppose another transaction started after COMMIT) and that page 9 is still dirty with rec-LSN = 1140. Now if the process crashes, recovery starts from the checkpoint, sees page 9's rec-LSN = 1140, reads page 9 from disk, compares its page-LSN against 1140 — if disk page-LSN < 1140, replay the UPDATE; otherwise skip.
The page-LSN is the breadcrumb that makes step 6 correct.
Common confusions
-
"Is the LSN the transaction ID?" No. The LSN names a record in the log; the xid names a transaction. One transaction writes many records (BEGIN, several UPDATEs, COMMIT), and each of those records has its own distinct LSN. You can go from a record to its xid by reading the xid field in the record; you cannot go from an xid to a single LSN because a transaction does not have one LSN.
-
"Can LSNs wrap?" Not in any practical database. LSNs are 64-bit. A database writing 1 GB/s of WAL takes 584 years to fill 2^64 bytes. At 100 MB/s it takes 5,840 years. If you see an engine that uses 32-bit LSNs (a toy, a student project), the answer is different — 32-bit LSNs wrap in hours on a busy system, and the engine must handle it. Production engines all use 64 bits.
-
"Are the before-image and after-image the whole page?" No — that would make every UPDATE record ~16 KiB. The images are the changed bytes only, anchored by an offset. An UPDATE that changes a 200-byte row emits a 400-byte record (plus headers), not a 32 KiB record. Full-page images do appear in the WAL in a separate mechanism — Postgres's
full_page_writes, which you saw in Torn Writes and the Need for a Log — but that is a torn-write defence, not the standard UPDATE record. -
"If page-LSN is always less than flush-LSN, how does the database commit anything?" The invariant
page-LSN ≤ flush-LSNis required at the moment a page is written to the data file, not at all times. In memory, a page in the buffer pool can (and usually does) havepage-LSN > flush-LSN— its modification is in the WAL buffer but not yet fsynced. The buffer manager simply does not flush that page yet; it waits until the log has caught up. That wait is what enforces the rule. -
"Does the page-LSN take space I could use for data?" Yes, 8 bytes per page. On a 16 KiB page that is 0.05%. The overhead is negligible compared to what the LSN buys.
-
"COMMIT contains the changes, right?" No. COMMIT is a tiny record — xid and a timestamp. The changes are in the UPDATE records that precede it. COMMIT's job is to mark "these prior UPDATEs count"; if COMMIT is missing from the log at recovery, the UPDATEs are undone.
Going deeper
Postgres XLog records — rmgr dispatch and the page image
Postgres's WAL record has a fixed header (XLogRecord) followed by a variable number of block-data sections (XLogRecordBlockHeader + payload) and a variable-length main-data section. The header includes a resource-manager ID (xl_rmid) — a single byte identifying which subsystem owns this record: Heap, Btree, Hash, Gin, Transaction, Standby, etc. Each rmgr has its own redo function, selected by a dispatch table in rmgrlist.h. When recovery reads a record, it looks up xl_rmid, calls that rmgr's rm_redo(record), and the rmgr knows how to apply the change to its pages.
Block-data sections can each carry a full-page image (XLR_BLOCK_ID_FPI) if the page is being logged for the first time since the last checkpoint (the full_page_writes machinery). The LSN is a 64-bit value written as two 32-bit words (xlogid:xrecoff) in textual form; programmatically it is an XLogRecPtr. The flush-LSN is called XLogFlushRecPtr and is accessible via pg_current_wal_flush_lsn().
InnoDB log records — MLOG types and Mini-Transactions
InnoDB's log is organised around mini-transactions (MTRs). An MTR is a short sequence of log records that together modify one or more pages atomically — for instance, a B-tree page split is one MTR spanning records for the old page and the new page. Each record has a type byte from the MLOG_* enum: MLOG_REC_INSERT, MLOG_REC_UPDATE_IN_PLACE, MLOG_REC_DELETE, MLOG_UNDO_INSERT, MLOG_1BYTE, MLOG_2BYTES, MLOG_4BYTES, MLOG_8BYTES, MLOG_WRITE_STRING, MLOG_CHECKPOINT, and many more. The small-typed variants (MLOG_1BYTE etc.) exist because InnoDB logs fine-grained modifications — changing a single 2-byte field in a page header emits an MLOG_2BYTES rather than a whole-page image.
The InnoDB LSN is a 64-bit byte offset into the redo log (ib_logfile0, ib_logfile1, ... in older MySQL; a single #innodb_redo directory in MySQL 8.0.30+). The flush-LSN is exposed as innodb_os_log_written. MTRs are framed with an MLOG_MULTI_REC_END sentinel so recovery can tell when a multi-record MTR is complete — if a crash leaves the MTR's records on disk without the sentinel, recovery discards the incomplete MTR's effects.
LSN space partitioning in distributed databases
A distributed database cannot use a single byte-offset LSN: there is no single log file. Different systems solve this differently.
- CockroachDB uses per-range Raft logs; each range's log has its own LSN space. Transactions that span ranges coordinate via the distributed-SQL layer and a two-phase-commit-like protocol; the global ordering of "all modifications in system-wide time order" is maintained by hybrid-logical-clock timestamps, not by a global LSN.
- Spanner uses TrueTime timestamps as its global order. Each Paxos group has a local log; global ordering comes from timestamps whose bounded uncertainty is exposed by TrueTime.
- FoundationDB uses a centralised sequencer that assigns a monotonic commit version to every transaction; that version plays the role of a global LSN.
- Aurora decouples the log from the compute nodes entirely — the log is a distributed service, and page servers replay the log to serve reads. LSNs are global in the sense that the log service assigns them centrally.
The pattern is: for a single-node engine, the LSN is literally a byte offset in one file; for a distributed engine, the LSN is either scoped per shard (with a higher-level mechanism for global order) or globally assigned by a central service. Both schemes preserve the local property that a page-LSN on a given replica is comparable to the flush-LSN on that replica — the WAL invariant still holds, just per-shard.
Why LSNs and page-LSNs suffice — a one-paragraph ARIES preview
With LSNs and page-LSNs in hand, ARIES recovery becomes three passes. Analysis: read the log from the last checkpoint to the end, reconstructing the dirty-page table and active-transaction table. Redo: for each log record from the earliest rec-LSN in the dirty-page table to the end, compare the record's LSN against the target page's page-LSN on disk; apply if record.lsn > page.lsn, skip otherwise — this is idempotent replay. Undo: for each transaction that was active at crash (never committed), walk its records backward via prev_lsn, applying before-images to reverse its effects. Every branch of every pass reduces to the same primitive: compare an LSN to another LSN. The WAL invariant is the only thing that makes those comparisons meaningful.
Where this leads next
You now have the vocabulary of a write-ahead log: records, LSNs, page-LSNs, flush-LSN, and the one inequality that holds it all together. The next question is a performance question, not a correctness one.
Every COMMIT requires an fsync on the WAL — and fsync costs milliseconds. A system that commits a thousand small transactions per second would spend all its time fsyncing if each commit issued its own fsync. The answer is group commit: amortise one fsync across many COMMIT records, trading a little latency for a lot of throughput. That is the next chapter, Group Commit: Amortizing fsync.
After that, checkpoints (so recovery does not start at the Big Bang of your log), then the ARIES algorithm itself (the 1992 paper that formalised every rule in this chapter), then the analysis, redo, and undo passes. By the end of Build 5 you will be able to write a crash-safe storage engine from first principles, and every line of it will be an incarnation of the inequality page-LSN ≤ flush-LSN.
References
- Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — the paper that defines LSNs, page-LSNs, and the WAL invariant in its modern form.
- PostgreSQL Global Development Group, WAL Internals — XLogRecord and LSN, PostgreSQL 16 documentation — the current reference on Postgres's record format and LSN representation.
- Oracle Corporation, InnoDB Redo Log, MySQL 8.0 reference manual — the structure of InnoDB's log, MTR framing, and LSN semantics.
- Gray and Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann 1992 — chapter 9 treats log records, LSNs, and the WAL invariant at textbook depth.
- Verbitski et al., Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases, SIGMOD 2017 — an example of LSN semantics in a disaggregated-log architecture.
- Hellerstein, Stonebraker, Hamilton, Architecture of a Database System, Foundations and Trends in Databases, 2007 — §5 gives a concise, vendor-neutral treatment of WAL and recovery that complements the ARIES paper.