ARIES: The Redo Pass

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The analysis pass told you two things: the dirty-page table (DPT) — which pages may be out of date on disk, each tagged with its rec-LSN, the earliest WAL record whose change is not yet reflected on disk — and the active-transaction table — which transactions were in flight at the crash. The redo pass uses the DPT. It begins the replay at min(rec-LSN) across all dirty pages — the single earliest LSN whose effect might not be on disk. From that LSN to the tail of the log, it walks forward, record by record. For each UPDATE (or insert, or delete) record, it asks one question: is the affected page's rec-LSN ≤ this record's LSN and is the on-disk page-LSN < this record's LSN? If yes, apply the after-image — the modification was not durably on the page, reinstall it. If no, skip — either the page is not dirty, or the modification is already on disk. The rule is two LSN comparisons. Nothing else. It is idempotent: replaying the same log ten times produces the same final page state as replaying it once, because each apply bumps the page-LSN past the record's LSN, so the next replay skips it. And — the part that surprises everyone the first time they read the ARIES paper — the redo pass replays even uncommitted transactions. The losers' changes get put back, exactly as they were at crash time. Only then does the undo pass reverse them. ARIES calls this "repeating history": the database is rebuilt to its exact pre-crash state first, then the work of uncommitted transactions is rolled back from that known state. It is more work in the abstract, but far simpler to reason about — and because undo itself is logged (with compensation log records), the reasoning is the only thing that matters.

The analysis pass was a reading exercise. You swept the log from the last checkpoint to the end, filling in two tables — one listing dirty pages and their earliest unflushed modification (the DPT), the other listing transactions that were live at the crash (the ATT). You made no changes to any data page. Not one byte of the database was written.

The redo pass is where writing begins. You are going to walk the log again, this time forward from the earliest point the DPT tells you might have work outstanding, and for each record you are going to ask: does the on-disk page already have this modification? If it does not, install it. If it does, skip. That is the entire pass.

It sounds too simple to be a recovery algorithm. The simplicity is the point: ARIES reduced a notoriously subtle problem — how do you put a database back together after a crash, when pages on disk are at different stages of a multi-transaction workload — to two LSN comparisons per record. Everything else, including the fact that min(rec-LSN) is a safe starting point and that idempotence holds, is a consequence of the invariants you built in the earlier chapters.

The goal, stated precisely

Before the crash, every data page in the buffer pool was at some state, and the WAL had recorded every modification leading to that state. The crash did two destructive things. It wiped the buffer pool — all in-memory dirty pages are gone. And it may have left the data files partially updated — some dirty pages had been flushed before the crash (or by the OS, opportunistically), others had not.

The goal of the redo pass is narrow and precise:

For every page on disk, end the redo pass with the page holding exactly the modifications that the log says it should have held at the instant of the crash.

Not just committed modifications. Every modification — including those made by transactions that were uncommitted at crash time. You will reverse the uncommitted ones in the next pass, but first you must have the pre-crash state reconstructed in full. Otherwise you would have to reason about which parts of the log were flushed, which parts of the page file were flushed, and how the two interleave. ARIES sidesteps that reasoning by insisting: first, reconstruct the world as it was; then, surgically undo the losers.

Why reconstruct the world first: consider an uncommitted transaction T that modified page 7, and suppose page 7 was flushed before the crash. The disk page has T's change. The log has T's UPDATE record but no COMMIT. If you tried to undo T directly without a redo pass, you would have to apply T's before-image to the on-disk page — fine. But what if page 7 was not flushed, and the disk page does not have T's change? Now you must not apply the before-image (there is nothing to undo). The decision — "is T's change on disk or not?" — requires exactly the LSN comparison the redo pass already does. Rather than split the logic between redo and undo, ARIES puts all of it in redo: first, bring every page to its pre-crash state (so now you know T's change is on page 7). Then undo can be unconditional.

Start from min(rec-LSN)

The dirty-page table from the analysis pass maps each dirty page to its rec-LSN — the LSN of the earliest WAL record whose change is not yet on disk for that page. If the DPT says {page 7: rec-LSN 1048, page 9: rec-LSN 1140, page 12: rec-LSN 1205}, then:

Everything on page 7 up to LSN 1047 is already on disk.
Everything on page 9 up to LSN 1139 is already on disk.
Everything on page 12 up to LSN 1204 is already on disk.
Records before LSN 1048 either (a) modify pages not in the DPT at all (so those pages are clean — already durable), or (b) have already been applied to their target pages (so replay would be a no-op).

The earliest LSN from which replay could possibly change any on-disk page is therefore min(rec-LSN) = 1048. Records earlier than that are irrelevant to redo — they describe modifications that are already durable. Records at or after min(rec-LSN) are candidates for replay, and each one is filtered by the skip-or-apply rule below.

Why min-rec-LSN is safe: the DPT is a complete list of pages whose on-disk state may lag the log. Any page not in the DPT is clean — its on-disk state is up-to-date with every log record that ever touched it. So a log record that modifies a page not in the DPT can be skipped outright; its effect is already on disk. For pages in the DPT, the rec-LSN is the earliest LSN that might need to be replayed for that page. Taking the minimum across the DPT gives the earliest LSN that might need to be replayed for any page. Records before min-rec-LSN change only pages that are either (a) not dirty (skip) or (b) dirty but at a later rec-LSN than this record's LSN (skip — the change is on disk, otherwise the rec-LSN would be this record's LSN or earlier). Either way: before min-rec-LSN, nothing to do.

The skip-or-apply rule

The redo pass walks forward from min(rec-LSN) to the end of the log. For each record, it runs a decision that is almost tautological in its simplicity.

The redo decision for one log record. Two filters in sequence. First: is this a record that can change a page at all (UPDATE/INSERT/DELETE)? Then: is its target page in the DPT with rec-LSN early enough that the change might still be missing? Only then does the pass pay the price of reading the page from disk and comparing its page-LSN. If the page-LSN is already at or above the record's LSN, the change is on disk — skip. Otherwise apply. Each apply bumps the page-LSN forward, which is why the whole procedure is idempotent.

The two filters before the page-LSN comparison are the important optimisations. The DPT check avoids reading pages that were never dirty. The rec-LSN check avoids reading pages whose earliest unflushed change is later than this record — the record's change is already on disk by the invariant. Only when both filters pass does the redo pass pay the cost of a page read.

The redo pass in Python

# aries_redo.py — the redo pass, given a WAL and a dirty-page table.
from __future__ import annotations
from typing import Dict, Iterator

# Types from the earlier chapters:
#   UpdateRecord(lsn, xid, page_id, offset, before, after, prev_lsn)
#   dpt: Dict[int, int]  mapping page_id -> rec_lsn (from the analysis pass)

def redo_pass(log: "LogReader",
              dpt: Dict[int, int],
              buffer_pool: "BufferPool") -> None:
    """Walk the log forward from min(rec_lsn) and replay what is missing."""
    if not dpt:
        return                          # no dirty pages — nothing to redo

    start_lsn = min(dpt.values())       # earliest LSN that might be missing on disk

    for record in log.scan_from(start_lsn):
        # Only records that modify a page are candidates.
        if not _modifies_a_page(record):
            continue                     # BEGIN, COMMIT, CHECKPOINT — skip

        pid = record.page_id

        # Filter 1: is the page dirty at all, and is its rec-LSN early enough
        #           that this record could still be missing on disk?
        if pid not in dpt:
            continue                     # page was never dirty — on disk already
        if dpt[pid] > record.lsn:
            continue                     # dirtiness starts after this record —
                                         # this record's change is already on disk

        # Filter 2: does the on-disk page already have this modification?
        # Reading the page loads it into the buffer pool; if it is the first
        # time the page is touched in recovery, the page-LSN is whatever was
        # persisted at crash time.
        page = buffer_pool.fetch(pid)

        if page.page_lsn >= record.lsn:
            # The on-disk page-LSN already covers this record. Do not re-apply
            # — that would double the effect for non-idempotent changes. Also,
            # correct the DPT: the earliest unflushed LSN for this page cannot
            # be earlier than the page-LSN we just read.
            dpt[pid] = max(dpt[pid], page.page_lsn + 1)
            continue

        # Apply the after-image. The page is in the pool; this write is
        # to memory. The buffer manager will flush it out at some later point
        # — or the end-of-recovery checkpoint will force it.
        page.apply_after_image(record.offset, record.after)
        page.page_lsn = record.lsn
        page.mark_dirty()
        # Note: rec-LSN in the DPT is NOT updated here. It tracks the earliest
        # unflushed LSN and is set by the analysis pass / buffer manager at
        # the moment the page first becomes dirty.

def _modifies_a_page(record) -> bool:
    """True for UPDATE / INSERT / DELETE records — the ones that have an
    after-image and a target page. BEGIN / COMMIT / CHECKPOINT / CLR are
    handled separately (CLR in the undo pass; the rest are metadata)."""
    from log_records import UpdateRecord     # reusing the dataclasses
    return isinstance(record, UpdateRecord)

The structure is flat: one loop, two guards, one apply. No recursion, no lookbacks, no cross-record state. Every record is processed in isolation, using only the DPT, the log record itself, and the target page's current page-LSN. That locality is why redo parallelises well in production engines — different pages have independent decisions, and replaying page 7 does not interfere with replaying page 9.

Two details in the code are worth noticing.

The dpt[pid] > record.lsn guard. This is the optimisation that lets recovery read a log record without reading the page. If the DPT says page 9 is dirty starting at rec-LSN 1140, and we are processing record LSN 1100 that modifies page 9, we know immediately: the change at LSN 1100 was flushed before the page became dirty. Skip — no page fetch needed. Without this guard, every UPDATE record would force a page read just to check its page-LSN. With it, the redo pass can burn through enormous stretches of the log without touching the data files.

The DPT correction when page-LSN is already ahead. If we read the page and find that its on-disk page-LSN is newer than the record we are processing — which can happen when the DPT was built from an inexact analysis (for example, the checkpoint recorded an out-of-date rec-LSN, or the OS flushed a page the checkpoint had no way to know about) — we nudge the DPT's rec-LSN up to one past the page-LSN. This does not affect the current record (we already skipped it) but it may let subsequent records skip the fetch via filter 1.

"Repeating history" — why uncommitted transactions are redone too

Here is the line from the ARIES paper that trips up every reader on first pass: "redo is repeating history." Even transactions that did not commit — that were still running at crash time, or that would have aborted — have their UPDATEs replayed in the redo pass. The database is reconstructed to its exact pre-crash state, losers and all.

This is not a performance decision. It is a correctness decision, and the alternative is worse than it looks.

Imagine recovery tried to be clever: during redo, skip any record belonging to a transaction that never committed (as determined from the ATT). Call this "selective redo." It seems smarter — why bother reinstating changes we are about to undo?

The problem: some of the uncommitted transaction's pages might have been flushed, and some might not. Selective redo leaves the flushed ones with the change (because it did not redo, but the disk page carries the change) and leaves the non-flushed ones without the change (because they were lost in the buffer pool and selective redo did not put them back). Now the undo pass has to decide, page by page, whether to apply the before-image (page is carrying the change — undo it) or not apply it (page never had the change — undoing a nonexistent change would corrupt the page). That decision requires, for every undo, exactly the page-LSN comparison that redo would have done. You have saved no work; you have only moved the comparison from one pass to another while making the logic harder to state.

Why repeating history simplifies undo: after redo, every page on disk is at its pre-crash state. The uncommitted transaction T has its changes on every page it touched — if the page was flushed, the change was already there; if the page was not flushed, redo just put it back. Now undo can walk T's records backward and apply each before-image unconditionally. There is no "did this change make it to disk?" question — it is on disk now, because redo put it there. Undo's logic collapses to: for each UPDATE record of T, write the before-image. The skip-or-apply rule of redo is not repeated in undo; it is not needed.

There is a second, subtler reason repeating history is the right choice: undo itself is logged (with compensation log records, CLRs — next chapter). If undo has to do a page-LSN comparison before writing, those writes get logged too, but now the log has conditional apply-or-skip records whose replayability depends on whether the pre-redo state matched what the analysis pass inferred. You would have recovery's correctness depending on the analysis pass's correctness depending on the checkpoint's correctness — a chain of conditionals. ARIES breaks the chain by insisting: redo first, rebuild the pre-crash state exactly, then undo from a known-good starting point. Every piece of logic after "redo finishes" can assume the world is back to its pre-crash configuration.

Idempotence — why replaying the replay is safe

A recovery algorithm that is not idempotent is a liability. Suppose redo is halfway through the log when the process crashes again — power cut in the data centre, a kernel panic, a careless operator pulling a cable. The database restarts. The analysis pass rebuilds the DPT and ATT from the log. The redo pass starts over from min(rec-LSN). Every record before the second crash gets replayed a second time.

If redo were not idempotent, this would corrupt the database. Every record that had already been applied once before the second crash would be applied again, doubling its effect.

Redo is idempotent by construction. Walk through the two cases for a single record R, page P, where R was applied once before the second crash.

Case 1: page P was flushed between the first apply and the second crash. On disk, page P has page-LSN(P) = R.lsn (the apply set it) and the change is in the page body. The second recovery restarts, builds a new DPT — but P is no longer dirty (it was flushed successfully), so P is either not in the new DPT at all, or its rec-LSN is higher than R.lsn. Either way, filter 1 of the redo rule skips R for page P. No double-apply.

Case 2: page P was not flushed between the first apply and the second crash. The in-memory modification is lost; the on-disk page is whatever it was before the first apply, with page-LSN(P) < R.lsn. The second recovery starts, re-applies R, and sets page-LSN(P) = R.lsn. The page is now in its correct post-R state — the same state it would have been in if the first apply had completed, the page had been flushed, and no second crash had occurred.

In both cases, after redo completes, the page carries R's modification exactly once. That is idempotence: applying the rule to the same (record, page) pair any number of times produces the same final page state.

Why the page-LSN bump matters for idempotence: when redo applies R to page P, it also writes page-LSN(P) = R.lsn atomically with the change. The "atomically" is important — the two writes must happen under the same latch, so a reader cannot see a page with R's new bytes but the old page-LSN (which would trick a subsequent redo into re-applying). In the buffer pool, that is trivial: the pool gives you an exclusive latch on the page, you change the bytes and the page-LSN together, you release the latch. The difficulty is on disk: the page-LSN must land with the same atomicity as the bytes, which is why torn-write defences (full-page writes, double-write buffer) were introduced in earlier chapters. Idempotence of redo depends on the atomicity of the page flush — and that is the one place recovery still leans on the storage engine's durability primitives.

A trace through three records and three pages

The analysis pass finishes and hands you:

DPT = { page 7: rec-LSN 1048,
        page 9: rec-LSN 1140,
        page 12: rec-LSN 1205 }
min(rec-LSN) = 1048

The log from LSN 1048 onward contains (simplified):

LSN 1048  UPDATE  xid=42  page=7   after="..."
LSN 1100  UPDATE  xid=42  page=5   after="..."   <-- page 5 is not in DPT
LSN 1140  UPDATE  xid=42  page=9   after="..."
LSN 1170  UPDATE  xid=43  page=7   after="..."   <-- second update to page 7
LSN 1205  UPDATE  xid=43  page=12  after="..."
LSN 1232  COMMIT  xid=42
LSN 1260  UPDATE  xid=43  page=9   after="..."

At crash time, pages 7 and 9 were at various disk states. Walk through redo record by record.

LSN 1048, page 7. Filter 1: page 7 in DPT, rec-LSN = 1048 ≤ 1048. Proceed. Fetch page 7 from disk. Suppose its on-disk page-LSN is 1020 (the last record that touched page 7 before xid=42's update, already durable). Filter 2: 1020 < 1048. Apply the after-image. Set page-LSN(7) = 1048.

LSN 1100, page 5. Filter 1: page 5 not in DPT. Skip — page 5 is clean, the modification at LSN 1100 is already durable on disk. No page fetch.

LSN 1140, page 9. Filter 1: page 9 in DPT, rec-LSN = 1140 ≤ 1140. Proceed. Fetch page 9 from disk. Its on-disk page-LSN is 1090 (some earlier modification that had flushed). Filter 2: 1090 < 1140. Apply. Set page-LSN(9) = 1140.

LSN 1170, page 7. Filter 1: page 7 in DPT, rec-LSN = 1048 ≤ 1170. Proceed. Page 7 is already in the buffer pool (we fetched it at LSN 1048 and applied to it). Its page-LSN is now 1048. Filter 2: 1048 < 1170. Apply. Set page-LSN(7) = 1170. The buffer pool caches the page across records, so this is a memory-only operation.

LSN 1205, page 12. Filter 1: page 12 in DPT, rec-LSN = 1205 ≤ 1205. Proceed. Fetch page 12. Suppose on-disk page-LSN = 1205 already (the page had been flushed between the buffer manager dirtying it and the crash). Filter 2: 1205 < 1205 is false. Skip — already on disk. Nudge the DPT: dpt[12] = max(1205, 1205 + 1) = 1206, meaning any future record with LSN < 1206 that targets page 12 can skip without a fetch.

LSN 1232, COMMIT. Not a page-modifying record. Skip. (The ATT will mark xid=42 as committed, which matters for the undo pass — xid=42 is a winner, not to be undone. But redo does not care.)

LSN 1260, page 9. Filter 1: page 9 in DPT, rec-LSN = 1140 ≤ 1260. Proceed. Page 9 is cached (we applied LSN 1140 to it). Its page-LSN is 1140. Filter 2: 1140 < 1260. Apply. Set page-LSN(9) = 1260.

Redo finishes. The buffer pool now holds pages 7, 9, and 12 in their exact pre-crash states. Importantly, page 9 carries xid=43's update at LSN 1260 — and xid=43 never committed. That is history being repeated. The undo pass will walk xid=43 backward and reverse the LSN 1260 change (and the LSN 1170 and 1205 changes) using the before-images. But that is the next chapter.

The total number of page fetches from disk across all seven records: three (pages 7, 9, 12). Six records attempted a filter-1 or filter-2 check; three were rejected before the fetch; one applied a change to a cached page. The redo pass touched disk only where it had to.

Common confusions

"Why does redo apply uncommitted transactions? Surely that is wasted work." Because the alternative — selective redo — requires exactly the same page-LSN comparison, moved into the undo pass, making undo's logic harder without saving any work. After redo repeats history, undo becomes a simple backward walk applying before-images unconditionally. The total CPU and I/O is comparable; the reasoning is much simpler. See the "repeating history" section above.
"What if the log has records before min(rec-LSN)?" Those records are not replayed. By the DPT invariant, every page modified by a record at LSN < min(rec-LSN) either (a) is not in the DPT at all — the page is clean, all its modifications are on disk — or (b) is in the DPT with a rec-LSN strictly greater than that record's LSN — the record's modification is already on disk (otherwise the rec-LSN would be that record's LSN or earlier). Skipping them is safe and makes recovery dramatically faster on a long-running database.
"Does redo fsync after each apply?" No. Redo applies changes into the buffer pool; it does not flush data pages as it goes. A single fsync of the WAL at the start of recovery (to make sure the log tail itself is durable) and a checkpoint at the end of recovery (to flush the reconstructed dirty pages) are typically all the fsyncs the redo pass involves. The point is not to re-durablize the log — the log is already durable, that is why recovery can trust it — but to bring the data pages into line with what the log says.
"What if a record targets a page that is no longer in the database?" This happens if the page was deallocated (a DROP TABLE, a free-space reclaim) after the record was written. The redo code has to handle it — either by recognising deallocated-page records and skipping ordinary UPDATEs to deallocated pages, or by logging the deallocation itself as a separate record type. Production engines do the latter: deallocation has its own log record, and replaying the log in order will deallocate the page at the right moment, so subsequent UPDATEs to it are skipped (the page is no longer valid).
"Is the buffer pool empty when redo starts?" Effectively yes. The crash cleared the pool. The redo pass fetches pages from disk as it needs them and caches them — when the next record targets the same page, the fetch is already cached. By the end of redo, the hot pages are all in the pool, warm for post-recovery work.
"What stops redo from running forever?" Redo stops at the tail of the log — the last record whose CRC validates. The log reader, walking forward, reaches a record whose CRC fails or whose length extends past end-of-file; that record is the torn tail from the crash. Redo stops just before it. The analysis pass has already noted the end-LSN, so redo knows where to stop without any side-channel signal.
"If redo re-reads the page before applying, is it slow?" The page read is a single sequential buffer-pool fetch (or a single disk I/O if the page is cold). For a long log with many records per page, the cost is amortised — one fetch, many applies. For a log with one record per page, recovery IS slow — that is why checkpoints matter and why you tune max_wal_size and checkpoint_timeout. Recovery time in ARIES is roughly O(records to replay) × I/O per unique dirty page, and the checkpoint interval bounds the first factor.

Going deeper

Physical vs logical vs physiological redo

A log record describes a modification. How the modification is described determines how redo replays it.

Physical redo — the record contains the exact bytes that should appear on the page after the modification. Replay is "write these bytes at this offset." This is the simplest form: it is obviously idempotent, obviously bit-for-bit deterministic, and requires no structural understanding of the page. Postgres's log is predominantly physical (with a twist — see physiological below). The cost is record size: a single-field update on a 200-byte row emits a record carrying 200 bytes of before-image and 200 bytes of after-image, even if only 4 bytes of the row changed.

Logical redo — the record contains a high-level description of the modification: "insert (key=5, value='hello') into table T." Replay is "execute this insert against the current state of the page." This is compact — the record size does not depend on the row size — but it has a fatal problem for physical recovery: the outcome depends on the current state of the page. If redo applies the insert to a page whose free space has shifted because of earlier replays, the insert might land in a different byte offset, which breaks every subsequent record that targets that offset. Pure logical redo is infeasible for a page-oriented storage engine. It is used in higher layers (Postgres's logical replication, for instance, which targets tuples not pages) but not in the redo pass.

Physiological redo — the middle ground that real engines actually use. A record is logical within a page (describing the modification in terms of a slot number or a key) and physical across pages (naming the specific page). Replay says "apply this insert to slot 3 of page 9." The page's internal layout can change between runs of the engine — a VACUUM or a reorganisation might have moved slot 3 to a different offset — and replay still works because it refers to the slot by number, not by byte position. InnoDB's redo log is almost entirely physiological: MLOG_REC_INSERT names the page, the type of record, and the logical insertion point; MLOG_2BYTES names a 2-byte field within the page by its offset.

ARIES as described in the 1992 paper supports all three kinds, and practical engines pick the one that matches their page format. Postgres uses physical-with-full-page-images. InnoDB uses physiological. LMDB, being copy-on-write and not log-based, sidesteps the question entirely.

InnoDB's physiological records — MLOG types in redo

The redo pass in InnoDB is organised around replaying MLOG records within mini-transactions. An MTR is framed by MLOG_MULTI_REC_END; during redo, the pass reads all records of an MTR into a buffer, verifies the frame, then applies them to their target pages in order. If the MTR is incomplete (crash occurred in the middle of its records), the whole MTR is discarded.

A record like MLOG_REC_INSERT carries: the page ID (space + page number), the type of the index whose record is being inserted, the position at which to insert (in terms of the logical record sequence within the page), and the row data. Redo reads the page, walks to the insertion position, and inserts the record — without knowing or caring what byte offset the insertion happens at. The page's internal slot directory and free-space pointers are updated by the insert code, exactly as they would be at runtime.

Because the log record describes what to do rather than where the bytes go, InnoDB's redo log is compact: a 32-byte row insertion might cost 40–50 bytes of log, compared to Postgres's ~100+ bytes (a physical record plus its metadata). The cost is that InnoDB's redo code is more complex — every log record type has a dedicated apply function that understands the page's internal structure.

The page-LSN comparison is identical to the physical case: MLOG_REC_INSERT at log LSN L applies to page P only if P's on-disk page-LSN is less than L. The skip-or-apply rule is structural; the apply step is what varies by log-record family.

Parallel redo

On a large log (tens of gigabytes of replay), sequential redo is slow. Modern engines parallelise. The insight: redo decisions for different pages are independent — replaying LSN 1048 on page 7 and LSN 1170 on page 9 can happen concurrently. The engine partitions the log by target page, hashes (page-ID → worker), and each worker replays its own partition's records in LSN order.

Coordination is required at two points. First, cross-page records (B-tree page splits, which modify two pages atomically) must replay both pages' records in lockstep — either by sending the record to both workers with a synchronisation barrier, or by centralising multi-page records on a single worker. Second, the end-of-redo barrier: before the undo pass starts, all workers must finish, so undo sees a fully reconstructed buffer pool.

Postgres 14+ has parallel redo behind a feature flag (recovery_prefetch + WAL prefetching prepares pages ahead of the applying thread; full parallelism is on the roadmap). InnoDB has had parallel redo since MySQL 8.0.11. Aurora's log-structured storage is effectively parallel redo by construction: each page server replays the log for its own pages independently.

Where this leads next

Redo repeats history. The database is now in its exact pre-crash state — committed transactions' changes are on the pages, and uncommitted transactions' changes are on the pages too. The next pass's job is to remove the second group.

Undo walks each transaction in the ATT backward via the prev_lsn chain. For every UPDATE record it finds, it writes the before-image to the page — unconditionally, because redo has already ensured the change is present. And critically, undo logs its own writes: every reversal emits a compensation log record (CLR), a redo-only record that records what was undone. CLRs make the undo pass itself crash-recoverable: if the process crashes during undo, the next recovery's redo pass replays the CLRs, completing the reversal that the previous run started.

That is ARIES: the undo pass and compensation log records — the third and final act of the recovery algorithm, and the one that makes ARIES robust to crashes during recovery itself.

References

Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — §4 formalises the redo pass, the skip-or-apply rule, and the "repeating history" paradigm.
PostgreSQL Global Development Group, WAL Internals — Recovery, PostgreSQL 16 documentation — the current reference on Postgres's redo implementation and rmgr dispatch.
Oracle Corporation, InnoDB Recovery, MySQL 8.0 reference manual — parallel redo, MLOG-record apply, MTR framing.
Gray and Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann 1992 — chapter 10 covers redo logging at textbook depth with physical/logical/physiological variants.
Härder and Reuter, Principles of Transaction-Oriented Database Recovery, ACM Computing Surveys 1983 — the taxonomy paper that introduced "physiological" as a term for redo records that are logical within a page and physical across pages.
Verbitski et al., Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases, SIGMOD 2017 — an example of parallel redo in a disaggregated-storage architecture, where page servers replay the log independently.