In short

The redo pass left you with a buffer pool reconstructed to its exact pre-crash state. Every page holds every modification the log said it should — including the modifications made by transactions that had not committed at crash time. Those transactions are the losers, listed in the transaction table (TT) that analysis built. The undo pass's job is to cancel them. For each loser, undo starts at its last-LSN (recorded in the TT) and walks backward through the prev_LSN chain that threads every record a transaction emitted, in reverse order. For each UPDATE record it visits, it writes the record's before-image onto the target page — unconditionally, because redo has already guaranteed the after-image is there to be reversed. But undo does one more thing: every time it writes a before-image, it emits a compensation log record (CLR) recording the undo action. A CLR contains the same after-image as a regular UPDATE — specifically, the before-image of the record it undoes, now being re-written to the page — plus an undo_next_LSN pointer to the next record in the chain that undo intends to visit. CLRs are redo-only: if the next crash happens mid-undo, the next recovery's redo pass replays the CLRs (putting the reversals back on disk), and the next undo pass picks up from wherever the last CLR's undo_next_LSN points — not from the end of the losers' chains. CLRs are never themselves undone; that is what makes undo a crash-recoverable operation rather than a single-shot gamble. Put the three passes together and you have an algorithm that survives arbitrarily many crashes during recovery itself, and the database comes up in the same final state every time.

The redo pass finished with the buffer pool holding every page at its pre-crash state. Committed transactions' changes are there, because they were logged and redo put them back. Uncommitted transactions' changes are also there — because ARIES repeats history, and it does that deliberately, so that undo's logic can be simple.

Now you have to take the uncommitted changes out.

The transaction table from the analysis pass lists every transaction that was in flight at crash time — no COMMIT record was found, no ABORT record either. These are the losers. For each one, the TT records the LSN of the most recent log entry that transaction emitted. That last-LSN is your starting point for walking backward.

The prev_LSN chain threads a transaction's records

Every log record belonging to a transaction carries a prev_LSN field pointing to the previous record from the same transaction. The BEGIN record has prev_LSN = NULL. Each subsequent record points back to the one before it — a singly linked list embedded in the log, with all the records of one transaction scattered among the records of every other concurrent transaction, but reachable in order from either end.

The prev_LSN chain for a single transaction, read from tail to headA horizontal strip shows the WAL as a series of log records. The records belonging to xid=43 are highlighted; other records are shown in muted grey. An arrow labelled "prev_LSN" runs from each xid=43 record backward to the previous xid=43 record, skipping over unrelated records from other transactions. The rightmost xid=43 record is labelled "last-LSN from TT" and undo starts there, walking left along the prev_LSN arrows until it reaches the BEGIN record with prev_LSN = NULL.prev_LSN chain — walking xid=43 backwardLSN 1095BEGINxid=43prev=NULLLSN 1140UPDATExid=42LSN 1170UPDATE p=7xid=43prev=1095LSN 1205UPDATE p=12xid=43prev=1170LSN 1232COMMITxid=42LSN 1260UPDATE p=9xid=43prev=1205LSN 1290UPDATExid=44TT: last-LSN = 1260start undo hereprev=NULL → stopchain exhaustedrecords of xid=42 and xid=44 are interleaved but irrelevant to xid=43's undoundo visits LSN 1260, then 1205, then 1170, then BEGIN — writing a CLR for each UPDATE
The prev_LSN chain. Every record belonging to xid=43 points back to the previous record of xid=43, skipping over unrelated records in between. Undo enters the chain at the tail (the last-LSN the TT recorded) and walks to the head (the BEGIN record, whose prev_LSN is NULL). The interleaved records from other transactions are ignored — undo only cares about its own chain.

Walking prev_LSN is how ARIES gets away with not keeping any in-memory per-transaction state after the crash. The chain is the transaction's history, woven into the log. Give undo the tail of the chain (from the TT) and it reconstructs the transaction's record list by following pointers backward, fetching one record at a time.

What undo does at each record

For each record the walk visits, undo has to reverse whatever that record did.

Why undo does not fetch and compare the page-LSN before writing: because redo already did the comparison. After redo, the page carries the after-image of the record undo is about to reverse. Whether the original after-image was on disk before the crash or not, redo has put it there now. So undo writes the before-image unconditionally — no filter, no decision, no fetch-and-check. The simplicity is bought by redo's repeating-history. This is the concrete payoff you were promised two chapters ago.

Compensation log records — logging the undo

Here is the problem CLRs solve. Suppose undo has rolled back 3 of the 5 UPDATEs belonging to xid=43 — installed the before-image on page 9 (LSN 1260's target), on page 12 (LSN 1205's), and on page 7 (LSN 1170's). It is about to walk from LSN 1170 to the BEGIN at 1095. At this moment, the data-centre power fails again.

When the database restarts, analysis runs, redo runs — and now what? The log has xid=43's original UPDATEs at LSNs 1170, 1205, 1260. If we blindly run the undo pass again, we will reverse these records a second time. But the first reversal already installed the before-images on disk (or at least wrote them to the buffer pool and redo's repeat will put them back). Reversing twice means we write the before-image's before-image, which is the original after-image — re-installing the uncommitted change that we were trying to get rid of.

The fix: every time undo writes a before-image, it emits a log record describing what it did. That record is the compensation log record (CLR). Structurally a CLR looks like an ordinary UPDATE — it names a page and offset, carries an after-image (the bytes it just wrote, which are the before-image of the record it undid), and has a prev_LSN pointing to the previous record of the transaction being rolled back. But it carries one extra field:

CLRs written during undo — the log as it growsThe log is drawn as a sequence of records on a horizontal timeline. Original records for xid=43 appear at LSNs 1095 BEGIN, 1170, 1205, 1260. After the crash and during undo, CLRs are appended at LSN 1400 (undoing LSN 1260), 1410 (undoing 1205), 1420 (undoing 1170). Each CLR has an undo_next_LSN pointer back to the next original record to undo. After LSN 1420 the chain reaches BEGIN; an END record is written at LSN 1425.the log grows during undo — CLRs appended for every reversalLSN 1095BEGINLSN 1170UPD p=7LSN 1205UPD p=12LSN 1260UPD p=9crashLSN 1400CLR p=9LSN 1410CLR p=12LSN 1420CLR p=7LSN 1425END xid=43undo_next=1205undo_next=1170undo_next=1095 (BEGIN)CLRs carry the before-image as their after-image; undo_next_LSN points to the next original record to undo
What the log looks like after undo runs to completion for xid=43. For each of xid=43's three UPDATE records, undo appended a CLR — each CLR's undo_next_LSN points to the prev_LSN of the record it undid, forming a second chain that leapfrogs over the reversed region. After the last CLR, an END record marks the transaction as fully rolled back.

If the process crashes while undo is in the middle of this sequence — say, between writing CLR 1410 and CLR 1420 — here is what happens on restart. Analysis scans the log. It finds the BEGIN at LSN 1095, it finds LSN 1170, 1205, 1260 (all UPDATEs of xid=43). Then it finds CLR 1400 and CLR 1410 (the two CLRs already emitted). The TT entry for xid=43 has its last-LSN updated to 1410 — the most recent record for the transaction. The transaction is still in flight (no END), so it remains a loser. Redo walks forward from min(rec-LSN) and replays every record — including the CLRs. After redo, the before-images installed by 1400 and 1410 are on disk (because CLRs are redo-only and got replayed exactly like UPDATEs). Undo starts at LSN 1410 (the TT's last-LSN). It sees a CLR. It does not reverse it; it follows undo_next_LSN = 1170 and resumes undo from there. Reverse LSN 1170, emit CLR 1420 (or 1430 — whatever the new LSN assignment is), write END, done.

The whole mechanism works because CLRs are treated specially in two ways.

CLRs are redo-only — they do NOT get undone

The invariant that makes crash-during-undo safe is simple to state:

A CLR is redo-only. Redo applies it. Undo skips it (following its undo_next_LSN to leapfrog over already-undone records). Nothing ever reverses a CLR.

Why is this the right rule? Because the whole point of a CLR is to record an undo. If we could reverse a CLR, we would be un-undoing — putting the uncommitted change back. That is exactly the thing undo was trying to accomplish not doing. A CLR represents progress toward rolling back a loser, and that progress must survive all future crashes.

Why undoing a CLR would break correctness: suppose undo emits CLR 1400 reversing the UPDATE at LSN 1260. The CLR's after-image is the before-image of LSN 1260 — say, the value 0 (the row existed as 0 before xid=43 modified it to 42). Redo of CLR 1400 writes 0 to page 9 at the right offset. If undo then tried to reverse CLR 1400, it would need a before-image of CLR 1400 — but the before-image of CLR 1400 is 42 (the value that was on the page when the CLR was about to overwrite it). Writing 42 back to the page reintroduces xid=43's uncommitted change. The reversal would have exactly cancelled the work the CLR did. A transaction that should end up rolled back would end up un-rolled-back. The only way to break the infinite loop is to declare that CLRs, once emitted, are never themselves reversed.

This is implemented concretely in two places.

In the undo walk. When undo fetches the record at its current position, it checks the record type. If it is a CLR, undo does not apply a before-image (there isn't one stored in the CLR — a CLR has only an after-image). Instead, undo reads the CLR's undo_next_LSN field and jumps there. The transaction's "current undo position" is now whatever the CLR pointed to — typically the prev_LSN of the record the CLR was undoing.

In the redo pass. The redo pass does apply CLRs. A CLR is effectively an UPDATE from redo's perspective — it has a page-ID, an offset, an after-image, and an LSN. The skip-or-apply rule runs on it unchanged: if the on-disk page-LSN is less than the CLR's LSN, apply; otherwise skip. Nothing about redo knows or cares that this record was emitted by a previous run's undo pass.

The undo pass in Python

# aries_undo.py — the undo pass, given a TT and a log.
from __future__ import annotations
from dataclasses import dataclass
from typing import Dict, Optional

# Reusing types from earlier chapters:
#   UpdateRecord(lsn, xid, page_id, offset, before, after, prev_lsn)
#   BeginRecord(lsn, xid, prev_lsn=None)
#   CLR(lsn, xid, page_id, offset, after, prev_lsn, undo_next_lsn)
#   EndRecord(lsn, xid, prev_lsn)
#   TT: Dict[int, TxnEntry] where TxnEntry has last_lsn

def undo_pass(log: "LogWriter",
              tt: Dict[int, "TxnEntry"],
              buffer_pool: "BufferPool") -> None:
    """Walk every loser backward. Emit a CLR for every UPDATE reversed.
       Emit an END when the chain runs out."""

    # A priority queue ordered by LSN (descending) — at each step we undo
    # the loser with the latest outstanding record. This is not strictly
    # required for correctness, but matches ARIES's prescription and keeps
    # the buffer pool hot for the most recently touched pages.
    import heapq
    to_undo = [(-entry.last_lsn, xid) for xid, entry in tt.items()]
    heapq.heapify(to_undo)

    # Per-xid cursor: the LSN of the next record to visit for this xid.
    cursor: Dict[int, Optional[int]] = {xid: entry.last_lsn for xid, entry in tt.items()}

    while to_undo:
        neg_lsn, xid = heapq.heappop(to_undo)
        lsn = -neg_lsn
        if cursor[xid] != lsn:
            # Stale entry (we advanced the cursor without removing this one).
            continue

        record = log.read(lsn)

        if isinstance(record, BeginRecord):
            # Chain exhausted. Emit END, remove from TT, do not re-enqueue.
            log.append(EndRecord(xid=xid, prev_lsn=lsn))
            del tt[xid]
            del cursor[xid]
            continue

        if isinstance(record, CLR):
            # Leapfrog: the CLR was written by a previous recovery pass.
            # Skip to the record it said to visit next. Do NOT reverse the CLR.
            next_lsn = record.undo_next_lsn
            if next_lsn is None:
                # CLR's undo_next is null means the BEGIN was the target —
                # the previous run had finished all UPDATEs; all that
                # remained was writing END. Do that now.
                log.append(EndRecord(xid=xid, prev_lsn=lsn))
                del tt[xid]
                del cursor[xid]
            else:
                cursor[xid] = next_lsn
                heapq.heappush(to_undo, (-next_lsn, xid))
            continue

        if isinstance(record, UpdateRecord):
            # Reverse the update: install the before-image, emit a CLR.
            page = buffer_pool.fetch(record.page_id)
            page.apply_after_image(record.offset, record.before)
            # The CLR's LSN is assigned by log.append; its after-image is
            # the before-image we just installed; its undo_next_LSN is the
            # prev_lsn of the record we reversed (where undo goes next).
            clr = CLR(xid=xid,
                      page_id=record.page_id,
                      offset=record.offset,
                      after=record.before,
                      prev_lsn=cursor[xid],    # threads into the xid's log chain
                      undo_next_lsn=record.prev_lsn)
            clr_lsn = log.append(clr)
            page.page_lsn = clr_lsn
            page.mark_dirty()

            # Advance cursor: next record to visit is what the reversed
            # record's prev_lsn pointed to (the CLR we just wrote has the
            # same undo_next_lsn, which is where a future recovery would
            # resume if we crash now).
            next_lsn = record.prev_lsn
            if next_lsn is None:
                # The reversed record was the first UPDATE after BEGIN;
                # the next step is to emit END. We can do that immediately.
                log.append(EndRecord(xid=xid, prev_lsn=clr_lsn))
                del tt[xid]
                del cursor[xid]
            else:
                cursor[xid] = next_lsn
                heapq.heappush(to_undo, (-next_lsn, xid))
            continue

        raise AssertionError(f"unexpected record type during undo: {record!r}")

The code is short and — by design — free of filters. No DPT check, no page-LSN comparison, no conditional "is this change already on disk?" logic. Undo writes the before-image to the buffer pool, logs the CLR, and advances. Every decision it would otherwise have had to make was already made by redo.

The heapq is one implementation detail worth calling out. ARIES specifies that undo processes the losers in reverse LSN order across all losers at each step — it repeatedly reverses the record with the highest LSN among all active cursors. Doing so keeps pages hot in the buffer pool (the most recently touched transaction's page is likely still cached), and it matches the "walk the log tail-first" mental model. A naive implementation that finishes one transaction completely before starting the next would also be correct, but ARIES's published form is the cross-transaction reverse-LSN walk.

Putting all three passes together — a crash test harness

Here is the top-level recovery function, stitching analysis, redo, and undo into one call. This is the shape every real engine's recovery has, with different tuning parameters.

# recover.py — the three-pass ARIES recovery.

def recover(log_manager, control_file, buffer_pool) -> None:
    """Crash recovery. Runs analysis, then redo, then undo.
       Safe to call any number of times, on any interleaving of crashes."""

    # 1. ANALYSIS — read the log, build DPT and TT.
    dpt, tt = analysis_pass(log_manager, control_file)

    # 2. REDO — replay history. After this, buffer pool = pre-crash state.
    redo_pass(log_manager.log_reader(), dpt, buffer_pool)

    # 3. UNDO — reverse every loser. Emits CLRs as it goes.
    undo_pass(log_manager.log_writer(), tt, buffer_pool)

    # 4. Checkpoint so the next recovery does not have to replay
    #    what we just did. The checkpoint flushes all reconstructed
    #    dirty pages and writes a CHECKPOINT record to the log.
    log_manager.checkpoint(buffer_pool)

A crash during undo and the recovery that follows

xid=43 is a loser with three UPDATEs at LSNs 1170, 1205, 1260, and a BEGIN at 1095. TT[43].last_lsn = 1260.

First recovery attempt.

  • Analysis: DPT = {p=7: 1170, p=9: 1260, p=12: 1205}. TT = {43: {last_lsn: 1260}}.
  • Redo: walks forward from 1170. Reinstalls the after-images of 1170, 1205, 1260 on pages 7, 12, 9 (every apply was needed — none had flushed).
  • Undo starts. The highest-LSN cursor is 1260 (xid=43). Fetch record 1260 (UPDATE p=9). Install its before-image on page 9. Emit CLR 1400 with undo_next_LSN = 1205. Cursor for xid=43 advances to 1205.
  • Next iteration: fetch record 1205 (UPDATE p=12). Install before-image on page 12. Emit CLR 1410 with undo_next_LSN = 1170. Cursor advances to 1170.
  • Power fails here. Undo has emitted CLR 1400 and CLR 1410, but not yet reversed the UPDATE at LSN 1170. The END record for xid=43 has not been written.

Second recovery attempt.

  • Analysis scans the log. It sees BEGIN 1095, UPDATE 1170, UPDATE 1205, UPDATE 1260, CLR 1400, CLR 1410. For xid=43, the analysis pass sets last_lsn = 1410 (the most recent record belonging to xid=43; a CLR counts as belonging to its xid). xid=43 is still in flight — no END was logged. It remains a loser.
    • The DPT may have shifted — if pages 9 and 12 were flushed before the second crash, their rec-LSNs are now later. If not, they are whatever the (now-resumed) checkpoint record said. Either way, DPT is correct.
  • Redo walks forward. It applies every UPDATE whose after-image is not on disk, and every CLR whose after-image is not on disk — CLRs are redo-only, which means redo treats them identically to UPDATEs. By the time redo finishes, pages 9 and 12 carry the before-images (because CLRs 1400 and 1410 have been replayed). Page 7 still carries the uncommitted LSN 1170 after-image (because redo replayed LSN 1170, and no CLR for it exists yet).
  • Undo starts. TT[43].last_lsn = 1410. Fetch record 1410. It is a CLR. Do not reverse it — read its undo_next_LSN = 1170 and jump there. Cursor = 1170.
  • Fetch record 1170 (UPDATE p=7). Install before-image on page 7. Emit CLR 1420 with undo_next_LSN = 1095 (the prev_LSN of 1170, which is BEGIN). Cursor = 1095.
  • Fetch record 1095 (BEGIN). Chain exhausted. Emit END. Remove xid=43 from TT.

Final state: every page that held xid=43's uncommitted changes now holds the before-image. The log contains the original records plus CLRs 1400, 1410, 1420 and an END — a full audit trail of the rollback. If a third crash happened between CLR 1420 and END, a third recovery would see the CLR, follow its undo_next_LSN to the BEGIN, and emit END. The algorithm is crash-safe recursively — any number of crashes during recovery is fine, because every step of undo is itself logged as a redo-only record.

Common confusions

Going deeper

InnoDB undo logs in their own tablespace

InnoDB implements undo differently from the ARIES textbook. Instead of weaving undo information into the redo log as CLRs, InnoDB keeps a separate undo log stored in its own tablespace (historically ibdata1, since MySQL 8 an undo-dedicated tablespace — undo_001, undo_002, ...). Every row modification writes two things: a redo record (for crash recovery) and an undo record (for rollback and MVCC read views).

The split means InnoDB's crash recovery is somewhat simpler than ARIES-prescribed: redo replays the redo log to reconstruct the buffer pool, and rollback of in-flight transactions at recovery time is done by replaying the undo log for each active transaction — walking the undo log's per-transaction chain and applying each undo record to the corresponding page. Because the undo log is physically separate and has its own checkpoints, InnoDB does not need CLRs in the redo stream — the undo itself is logged in the undo tablespace, which is durable, and the redo log carries modifications to that undo tablespace as ordinary page updates.

The trade-off is space: InnoDB pays for two parallel logs instead of one. The benefit is that undo records double as MVCC history (older transactions can read the pre-update state by chasing undo pointers). ARIES's unified log, by contrast, has to keep committed transactions' records around long enough for any concurrent reader to reconstruct old versions — which most ARIES descriptions do not even try to address, because they predate MVCC's dominance.

Postgres's different approach — no undo, just MVCC rollback

Postgres does not have an undo pass at all. Its rollback mechanism is radically different: when a transaction inserts or updates a row, it writes a new tuple version to the page, tagged with the transaction's xid. The old version remains in place, also tagged. When a transaction aborts or is a loser at crash time, Postgres does not rewrite any page bytes — it just marks the xid as aborted in the commit log (pg_xact). Every reader's MVCC snapshot check then filters out the aborted transaction's tuples as invisible.

This means Postgres's recovery is two passes, not three: analysis and redo. There is no undo. Loser transactions are abandoned in place; their tuples sit on disk as dead tuples, reclaimed later by VACUUM. For a pure-recovery workload this is enormously simpler — no CLRs, no prev_LSN walk, no two-chain log. For a steady-state workload the cost is tuple bloat and the need for VACUUM to reclaim space.

The Postgres approach works because Postgres is MVCC-native: every tuple carries its xid, and visibility is checked per-read against the commit log. In an engine without MVCC (InnoDB and pre-MVCC Postgres alike), a loser transaction's bytes cannot simply be "left on the page and ignored" — a subsequent reader or writer would see the uncommitted value directly, with no filter. That engine must physically reverse the bytes, which is what ARIES's undo pass does.

Nested transactions and savepoints use the same CLR machinery

ARIES extends cleanly to savepoints. When a transaction declares SAVEPOINT s1, the engine records the LSN of s1 (call it save_lsn). If the transaction later issues ROLLBACK TO s1, the engine walks the transaction's prev_LSN chain backward from the current last-LSN, applies before-images, emits CLRs — until the cursor reaches save_lsn. At that point, it stops. The transaction continues as if nothing between s1 and the rollback had happened. The CLRs record the partial rollback in the log, so a crash during ROLLBACK TO s1 is recoverable by the same mechanism that handles a crash during full undo.

Partial rollbacks are one of the features the ARIES paper explicitly called out in its title ("fine-granularity locking and partial rollbacks"), and CLRs are the enabling machinery. The log is doing double duty: recording the forward work of transactions, and recording the reversal work as well.

Where this leads next

You now have the full ARIES algorithm. Analysis, redo, undo — three passes, each with its own data structures, each with a tight locality of reasoning. Together they survive any crash pattern: crash during normal operation, crash during recovery, crash during recovery of a recovery. Every crash leads to a recovery that ends in the same final state, with every committed transaction's work durable and every uncommitted transaction's work gone.

The next chapter is the test. You will build a crash-loop harness — a test framework that runs your database, sends transactions, and sends kill -9 at random points during both normal operation and during recovery itself. Then it checks that after every restart the database comes up in a consistent, expected state. If your implementation of WAL, checkpointing, and the three ARIES passes is correct, the harness runs forever. If any of them is subtly wrong, the harness finds a case where the database comes up corrupted or refuses to come up at all.

Building that harness, seeing it find your bugs, and fixing them until it runs clean for hours — that is how a recovery implementation earns the right to be called correct. It is the subject of kill -9 in a loop: building a recovery test harness.

References

  1. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — §5 formalises the undo pass, CLRs, and the undo_next_LSN mechanism.
  2. Mohan, Repeating History Beyond ARIES, VLDB 1999 — retrospective on why CLRs and repeating-history made ARIES crash-recursive.
  3. Oracle Corporation, InnoDB Undo Logs, MySQL 8.0 reference manual — the separate-tablespace approach to undo.
  4. PostgreSQL Global Development Group, MVCC and Vacuuming, PostgreSQL 16 documentation — Postgres's no-undo rollback model.
  5. Gray and Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann 1992 — chapter 11 on crash recovery covers savepoints, partial rollbacks, and CLR equivalents in the broader transaction-processing literature.
  6. Graefe, A Survey of B-tree Logging and Recovery Techniques, ACM TODS 2012 — modern survey of how ARIES extensions handle B-tree structure modifications and nested top actions, both of which lean on CLR-like machinery.