In short

ARIES recovery is three passes — analysis, redo, undo — and this chapter is the first. When the database process comes back up after a crash, the world is a mess. The data files hold pages at unknown ages: some pages were flushed before the crash (with every committed update already baked in), some pages were flushed mid-transaction (with half of an uncommitted transaction's writes already on disk), some pages are ancient (they have not been dirtied in months). The log is intact up to its last good record. The checkpoint record in the log tells recovery what the dirty-page table and transaction table looked like at some instant shortly before the crash. What recovery does not yet know is what changed between that checkpoint and the crash — new transactions that began, old transactions that committed or aborted, and new pages that became dirty. The analysis pass fills in exactly that gap. It scans the log forward, starting at the checkpoint's LSN, and reads every record up to the end-of-log. For each BEGIN it encounters, it adds an entry to the transaction table (TT). For each COMMIT or ABORT, it removes the matching entry. For each UPDATE, it refreshes the last-LSN of the transaction and — if the target page is not already in the dirty-page table (DPT) — it adds it, with the UPDATE's LSN as the page's rec-LSN. When the scan reaches end-of-log, the DPT holds every page that might still need redo (pages whose modification is in the log but whose LSN has not definitely reached disk), and the TT holds every transaction that was in flight at the crash — the ones that never committed and therefore must be undone. Analysis does not touch any data page, does not fsync anything, does not modify the log. It is a pure read pass whose output is two in-memory tables that set the stage for redo and undo. By the end of this chapter you will be able to walk through a small log sample and produce, by hand, the DPT and TT that the next two passes will consume.

At 3:14am the server powers back on. systemd spawns postgres; Postgres reads pg_control and finds a state = in production field — the database did not shut down cleanly. Recovery mode begins.

What does the recovery process see? The data directory is on disk, exactly as it was the instant the power went out. The WAL is there, fsynced up to some LSN. The buffer pool is empty — every page that was in memory is gone. There is no in-memory state at all: no transaction table, no dirty-page table, no list of active xids, no list of held locks. Just files.

The first question every recovery procedure must answer is what transactions were in flight, and what pages might need updates? ARIES calls the pass that answers this the analysis pass, and it is the subject of this chapter.

Recovery is three passes and analysis is first

The ARIES algorithm splits crash recovery into three phases, executed in order. Each one does a different job and leaves a different artefact behind.

The three passes of ARIES recoveryThree boxes arranged horizontally representing the three passes of ARIES recovery. The first box is labelled "1. ANALYSIS" and under it says "scan forward from last checkpoint to end-of-log; build DPT and TT". The second box is labelled "2. REDO" and says "scan forward from earliest rec-LSN in DPT; reapply every record whose page-LSN on disk is older". The third box is labelled "3. UNDO" and says "walk every TT entry backward via prev_lsn; apply before-images to cancel in-flight transactions". Arrows connect the three boxes in sequence. Below, three arrows indicate what each pass consumes and produces: analysis consumes the log and produces DPT + TT; redo consumes DPT + log and produces data-file pages; undo consumes TT + log and produces data-file pages.the three passes of ARIES recovery1. ANALYSISscan forward fromlast checkpoint to end-of-logproduces DPT, TT2. REDOscan forward from earliestrec-LSN in DPT to end-of-logreplays committed work3. UNDOwalk TT entries backwardvia prev_lsn chaincancels in-flight txnsinputs and outputsanalysis:log + checkpointDPT, TT (in memory)redo:DPT + logdata-file pages updatedundo:TT + logdata-file pages rolled backanalysis writes nothing to disk; it is a read pass whose output is two tables.
The three passes of ARIES. Analysis reads the log and produces two in-memory tables. Redo reads the log and the DPT and mutates data-file pages. Undo reads the log and the TT and mutates data-file pages. Only analysis is pure — it does not change anything on disk. That is what lets it run unconditionally, every time recovery starts, with no risk of making things worse.

The separation matters. Analysis is cheap (one sequential log read, tiny in-memory tables), and it has no side effects on durable storage. If it crashes halfway through, the next boot simply runs it again from the beginning — idempotent, safe, repeatable. Redo and undo, in contrast, mutate data-file pages. They must be designed so that their mutations are idempotent (replaying a redo twice on the same page is a no-op), but the easier the precondition analysis gives them, the simpler their implementation.

What analysis has to produce

Analysis is a function from log + checkpoint to (DPT, TT). The two output tables are the same ones you saw in Checkpointing: Bounding Recovery Time, with one difference: the checkpoint captured them at some instant before the crash, whereas analysis must deliver them as they were at the crash itself.

The TT tells undo who to roll back. The DPT tells redo where to start reading the log and which pages to pay attention to. If the two tables are wrong, redo and undo are wrong; if the two tables are right, redo and undo are simple.

Analysis as a forward log scan

Here is the whole algorithm in one function.

# analysis.py — the ARIES analysis pass.
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Dict

# Transaction states tracked during analysis.
TXN_IN_FLIGHT = "in_flight"

@dataclass
class TxnEntry:
    xid: int
    state: str           # "in_flight"; committed/aborted txns are removed from the table
    last_lsn: int        # LSN of the most recent record for this xid

@dataclass
class DptEntry:
    page_id: int
    rec_lsn: int         # LSN of the earliest record for this page that might not be on disk

def analysis_pass(log_manager, control_file) -> tuple[Dict[int, DptEntry], Dict[int, TxnEntry]]:
    """Scan forward from the last checkpoint to end-of-log. Returns (DPT, TT)."""
    # 1. Find the last checkpoint and read its tables.
    ckpt_lsn = control_file.read_last_checkpoint()
    ckpt     = log_manager.read_record(ckpt_lsn)

    dpt: Dict[int, DptEntry]  = {p: DptEntry(p, r) for p, r in ckpt.dirty_pages}
    tt:  Dict[int, TxnEntry]  = {x: TxnEntry(x, TXN_IN_FLIGHT, l) for x, l in ckpt.active_xids}

    # 2. Scan every record after the checkpoint, updating DPT and TT appropriately.
    for rec in log_manager.scan_from(ckpt_lsn + ckpt.size):
        if rec.type == "BEGIN":
            # A new transaction started after the checkpoint was taken. Add it to TT.
            tt[rec.xid] = TxnEntry(rec.xid, TXN_IN_FLIGHT, rec.lsn)

        elif rec.type == "UPDATE":
            # The transaction is (still) active; refresh its last-LSN.
            tt[rec.xid] = TxnEntry(rec.xid, TXN_IN_FLIGHT, rec.lsn)
            # If this page is not yet in the DPT, add it with rec-LSN = this record's LSN.
            # The earliest-LSN-wins rule: setdefault leaves an existing rec-LSN unchanged.
            if rec.page_id not in dpt:
                dpt[rec.page_id] = DptEntry(rec.page_id, rec.lsn)

        elif rec.type == "COMMIT":
            # The transaction committed before the crash; remove it from TT.
            # Its updates still need redo (they are on the log, maybe not on disk),
            # but it does not need undo.
            tt.pop(rec.xid, None)

        elif rec.type == "ABORT":
            # The transaction was explicitly rolled back before the crash. Same handling as
            # commit for TT purposes: remove it. Any CLRs it wrote are already in the log
            # and will be reapplied by the redo pass.
            tt.pop(rec.xid, None)

        elif rec.type == "CHECKPOINT":
            # An unusual case: a checkpoint record written AFTER the one we started from.
            # Treat it as authoritative — its DPT/TT reflect a later state than ours.
            # (This happens if a checkpoint began but the control file was not yet updated
            # at crash time.)
            dpt = {p: DptEntry(p, r) for p, r in rec.dirty_pages}
            tt  = {x: TxnEntry(x, TXN_IN_FLIGHT, l) for x, l in rec.active_xids}

        elif rec.type == "CLR":
            # Compensation log record — produced by undo during a previous recovery or
            # by an in-process rollback. It applies the before-image of some UPDATE.
            # Treat it like an UPDATE for DPT purposes: it modifies a page.
            tt[rec.xid] = TxnEntry(rec.xid, TXN_IN_FLIGHT, rec.lsn)
            if rec.page_id not in dpt:
                dpt[rec.page_id] = DptEntry(rec.page_id, rec.lsn)

    # 3. When the scan hits end-of-log, DPT and TT reflect the state at crash time.
    return dpt, tt

The routine is fewer than forty executable lines and has one obvious structure: a big for loop that dispatches on record type. Every branch does exactly one thing to each table.

setdefault for the DPT. When an UPDATE refers to page p, the page is added to the DPT only if it is not already there, and its rec-LSN is set to the UPDATE's own LSN. That is the earliest-LSN-wins rule. If page p was already in the DPT (inherited from the checkpoint or added by an earlier UPDATE in this scan), its existing rec-LSN is kept — because that earlier LSN is the earliest record whose modification might not be on disk, which is what rec-LSN means.

Remove on COMMIT, not on UPDATE. A transaction stays in the TT until it commits or aborts. Only when the scan encounters the COMMIT or ABORT record does analysis remove the entry. Until then, even after many UPDATEs, the transaction is "in flight" by definition.

CLRs look like updates for bookkeeping. A Compensation Log Record is the trace left by a rollback — either an in-process rollback that happened before the crash, or the output of a previous recovery's undo pass. For analysis purposes a CLR modifies a page, so it must update the DPT the same way an UPDATE does. You will see CLRs in detail in the undo chapter.

A log timeline — what analysis sees

Imagine the log at the moment of the crash. It runs from log creation on the left to the crashed tail on the right. Somewhere near the tail is the last checkpoint record. Analysis starts there and walks rightward.

The log timeline as seen by the analysis passA horizontal timeline from left to right representing the WAL. Near the right end, a vertical marker labelled "last checkpoint (LSN 2000)" divides the timeline into two regions. The right region is shaded and labelled "analysis scans this". Within the right region, colored markers show records: a green dot labelled "COMMIT xid=41", two blue dots labelled "UPDATE xid=42 p=7" and "UPDATE xid=42 p=9", a green dot "COMMIT xid=42", a yellow dot "BEGIN xid=43", two blue dots "UPDATE xid=43 p=12" and "UPDATE xid=43 p=7", a red dot "ABORT xid=44", a yellow dot "BEGIN xid=45", a blue dot "UPDATE xid=45 p=9". The timeline terminates at a crash icon at "LSN 2480: crash — end of log". Below the timeline, three outcomes are shown: "committed (41, 42) → no undo, maybe redo", "aborted (44) → remove from TT", "in flight at crash (43, 45) → undo".the log timeline analysis walks throughearlier log(ignored)checkpoint (LSN 2000)analysis scans this region →COMMIT 41UPD 42 p=7UPD 42 p=9COMMIT 42BEGIN 43UPD 43 p=12UPD 43 p=7ABORT 44BEGIN 45UPD 45 p=9CRASH (2480)what analysis concludes at end-of-logcommitted before crash: 41, 42 — no undo, their UPDATEs may need redoaborted before crash: 44 — removed from TT (its CLRs, if any, handled by redo)in flight at crash: 43, 45 — remain in TT, must be undoneDPT after scan:{ p=7: rec-LSN min(ckpt,2080), p=9: rec-LSN min(ckpt,2120), p=12: rec-LSN 2280 }TT after scan:{ xid=43: last_lsn=2320, xid=45: last_lsn=2480 }
The WAL at the moment of crash. Analysis inherits DPT and TT from the checkpoint at LSN 2000, then walks every record to the end-of-log at LSN 2480. Each BEGIN adds a row to TT; each COMMIT or ABORT removes one; each UPDATE refreshes the transaction's last-LSN and, if the page is new, adds it to DPT with the current record's LSN as rec-LSN. At the end, TT holds only the in-flight transactions (43 and 45) and DPT holds every page that might need redo.

Notice how the scan treats committed, aborted, and in-flight transactions differently. A committed transaction leaves no mark on the TT (the COMMIT record removes it), but its UPDATE records still put pages into the DPT — because those pages might not yet be on disk. An aborted transaction is also gone from TT, but if it wrote any CLRs before aborting, those CLRs have pages of their own in the DPT. An in-flight transaction — one whose BEGIN is in the log but whose COMMIT or ABORT is not — stays in TT at end-of-log, and that is exactly how undo recognises it.

Walking a concrete trace

Here is a small log, rendered as a table. Suppose the checkpoint at LSN 2000 contained: active = { 41, 42 }, dirty = { p=5: rec-LSN 1820 }. Walk through analysis record-by-record.

LSN Record Action on TT Action on DPT
2000 CHECKPOINT initialise TT = { 41:?, 42:? } initialise DPT = { 5: rec-LSN 1820 }
2040 COMMIT xid=41 remove 41 from TT (unchanged)
2080 UPDATE xid=42 page=7 tt[42].last_lsn = 2080 setdefault p=7 → rec-LSN 2080
2120 UPDATE xid=42 page=9 tt[42].last_lsn = 2120 setdefault p=9 → rec-LSN 2120
2160 COMMIT xid=42 remove 42 from TT (unchanged)
2200 BEGIN xid=43 tt[43] = (in_flight, 2200) (unchanged)
2280 UPDATE xid=43 page=12 tt[43].last_lsn = 2280 setdefault p=12 → rec-LSN 2280
2320 UPDATE xid=43 page=7 tt[43].last_lsn = 2320 p=7 already present; leave rec-LSN = 2080
2400 ABORT xid=44 remove 44 (if present; ignored if absent) (unchanged)
2440 BEGIN xid=45 tt[45] = (in_flight, 2440) (unchanged)
2480 UPDATE xid=45 page=9 tt[45].last_lsn = 2480 p=9 already present; leave rec-LSN = 2120

End of log. Final state:

TT  = { 43: last_lsn = 2320, 45: last_lsn = 2480 }
DPT = { 5:  rec-LSN = 1820,
        7:  rec-LSN = 2080,
        9:  rec-LSN = 2120,
        12: rec-LSN = 2280 }

Two things worth pausing on.

Page 7 is modified twice — by xid 42 at LSN 2080 and by xid 43 at LSN 2320 — but its DPT rec-LSN is 2080, not 2320. The rec-LSN is the earliest LSN whose modification might not yet be on disk for this page. The later modification at LSN 2320 is already covered by starting redo from 2080: if redo replays from 2080, it will reach 2320 and apply it too.

Transaction 44 never appears on any row the scan added — but there is an ABORT for it. This is not a bug; it is a race: xid 44 may have begun before the checkpoint (so it was in the checkpoint's ATT) and aborted after. The tt.pop(rec.xid, None) with the None default handles the absence gracefully, so a spurious ABORT for an xid we do not know about is a no-op. (In the table above, 44 is shown as not previously present, illustrating this case. If it were in the checkpoint's ATT, the pop would remove it.)

The final TT has exactly the transactions that will be undone. The final DPT has exactly the pages redo will inspect. Analysis is done.

What analysis does not do

A few things to rule out, because they are easy to assume.

Analysis does not read any data page. The only input is the log. This is deliberate: data-page reads are random I/O and slow, and analysis needs to be fast so the system can move on to redo. Analysis is pure log scan.

Analysis does not apply any UPDATE to any page. It records the fact that an UPDATE exists (by putting the target page in the DPT) but does not execute it. Execution is the redo pass's job.

Analysis does not fsync anything or write to the log. The log file is opened read-only during analysis. (It will be opened for append after redo and undo have converged, when normal operation resumes.)

Analysis does not know whether any specific UPDATE is already on disk. For that, redo will compare each UPDATE's LSN to the page-LSN on the actual on-disk page. Analysis assumes the worst: if a page is dirty in the DPT, maybe its modifications are not on disk. If a page is not in the DPT (it was clean at checkpoint and no UPDATE has referenced it since), it is guaranteed to be up-to-date and redo can skip it entirely — that is the economy the DPT buys.

End-of-log detection and torn-tail records

Analysis keeps reading the log until log_manager.scan_from(...) tells it there are no more records. How does the iterator know where to stop? Each record's frame has a length prefix and a CRC32 (from Log Records, LSNs, and the Log Sequence). The iterator reads the length, reads that many bytes, computes the CRC, and compares it to the trailing CRC field.

Three outcomes at the tail:

  1. End-of-file cleanly — the iterator tried to read the next record's length prefix and got 0 bytes. Analysis stops.
  2. Partial record — the length prefix says "450 bytes of body follow" but only 200 bytes are left in the file before EOF. The append was interrupted by the crash. Analysis truncates the log at the start of this partial record and stops.
  3. Body complete but CRC fails — rarer, this is a torn write within a single record (the disk wrote the length prefix and some of the body but not all the body before powerloss). Same handling: truncate at the start of this record.

After analysis finishes, the log manager atomically truncates the log to the last good record's end offset. The next append (which comes later, after all three passes complete) extends from there. The torn tail is gone; the surviving records are self-consistent.

Common confusions

Going deeper

The 1992 ARIES paper and the distinction of in-doubt transactions

Mohan et al.'s original ARIES paper (TODS 1992) treats analysis with more nuance than the three-state sketch above. Their TT carries a richer per-transaction state enum: U (unprepared, pure in-flight), P (prepared — two-phase commit has passed the prepare phase but not yet received a global commit/abort), C (committed), and E (aborted/ended). States C and E are transient during the scan — as soon as a COMMIT or ABORT is seen, the entry is removed — but state P is retained because a prepared transaction is in doubt: its fate depends on a coordinator's decision that may not yet be in this log. After analysis, in-doubt transactions are queued for the transaction manager to resolve via the prepared-transaction resolution protocol.

The paper also specifies the UndoNxtLSN field for CLRs — a forward-pointer from a CLR to the next record that needs undo, skipping the CLR's own target. This is a subtle but crucial optimisation: without it, undo would re-do-and-undo the same record indefinitely. Analysis records this field on a per-CLR basis as it scans, so undo can consult it without re-reading the log.

The canonical presentation of analysis is §6 of the ARIES paper, and it runs to about eight pages. What we have covered in this chapter is the textbook skeleton; §6.3 of the paper is the authoritative source when you need to build a production implementation.

How Postgres's StartupXLOG compresses analysis and redo

Postgres does not have three cleanly separated passes. Its recovery entry point, StartupXLOG (in src/backend/access/transam/xlog.c), runs a single forward scan from the checkpoint that interleaves what ARIES calls analysis and redo. As each WAL record is read, the dispatcher (RmgrTable[xl_rmid].rm_redo(record)) calls the resource manager's redo function — which both updates any in-memory state (like the xlogreader state that tracks in-flight xids) and applies the change to the target page (if the page-LSN is old enough).

There is no explicit DPT in Postgres. Instead, every page read during redo is checked against the record's LSN directly: if page->pd_lsn < record->lsn, redo applies; otherwise it skips. This works because Postgres does not need the DPT to compute the starting LSN (it just starts at the checkpoint's redo pointer, which is precomputed when the checkpoint is written). The TT also lives in a derived form: the KnownAssignedXids structure records every xid seen without a COMMIT/ABORT, and the set of xids still in it at end-of-log are the "in flight" transactions undo will roll back.

The two-pass (analysis+redo) fusion costs Postgres some conceptual clarity but saves one whole log scan. For a multi-hour-old checkpoint, that single-scan saving is significant — several minutes of recovery time.

How InnoDB's recv_recovery_from_checkpoint_start does it

InnoDB's recovery entry point is recv_recovery_from_checkpoint_start in storage/innobase/log/log0recv.cc. It reads the flushed-up-to LSN from the redo-log header (not a separate CHECKPOINT record; see the checkpoint chapter for why InnoDB differs), then scans forward. As it reads records it populates two hash tables:

The interesting difference from textbook ARIES is that InnoDB gathers the records into per-page lists during "analysis," so redo is not a second log scan but a per-page replay of the pre-computed record lists. The trade-off: more memory during analysis (all redo records are held in memory per page) but fewer disk reads (redo applies records in page order, converting random page writes into sequential reads of page groups).

SQLite — recovery without analysis

SQLite's rollback-journal mode (the default until about 2010) does not use ARIES at all. Its recovery procedure has one pass: read the journal, and for every page-image in the journal, write the image back to the main database file — i.e., it is pure undo of the journal. There is no DPT, no TT, no analysis. This works because SQLite only has one writer at a time and commits are serialised, so "in flight at crash" is at most one transaction and "which pages need redo" is answered by "whichever pages are not in the journal are unchanged."

WAL-mode SQLite (the default since 2010) is a little more complex but still simpler than ARIES: it rolls back partial writes by truncating the WAL to the last committed frame and treats uncommitted frames as never having happened. Again, no analysis pass — the WAL header carries enough metadata to find the last committed frame directly.

The take-away: ARIES analysis pays off when the system supports concurrent transactions and fine-grained locking. For single-writer or whole-page-logged engines, simpler recovery algorithms suffice.

Where this leads next

Analysis hands two tables to the next pass:

The immediate next step is redo. Starting from min(rec-LSN over DPT), redo scans forward through the log and reapplies every UPDATE whose target page has a page-LSN on disk that is older than the record's LSN. The effect is to bring every page on disk into the state it would have been in if the crash had never happened — including the writes of transactions that were in flight at the crash. That sounds wrong (you do not want in-flight transactions' writes to persist!) but it is actually the right order: redo first re-establishes the pre-crash state exactly, and then undo walks the TT backward to cancel the in-flight transactions' writes. Doing the two passes in that order is subtle; it is the subject of the next chapter, which explains why redo before undo is not a mistake.

After redo comes undo, and then the database reopens for normal operation.

References

  1. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — §6 is the authoritative treatment of the analysis pass, including the transaction-table state enum and CLR handling.
  2. PostgreSQL Global Development Group, StartupXLOG and Recovery Internals, PostgreSQL 16 documentation — the operational description of Postgres's fused analysis+redo pass.
  3. Oracle Corporation, InnoDB Recovery, MySQL 8.0 reference manual — InnoDB's recv_recovery_from_checkpoint_start and the per-page record grouping that differs from textbook ARIES.
  4. Gray and Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann 1992 — §10.3 covers analysis in the broader context of restart recovery.
  5. Hellerstein, Stonebraker, Hamilton, Architecture of a Database System, Foundations and Trends in Databases, 2007 — §5.4.2 sketches the three-pass structure concisely for newcomers.
  6. SQLite Consortium, Atomic Commit in SQLite — the journal-based recovery model, a useful contrast to ARIES for understanding when the full three-pass structure is and is not needed.