In short
ARIES recovery is three passes — analysis, redo, undo — and this chapter is the first. When the database process comes back up after a crash, the world is a mess. The data files hold pages at unknown ages: some pages were flushed before the crash (with every committed update already baked in), some pages were flushed mid-transaction (with half of an uncommitted transaction's writes already on disk), some pages are ancient (they have not been dirtied in months). The log is intact up to its last good record. The checkpoint record in the log tells recovery what the dirty-page table and transaction table looked like at some instant shortly before the crash. What recovery does not yet know is what changed between that checkpoint and the crash — new transactions that began, old transactions that committed or aborted, and new pages that became dirty. The analysis pass fills in exactly that gap. It scans the log forward, starting at the checkpoint's LSN, and reads every record up to the end-of-log. For each BEGIN it encounters, it adds an entry to the transaction table (TT). For each COMMIT or ABORT, it removes the matching entry. For each UPDATE, it refreshes the last-LSN of the transaction and — if the target page is not already in the dirty-page table (DPT) — it adds it, with the UPDATE's LSN as the page's rec-LSN. When the scan reaches end-of-log, the DPT holds every page that might still need redo (pages whose modification is in the log but whose LSN has not definitely reached disk), and the TT holds every transaction that was in flight at the crash — the ones that never committed and therefore must be undone. Analysis does not touch any data page, does not fsync anything, does not modify the log. It is a pure read pass whose output is two in-memory tables that set the stage for redo and undo. By the end of this chapter you will be able to walk through a small log sample and produce, by hand, the DPT and TT that the next two passes will consume.
At 3:14am the server powers back on. systemd spawns postgres; Postgres reads pg_control and finds a state = in production field — the database did not shut down cleanly. Recovery mode begins.
What does the recovery process see? The data directory is on disk, exactly as it was the instant the power went out. The WAL is there, fsynced up to some LSN. The buffer pool is empty — every page that was in memory is gone. There is no in-memory state at all: no transaction table, no dirty-page table, no list of active xids, no list of held locks. Just files.
The first question every recovery procedure must answer is what transactions were in flight, and what pages might need updates? ARIES calls the pass that answers this the analysis pass, and it is the subject of this chapter.
Recovery is three passes and analysis is first
The ARIES algorithm splits crash recovery into three phases, executed in order. Each one does a different job and leaves a different artefact behind.
The separation matters. Analysis is cheap (one sequential log read, tiny in-memory tables), and it has no side effects on durable storage. If it crashes halfway through, the next boot simply runs it again from the beginning — idempotent, safe, repeatable. Redo and undo, in contrast, mutate data-file pages. They must be designed so that their mutations are idempotent (replaying a redo twice on the same page is a no-op), but the easier the precondition analysis gives them, the simpler their implementation.
What analysis has to produce
Analysis is a function from log + checkpoint to (DPT, TT). The two output tables are the same ones you saw in Checkpointing: Bounding Recovery Time, with one difference: the checkpoint captured them at some instant before the crash, whereas analysis must deliver them as they were at the crash itself.
- Transaction table (TT) — one row per transaction that was in flight at the crash. A transaction is in flight if it began and has not yet committed or aborted. Each row stores at minimum the xid, the transaction's state (the only state that survives analysis is "in flight" — all committed ones are gone), and the LSN of the transaction's most recent log record. That last-LSN is the anchor undo will walk backward from.
- Dirty-page table (DPT) — one row per page that might have a WAL record whose modification has not yet reached the on-disk page. Each row stores the page ID and its rec-LSN: the LSN of the earliest WAL record for this page whose modification might not yet be on disk.
The TT tells undo who to roll back. The DPT tells redo where to start reading the log and which pages to pay attention to. If the two tables are wrong, redo and undo are wrong; if the two tables are right, redo and undo are simple.
Analysis as a forward log scan
Here is the whole algorithm in one function.
# analysis.py — the ARIES analysis pass.
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Dict
# Transaction states tracked during analysis.
TXN_IN_FLIGHT = "in_flight"
@dataclass
class TxnEntry:
xid: int
state: str # "in_flight"; committed/aborted txns are removed from the table
last_lsn: int # LSN of the most recent record for this xid
@dataclass
class DptEntry:
page_id: int
rec_lsn: int # LSN of the earliest record for this page that might not be on disk
def analysis_pass(log_manager, control_file) -> tuple[Dict[int, DptEntry], Dict[int, TxnEntry]]:
"""Scan forward from the last checkpoint to end-of-log. Returns (DPT, TT)."""
# 1. Find the last checkpoint and read its tables.
ckpt_lsn = control_file.read_last_checkpoint()
ckpt = log_manager.read_record(ckpt_lsn)
dpt: Dict[int, DptEntry] = {p: DptEntry(p, r) for p, r in ckpt.dirty_pages}
tt: Dict[int, TxnEntry] = {x: TxnEntry(x, TXN_IN_FLIGHT, l) for x, l in ckpt.active_xids}
# 2. Scan every record after the checkpoint, updating DPT and TT appropriately.
for rec in log_manager.scan_from(ckpt_lsn + ckpt.size):
if rec.type == "BEGIN":
# A new transaction started after the checkpoint was taken. Add it to TT.
tt[rec.xid] = TxnEntry(rec.xid, TXN_IN_FLIGHT, rec.lsn)
elif rec.type == "UPDATE":
# The transaction is (still) active; refresh its last-LSN.
tt[rec.xid] = TxnEntry(rec.xid, TXN_IN_FLIGHT, rec.lsn)
# If this page is not yet in the DPT, add it with rec-LSN = this record's LSN.
# The earliest-LSN-wins rule: setdefault leaves an existing rec-LSN unchanged.
if rec.page_id not in dpt:
dpt[rec.page_id] = DptEntry(rec.page_id, rec.lsn)
elif rec.type == "COMMIT":
# The transaction committed before the crash; remove it from TT.
# Its updates still need redo (they are on the log, maybe not on disk),
# but it does not need undo.
tt.pop(rec.xid, None)
elif rec.type == "ABORT":
# The transaction was explicitly rolled back before the crash. Same handling as
# commit for TT purposes: remove it. Any CLRs it wrote are already in the log
# and will be reapplied by the redo pass.
tt.pop(rec.xid, None)
elif rec.type == "CHECKPOINT":
# An unusual case: a checkpoint record written AFTER the one we started from.
# Treat it as authoritative — its DPT/TT reflect a later state than ours.
# (This happens if a checkpoint began but the control file was not yet updated
# at crash time.)
dpt = {p: DptEntry(p, r) for p, r in rec.dirty_pages}
tt = {x: TxnEntry(x, TXN_IN_FLIGHT, l) for x, l in rec.active_xids}
elif rec.type == "CLR":
# Compensation log record — produced by undo during a previous recovery or
# by an in-process rollback. It applies the before-image of some UPDATE.
# Treat it like an UPDATE for DPT purposes: it modifies a page.
tt[rec.xid] = TxnEntry(rec.xid, TXN_IN_FLIGHT, rec.lsn)
if rec.page_id not in dpt:
dpt[rec.page_id] = DptEntry(rec.page_id, rec.lsn)
# 3. When the scan hits end-of-log, DPT and TT reflect the state at crash time.
return dpt, tt
The routine is fewer than forty executable lines and has one obvious structure: a big for loop that dispatches on record type. Every branch does exactly one thing to each table.
setdefault for the DPT. When an UPDATE refers to page p, the page is added to the DPT only if it is not already there, and its rec-LSN is set to the UPDATE's own LSN. That is the earliest-LSN-wins rule. If page p was already in the DPT (inherited from the checkpoint or added by an earlier UPDATE in this scan), its existing rec-LSN is kept — because that earlier LSN is the earliest record whose modification might not be on disk, which is what rec-LSN means.
Remove on COMMIT, not on UPDATE. A transaction stays in the TT until it commits or aborts. Only when the scan encounters the COMMIT or ABORT record does analysis remove the entry. Until then, even after many UPDATEs, the transaction is "in flight" by definition.
CLRs look like updates for bookkeeping. A Compensation Log Record is the trace left by a rollback — either an in-process rollback that happened before the crash, or the output of a previous recovery's undo pass. For analysis purposes a CLR modifies a page, so it must update the DPT the same way an UPDATE does. You will see CLRs in detail in the undo chapter.
A log timeline — what analysis sees
Imagine the log at the moment of the crash. It runs from log creation on the left to the crashed tail on the right. Somewhere near the tail is the last checkpoint record. Analysis starts there and walks rightward.
Notice how the scan treats committed, aborted, and in-flight transactions differently. A committed transaction leaves no mark on the TT (the COMMIT record removes it), but its UPDATE records still put pages into the DPT — because those pages might not yet be on disk. An aborted transaction is also gone from TT, but if it wrote any CLRs before aborting, those CLRs have pages of their own in the DPT. An in-flight transaction — one whose BEGIN is in the log but whose COMMIT or ABORT is not — stays in TT at end-of-log, and that is exactly how undo recognises it.
Walking a concrete trace
Here is a small log, rendered as a table. Suppose the checkpoint at LSN 2000 contained: active = { 41, 42 }, dirty = { p=5: rec-LSN 1820 }. Walk through analysis record-by-record.
| LSN | Record | Action on TT | Action on DPT |
|---|---|---|---|
| 2000 | CHECKPOINT | initialise TT = { 41:?, 42:? } | initialise DPT = { 5: rec-LSN 1820 } |
| 2040 | COMMIT xid=41 | remove 41 from TT | (unchanged) |
| 2080 | UPDATE xid=42 page=7 | tt[42].last_lsn = 2080 | setdefault p=7 → rec-LSN 2080 |
| 2120 | UPDATE xid=42 page=9 | tt[42].last_lsn = 2120 | setdefault p=9 → rec-LSN 2120 |
| 2160 | COMMIT xid=42 | remove 42 from TT | (unchanged) |
| 2200 | BEGIN xid=43 | tt[43] = (in_flight, 2200) | (unchanged) |
| 2280 | UPDATE xid=43 page=12 | tt[43].last_lsn = 2280 | setdefault p=12 → rec-LSN 2280 |
| 2320 | UPDATE xid=43 page=7 | tt[43].last_lsn = 2320 | p=7 already present; leave rec-LSN = 2080 |
| 2400 | ABORT xid=44 | remove 44 (if present; ignored if absent) | (unchanged) |
| 2440 | BEGIN xid=45 | tt[45] = (in_flight, 2440) | (unchanged) |
| 2480 | UPDATE xid=45 page=9 | tt[45].last_lsn = 2480 | p=9 already present; leave rec-LSN = 2120 |
End of log. Final state:
TT = { 43: last_lsn = 2320, 45: last_lsn = 2480 }
DPT = { 5: rec-LSN = 1820,
7: rec-LSN = 2080,
9: rec-LSN = 2120,
12: rec-LSN = 2280 }
Two things worth pausing on.
Page 7 is modified twice — by xid 42 at LSN 2080 and by xid 43 at LSN 2320 — but its DPT rec-LSN is 2080, not 2320. The rec-LSN is the earliest LSN whose modification might not yet be on disk for this page. The later modification at LSN 2320 is already covered by starting redo from 2080: if redo replays from 2080, it will reach 2320 and apply it too.
Transaction 44 never appears on any row the scan added — but there is an ABORT for it. This is not a bug; it is a race: xid 44 may have begun before the checkpoint (so it was in the checkpoint's ATT) and aborted after. The tt.pop(rec.xid, None) with the None default handles the absence gracefully, so a spurious ABORT for an xid we do not know about is a no-op. (In the table above, 44 is shown as not previously present, illustrating this case. If it were in the checkpoint's ATT, the pop would remove it.)
The final TT has exactly the transactions that will be undone. The final DPT has exactly the pages redo will inspect. Analysis is done.
What analysis does not do
A few things to rule out, because they are easy to assume.
Analysis does not read any data page. The only input is the log. This is deliberate: data-page reads are random I/O and slow, and analysis needs to be fast so the system can move on to redo. Analysis is pure log scan.
Analysis does not apply any UPDATE to any page. It records the fact that an UPDATE exists (by putting the target page in the DPT) but does not execute it. Execution is the redo pass's job.
Analysis does not fsync anything or write to the log. The log file is opened read-only during analysis. (It will be opened for append after redo and undo have converged, when normal operation resumes.)
Analysis does not know whether any specific UPDATE is already on disk. For that, redo will compare each UPDATE's LSN to the page-LSN on the actual on-disk page. Analysis assumes the worst: if a page is dirty in the DPT, maybe its modifications are not on disk. If a page is not in the DPT (it was clean at checkpoint and no UPDATE has referenced it since), it is guaranteed to be up-to-date and redo can skip it entirely — that is the economy the DPT buys.
End-of-log detection and torn-tail records
Analysis keeps reading the log until log_manager.scan_from(...) tells it there are no more records. How does the iterator know where to stop? Each record's frame has a length prefix and a CRC32 (from Log Records, LSNs, and the Log Sequence). The iterator reads the length, reads that many bytes, computes the CRC, and compares it to the trailing CRC field.
Three outcomes at the tail:
- End-of-file cleanly — the iterator tried to read the next record's length prefix and got 0 bytes. Analysis stops.
- Partial record — the length prefix says "450 bytes of body follow" but only 200 bytes are left in the file before EOF. The append was interrupted by the crash. Analysis truncates the log at the start of this partial record and stops.
- Body complete but CRC fails — rarer, this is a torn write within a single record (the disk wrote the length prefix and some of the body but not all the body before powerloss). Same handling: truncate at the start of this record.
After analysis finishes, the log manager atomically truncates the log to the last good record's end offset. The next append (which comes later, after all three passes complete) extends from there. The torn tail is gone; the surviving records are self-consistent.
Common confusions
-
"Does analysis write to disk?" No. Analysis is a pure read pass. It reads the control file (to find the last checkpoint's LSN), reads the log file (sequentially), and builds two tables in memory. No fsync, no page write, no log append.
-
"Do I need the checkpoint at all? Can I just scan the whole log?" You can, and it will be correct — but it will be slow. Without a checkpoint, analysis must scan from byte 0 of the log (or from log creation), which is unbounded. The checkpoint gives analysis an initial DPT and TT that are already mostly populated, and lets it start at a recent LSN. The speedup is what checkpointing is for.
-
"Why does analysis keep the DPT rec-LSN as the earliest LSN, not the latest?" Because rec-LSN answers the question "from what LSN forward might there be log records for this page that are not on disk?" If the page became dirty at LSN 2080 and was dirtied again at 2320, the earliest unflushed modification is at 2080. Redo must start at at least 2080 to replay the first modification; if it started at 2320, it would skip the 2080 modification, which might not be on disk.
-
"Does analysis care whether the page is actually dirty right now? The buffer pool is empty after a crash." Correct — the buffer pool is empty, so there is no "dirty" in the classical sense. But "dirty" here means logically dirty: the page may have a WAL record whose modification has not reached the data file. The distinction does not depend on whether the page is currently in a buffer pool; it depends only on whether the WAL record has been flushed past the corresponding data-page write. Analysis reconstructs that logical-dirty state from the log.
-
"What if the checkpoint's tables are wrong?" They are not wrong; they are stale. A fuzzy checkpoint captures the tables at a slightly earlier instant than when its record lands, so the checkpoint record's tables reflect the state at time t_snap, while the record's LSN is t_record ≥ t_snap. Analysis inherits the tables as of t_snap and then replays every record from t_record onward — any activity between t_snap and t_record is captured by those replayed records. Staleness is absorbed by the scan.
-
"What happens to CLRs during analysis?" They behave like UPDATEs. CLR means "this transaction's rollback modified page p at LSN L" — so page p goes into the DPT (via
setdefault) and the transaction's last-LSN is refreshed. CLRs do not remove transactions from the TT; only COMMIT and ABORT do that. -
"Does analysis need the full log of the crashed process, or only the tail?" Only the tail after the last checkpoint, because the checkpoint's DPT+TT already encode the state before that point. Everything earlier is by definition already accounted for.
Going deeper
The 1992 ARIES paper and the distinction of in-doubt transactions
Mohan et al.'s original ARIES paper (TODS 1992) treats analysis with more nuance than the three-state sketch above. Their TT carries a richer per-transaction state enum: U (unprepared, pure in-flight), P (prepared — two-phase commit has passed the prepare phase but not yet received a global commit/abort), C (committed), and E (aborted/ended). States C and E are transient during the scan — as soon as a COMMIT or ABORT is seen, the entry is removed — but state P is retained because a prepared transaction is in doubt: its fate depends on a coordinator's decision that may not yet be in this log. After analysis, in-doubt transactions are queued for the transaction manager to resolve via the prepared-transaction resolution protocol.
The paper also specifies the UndoNxtLSN field for CLRs — a forward-pointer from a CLR to the next record that needs undo, skipping the CLR's own target. This is a subtle but crucial optimisation: without it, undo would re-do-and-undo the same record indefinitely. Analysis records this field on a per-CLR basis as it scans, so undo can consult it without re-reading the log.
The canonical presentation of analysis is §6 of the ARIES paper, and it runs to about eight pages. What we have covered in this chapter is the textbook skeleton; §6.3 of the paper is the authoritative source when you need to build a production implementation.
How Postgres's StartupXLOG compresses analysis and redo
Postgres does not have three cleanly separated passes. Its recovery entry point, StartupXLOG (in src/backend/access/transam/xlog.c), runs a single forward scan from the checkpoint that interleaves what ARIES calls analysis and redo. As each WAL record is read, the dispatcher (RmgrTable[xl_rmid].rm_redo(record)) calls the resource manager's redo function — which both updates any in-memory state (like the xlogreader state that tracks in-flight xids) and applies the change to the target page (if the page-LSN is old enough).
There is no explicit DPT in Postgres. Instead, every page read during redo is checked against the record's LSN directly: if page->pd_lsn < record->lsn, redo applies; otherwise it skips. This works because Postgres does not need the DPT to compute the starting LSN (it just starts at the checkpoint's redo pointer, which is precomputed when the checkpoint is written). The TT also lives in a derived form: the KnownAssignedXids structure records every xid seen without a COMMIT/ABORT, and the set of xids still in it at end-of-log are the "in flight" transactions undo will roll back.
The two-pass (analysis+redo) fusion costs Postgres some conceptual clarity but saves one whole log scan. For a multi-hour-old checkpoint, that single-scan saving is significant — several minutes of recovery time.
How InnoDB's recv_recovery_from_checkpoint_start does it
InnoDB's recovery entry point is recv_recovery_from_checkpoint_start in storage/innobase/log/log0recv.cc. It reads the flushed-up-to LSN from the redo-log header (not a separate CHECKPOINT record; see the checkpoint chapter for why InnoDB differs), then scans forward. As it reads records it populates two hash tables:
recv_sys->addr_hash: keyed by (space_id, page_no), holding lists of records to apply to each page. This is analogous to the DPT but richer — it contains the actual records to replay, not just page IDs.trx_sys->rw_trx_list: the in-memory transaction table, populated as BEGINs are read and trimmed as COMMITs are seen.
The interesting difference from textbook ARIES is that InnoDB gathers the records into per-page lists during "analysis," so redo is not a second log scan but a per-page replay of the pre-computed record lists. The trade-off: more memory during analysis (all redo records are held in memory per page) but fewer disk reads (redo applies records in page order, converting random page writes into sequential reads of page groups).
SQLite — recovery without analysis
SQLite's rollback-journal mode (the default until about 2010) does not use ARIES at all. Its recovery procedure has one pass: read the journal, and for every page-image in the journal, write the image back to the main database file — i.e., it is pure undo of the journal. There is no DPT, no TT, no analysis. This works because SQLite only has one writer at a time and commits are serialised, so "in flight at crash" is at most one transaction and "which pages need redo" is answered by "whichever pages are not in the journal are unchanged."
WAL-mode SQLite (the default since 2010) is a little more complex but still simpler than ARIES: it rolls back partial writes by truncating the WAL to the last committed frame and treats uncommitted frames as never having happened. Again, no analysis pass — the WAL header carries enough metadata to find the last committed frame directly.
The take-away: ARIES analysis pays off when the system supports concurrent transactions and fine-grained locking. For single-writer or whole-page-logged engines, simpler recovery algorithms suffice.
Where this leads next
Analysis hands two tables to the next pass:
DPT = { page_id → rec-LSN }— the pages that might need redo, and the LSN to start replaying from.TT = { xid → last_lsn }— the in-flight transactions that must be undone.
The immediate next step is redo. Starting from min(rec-LSN over DPT), redo scans forward through the log and reapplies every UPDATE whose target page has a page-LSN on disk that is older than the record's LSN. The effect is to bring every page on disk into the state it would have been in if the crash had never happened — including the writes of transactions that were in flight at the crash. That sounds wrong (you do not want in-flight transactions' writes to persist!) but it is actually the right order: redo first re-establishes the pre-crash state exactly, and then undo walks the TT backward to cancel the in-flight transactions' writes. Doing the two passes in that order is subtle; it is the subject of the next chapter, which explains why redo before undo is not a mistake.
After redo comes undo, and then the database reopens for normal operation.
References
- Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992 — §6 is the authoritative treatment of the analysis pass, including the transaction-table state enum and CLR handling.
- PostgreSQL Global Development Group, StartupXLOG and Recovery Internals, PostgreSQL 16 documentation — the operational description of Postgres's fused analysis+redo pass.
- Oracle Corporation, InnoDB Recovery, MySQL 8.0 reference manual — InnoDB's
recv_recovery_from_checkpoint_startand the per-page record grouping that differs from textbook ARIES. - Gray and Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann 1992 — §10.3 covers analysis in the broader context of restart recovery.
- Hellerstein, Stonebraker, Hamilton, Architecture of a Database System, Foundations and Trends in Databases, 2007 — §5.4.2 sketches the three-pass structure concisely for newcomers.
- SQLite Consortium, Atomic Commit in SQLite — the journal-based recovery model, a useful contrast to ARIES for understanding when the full three-pass structure is and is not needed.