How Postgres, Oracle, and InnoDB Do MVCC Differently

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The three major production MVCC engines — PostgreSQL, Oracle, and MySQL's InnoDB — all implement the same high-level idea from chapter 58: keep multiple row versions alive, tag each with the creator's transaction id, and let readers pick the version visible to their snapshot. Readers never block writers, writers never block readers, the only real contention is writer-writer on the same row. That part is universal.

Their physical layouts diverge radically. Postgres keeps old versions in the table heap itself — each row is a chain of versions in the same pages as the live data, a background VACUUM reclaims tombstones. Oracle keeps old versions in a separate undo segment — the heap holds only the current version, the undo segment is a circular log of "how to roll back to any past SCN." InnoDB is Oracle-in-spirit — a rollback segment outside the clustered index, hidden columns on every row (DB_TRX_ID, DB_ROLL_PTR) link to undo records, a purge thread reclaims them.

Those three choices ripple everywhere. Postgres's heap bloats under long-running transactions and needs aggressive autovacuum; Oracle hits the famous ORA-01555: snapshot too old when undo is reclaimed before a long query finishes; InnoDB's purge thread can fall behind and stall. Each engine also implements SERIALIZABLE differently — Postgres does SSI (true serialisability, chapter 60), Oracle ships snapshot isolation under the label (allows write skew, chapter 59), InnoDB layers strict 2PL on writes on top of MVCC reads.

The practical upshot for you: which engine you are on changes how you configure pooling, tune cleanup, bound long queries, and reason about invariants. Moving an application from Postgres to Oracle at the same isolation level name silently changes its correctness guarantees. This chapter closes Build 7 by mapping the three engines side-by-side so you can pick, tune, and port with eyes open.

A million-row UPDATE, three engines, three outcomes

Start with a concrete scenario. You have a table orders with 1,000,000 rows, each holding a modest status column among other fields. You run a single transaction that updates every row:

BEGIN;
UPDATE orders SET status = 'archived';
COMMIT;

One statement. A million rows touched. What happens physically?

Postgres. Every UPDATE is really an insert-of-new-version plus a tombstone-of-old-version, both in the heap. After the commit, the orders table's heap pages now contain 2,000,000 row versions: the 1M new rows marked current, and the 1M old rows marked superseded (xmax pointing to your transaction id). None of the old versions are freed yet — they may still be visible to any reader whose snapshot predates your commit. Autovacuum will eventually walk all 2M versions, check visibility, and mark dead ones as reclaimable. Your table's on-disk footprint doubled for minutes or hours.

Oracle. The heap still has 1M rows — the current versions, updated in place. A parallel structure, the undo segment, grew by 1M entries: one undo record per row, each describing how to reverse the change (the old status value, plus the metadata to reconstruct the pre-update row). Old undo entries expire based on undo_retention; after that they become eligible to be overwritten by future undo traffic.

InnoDB. Similar to Oracle in shape. The clustered index on orders still has 1M rows — the current versions, updated in place. The rollback segment grew by 1M entries linked via DB_ROLL_PTR on each updated row; a background purge thread reclaims them once no snapshot can still see them.

The single biggest architectural fork in MVCC engine design is where dead versions live: in-heap with the live data, or in a separate undo structure. Everything else — visibility check cost, vacuum behaviour, long-running-transaction failure modes, HOT updates, connection pooling — is downstream of that choice.

The dimension sheet

Ten dimensions, three engines, thirty cells. Keep this table open when tuning production.

Dimension	PostgreSQL	Oracle	InnoDB (MySQL)
Where old versions live	In-heap — chain of tuples within table pages	Undo segment — separate tablespace, circular	Rollback segment — separate area, linked-list
Where `xmin`/`xmax` live	24-byte per-row header in heap tuple	In undo log entries (heap has current SCN only)	Hidden columns: `DB_TRX_ID` (6B), `DB_ROLL_PTR` (7B)
How old versions get cleaned	VACUUM (autovacuum background workers)	Automatic undo management based on `undo_retention` + space pressure	Purge thread (configurable `innodb_purge_threads`)
Long-running-transaction failure mode	Table and index bloat — vacuum blocked past oldest snapshot	ORA-01555 "snapshot too old" — undo reclaimed under a live reader	Purge lag — undo grows, potential stall on writes
Write amplification of UPDATE	Full new tuple written (unless HOT); indexes rewritten	In-place update + undo record	In-place update + undo record
SERIALIZABLE implementation	SSI (Cahill 2008) — genuine serialisability	Snapshot isolation — allows write skew despite the name	Strict 2PL on the write path; reads still MVCC
READ COMMITTED snapshot behaviour	Fresh snapshot per statement	Fresh snapshot per statement	Fresh snapshot per statement
REPEATABLE READ behaviour	Transaction-level snapshot (SI); phantoms eliminated by snapshot	Transaction-level read consistency; limited gap handling	Transaction-level snapshot + gap locks against phantoms
Hot-row update optimisation	HOT updates — same-page, no index maintenance	No equivalent named optimisation	Change buffer — defers secondary index writes
Visibility check cost	O(chain length) walk in heap	O(undo walk) from current row back to snapshot SCN	O(`DB_ROLL_PTR` chain walk) into rollback segment

Two axes do most of the work. The first is the heap vs undo column — where old versions live dictates how cleanup works, how long queries fail, and what visibility lookup costs. The second is the SERIALIZABLE column — the same three-word SQL command invokes genuinely different algorithms and permits genuinely different anomaly profiles. Every row below builds on these two.

The same 1M-row UPDATE produces two very different on-disk footprints. Postgres doubles the heap until VACUUM reclaims the dead tuples. Oracle and InnoDB keep the heap compact and push the old-version load into a separate undo structure that gets reclaimed as undo retention expires. The choice determines which failure modes bite you under long-running transactions.

Postgres — old versions in the heap

Every row in a Postgres table carries a 24-byte heap-tuple header with the fields that make MVCC possible: xmin (creating transaction), xmax (superseding transaction, or zero), cmin/cmax (sub-transaction command counters), ctid (a (page, offset) pointer to the next version in the chain, if any), plus flags. The whole header sits inside the table's own storage pages, alongside the column data.

An UPDATE does not overwrite. It does three things, in order:

Allocate a new tuple. Find space on the same page as the old tuple if possible, otherwise on a different heap page. Write the new column values with header xmin = current_xid, xmax = 0, ctid = <self>.
Stamp the old tuple. Set the old tuple's xmax = current_xid and its ctid to point at the new tuple's location.
Update indexes. For each index on the table, insert a new index entry pointing at the new tuple's ctid. The old index entry is left in place (it still points at the old tuple, now dead).

Why indexes get full new entries: Postgres indexes store ctid pointers directly into the heap. Index entries are not version-aware. Every heap-tuple location gets its own index entry. This is what makes the index bloat problem as real as the heap bloat problem — a table with 5 indexes turns every UPDATE into 6 writes by default, all of which the eventual vacuum must clean up.

Readers execute a visibility walk. Given the ctid of the row they reached (via an index or a sequential scan), they load the heap tuple, check visibility of the header (xmin, xmax) against their snapshot (see chapter 58 for the exact rule), and if invisible, follow ctid to the next tuple in the chain. The walk terminates at a visible version or an end-of-chain marker.

VACUUM and autovacuum

Dead tuples cannot be reclaimed as soon as the transaction that killed them commits. A reader whose snapshot predates that commit can still see them. So "dead to everyone" is a stricter condition: no running transaction has a snapshot that could see this dead tuple's creator, and no running snapshot predates the killer's commit. Concretely, a tuple whose xmax committed before the globally oldest running snapshot's xmin is safe to reclaim.

Autovacuum is the background process that checks this. For each heap page it walks, it:

Examines every tuple's header.
Determines which are dead-to-everyone using the oldest-snapshot threshold.
Marks those tuples' line pointers as LP_DEAD and rewrites the free-space map so future inserts land in the reclaimed slots.
For secondary-index entries, schedules removal of the corresponding index entries.
Optionally truncates the file tail if trailing pages become empty.

Autovacuum triggers when the fraction of dead tuples crosses autovacuum_vacuum_scale_factor * live_tuples + autovacuum_vacuum_threshold (defaults 0.2 and 50). On a heavily updated table this default is too lazy — by the time 20% of rows are dead you are measuring response-time regressions. Hot-table tuning typically drops the scale factor to 0.05 or lower.

The long-running-transaction bloat problem

One long reader is all it takes to stall the whole cleanup machinery. Imagine an analyst runs a repeatable-read report that takes an hour. Its snapshot pins xmin_horizon at the xid it observed. Every UPDATE across the whole database during that hour generates dead tuples that autovacuum cannot reclaim, because the analyst's snapshot might still need them.

A busy OLTP workload emitting 10k updates/sec over an hour generates 36 million dead tuples autovacuum must wait on. When the long reader finishes, autovacuum now has a mountain of work; throughput sags while it catches up; indexes remain bloated even after heap reclaim (index VACUUM is a separate pass). The operational playbook is: monitor pg_stat_activity for long-running transactions; kill or alert on anything older than your bloat tolerance; push analytics to replicas with hot_standby_feedback = off so the primary's autovacuum is not held hostage.

Transaction-ID wraparound

Postgres xids are 32 bits; on high-throughput systems, 4 billion xids come around in days. Without intervention, an old row would eventually appear to have xmin in the future — invisible to everyone. Postgres avoids this with freezing: VACUUM rewrites old xmin as FrozenTransactionId, always visible. A database that goes long enough without vacuum will refuse new writes to protect itself — the dreaded ERROR: database is not accepting commands to avoid wraparound data loss.

HOT — the optimisation that makes it bearable

The Heap-Only Tuple optimisation (Postgres 8.3, 2008) addresses the common case where an UPDATE modifies no indexed column. If the updated columns are all non-indexed and the new version fits on the same heap page as the old, Postgres performs a HOT update: the old tuple's ctid points to the new tuple on the same page; no index maintenance happens; existing index entries still point at the old tuple, and the heap chases the in-page ctid chain to reach the new version.

In practice, HOT converts a huge fraction of OLTP updates into zero-index-write operations. The only cost is leaving enough free space per page for new versions — hence the advice to set fillfactor = 90 or lower on heavy-update tables, reserving 10% of each page for HOT.

Oracle — old versions in undo segments

Oracle takes the opposite approach. The heap (or "data segment") stores only current versions — each UPDATE overwrites in place. The undo segment is a separate tablespace structured as a circular ring buffer of undo records: one record per modified row, describing how to reverse the change to a previous System Change Number (SCN).

Every block header in the heap carries an Interested Transaction List (ITL): a small array of (transaction_id, undo_segment_pointer) entries for transactions currently modifying rows on that block. When transaction T updates row R on block B, Oracle:

Adds an ITL entry for T on block B (if not already present).
Writes an undo record to T's undo segment containing the pre-image of R.
Overwrites R in place with the new column values.
Flags R's row header with the ITL slot number pointing at T's undo record.

A reader arriving at R whose snapshot SCN is older than T's commit SCN follows the chain: row R → ITL entry → undo record → reconstructed prior version of R. If that prior version's own ITL pointer still predates the reader's SCN, follow that record's undo chain further back, and so on, until a version at or before the reader's SCN is reconstructed.

Why this works: the undo log is not just for ROLLBACK — it is also Oracle's version store. The same data that lets a failed transaction reverse itself lets a long-running reader reconstruct the past. Oracle calls this unified story read consistency, and it dates to the early 1980s, predating the term "MVCC" by years in common use.

ORA-01555, the classic Oracle error

The undo segment is finite and cyclic. Old undo records are overwritten once they are older than undo_retention and the segment is under space pressure. If a reader's SCN is older than any undo record still retained, the reconstruction chain dead-ends mid-walk, and the reader receives the canonical Oracle error:

ORA-01555: snapshot too old: rollback segment number N with name "..."
           too small

The name is a historical artefact — the segment was originally called the rollback segment in Oracle 7 and before; the modern name is undo segment, but the error message preserves the old nomenclature.

Fixes in decreasing order of preference: (a) enlarge undo_retention to cover your longest expected read; (b) enlarge the undo tablespace so the retention policy has room to honour itself; (c) split long queries into smaller batches; (d) run analytics against a read-only standby. The same problem Postgres calls "autovacuum stall due to long snapshots", Oracle calls "ORA-01555" — different failure modes from the same underlying tension between long readers and aggressive cleanup.

The heap stays compact

The payoff for the undo-segment complexity is that the heap never bloats. After a 1M-row UPDATE, the heap occupies the same pages as before, with the same row count — only the column values changed. No vacuum needs to walk it. This matters for scan-heavy reporting workloads where cache efficiency depends on data density. Oracle's heap on an update-heavy table stays dense; Postgres's needs vacuum to remain so.

InnoDB — rollback segments

InnoDB (MySQL's default storage engine since 5.5) is architecturally closer to Oracle than to Postgres. The clustered index — the primary-key-ordered B-tree that holds the row data — stores only current versions. Each row carries two hidden columns not visible to SQL:

DB_TRX_ID (6 bytes): the id of the transaction that last modified the row.
DB_ROLL_PTR (7 bytes): a pointer to an undo log record in the rollback segment, or null if the row has never been updated.

Undo records in the rollback segment form a linked list: each undo record carries its own DB_TRX_ID and DB_ROLL_PTR fields, chaining back to the previous version of that row. A reader arriving at a clustered-index row whose DB_TRX_ID is not visible to the reader's snapshot (its read view) follows DB_ROLL_PTR into the rollback segment, reconstructs the prior version by applying the undo delta, and checks visibility again. Iterate until a visible version is reconstructed or the chain ends.

The purge thread (or threads, configurable via innodb_purge_threads) removes undo records once no active read view can still see them. The mechanism mirrors Postgres's oldest-snapshot horizon: the purge thread tracks the oldest read view's start point and reclaims any undo records whose owning transaction committed before that point.

Gap locks at REPEATABLE READ

InnoDB's REPEATABLE READ is a notable divergence. Pure snapshot isolation allows phantoms at predicate reads only if the predicate engine is implemented naively; in practice SI prevents them because the snapshot is frozen. But InnoDB adds gap locks on top of MVCC at REPEATABLE READ: when you run SELECT ... WHERE indexed_col > 100 FOR UPDATE or a locking read inside a transaction, InnoDB locks the gaps in the index between existing keys above 100. A concurrent INSERT into that range blocks until your transaction commits.

The effect is to prevent phantoms even for locking reads and certain non-repeatable-read edge cases Postgres's SI tolerates. The cost is reduced concurrency on insert-heavy workloads and a class of deadlocks unique to gap-lock semantics. Many production MySQL shops drop to READ COMMITTED precisely to disable gap locks.

InnoDB's SERIALIZABLE is 2PL

InnoDB's SERIALIZABLE is not SSI. It drops the MVCC-snapshot reading for non-locking reads and upgrades every plain SELECT into SELECT ... LOCK IN SHARE MODE. Readers take shared row locks; writers take exclusive locks; the protocol reduces to strict two-phase locking as described in chapter 53. Writes still go through MVCC for the version store (so readers at lower levels see a consistent snapshot), but the SERIALIZABLE level itself is lock-based, not optimistic.

Consequence: InnoDB SERIALIZABLE genuinely prevents write skew (2PL is sound), but throughput drops dramatically under contention, mirroring classical 2PL's lock-queue pathology. Postgres's SSI is optimistic and retries; InnoDB's SERIALIZABLE is pessimistic and blocks. Same guarantee, different operational profile.

Cross-engine differences that matter in practice

Five operational differences that touch your hands, not just your diagrams.

Postgres bloat management is mandatory. Running Postgres without monitoring pg_stat_user_tables (n_dead_tup, last_autovacuum) is like running a car without a fuel gauge. For hot tables drop autovacuum_vacuum_scale_factor to 0.05, raise autovacuum_max_workers, and schedule manual VACUUM on the hottest ones.

Oracle undo retention must match your longest read. If you run hour-long reports on an Oracle OLTP system, set undo_retention to at least 3600, size the undo tablespace for an hour of undo traffic, and consider RETENTION GUARANTEE so retention is enforced even under space pressure. Skipping this is how ORA-01555 bugs show up in Jira every quarter.

InnoDB purge lag is the silent killer. Monitor information_schema.innodb_metrics — trx_rseg_history_len is the depth of the undo history list. On a healthy system it sits in the hundreds; millions means the purge thread is behind. The root cause is almost always a forgotten long-running transaction; SHOW ENGINE INNODB STATUS lists the oldest.

SERIALIZABLE semantics are not portable. Port an application from Postgres to Oracle keeping the same SET TRANSACTION ISOLATION LEVEL SERIALIZABLE statements, and your write-skew-sensitive invariants silently break — Oracle's SERIALIZABLE is SI. Port the other way and you gain correctness plus a new requirement: handle 40001 serialisation-failure retries (chapter 57).

Connection pooling is Postgres's operational cliff. Postgres uses process-per-connection — each session is a full OS process (10-20 MB) with a fork cost at connect time. Above a few hundred concurrent connections Postgres stalls under context-switch and memory pressure. The fix is an external pooler (PgBouncer, pgcat) bounding backend count. Oracle and InnoDB use threads-per-connection and scale to thousands of connections natively — pooling is still good hygiene but not operationally critical.

The SERIALIZABLE semantics gotcha, again

Worth saying twice: the word SERIALIZABLE in SQL means different things to different engines.

Postgres ≥ 9.1: SSI. True serialisability. The engine tracks read-write antidependencies and aborts transactions that close dangerous cycles. Write skew is caught. Overhead is ~10-20% over SI. You must handle 40001 retries (ch.57).
Oracle: Snapshot isolation. First-committer-wins on writes; no rw-edge tracking. Write skew is allowed. The on-call doctor scenario from chapter 59 reproduces on Oracle at SERIALIZABLE and no retry will ever help.
InnoDB: Strict 2PL on reads and writes. Every read takes shared locks, every write takes exclusive locks, held until commit. Write skew prevented by blocking. Overhead is large under contention — this is why InnoDB applications rarely use SERIALIZABLE in practice.
SQL Server: Strict 2PL (like InnoDB). Separately, Read Committed Snapshot Isolation (RCSI) and Snapshot Isolation are optional flavours with MVCC semantics; the SERIALIZABLE level itself is lock-based.

The invariant: never port a write-skew-sensitive application across engines without re-verifying behaviour. The same SQL produces different executions, sometimes different final states.

Connection and pooling implications

The MVCC choice subtly constrains pooling strategy. Postgres's process-per-connection model combines with the snapshot-and-vacuum interaction to create a sharp operational profile: an idle backend holding an old snapshot blocks autovacuum. PgBouncer in transaction mode is the near-universal default — the pooler returns a backend between every transaction, so snapshots live only for the duration of one transaction. The catch: transaction-mode pooling breaks features that rely on cross-transaction session state (prepared statements with some clients, advisory locks). Most ORMs manage this; hand-written psycopg2 code often doesn't.

Oracle and InnoDB, with thread-per-connection and compact undo, tolerate idle connections better. Pooling is still best practice but the failure mode of "no pool at all" is less catastrophic. A Java app pooled via HikariCP at 100 connections to Oracle works fine; 100 unpooled connections to Postgres saturates the machine's memory.

A minimal MVCC simulator for each style

Two Python classes, same visibility predicate, different storage. Both are pedagogical — production engines have thousands of additional lines for durability, locking, and corner cases — but the storage contrast is exactly the Postgres-vs-Oracle/InnoDB split.

# concurrency/heap_mvcc.py — Postgres-style: versions in the heap
from dataclasses import dataclass, field

@dataclass
class HeapTuple:
    xmin: int
    xmax: int | None
    ctid: int | None            # next version's slot in the page
    value: object

class HeapMVCC:
    """Old versions live alongside current versions in the same 'heap page'.
    An UPDATE allocates a new tuple and links the old one forward via ctid."""

    def __init__(self):
        self.heap: list[HeapTuple] = []       # the 'page'
        self.committed: set[int] = set()

    def _visible(self, t: HeapTuple, snap: set[int]) -> bool:
        if t.xmin not in snap: return False
        if t.xmax is None: return True
        return t.xmax not in snap

    def read(self, key_slot: int, snap: set[int]):
        slot = key_slot
        while slot is not None:                # walk ctid chain
            t = self.heap[slot]
            if self._visible(t, snap): return t.value
            slot = t.ctid
        return None

    def update(self, key_slot: int, new_value, xid: int):
        new_slot = len(self.heap)
        self.heap.append(HeapTuple(xmin=xid, xmax=None, ctid=None, value=new_value))
        old = self.heap[key_slot]
        old.xmax, old.ctid = xid, new_slot      # tombstone + forward pointer

Twenty-five lines. The dead tuples are in self.heap forever until a hypothetical VACUUM walks the list and drops entries whose xmax is in every snapshot's past.

# concurrency/undo_mvcc.py — Oracle/InnoDB-style: current heap + undo log
from dataclasses import dataclass

@dataclass
class UndoRecord:
    xid: int                   # tx that created this undo (i.e. did the update)
    prev_xid: int              # tx that owned the row BEFORE this update
    prev_undo: int | None      # pointer to previous undo record, or None
    prev_value: object

class UndoMVCC:
    """Heap holds ONLY current versions; undo log stores how to roll back.
    Readers that can't see the current version walk the undo chain backwards."""

    def __init__(self):
        self.heap: dict[int, tuple[int, int | None, object]] = {}
        # row_id -> (xid_of_current_version, roll_ptr, current_value)
        self.undo: list[UndoRecord] = []
        self.committed: set[int] = set()

    def read(self, row_id: int, snap: set[int]):
        xid, roll_ptr, value = self.heap[row_id]
        if xid in snap: return value           # current version is visible
        while roll_ptr is not None:            # walk undo chain backwards
            rec = self.undo[roll_ptr]
            if rec.prev_xid in snap: return rec.prev_value
            roll_ptr = rec.prev_undo
        return None                            # older than anything retained

    def update(self, row_id: int, new_value, xid: int):
        old_xid, old_roll, old_value = self.heap[row_id]
        undo_idx = len(self.undo)
        self.undo.append(UndoRecord(xid, old_xid, old_roll, old_value))
        self.heap[row_id] = (xid, undo_idx, new_value)   # in-place update

Same visibility semantics, entirely different storage. The heap dictionary in UndoMVCC never grows on UPDATE — only its current values change. The undo list grows by one entry per UPDATE, and a real engine would periodically truncate it once no snapshot can still need its early entries.

A million-row UPDATE under each engine

Setup: the orders table has 1M rows, all on 20,000 heap pages (50 rows per 8 KB page). One transaction, T_big, updates every row's status column. Post-commit state:

Postgres without HOT. Each UPDATE writes a new tuple on a different page (or the same page if fillfactor left room, but assume the table was at fillfactor 100). The heap now contains 2M tuples on roughly 40,000 pages. Every secondary index has 1M new entries pointing at new locations plus 1M old entries pointing at tombstones. Autovacuum kicks in when dead fraction exceeds 20% — here it is 50%, so it starts immediately. Walking 40k pages to reclaim 1M dead tuples takes minutes; index vacuum is another pass.

Postgres with HOT. If status is not indexed and the page had room (fillfactor < 100), most updates are HOT. New version lives in the same page, chained via in-page ctid. No index maintenance. The heap still has 2M tuples temporarily, but they are concentrated and autovacuum reclaims them page-at-a-time efficiently. This is why production Postgres schemas set fillfactor 80-90 on heavily updated tables — the HOT path survives the UPDATE storm.

Oracle. Heap has 1M rows, unchanged in count; each row's block now has an ITL entry for T_big and its row pointer moved through the ITL to an undo record. Undo segment grew by ~1M records. If undo_retention is 900 seconds and the table is churned further, T_big's undo eventually becomes eligible for overwrite. Any reader whose snapshot SCN predates T_big's commit and takes longer than undo_retention seconds to finish hits ORA-01555.

InnoDB. Clustered index has 1M rows, unchanged in count; each row's DB_TRX_ID now equals T_big's id and its DB_ROLL_PTR points at an undo record in the rollback segment. Rollback segment grew by ~1M records. The purge thread starts reclaiming them as soon as no read view can still see pre-T_big state. trx_rseg_history_len spikes to ~1M and drains over the next minutes.

Three engines, one workload, three different physical outcomes and three different operator concerns. This is why "MVCC" as a single word hides the most consequential architectural choice the engine has made.

Common confusions

"All MVCC engines behave the same." They do not. Same visibility rule, three storage layouts, three cleanup strategies, three SERIALIZABLE implementations. The name MVCC covers a family, not a protocol. Cross-engine portability of isolation semantics cannot be assumed.

"VACUUM is a Postgres oddity." VACUUM is a specific manifestation of "old versions must eventually be reclaimed". Oracle reclaims via undo-retention-based ring-buffer overwrite; InnoDB reclaims via purge thread. Every MVCC engine has a version-reclaim story; Postgres's is the most externally visible because of how often it misbehaves, but "VACUUM is weird" is the wrong read. The better one: in-heap versions require explicit, batchy cleanup; undo versions get continuous, implicit cleanup. Both have failure modes.

"Oracle doesn't have write skew because it has SERIALIZABLE." The single most expensive misconception in the database world. Oracle's SERIALIZABLE is snapshot isolation. Write skew is the canonical SI anomaly. If you rely on Oracle SERIALIZABLE to prevent write skew, your invariants will silently break. Either use SELECT ... FOR UPDATE over the read set, materialise the invariant as a constraint, or move to Postgres for SSI.

"Long-running read-only transactions are free." They are not free on any MVCC engine. On Postgres they prevent vacuum. On Oracle they risk ORA-01555. On InnoDB they stall purge. A one-hour report on a busy OLTP primary is an operational event, not a zero-cost action.

"If my workload is fine on Postgres it will be fine on Oracle." Not if you use long queries (different failure mode) or SERIALIZABLE (different guarantee) or many connections (different overhead profile). Performance equivalence across engines requires workload-specific benchmarking, not pattern-matching.

Going deeper

Three directions the field is moving in 2025 and where to follow.

Postgres's pluggable table storage — an escape hatch from the heap model

Postgres 12 introduced pluggable table access methods (docs), letting an extension ship a different storage layout while reusing the indexes, planner, and executor. Experimental zheap (an out-of-tree EnterpriseDB project) reimplements Postgres storage along Oracle/InnoDB lines — heap holds current versions only, undo log holds rollback data, no VACUUM in the classical sense. zheap has not merged into core, but the pluggable TAM means a future Postgres could support multiple storage engines the way MySQL does.

Oracle flashback queries

Oracle exposes the undo machinery directly to SQL via AS OF SCN or AS OF TIMESTAMP. The query engine reconstructs table state at the requested point using the same undo walk live readers use. undo_retention bounds how far back you can flashback; extend retention and your production DB becomes a limited time-machine. Amazon Aurora Postgres added a comparable feature called "backtrack" in 2018; vanilla Postgres does not expose versioned queries out of the box.

CockroachDB and Spanner — distributed MVCC, undo-style

CockroachDB and Google Spanner both use MVCC with versions stored separately from the current row, closer to Oracle's style than Postgres's. Each version is tagged with a hybrid logical clock timestamp; storage is a keyspace with versioned keys (key@ts1, key@ts2, ...) backed by RocksDB or Colossus. A garbage-collection loop reclaims versions older than the GC TTL (default 25 hours in CockroachDB, used to support AS OF SYSTEM TIME queries). The distributed version of the long-running-reader problem is the same — a query holding a low-timestamp snapshot prevents GC across every range.

Where this leads next

This chapter closes Build 7 on concurrency control. You now have the full stack: locking (ch.52-53), the anomaly catalogue (ch.56), isolation levels and their standards mess (ch.57), MVCC mechanics (ch.58), snapshot isolation and write skew (ch.59), SSI (ch.60), and three production engines' concrete implementations (this chapter). You can pick an isolation level on any mainstream database and defend the choice; you can tune cleanup mechanisms with an understanding of why they exist; you can port applications between engines without silently corrupting invariants.

Build 8 turns from "how do transactions coexist while running" to "what survives when the machine dies mid-transaction". The topics:

The write-ahead rule — log records hit disk before the data pages they describe.
ARIES — the three-pass recovery algorithm (analysis, redo, undo) every serious relational engine uses.
Group commit and fsync amortisation — how engines achieve thousands of commits per second despite disk syncs costing milliseconds.
Connection handling, protocol design, and the transport layer — how bytes leave the client and reach the engine.
Replication — physical and logical streams, failover, and the consistency guarantees a replica actually provides.

MVCC answered "how do I run many transactions at once without serialising everything". Build 8 answers "how do those transactions survive a power loss halfway through, and how do they stay consistent across machines". Those two questions together are why relational databases are the ones you put your money in.

References

PostgreSQL documentation, Chapter 13: Concurrency Control — Transaction Isolation and MVCC — the production reference for Postgres's MVCC, covering the xmin/xmax/ctid model, snapshot representation, and the explicit lock modes that supplement MVCC. The "Routine Vacuuming" chapter adjoining it is the operator's guide to managing in-heap dead tuples.
Oracle documentation, Data Concurrency and Consistency — Oracle's treatment of read consistency, undo segments, SCNs, the SERIALIZABLE isolation level (which is snapshot isolation), and the ORA-01555 error. Also covers flashback queries and AS OF SCN semantics.
MySQL documentation, InnoDB Locking and Transaction Model — the canonical reference for InnoDB's clustered-index-with-hidden-columns MVCC, gap locking at REPEATABLE READ, purge threads, and the strict-2PL SERIALIZABLE implementation. Essential reading for MySQL operators.
Wu, Arulraj, Lin, Xian, Pavlo, An Empirical Evaluation of In-Memory Multi-Version Concurrency Control, VLDB 2017 — measured comparison of MVCC variants across Postgres, Oracle, MySQL, SQL Server Hekaton, HyPer, and MemSQL on uniform workloads. Where the Postgres-vs-Oracle-vs-InnoDB trade-offs are quantified rather than narrated.
Suzuki, The Internals of PostgreSQL — free online book, deeply technical treatment of Postgres storage, MVCC, vacuum, and transaction-id wraparound. Chapters 5 (Concurrency Control) and 6 (VACUUM Processing) are the single best deep dive on Postgres's in-heap version model.
Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 7 — the accessible engineering treatment of MVCC, snapshot isolation, write skew, and the engine-specific differences covered here. Pairs well with the primary-source engine docs for readers coming from a general software-engineering background.