In short

Multi-version concurrency control (MVCC) is the technique that makes Postgres, Oracle, MySQL's InnoDB, and essentially every modern OLTP engine feel fast under mixed read/write load. The previous two chapters (2PL, anomalies) made one hard assumption: there is exactly one stored value per row at any instant, so a reader and a writer of the same row must coordinate through locks. MVCC rejects that assumption. Instead of overwriting a row in place, a writer creates a new version tagged with its transaction ID; the old version stays alive until no running transaction can still see it. A reader picks the version that was already committed as of its own transaction's start time. The effect: readers never block writers, writers never block readers, and the only real blocking is writer-writer contention on the same row.

The cost is space (dead tuples accumulate until vacuum reclaims them) and complexity (every row access runs a visibility check). The win is enormous — on read-heavy OLTP workloads MVCC scales linearly with cores while 2PL saturates at the hot-row lock queue.

The Python implementation is about fifty lines. Each row becomes a list of versions, each version tagged (xmin, xmax, value). Each transaction captures a snapshot at BEGIN: the set of transaction IDs that had committed at that instant. A version v is visible to transaction tx if and only if v.xmin is in tx.snapshot and v.xmax is either null or not in tx.snapshot. That pair of conditions is the entire visibility rule. Postgres's production implementation is hundreds of pages of C, but the core is the whiteboard rule above, layered over storage, index, and vacuum machinery.

You finished chapter 53 with a working lock-based database and chapter 56 with a survey of the anomalies weaker isolation levels permit. Between those is a trap. On a read-heavy OLTP workload — an e-commerce catalog at 95% reads — 2PL forces every reader to queue behind any writer touching a nearby row. The workload could run in parallel across 64 cores; the lock manager makes it run on roughly one.

The fix is not a better lock. The fix is to stop locking on the read path. When a writer modifies a row, do not wait for readers to leave, and do not make readers wait. Keep the old value around for anyone still looking at it, and let the writer append a new one. This is MVCC — invented by Reed's 1978 MIT thesis, shipped by Oracle since version 4 in 1984 under the name "read consistency", and in Postgres continuously since version 6.5 in 1999.

The throughput gap MVCC closes

Benchmark 2PL against MVCC on a contested workload: one hot row — a counter total_orders — read by 100 concurrent reporting queries and incremented by a single writer at 1 Hz.

Under rigorous 2PL, the writer takes a write lock and blocks all 100 readers for the write plus commit (say 5 ms). During that 5 ms the readers queue; through the next second the queue drains. The writer's 5 ms wipes out 500 reader-milliseconds per second elapsed. The system is largely idle.

Under MVCC, the writer creates a new version; readers continue reading the old one. Neither side takes a lock covering the other. The 100 reader threads saturate the hardware — roughly 10,000 read TPS at 10 ms per read transaction. The writer slides in its one update per second without blocking anyone.

The Postgres 6.5 release notes from June 1999 — which introduced MVCC — reported roughly 8× to 10× read throughput on contended workloads compared to the previous locking implementation. On modern 64-core hardware, MVCC scales to millions of read TPS while strict-2PL flat-lines below 100k. The point is not that 2PL is bad; MVCC uses 2PL internally on the write path. The point is that 2PL is the wrong protocol for readers.

The core idea — keep old versions

Every lock-based concurrency protocol assumes one thing from the storage layer: rows are mutable. UPDATE accounts SET balance = 400 WHERE id = 'alice' overwrites the bytes of Alice's row. Locks exist because two threads cannot safely overwrite the same bytes without coordination, and because a reader cannot observe bytes that are mid-overwrite.

MVCC's insight: do not overwrite. When T1 updates Alice's balance from 500 to 400, the engine does not replace the row. It performs two distinct operations:

  1. Mark the existing version as ending at T1. The row stores a field xmax. Before T1, Alice's row has xmax = None — "still current." T1 sets xmax = T1 — "superseded by T1."
  2. Append a new version. A new physical row is written with xmin = T1, xmax = None, value = 400.

After the update, Alice has two versions in storage:

v1: xmin=T0, xmax=T1,   value=500
v2: xmin=T1, xmax=None, value=400

A reader looking at Alice's row sees a version chain. Which version it reads depends on its snapshot. A reader that started before T1 committed sees v1; a reader that started after sees v2.

Version chain for row alice across three updatesA horizontal timeline labelled 'real time'. Four rectangles stacked vertically represent versions v1 through v4 of row alice. v1 shows xmin=T0, xmax=T1, value=500. v2 shows xmin=T1, xmax=T2, value=400. v3 shows xmin=T2, xmax=T5, value=350. v4 shows xmin=T5, xmax=None, value=500 (current). Arrows indicate xmax pointing to the transaction that superseded each version. Vertical dashed lines mark the commit times of T1, T2, and T5 on the timeline below.Version chain for row alicev1 — value 500xmin=T0, xmax=T1v2 — value 400xmin=T1, xmax=T2v3 — value 350xmin=T2, xmax=T5v4 — value 500 (current)xmin=T5, xmax=Nonereal time →T1 commitsT2 commitsT5 commits
Four physical versions of Alice's row after three updates. Each version records the transaction that created it (xmin) and the transaction that superseded it (xmax, or None for the current version). Readers pick whichever version is visible under their snapshot — the current version for readers that started after T5 committed, v3 for readers that started between T2 and T5, v2 between T1 and T2, v1 before T1.

Why this eliminates reader/writer conflicts: two agents can never observe a "partially updated" row because rows are not partially updated. A write is two separate append-style actions — each atomic on its own. A reader that arrives mid-write sees either the old version with xmax = None (new version not landed yet) or the new version with xmin = T1 (old already superseded). There is no intermediate state.

The xmin/xmax pair is the entire on-disk machinery MVCC needs. Everything else — snapshots, visibility checks, vacuum — is logic that interprets these two fields.

A Python implementation

Build MVCC bottom-up in real Python. Start with a Version dataclass, an MVCCStore that holds version chains, and a Transaction class that carries a snapshot.

# concurrency/mvcc_store.py
from dataclasses import dataclass
from collections import defaultdict
from typing import Any
import threading, itertools

@dataclass
class Version:
    xmin: int                  # tx that created this version
    xmax: int | None           # tx that superseded/deleted it (None = live)
    value: Any

class MVCCStore:
    """A key-value store where every key maps to an append-only list of
    versions. Writes never overwrite; they mark old versions' xmax and
    append a new version. Readers consult the visibility predicate."""

    def __init__(self):
        self.rows: dict[str, list[Version]] = defaultdict(list)
        self.committed: set[int] = set()     # committed xids
        self.aborted: set[int] = set()       # aborted xids
        self._xid_gen = itertools.count(1)   # monotonic xid generator
        self._running: set[int] = set()      # in-flight xids
        self._lock = threading.Lock()        # guards all the above

Fifteen lines give you the state every MVCC engine tracks: version table, committed xids, aborted xids, a monotonic xid counter, and currently-running transactions. The lock self._lock guards the bookkeeping, not the data — held for the microseconds needed to flip a bit in committed or append to a version list. Very different from a 2PL lock manager, where locks are held for milliseconds or seconds.

Now the visibility predicate — the heart of the protocol:

def visible(v: Version, tx_snapshot: set[int], committed: set[int]) -> bool:
    """Is version v visible to a transaction whose snapshot is tx_snapshot?

    A version is visible iff:
      (1) its creator (xmin) had already committed at the tx's start, AND
      (2) it has not been superseded by a tx that had already committed
          at the tx's start (xmax is None, or xmax is not in the snapshot).
    """
    if v.xmin not in tx_snapshot:
        return False                           # creator was not yet committed
    if v.xmax is None:
        return True                            # still the current version
    if v.xmax in tx_snapshot:
        return False                           # superseded by a committed tx
    return True                                # xmax is uncommitted/in-flight

Six lines, three branches. Every production MVCC engine's visibility check is essentially this function plus extra bookkeeping for aborted transactions and sub-transaction IDs.

  1. If v.xmin is not in the snapshot, the creator had not committed when the reader started — invisible, even if it has since committed.
  2. If v.xmax is None, no one has superseded this version — visible.
  3. If v.xmax is in the snapshot, the supersession was already visible at snapshot time — read the newer version instead.
  4. Otherwise v.xmax exists but is not in the snapshot — supersession has not yet become visible — still visible.

Why both conditions matter: (1) alone would make uncommitted writes visible the moment the snapshot is taken — a dirty read. (2) alone would make every committed version of every row visible to every reader. The two together pin down a single version per row per reader — the one whose lifetime interval [xmin_commit, xmax_commit) contains the reader's snapshot point.

Read and write operations wire these pieces together:

def read(self, tx, key):
    with self._lock:
        versions = self.rows.get(key, [])
        for v in reversed(versions):           # newest first
            if visible(v, tx.snapshot, self.committed):
                return v.value
        return None                            # no visible version = not found

def write(self, tx, key, value):
    with self._lock:
        versions = self.rows[key]
        if versions:
            current = versions[-1]             # most recent version
            if current.xmax is not None:
                if current.xmax not in self.aborted:
                    raise WriteConflict(key)   # someone else superseded it
            current.xmax = tx.xid              # mark as superseded by us
        versions.append(Version(tx.xid, None, value))

read scans newest-first and returns the first version the predicate accepts. write marks the current version's xmax, then appends a new version. If xmax is already set by another non-aborted transaction, raise WriteConflictfirst-writer-wins.

Commit and abort are trivial:

def commit(self, tx):
    with self._lock:
        self._running.discard(tx.xid)
        self.committed.add(tx.xid)

def abort(self, tx):
    with self._lock:
        self._running.discard(tx.xid)
        self.aborted.add(tx.xid)
        # Any versions with xmin == tx.xid become invisible to everyone
        # automatically, because xmin is never in anyone's future snapshot
        # once tx is in aborted rather than committed.

Notice what is not here. No lock acquisition on read. No transactional lock on write — only a brief latch on the bookkeeping. No blocking, no lock manager, no queues. The data structure is the protocol.

The Transaction class just captures a snapshot:

class Transaction:
    def __init__(self, store: MVCCStore):
        self.store = store
        with store._lock:
            self.xid = next(store._xid_gen)
            store._running.add(self.xid)
            # Snapshot is the set of committed xids as of BEGIN.
            self.snapshot = set(store.committed)

Three lines of work at BEGIN. The snapshot is frozen — a set copy at this instant. Future commits add xids to store.committed, but self.snapshot does not see them. That frozen-ness gives the transaction its stable view.

The visibility rules, precisely

The predicate is six lines, but its meaning deserves slow reading, because every subtlety of MVCC — snapshot isolation, write skew, SSI, vacuum safety — falls out of it.

Formally: a version v = (xmin, xmax, value) is visible to transaction tx iff:

V1 says the creator had committed before tx began. Snapshots contain committed xids only, so V1 fails if the creator had not yet started, was running, or had aborted. Why snapshots contain only committed xids: the snapshot models "what has actually happened" as of start time. Running transactions might yet abort; aborted ones never happened. Excluding both is what prevents dirty reads. Aborted transactions are never added to committed, so they are never in any snapshot — which is what "aborted" means.

V2 says either nobody has superseded v yet (xmax = None) or the supersession was not visible at snapshot time. Why xmax ∈ tx.snapshot makes v invisible: if the superseder had committed before tx started, the newer version already exists from tx's point of view — v is old news. tx reads the newer version, which V1 accepts on its own xmin. Conversely, if xmax is in committed now but was not in committed at tx's start, the supersession happened after tx began; the frozen snapshot does not see it.

In one sentence: a version is visible to a reader iff its lifetime interval — from the commit of its creator to the commit of its superseder — straddles the reader's snapshot point. That is the whole MVCC correctness story. Every subsequent subtlety is about writes, not visibility.

Why readers never block writers

The consequences of the visibility rule fall out immediately.

A read never acquires a transactional lock. The read method takes the internal _lock for microseconds — long enough to walk the chain — but that is data-structure latching, not transactional locking. It does not conflict with writers on the row.

A write never blocks a read. A writer holds _lock for the append only. The reader does not wait for the writer's transaction to commit; it reads whatever is visible under its own snapshot. The writer's new version is invisible until the writer commits and future readers' snapshots include the writer's xid.

Writer-writer conflict on the same row is the only real blocking. If T1 and T2 both update Alice, one marks xmax first; the other finds xmax already set and aborts (or waits for T1 to resolve, then retries). Postgres takes the wait-then-retry route at READ COMMITTED; the blocking scope is two transactions on one row — it does not cascade.

This asymmetry is MVCC's central claim. In 2PL, reads and writes share one lock space — a write lock excludes every reader. In MVCC, reads are predicate-only (visibility check) and writes are latch-protected appends. They do not share a lock. Writers contest xmax — not a lock manager.

The snapshot — how a transaction fixes its view

self.snapshot = set(store.committed) is correct but wasteful: on a billion-transaction history, it copies a billion-element set at every BEGIN. Production engines use a compact representation.

In Postgres, a snapshot is a triple (xmin, xmax, xip_list):

The visibility check for xid v.xmin becomes:

def xid_visible(xid: int, snap) -> bool:
    if xid < snap.xmin:
        return xid in committed                # definitely committed or aborted
    if xid >= snap.xmax:
        return False                           # future transaction
    # xid is in [xmin, xmax) — check xip_list
    if xid in snap.xip_list:
        return False                           # was running at snapshot time
    return xid in committed                    # committed between transactions

O(log n) in the size of xip_list — the number of currently-running transactions, typically tens to hundreds. A billion-transaction history has a snapshot size of O(concurrency), not O(history).

Why three-state works: xids are monotonic, so every past xid is committed, aborted, or running. The running ones are a small minority. xmin/xmax bracket where running xids could exist; xip_list enumerates them; everything outside the bracket is resolved by the normal committed set.

Snapshots are taken once per statement (READ COMMITTED) or once per transaction (REPEATABLE READ, SERIALIZABLE). Cost O(concurrency), visibility checks O(log concurrency). This is why MVCC's read path beats 2PL's: 2PL acquire-release is a lock-table lookup plus potential blocking; MVCC visibility is a snapshot membership test plus two set lookups.

The cost — dead versions and vacuum

Nothing is free. Every update produces a new version and leaves the old behind. A table taking 10k updates per second accumulates 10k dead versions per second — 36 million per hour. Without reclamation, storage grows unbounded and every visibility scan takes longer as chains lengthen.

MVCC engines reclaim dead versions through a background process. The design differs by engine:

The common pattern: versions live until no snapshot can still see them — once every running transaction has snapshot.xmin greater than v.xmax_commit, v is unreachable.

Tuning matters. Under-configured autovacuum is the most common Postgres performance problem in the field. Heavy-update tables with lazy autovacuum accumulate bloat: heap and indexes grow, query plans degrade. Tuning autovacuum_vacuum_scale_factor to 0.05 on hot tables is standard practice. Transaction-ID wraparound is the other classical Postgres risk: xids are 32-bit, and without vacuum-freeze rewriting ancient xmin as FrozenXID, a row could eventually appear to be "in the future." A Postgres database that goes years without VACUUM stops accepting writes to protect itself.

Writer-writer conflicts — the only real blocking

The one situation MVCC makes you wait is two writers modifying the same row. T1 and T2 both want Alice's balance; both see the same current version v. T1 sets v.xmax = T1 first; T2 arrives and finds v.xmax already set.

T2's choice:

  1. First-writer-wins (abort): T2 raises WriteConflict immediately. This is what Postgres does at REPEATABLE READ and SERIALIZABLE; the application retries.
  2. First-writer-wins (wait): T2 waits for T1 to commit or abort. If T1 commits, T2 aborts (T2's snapshot excludes T1's write). If T1 aborts, T2 proceeds. This is what Postgres does at READ COMMITTED.

Either way, blocking is two transactions on one row. It does not cascade to readers of that row or writers of any other row. The worst case is a hot row (counter, sequence) — an application design problem, not an MVCC one.

Real OLTP workloads touch mostly disjoint rows — different users editing different profiles, different orders inserting different ids. Schema reviews flag the hot rows that do exist (counters, sequences, aggregates) and replace them with UUIDs, partitioned counters, or materialised views. MVCC's production success rests on this empirical fact: writer-writer row contention is rare in well-designed schemas.

Read-only transactions get the best of it

A transaction that only issues SELECT statements — a report, an analytics scan, a dashboard — pays almost nothing under MVCC. It takes a snapshot at BEGIN (O(concurrency), microseconds). It walks version chains with the predicate (O(chain length), typically 1 or 2 for well-vacuumed tables). It never acquires a lock. It never aborts — a read-only transaction cannot lose a write race.

This is why analytical queries run safely against OLTP primaries in MVCC systems. The analyst's hour-long query does not block transactions. The worst it does is hold back vacuum — if its snapshot is old, vacuum cannot reclaim versions superseded after the snapshot. Hence Postgres's hot_standby_feedback and Oracle's "snapshot too old" — different responses to the same tension between long reads and aggressive cleanup.

A concrete timeline — two transactions around a single update

Three transactions on a single row r, value initially v1. Track snapshot contents and what each transaction sees.

t=0:  T0 committed in the past; committed = {T0}; r's versions = [(T0, None, v1)].

t=1:  T1 begins. snapshot_T1 = {T0}. Running = {T1}.

t=2:  T2 begins. snapshot_T2 = {T0}. Running = {T1, T2}.
      T2's snapshot does NOT include T1 — T1 is still running.

t=3:  T1 writes r = v2.
      r's versions become:
        [(T0, T1, v1),    # old version, xmax=T1
         (T1, None, v2)]  # new version, xmin=T1

t=4:  T2 reads r.
      Walk versions newest-first:
        v=(T1, None, v2): V1 says T1 ∉ snapshot_T2 = {T0}, so invisible.
        v=(T0, T1, v1):   V1 says T0 ∈ {T0}: ok. V2 says T1 ∉ {T0}: ok.
      T2 reads v1. Correct — T1 is in-flight, so its write is invisible.

t=5:  T1 commits. committed = {T0, T1}. Running = {T2}.

t=6:  T3 begins. snapshot_T3 = {T0, T1}. Running = {T2, T3}.

t=7:  T3 reads r.
      Walk versions newest-first:
        v=(T1, None, v2): V1: T1 ∈ {T0, T1} ok. V2: xmax None, ok. VISIBLE.
      T3 reads v2.

t=8:  T2 (still running) reads r again.
      snapshot_T2 is frozen at {T0}. Re-run visibility:
        v=(T1, None, v2): V1: T1 ∉ {T0}, invisible.
        v=(T0, T1, v1):   V1: T0 ∈ {T0}, V2: T1 ∉ {T0}, visible.
      T2 reads v1 — same as at t=4. Repeatable read.

Three simultaneous readers, three internally consistent views of r:

  • T3, which started after T1 committed, sees v2.
  • T2, started while T1 was running, sees v1 throughout its lifetime.
  • T1 itself, after its write, sees v2 (own-writes are visible to self — a special case of the rule that treats xmin == self.xid as "always visible").

No locks taken, no one blocked. T2's view is pinned by the snapshot, not by a lock. v1 is not reclaimable yet — T2's snapshot still needs it. Once T2 finishes, v1 becomes eligible for vacuum.

Timeline of T1, T2, T3 on row r under MVCCA timeline chart with three horizontal bars representing transactions T1 (top), T2 (middle), T3 (bottom). Below them, a second chart shows the two versions of row r. A vertical dashed line at t=3 marks T1's write creating v2. A dashed line at t=5 marks T1's commit. T1's bar runs from t=1 to t=5, labelled 'writes v2 at t=3, commits at t=5'. T2's bar runs from t=2 through t=8+, labelled 'sees v1 throughout'. T3's bar runs from t=6 onward, labelled 'sees v2'. The version bars show v1 existing from t=0 continuously and v2 existing from t=3 onward.timet=1t=2t=3 (T1 writes)t=5 (T1 commits)t=6T1 — writes v2 at t=3, commits at t=5T2 — sees v1 throughout (snapshot frozen at {T0})T3 — sees v2v1 live (xmax=None)v1 superseded (xmax=T1), still visible to T2v2 live — visible to readers starting after t=5
T1 writes v2 at t=3 and commits at t=5. T2, begun at t=2, sees v1 for its entire lifetime — its snapshot is frozen at the moment before T1 committed. T3, begun at t=6 after T1 committed, sees v2. Both versions coexist in storage; neither reader blocks the writer, and the writer does not block either reader.

Common confusions

Going deeper

The basics above cover 80% of what you will ever need to know about MVCC in a production system. The remaining 20% is where the research frontier sits — in-memory engines, HOT updates, and time-travel queries.

Hekaton — lock-free MVCC in memory

SQL Server's Hekaton in-memory engine (shipped 2014) implements MVCC with no lock manager at all. Every row is a pointer in a lock-free hash table; every update is a compare-and-swap on the row pointer that installs a new version. Readers walk version chains via atomic pointer loads; writers CAS the head of the chain. If the CAS fails (another writer got there first), the losing writer aborts. No lock acquisition, no condition variables, no kernel transitions — just atomic pointer operations.

Hekaton reports 30-100× throughput over SQL Server's disk-based engine on in-memory OLTP workloads. The secret sauce is not the algorithm (standard MVCC) but the implementation (lock-free, cache-line-aware, no latches on the fast path). The lesson: MVCC's theoretical lock-freedom for readers was, for decades, defeated by the storage-layer latches needed to read version chains safely. Lock-free data structures let the theoretical win materialise.

HOT updates in Postgres

Naive MVCC has an unpleasant consequence: every update creates a new heap tuple and a new index entry for every index on the table. A table with 5 indexes turns every UPDATE into 6 writes (heap + 5 indexes). Index bloat becomes a first-order problem.

Postgres's Heap-Only Tuple (HOT) optimisation (introduced in 8.3, 2008) addresses the common case: updates that do not change any indexed column. If the updated columns are none of the indexed ones, the new version can live in the same heap page as the old, and the old version's ctid points to it. Existing index entries still point to the old version; the heap chases the ctid to find the new one. Index maintenance is skipped entirely.

In practice HOT converts a huge fraction of OLTP updates to zero-index-write operations. The catch: it only works if the same page has room for the new version. Hence the Postgres recommendation to leave fillfactor below 100 on heavily updated tables — carve out free space for HOT updates upfront.

Time-travel queries

MVCC's version chains are a time-travel record of the database. CockroachDB and Google Spanner expose this directly: SELECT ... AS OF SYSTEM TIME '2024-01-01 12:00:00' returns a snapshot as of that time, using the same version-chain machinery. Because every version is stamped with its commit timestamp, the query engine can pick the version that was current at the requested time, provided vacuum has not reclaimed it.

Postgres does not expose this out of the box — versions are identified by xid, not by wall-clock time, and there is no canonical "find the version at time T" API. Extensions like pg_dirtyread walk the heap directly to recover dead versions, but this is forensic recovery, not supported time-travel. The difference is architectural: systems designed for time-travel (CockroachDB, Spanner, Datomic) retain versions by default; systems designed for OLTP throughput (Postgres, Oracle, InnoDB) reclaim versions aggressively. The machinery is the same; the retention policy is a different axis.

SI is strictly weaker than serialisability

Snapshot isolation guarantees that every transaction sees a consistent snapshot and that no two transactions can concurrently write the same row (first-writer-wins on write). It does not guarantee serialisability. The canonical counterexample is write skew:

Two doctors alice and bob are both on call. An invariant requires at least one to remain on call. Each signs herself off after checking the table — SELECT COUNT(*) FROM duty WHERE oncall = true — sees the count is 2, concludes it is safe to sign off, and writes her own row to oncall = false. Both transactions ran against a snapshot where the invariant held; both wrote to disjoint rows (their own); snapshot isolation allows both commits. Post-commit, no doctor is on call.

No serial ordering of these two transactions could produce this outcome — whichever ran second would see count=1 and not sign off. But MVCC's visibility rule is local to each version's xmin/xmax; it has no machinery for detecting "your snapshot's predicates are now stale." The fix is serialisable snapshot isolation (SSI), which tracks predicate dependencies at commit time and aborts transactions whose snapshot predicates have been invalidated by concurrent writers. That is chapter 60.

Where this leads next

You now have the core of MVCC: version chains, snapshots, the visibility predicate, writer-writer conflict detection, vacuum semantics. The rest of Build 7 fills in the remaining corners.

The conceptual arc across Build 7 is a ladder of trade-offs: global lock (chapter 52) for simplicity, 2PL (53) for fine-grained correctness, isolation levels (57) for throughput at known-weakness cost, MVCC (this chapter) for read scalability, SSI (60) for MVCC without write skew. No single point on the ladder is right for every workload. The point of working through the ladder is that every production database picks some combination — Postgres does MVCC + SSI, Oracle does MVCC + SI with explicit locking, InnoDB does MVCC + strict 2PL on writes — and understanding any of them means understanding the ladder.

MVCC is the step of the ladder where readers finally stop being penalised for writers existing. Everything after this chapter is about making the writers behave correctly at scale.

References

  1. Bernstein, Hadzilacos, Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley 1987 — chapter 5 is the original textbook treatment of multi-version concurrency control, formalising version chains, timestamp-based visibility, and the connection to serialisability.
  2. Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil, A Critique of ANSI SQL Isolation Levels, SIGMOD 1995 — the paper that named snapshot isolation, explained why it is weaker than serialisability, and framed the write-skew problem that shaped the next decade of research.
  3. Wu, Arulraj, Lin, Xian, Pavlo, An Empirical Evaluation of In-Memory Multi-Version Concurrency Control, VLDB 2017 — measured comparison of the MVCC variants used by Postgres, Oracle, MySQL, SQL Server Hekaton, HyPer, and MemSQL across uniform workloads. Where the trade-offs come into focus empirically.
  4. PostgreSQL documentation, MVCC in Chapter 13: Concurrency Control — the production reference. Covers transactional semantics, the xmin/xmax/xip_list snapshot representation, and the explicit locking commands that supplement MVCC for the cases it does not cover.
  5. Oracle documentation, Data Concurrency and Consistency — Oracle's treatment of read consistency via undo segments, snapshot too old errors, and the SERIALIZABLE isolation level (which is actually snapshot isolation).
  6. Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 7 — the accessible engineering treatment of MVCC, snapshot isolation, write skew, and the production trade-offs. The best place to send a colleague new to concurrency control.