Serializable Snapshot Isolation (SSI)

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Serializable Snapshot Isolation (SSI) is the algorithm Postgres has used for its SERIALIZABLE isolation level since version 9.1 (2011). It was introduced by Cahill, Röhm, and Fekete at SIGMOD 2008 in a paper titled Serializable Isolation for Snapshot Databases. The entire design rests on one theorem from Fekete, Liarokapis, O'Neil, O'Neil, Shasha (2005): every non-serialisable execution under plain snapshot isolation contains, somewhere in its dependency graph, a cycle with at least two consecutive rw-antidependency edges meeting at a pivot transaction. That is the "dangerous structure" — necessary and sufficient for any SI anomaly, including every write skew you saw in chapter 59 and the read-only anomaly that caught everyone by surprise in 2004.

SSI turns that theorem into an implementation. Every running transaction carries two Boolean flags — inConflict (I have an incoming rw-antidependency) and outConflict (I have an outgoing rw-antidependency). Every row a transaction reads gets a lightweight SIREAD lock — not a blocking lock, just a marker saying "this transaction read this row at this version." When a concurrent transaction later overwrites a SIREAD-marked row, the engine detects the rw-edge and flips the two flags on the two transactions involved. When any transaction tries to commit with both flags set and at least one of its conflict partners has already committed or is committing, the engine aborts one of them. The pivot is killed before it can commit, the cycle is broken, the history is serialisable.

Overhead on plain SI is 10-20% for typical OLTP — detection runs in the MVCC read path (add to a predicate lock table) and the commit path (scan outgoing rw-edges). Readers still never block writers. Postgres's implementation lives in predicate.c at ~4000 lines; MySQL InnoDB and Oracle have not shipped SSI and still brand plain SI as SERIALIZABLE.

You finished chapter 59 with a working model of SI and a clean demonstration of write skew. Two doctors, each checking the count, each clocking out, hospital dark. You want true serialisability without paying strict 2PL's read-blocks-write tax on a workload that is 95% reads. Cahill's algorithm adds the minimal bookkeeping Fekete's theorem says is sufficient — not one instruction more — and gets you serialisability without giving up MVCC's central promise.

Why 2PL's price is too high — the motivation for SSI

The obvious route from SI to serialisability is to throw away MVCC and go back to two-phase locking: shared lock on every read, exclusive on every write, hold until commit. That works. Rigorous 2PL is provably serialisable. But a reporting query that scans a million rows now holds a million shared locks for seconds to minutes, and every row it touched is untouchable by any writer. On a read-heavy workload this is ruinous — chapter 58 measured MVCC beating 2PL by 8-10× on contended reads; throwing that away to get serialisability is surrender, not a trade.

SSI asks a sharper question. SI already has MVCC's read path (no locks, pure snapshot lookup) and first-committer-wins at commit. The only thing missing is detection of rw-edges, and Fekete's theorem says you don't even need to detect all of them — only the pivot pattern, two consecutive rw-edges meeting at a single transaction. SSI's bet is that this detection is cheap: two Boolean flags per transaction and a non-blocking "I read this row" marker. The bet pays off at 10-20% overhead over SI.

Why this matters architecturally: SSI is the only known way to get serialisability with readers that never block writers. Every other serialisable protocol — 2PL, timestamp ordering, optimistic concurrency with full validation — either locks reads or aborts heavily under load. SSI threads the needle by exploiting a structural property of SI histories Fekete identified.

The Fekete 2005 theorem

The paper you need to internalise is Making Snapshot Isolation Serializable by Fekete, Liarokapis, O'Neil, O'Neil, Shasha (ACM TODS 30(2), 2005). It contains the only result you need to know to understand SSI. Everything else follows mechanically.

Theorem (informal). Let H be any history under snapshot isolation. Construct the multi-version serialisation graph MVSG(H) with edges T_a → T_b for each dependency:

T_a --ww--> T_b if T_a and T_b both wrote the same row and T_a's version was superseded by T_b's.
T_a --wr--> T_b if T_b read a version created by T_a.
T_a --rw--> T_b if T_a read a version that T_b then overwrote (an rw-antidependency).

Then H is serialisable iff MVSG(H) is acyclic. Moreover — and this is the SSI-enabling part — every cycle in MVSG(H) for a history admitted by SI contains at least two consecutive rw-antidependency edges.

Call the transaction where the two rw-edges meet the pivot. In a path T_a --rw--> T_b --rw--> T_c, T_b is the pivot: a reader that was then overwritten (incoming rw-edge) and a reader of something it will then overwrite (outgoing rw-edge). T_b is the transaction that makes the cycle non-serialisable. Abort T_b and the cycle breaks.

A strengthening in the same paper: if the pivot commits last among the three, the anomaly is observable; if the pivot commits first, SI admits it but no client may notice. SSI's detection fires regardless of commit order — conservative but correct.

Why "two consecutive" is the right granularity: a lone rw-edge T_a --rw--> T_b is fine — serialise T_a before T_b. A rw-edge followed by a ww/wr edge does not force a cycle either. Only two rw-edges meeting at a single vertex force a cycle, because the rw-direction is "reader before writer" and two of them say "pivot read something that got overwritten, and also overwrote something that got read" — impossible to linearise.

This is the exact pattern from chapter 59's doctor example. T1 read Bob (T2 then wrote), T2 read Alice (T1 then wrote). Two rw-edges: T1 --rw--> T2 and T2 --rw--> T1. Either transaction is a pivot. SSI kills one.

SSI's runtime algorithm

Cahill's 2008 paper turns the theorem into code. The algorithm has three moving parts:

1. SIREAD locks on the read path. Every read records a SIREAD lock on the (transaction, row) pair. SIREAD is a misleading name — it does not block. It is metadata: "T read this row at this version." The table is indexed by row so a future writer can look up all readers of the row it is about to modify.

2. Two flags per transaction. Each running T carries T.inConflict (an rw-edge points into T — something T read was overwritten) and T.outConflict (an rw-edge points out — T overwrote something another concurrent transaction read). Both start false; flags flip on the write path.

3. Detection on the write path. When W writes row r, the engine scans SIREAD locks on r and for each concurrent reader R records the edge R --rw--> W: set R.outConflict = True and W.inConflict = True.

4. Abort on commit. When T commits with both flags true AND at least one conflict partner has committed or is committing, T is the pivot (or part of one). Abort T. Return SQLSTATE 40001. Application retries.

Here is Cahill's algorithm in ~40 lines of Python, directly on top of the SI engine from chapter 59.

# concurrency/ssi_engine.py — SSI on top of the SI engine
from dataclasses import dataclass, field
from concurrency.si_engine import SIEngine, Tx, AbortError

@dataclass
class SsiTx(Tx):
    in_conflict:  bool = False        # incoming rw-edge observed
    out_conflict: bool = False        # outgoing rw-edge observed
    siread: set[str] = field(default_factory=set)   # rows this tx read
    commit_pending: bool = False

class SSIEngine(SIEngine):
    def __init__(self):
        super().__init__()
        self.live: dict[int, SsiTx] = {}                  # xid -> SsiTx
        self.siread_by_row: dict[str, set[int]] = {}      # row -> {xids}

    def begin(self) -> SsiTx:
        tx = SsiTx(xid=self.next_xid, snapshot=self.committed.copy())
        self.next_xid += 1
        self.live[tx.xid] = tx
        return tx

    def read(self, tx: SsiTx, key: str):
        value = super().read(tx, key)
        # record SIREAD lock; does NOT block
        tx.siread.add(key)
        self.siread_by_row.setdefault(key, set()).add(tx.xid)
        return value

    def write(self, tx: SsiTx, key: str, value):
        super().write(tx, key, value)
        # for every concurrent reader R of this row, mark rw-edge R --rw--> tx
        for rxid in list(self.siread_by_row.get(key, set())):
            if rxid == tx.xid or rxid not in self.live: continue
            reader = self.live[rxid]
            if tx.xid not in reader.snapshot:   # they are concurrent
                reader.out_conflict = True
                tx.in_conflict       = True

    def commit(self, tx: SsiTx):
        tx.commit_pending = True
        if tx.in_conflict and tx.out_conflict:
            # pivot pattern: two consecutive rw-edges meet at tx
            self._release(tx); raise AbortError("SSI: pivot abort 40001")
        super().commit(tx); self._release(tx)

    def _release(self, tx):
        for k in tx.siread: self.siread_by_row.get(k, set()).discard(tx.xid)
        self.live.pop(tx.xid, None)

That is the complete algorithm. Forty-one lines inheriting from the SI engine of chapter 59. The only new machinery is the SIREAD table, the two flags, and the pivot check at commit.

Why tx.xid not in reader.snapshot is the concurrency test: a reader's snapshot contains exactly the transactions that were committed at the reader's BEGIN. If the writer's xid is not in that snapshot, the writer was not yet committed when the reader began — they overlap in time. That is the definition of "concurrent" under SI, and it is precisely when rw-edges matter: a reader that already sees the writer's commit is not on the snapshot side of an rw-edge, and a writer whose whole lifetime was before the reader began is not either.

Real-world SSI is slightly more eager than this sketch: it aborts as soon as both flags are set and any conflict partner is committing, rather than waiting for partners to finish. The eagerness is the source of the false positives discussed below, but it keeps bookkeeping bounded.

Walking through the doctor example under SSI

Re-run the doctor scenario from chapter 59, this time on the SSI engine. T1 is Alice's clock-out; T2 is Bob's. Both begin concurrently with the bootstrap state alice = True, bob = True.

t=0. Bootstrap. BOOTSTRAP writes alice = True, bob = True, commits. committed = {BOOTSTRAP}.

t=1, t=2. T1 and T2 begin. Both snapshots are {BOOTSTRAP}. All flags false. Both are live and concurrent (neither's xid is in the other's snapshot).

t=3, t=4. Both read alice and bob. SIREAD locks: siread_by_row['alice'] = {T1, T2}, same for 'bob'. Both see True, True.

t=5. T1 writes alice = False. The write path scans siread_by_row['alice'] = {T1, T2}. T1 is itself, skip. T2 is concurrent. Record rw-edge T2 --rw--> T1. Set T2.out_conflict = True, T1.in_conflict = True.

t=6. T2 writes bob = False. Scan siread_by_row['bob']. T1 is concurrent. Record rw-edge T1 --rw--> T2. Set T1.out_conflict = True, T2.in_conflict = True. Both transactions now sit at the pivot of a two-edge rw-cycle.

t=7. T1 calls commit(). Both T1 flags true → abort with SQLSTATE 40001. Staged write to alice discarded. SIREAD locks released.

t=8. T2 commits. Postgres may or may not abort T2 too depending on exact conflict-partner state; the guarantee is that at most one of {T1, T2} persists. Say T2 commits.

t=9. T1 retries. Application catches serialization_failure, restarts T1 against a fresh snapshot that includes T2. T1 now reads alice = True, bob = False. Application logic sees count = 1, correctly refuses to clock Alice out. Invariant preserved.

The crucial property: application code did not change. No SELECT ... FOR UPDATE, no summary-row rewriting, no new constraints. Just a try/except around the transaction and a retry on 40001. SSI does the rest.

False positives — the cost of conservatism

SSI's detection is not precise. It catches every real pivot pattern — the theorem guarantees that — but it also catches some patterns that look like pivots but do not actually form a cycle in the final history. These are false positives: transactions aborted that would have committed safely under true serialisability.

The reason is that SSI flips the flags on any concurrent rw-edge, regardless of whether the resulting structure actually cycles. The engine cannot afford to compute the full dependency graph at runtime — that would cost as much as strict validation. So it settles for a local, over-approximate rule: if your inbound and outbound rw-flags are both set, you might be a pivot, abort to be safe.

In typical OLTP workloads the false-positive rate is under 5%. On pathological workloads — long-running reporting queries concurrent with OLTP writes, or write-heavy transactions touching overlapping read sets — the rate can climb to 20% or more. Applications using SSI must implement a retry loop. Postgres returns SQLSTATE 40001 (serialization_failure) and the client is expected to catch it, possibly with exponential backoff, and rerun the transaction.

Why exact detection is too expensive: precise pivot detection requires traversing the dependency graph at commit time to check for actual cycles, which on large read sets with many concurrent transactions is O(n²) or worse. Cahill's two-flag approximation is O(1) at commit — constant work per transaction — and the false-positive cost is empirically small. The paper reports that the approximation approach achieves within 10% of the throughput of an exact-detection variant on TPC-C, with far simpler implementation.

Postgres has shipped one specific optimisation for this: read-only deferrable transactions. SET TRANSACTION ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE will wait (if necessary) until it can take a snapshot that is guaranteed not to contribute to any anomaly, then run without any SSI overhead at all. Useful for long analytical reports on a serialisable database.

The SIREAD lock is not a lock

The terminology in the SSI literature is unfortunate. "SIREAD lock" suggests blocking. It does not block. It is pure metadata: an entry in a hash table keyed by row, value "transaction T read this row." A writer looking up the row finds the entry, reads off the transaction ID, flips the rw-flags, and proceeds. Inserting and reading SIREAD entries are microsecond operations that participate in no compatibility matrix.

Compare to a 2PL shared lock: held in a lock manager tracking compatibility, wait queues, deadlock detection. A writer must wait for shared locks to release. Expensive and serialising.

SSI's SIREAD entries exist in a separate code path. A concurrent writer does not block on them; it uses them to update its flags and proceeds. This is precisely why SSI keeps MVCC's "readers never block writers" property while still achieving serialisability. In Postgres the data structure is called PREDICATELOCKTARGET, not LOCK, and it lives in an entirely separate manager from the shared/exclusive lock machinery. The word "lock" in "SIREAD lock" is historical baggage from Cahill's paper; mentally substitute "SSI read marker" and the model snaps into place.

Predicate reads and phantom prevention in SSI

A query like SELECT SUM(amount) FROM orders WHERE region='south' AND status='pending' reads a set defined by a predicate. Under pure SI, phantoms are handled by the snapshot — concurrent inserts are invisible. Under SSI you need more: track the rw-edge that would form if the concurrent insert had been visible, because semantically the reader "read" the phantom rows it would have returned.

SSI's solution: SIREAD locks on predicates, not just rows. Since storing arbitrary predicates is infeasible (equality over predicates is undecidable in general), production engines take SIREAD locks on the physical access paths the query used. Postgres's approach: B-tree index scan → SIREAD on index pages touched; sequential scan → SIREAD on the whole relation. Any insert landing on one of those physical locations triggers an rw-edge.

Postgres adapts granularity dynamically. Default is tuple-level; when max_predicate_locks_per_transaction (default 64) is exceeded, aggregate to page-level; further pressure aggregates to relation-level. Classic locking-granularity trade-off: finer is more accurate with more memory, coarser is cheaper with more false positives.

Why coarsening is sound: aggregating fine-grained SIREAD entries into a single page-level entry can only increase the set of writes that trigger rw-edges, never decrease it. Any real pivot pattern is still caught; only false positives grow. Correctness preserved, precision degraded.

Postgres's SSI implementation

Postgres's SSI lives in src/backend/storage/lmgr/predicate.c, about 4000 lines of C. State is distributed across three structures:

PredicateLockTarget — shared-memory hash table keyed by target (tuple TID, page, or relation OID), with a list of SIREAD locks on it. The "who read this" index.
SERIALIZABLEXACT — per-transaction record with the two conflict flags, SIREAD list, and conflict-partner pointers.
RWConflictPool — small records linking transactions whose rw-edge has been detected, enabling smarter abort decisions (prefer aborting transactions that cause fewest retries).

Memory is the main operational concern: on large scans, per-transaction SIREAD entries explode. Postgres caps them via max_predicate_locks_per_transaction and escalates to page-level or relation-level when exceeded. When the global pool exhausts, fallback is aggressive relation-level locking — more false positives but the system stays up.

Performance: Cahill reports 10-20% overhead over SI on TPC-C-like workloads. Ports & Grittner (VLDB 2012) echo this on Postgres — 10-15% for typical OLTP, higher for large-read-set workloads where predicate locks escalate. Oracle and MySQL InnoDB have not implemented SSI — Oracle's SERIALIZABLE is plain SI, InnoDB's is 2PL next-key locking. As of 2026, Postgres is the only major open-source OLTP database shipping true SSI.

Three transactions, one pivot, step by step

A finance application with a savings and a checking account owned by the same person. Invariant: sum of balances non-negative. Initial state: savings = 500, checking = 300, sum = 800.

Three concurrent transactions:

T1 (read-only reporter): reads both balances for a statement.
T2: reads both, withdraws 600 from savings if sum after withdrawal ≥ 0.
T3: reads both, withdraws 600 from checking if sum after withdrawal ≥ 0.

Under plain SI all three commit and sum ends at -400 — classic write skew between T2 and T3, with T1 possibly observing a state no serial order admits.

Under SSI:

All three begin, flags false.
All three read savings and checking. SIREAD tables now: savings → {T1, T2, T3}, checking → {T1, T2, T3}.
T2 writes savings = -100. Scan SIREAD(savings): T1 and T3 concurrent. Edges T1 --rw--> T2 and T3 --rw--> T2. Set T1.out, T3.out, T2.in.
T3 writes checking = -300. Scan SIREAD(checking): T1 and T2 concurrent. Edges T1 --rw--> T3 and T2 --rw--> T3. Set T2.out, T3.in.

State: T2 has in=out=True. T3 has in=out=True. T1 has out=True, in=False.

T2 commits → both flags set → abort 40001.
T3 commits — both flags set, but its pivot partner T2 just aborted. Cahill's canonical algorithm would commit T3; Postgres's implementation may still abort conservatively.
T1 (read-only) commits safely — in stays false, pivot condition fails.
Application retries T2. Fresh snapshot sees checking = -300. sum after 600 withdrawal would be -400; logic refuses. Correct.

The pivot pattern. T2 sits at the meeting point of two rw-antidependencies: T1 and T3 both read rows that T2 overwrote (incoming rw), and T2 read a row that T3 overwrote (outgoing rw — symmetrically T2 also has the mirror edge into T3). T2.inConflict and T2.outConflict both become true by the time T2 tries to commit. SSI's pivot check fires and T2 is aborted with SQLSTATE 40001. The application retries T2 against a fresh snapshot and either succeeds (if the new state allows) or correctly refuses (as here). T1, a read-only transaction, never has an incoming rw-edge, so its pivot check fails and it commits safely — a built-in optimisation for readers.

Common confusions

"SSI uses more locks than SI." SIREAD markers do not participate in any compatibility matrix and do not block. The actual blocking pattern under SSI is identical to SI: only writer-writer conflicts on the same row cause waits.
"SSI prevents all anomalies." Only all SI-admissible anomalies. Lost updates are caught by SI's first-committer-wins underneath; SSI adds pivot detection on top. Application race conditions outside the concurrency model are not SSI's job.
"SSI retries automatically." No. On abort it returns SQLSTATE 40001 (serialization_failure) to the client; the application catches and retries. Django, Rails ActiveRecord, SQLAlchemy have middleware for this; bare JDBC and psycopg do not. Forgetting the retry loop is the commonest SSI integration bug.
"SSI is Postgres-only." Algorithmically general; in production as of 2026, Postgres alone among major OLTP databases ships it. Oracle's SERIALIZABLE is plain SI; InnoDB's is 2PL with next-key locking; CockroachDB and YugabyteDB use timestamp-based schemes.
"False positives ruin SSI." Typical OLTP workloads see under 5% false positives; pathological workloads up to 20%. Exponential-backoff retry handles either easily — far cheaper than 2PL's read blocking.
"Adding SSI to any SI database is trivial." 40 lines in Python; 4000 lines of C in Postgres because predicate-level tracking and memory-bounded escalation are hard engineering. Shipping SSI rethinks the lock manager, shared memory, and commit pipeline. Why Oracle has not done it despite knowing Cahill's paper for 18 years.

Going deeper

Three corners of SSI worth understanding because they inform when to use it and when to reach for something else.

Read-only transactions for free

A read-only transaction never writes, so no concurrent reader can be on the wrong side of its write. Hence T.in_conflict stays false for any read-only T. The pivot condition requires both flags true; if in_conflict cannot become true, T cannot be the pivot, and SSI never aborts it for pivoting. Read-only transactions under SSI run with zero pivot-abort risk — they can still be conflict partners of writers, but they themselves are never the aborted one.

Postgres's READ ONLY DEFERRABLE goes further: the transaction waits until it can take a snapshot at a moment when no pivot-pattern involving it is possible, then runs with no SSI bookkeeping at all. Perfect for long analytical reports on a serialisable primary.

2PL comparison, sharpened

Under rigorous 2PL, reads take shared locks that block concurrent writers. Cost is proportional to read-set-size × writer rate, paid as latency. Zero aborts.

Under SSI, reads take non-blocking SIREAD markers; writers scan them to update flags. Same dimensional cost, but paid as work (constant-time per entry), not as queueing. Aborts arise only from pivot detection, bounded by the false-positive rate.

Qualitative swap: 2PL pays in latency with no aborts; SSI pays in aborts with no waits. Aborts are usually cheaper than waits because an aborted transaction releases its resources immediately — no convoy effects. This is why MVCC+SSI has eaten 2PL's lunch in OLTP since 2011.

Distributed SSI and why Spanner went a different way

SSI's pivot detection requires a consistent view of which transactions are concurrent. Across nodes, building that view is as hard as building a central transaction manager — the very bottleneck SSI avoided on a single node. Google Spanner chose TrueTime-based external consistency with Paxos + 2PC instead; CockroachDB uses Hybrid Logical Clocks plus optimistic timestamp advancement; YugabyteDB similar. Distributed SSI has been studied (Gupta et al.) but has not reached wide production. Once you pay for a consensus layer, timestamp-ordering usually wins because it also gives real-time (strict-serialisable) guarantees that SSI does not. Single-node Postgres is the environment SSI was designed for and where it still dominates.

Where this leads next

You now have Build 7's concurrency story end-to-end: 2PL was fine-grained but locked reads; isolation levels traded safety for throughput; MVCC freed readers from locks; snapshot isolation was what MVCC naturally gave and revealed the write-skew gap; SSI closes that gap at 10-20% overhead with no reader blocking.

Chapter 61 (How Postgres, Oracle, and InnoDB do MVCC differently) compares the three production implementations: Postgres's version-chain-per-row with VACUUM, Oracle's undo segments with ORA-01555, InnoDB's cluster-index-with-hidden-columns. Chapter 62 closes Build 7 with the retry-and-backoff pattern every serialisable application needs when it catches a 40001.

Build 8 pivots to transport: connection pooling, the Postgres wire protocol, prepared statements, pipelining. Everything you have learned about MVCC and SSI becomes invisible through the network lens — what the client sees is a TCP connection that occasionally returns 40001 — but knowing what is happening underneath is what lets you debug it when it breaks.

References

Cahill, Röhm, Fekete, Serializable Isolation for Snapshot Databases, ACM SIGMOD 2008 — the SSI algorithm. Introduces the two-Boolean pivot-detection scheme, quantifies overhead at 10-20% over plain SI, and demonstrates true serialisability without 2PL's read blocking. The canonical reference; read this first.
Fekete, Liarokapis, O'Neil, O'Neil, Shasha, Making Snapshot Isolation Serializable, ACM TODS 30(2), 2005 — the theorem that makes SSI possible. Every non-serialisable SI history contains a cycle with at least two consecutive rw-antidependencies at a pivot transaction. The theoretical foundation chapter 59 foreshadowed and this chapter exploits.
Ports, Grittner, Serializable Snapshot Isolation in PostgreSQL, VLDB 2012 — the Postgres implementation paper. Describes the predicate-lock manager, the memory-bounded granularity escalation, the read-only optimisations, and the measured overhead on real workloads. Essential reading before tuning Postgres SSI in production.
Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil, A Critique of ANSI SQL Isolation Levels, SIGMOD 1995 — the paper that named snapshot isolation and write skew, exposing the gap that SSI would close 13 years later. Background for why SSI was needed in the first place.
PostgreSQL Wiki, Serializable Snapshot Isolation (SSI) — the practitioner-facing documentation of SSI as shipped in Postgres. Covers SQLSTATE 40001 handling, retry guidance, READ ONLY DEFERRABLE, and the specific behavioural differences between Postgres SERIALIZABLE and Oracle/MySQL SERIALIZABLE.
Adya, Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions, MIT PhD thesis 2000 — the dependency-graph formalism Fekete 2005 uses to state the pivot theorem. If you want to work formally with isolation levels, Adya is the foundation.