Snapshot Isolation and the Write Skew Problem

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Snapshot isolation (SI) is the isolation model MVCC naturally hands you. When a transaction calls BEGIN, the engine records a consistent point-in-time view of the database — every committed transaction up to that instant is visible, nothing after. Every read in that transaction goes through this snapshot, regardless of what other transactions commit in the meantime. Writes go into private version chains. At commit, the engine runs one check: first-committer-wins — if any row you modified was also modified by a concurrent transaction that committed before you, you abort. Otherwise you commit.

This single mechanism kills three of the four classical ANSI anomalies for free. Dirty reads vanish because snapshots only include committed versions. Non-repeatable reads vanish because the snapshot is frozen — the same row read twice in one transaction returns the same value, no matter what anyone else does. Phantoms vanish because the snapshot includes the set of rows that existed at BEGIN; concurrent inserts are invisible, so re-running the same predicate returns the same set.

But SI is not serialisable. It permits write skew: two transactions each read a set of rows, each make a decision based on what they read, and each update different rows in a way that is fine on its own but together violates an invariant the two reads established. Two doctors each decide they can clock out because the other is on call; both commit; the hospital goes dark.

The reason SI cannot catch this is structural, not a bug you can patch. First-committer-wins asks "did any transaction modify the same row I modified?". Write skew's writes touch different rows. The dangerous dependency is between a read and a write, not between two writes, and SI tracks only the latter. This chapter builds SI from scratch in Python, exhibits write skew, proves why first-committer-wins is blind to it, and sets up ch.60 on SSI, which closes the gap by tracking the missing read-write edges.

The hospital that goes dark

A hospital runs a shift-management system with one invariant every nurse knows by heart: at any given instant, at least one doctor must be on call. The doctors table has a row per doctor with a boolean on_call column. This evening there are two doctors on call, Alice and Bob. Both are exhausted. Both open the staff portal at the same time to clock out.

Each click triggers the same transaction:

BEGIN;
SELECT count(*) FROM doctors WHERE on_call = true;
-- application code: if count >= 2, OK to clock out
UPDATE doctors SET on_call = false WHERE name = <me>;
COMMIT;

Under snapshot isolation, both transactions begin within a millisecond of each other. Both snapshots see the same state — two doctors on call. Both applications see count = 2, which comfortably satisfies the invariant "leave at least one". Alice's transaction updates her row and commits. Bob's transaction, running concurrently, updates his row and commits. Neither update touched the row the other updated. First-committer-wins does not trigger.

The final state: Alice.on_call = false, Bob.on_call = false. No doctor on call. The invariant the application so carefully checked is violated. An hour later a patient crashes and there is nobody in the building to respond.

This is write skew. It is not a bug in the application — the checks are correct, the SQL is correct, the isolation level is documented as preventing every ANSI anomaly. It is a gap in the isolation model itself. Understanding why SI cannot catch this is the central point of this chapter.

What snapshot isolation actually guarantees

Snapshot isolation is three mechanical rules.

Rule 1 — snapshot at BEGIN. When T executes BEGIN, the engine records the set of transactions committed at that instant. Call this set T.snapshot. Every subsequent read in T sees only row versions whose creator (xmin) is in T.snapshot and whose overwriter (xmax), if any, is not. Why a set and not a timestamp: MVCC version chains are labelled with transaction IDs, and sets cleanly handle in-progress transactions that may commit in any order. A timestamp works too if IDs are assigned in commit order, but sets are what Postgres uses.

Rule 2 — snapshot is immutable. T.snapshot is taken once and never mutates. If S commits after T began, S is still not in T.snapshot — T never sees S's writes, even after a ten-minute pause. This eliminates non-repeatable reads and phantoms.

Rule 3 — first-committer-wins at commit. When T calls COMMIT, the engine checks each row in T.write_set: is there any version whose xmin committed after T.snapshot was taken? If yes, abort. If no, commit.

That is the whole mechanism. Reads are snapshot lookups, writes are staged, commit makes one pass over the write set.

The formal statement is narrower than the intuition. SI guarantees no two concurrent committed transactions modified the same row. It does not guarantee reads are consistent with other transactions' committed writes — only with the reader's own snapshot.

Why this is weaker than it sounds: "consistent with my snapshot" does not mean the world I decide about is the world I am writing into. If I read x=5 from my snapshot and commit y = x+1 = 6, but x has since become 10, I have written y=6 into a database where x=10. Each individual read and write is consistent — but the compound read-decide-write reasons about an x that is no longer current. Write skew is this gap.

Why SI kills the three ANSI anomalies

The ANSI SQL-92 standard names three anomalies (see ch.57). SI forbids all three as a consequence of its three rules.

Dirty read — eliminated by rule 1. T1.snapshot contains only committed transactions, so an uncommitted write from T2 (whose xmin = T2.xid is not in T1.snapshot) is invisible. The read resolves to an earlier committed version or finds none.

Non-repeatable read — eliminated by rule 2. The snapshot is frozen at BEGIN. Whichever version was visible the first time is visible the second time, regardless of intervening commits. Re-reads are always stable.

Phantom read — eliminated by rule 2 extended to predicate reads. A concurrent insert creates a version with xmin ∉ T1.snapshot, so it is invisible on every read. The predicate returns the same result set every time.

Why this phantom guarantee is stronger than lock-based REPEATABLE READ: 2PL prevents phantoms with range locks that block concurrent inserts. SI uses no locks — concurrent inserts proceed, they simply happen in a world the reader cannot see. Readers never block writers, writers never block readers, and phantoms are nonetheless gone. This is why SI felt revolutionary when Oracle shipped it in the late 1980s.

Three of the four classical anomalies, gone for the price of MVCC plus a commit-time check. The question is what is left.

Why SI does not kill write skew

Here is the precise place SI fails. First-committer-wins is a check on the write set. At commit, the engine asks: "did any concurrent transaction modify any row I modified?". The unit of comparison is a row.

In the doctor scenario, T1.write_set = {Alice} and T2.write_set = {Bob}. The rows are disjoint. The check passes for both.

The dependency that matters is not write-write. It is read-write. T1 read Bob (inside the count(*) predicate) and made a decision conditional on Bob being on-call; T2 then overwrote Bob. Symmetrically, T2 read Alice and T1 overwrote Alice. Each transaction made a decision based on a version of a row the other one later invalidated.

This is an rw-antidependency: edge T_a --rw--> T_b when T_a read a version T_b subsequently overwrote. The doctor case has two such edges — T1 --rw--> T2 (T1 read Bob, T2 wrote Bob) and T2 --rw--> T1 (T2 read Alice, T1 wrote Alice) — forming a cycle. The classical serialisability theorem (Bernstein, Hadzilacos, Goodman 1987) says a history is serialisable iff its dependency graph is acyclic; the doctor history has a cycle. SI permits it because its commit check sees only ww edges — it is structurally blind to rw edges.

Why you cannot just broaden first-committer-wins: if you aborted whenever a concurrent transaction overwrote a row you read, almost every transaction would abort under read-heavy contention. The right fix is to track rw-edges lazily and only abort on a dangerous pattern — two consecutive rw edges — which is SSI's job. That retrofit is the whole point of Cahill 2008.

Write skew, formally

The clean statement comes from Fekete, Liarokapis, O'Neil, O'Neil, Shasha 2005 ("Making Snapshot Isolation Serializable", ACM TODS):

Theorem (Fekete et al. 2005). Any non-serialisable history admitted by snapshot isolation contains a cycle in the dependency graph with at least two consecutive rw-antidependencies. The transaction at the meeting point of the two edges is the pivot.

The doctor case is the minimal instance: two transactions, two rw edges in opposite directions, either transaction is the pivot.

This is the cornerstone of SSI. To upgrade SI to serialisability you do not need to track every conflict — you need to detect just one pattern: two adjacent rw edges meeting at some pivot T_p. Aborting T_p (or a neighbour) restores serialisability. SI has zero rw-edge tracking; SSI adds exactly the tracking the theorem identifies as sufficient. Every other conflict SI already handles (ww via first-committer-wins, wr via snapshot reads). The rw gap is the entire gap.

Write skew is therefore not a bug in SI but a design boundary. SI is cheap because it only tracks ww conflicts; its cost is that rw conflicts slip through.

Building SI in Python

Snapshot isolation is not hard to implement once you have MVCC. Take a simple MVCC store from the previous chapter and add the two SI-specific pieces: snapshot capture at BEGIN, first-committer-wins at COMMIT.

# concurrency/si_engine.py — snapshot isolation on top of MVCC
from dataclasses import dataclass, field

class AbortError(Exception): pass

@dataclass
class Version:
    xmin: int        # creator transaction id
    xmax: int | None # overwriter, or None if current tip
    value: object

@dataclass
class Tx:
    xid: int
    snapshot: set[int] = field(default_factory=set)   # committed-at-BEGIN
    writes: dict[str, object] = field(default_factory=dict)  # staged

The snapshot is a set of transaction IDs that were committed at BEGIN. The write set is a dict from key to the new value being staged.

# concurrency/si_engine.py — continued
class SIEngine:
    def __init__(self):
        self.rows: dict[str, list[Version]] = {}   # key -> version chain
        self.committed: set[int] = set()           # globally committed xids
        self.next_xid = 1

    def begin(self) -> Tx:
        tx = Tx(xid=self.next_xid, snapshot=self.committed.copy())
        self.next_xid += 1
        return tx

    def read(self, tx: Tx, key: str):
        # first check this tx's own staged writes
        if key in tx.writes:
            return tx.writes[key]
        # else walk the version chain for the newest version visible to tx
        for v in reversed(self.rows.get(key, [])):
            if v.xmin in tx.snapshot and (v.xmax is None or v.xmax not in tx.snapshot):
                return v.value
        return None

    def write(self, tx: Tx, key: str, value):
        tx.writes[key] = value

Reads walk the version chain and pick the newest version whose xmin is visible to this transaction and whose xmax (if any) is not visible. Writes are staged into the transaction's local dict — nothing hits self.rows until commit.

# concurrency/si_engine.py — commit with first-committer-wins
    def commit(self, tx: Tx):
        for key in tx.writes:
            for v in self.rows.get(key, []):
                # a version of this key was created by someone who committed
                # after tx began -> concurrent write-write conflict
                if v.xmin not in tx.snapshot and v.xmin in self.committed:
                    raise AbortError(f"first-committer-wins on {key}")
        # no conflict: materialise the writes into version chains
        for key, value in tx.writes.items():
            chain = self.rows.setdefault(key, [])
            if chain and chain[-1].xmax is None:
                chain[-1].xmax = tx.xid       # supersede current tip
            chain.append(Version(xmin=tx.xid, xmax=None, value=value))
        self.committed.add(tx.xid)

The commit loop is the entire SI validation. For each key you wrote, look at every version in the chain: is there a version whose creator committed after your snapshot was taken? v.xmin not in tx.snapshot means the creator was not yet committed at your BEGIN; v.xmin in self.committed means they have since committed. That is exactly the first-committer-wins condition. If any key triggers it, abort. Otherwise append your versions and mark yourself committed.

Why this simple check is enough to avoid dirty reads, non-repeatable reads, and phantoms but not write skew: dirty reads are avoided because read filters on v.xmin in tx.snapshot, which excludes uncommitted versions. Non-repeatable reads are avoided because the snapshot is frozen at begin — every read resolves against the same set, so the same key returns the same value. Phantoms are avoided for the same reason at the predicate level (any full-table scan returns only versions visible to the snapshot). Write skew is not avoided because the commit check iterates over tx.writes — keys I wrote — not over keys I read. A concurrent transaction that wrote a key I only read never triggers the check.

Forty lines of Python and you have an isolation model that dominates every production database shipped before 2011.

Exhibiting write skew on the Python engine

Run the doctor scenario against the engine you just wrote.

# tests/test_write_skew.py
from concurrency.si_engine import SIEngine, AbortError

def test_doctor_write_skew():
    db = SIEngine()
    # setup: bootstrap both doctors to on_call=True
    setup = db.begin()
    db.write(setup, 'alice', True)
    db.write(setup, 'bob',   True)
    db.commit(setup)

    T1 = db.begin()
    T2 = db.begin()

    # T1's decision: at least two on call? then I can clock out
    if sum([db.read(T1, 'alice'), db.read(T1, 'bob')]) >= 2:
        db.write(T1, 'alice', False)
    # T2's decision: symmetric
    if sum([db.read(T2, 'alice'), db.read(T2, 'bob')]) >= 2:
        db.write(T2, 'bob', False)

    db.commit(T1)        # succeeds: only key 'alice' in write set
    db.commit(T2)        # ALSO succeeds: only key 'bob' in write set

    check = db.begin()
    assert db.read(check, 'alice') is False
    assert db.read(check, 'bob')   is False
    # invariant violated: no doctor on call

Step through commit(T1). T1.writes = {'alice': False}. The chain for 'alice' has one version — bootstrap, with xmin = setup.xid ∈ T1.snapshot. The check v.xmin not in tx.snapshot and v.xmin in self.committed evaluates to False. No abort. T1 commits.

Now commit(T2). T2.writes = {'bob': False}. The chain for 'bob' has only the bootstrap version, whose xmin ∈ T2.snapshot. No concurrent writer. T2 commits. Crucially, the chain for 'alice' — now holding T1's new version — is never inspected, because T2 never wrote 'alice'.

Final state: both False. Invariant violated by an isolation level that rejected every ANSI anomaly.

Other write skew scenarios in the wild

Write skew is not a hospital-only curiosity. The same pattern — overlapping reads, disjoint writes, a cross-row invariant — shows up in every class of application with shared state.

Inventory booking. Two customers read qty = 1 for the same SKU, both insert an orders row. Under SI both commit; inventory is oversold. Reads are of items, writes are of orders; no row is written twice.
Username reservation. Two signups run SELECT count(*) FROM accounts WHERE username = 'arjun', both see 0, both insert. Writes are of different new rows; first-committer-wins sees no collision.
Financial invariants. Rule: two linked accounts' balances must sum to ≥ 0. T1 and T2 each read both balances, each withdraws from a different account judging the sum healthy; the combined withdrawal overshoots. Disjoint writes, violated invariant.
Approval workflows. Each approver reads the approval count, sees it below threshold, inserts. Simultaneous approvers push the count past the threshold.
Meeting-room booking. Two users check for time-range overlap, see none on their snapshots, insert. Both commit; the room is double-booked. Postgres can prevent this particular one via EXCLUDE USING gist, but not via isolation.

The pattern is always: read a set, decide using an invariant over the set, write a new or different row. As long as the writes don't collide on the same key, SI passes them all through.

The read-only serialisation anomaly

There is a subtler SI anomaly discovered by Fekete et al. 2005: even a transaction that performs only reads can observe a state under SI that is inconsistent with any serial schedule. The construction requires three transactions and is worth understanding as a reminder that SI's gap is not about writes alone.

The minimal example has a "depositor" transaction T1 crediting interest, a "withdrawer" T2 making a withdrawal that would be rejected if interest had already posted, and a read-only "reporter" T3 that reads both accounts. Depending on the interleaving, T3 can see a state in which T2 has withdrawn (successfully) from an account whose interest was posted by T1 — a history that cannot be serialised: if T1 precedes T2, T2 must see the interest and reject; if T2 precedes T1, T3 reading in real-time order should see T2's state without T1's, not both. SI lets all three commit, and T3's output is a report that no serial order could produce.

The takeaway: write skew is the commonest SI anomaly but not the only one. The general statement from the theorem above — two consecutive rw-antidependencies in a cycle — covers this case too. SSI's detection catches it.

What you do about write skew under SI

Four strategies in decreasing order of elegance.

1 — Make the schema enforce the invariant. A UNIQUE(username) constraint turns username-reservation write skew into a constraint violation the DB catches regardless of isolation level. CHECK handles row-local invariants; Postgres's EXCLUDE USING gist handles range-overlap invariants. Every invariant you move from application code into the schema becomes invisible to write skew. Why this works: constraints are checked at commit against the materialised database, not any one transaction's snapshot — two concurrent inserts of the same unique key race, one lands first, the second's constraint check sees the winner and fails.

2 — Materialise the invariant into a shared row. For the doctor case, keep an on_call_count row that every toggle updates. Both clock-outs now write the same key; first-committer-wins fires cleanly; one aborts. You have reshaped a read-write conflict into a write-write conflict SI can see. The cost is the discipline of updating the summary row on every toggle, usually enforced with triggers.

3 — Explicit row locks on the read set. SELECT ... FOR UPDATE takes write locks on read rows, held until commit. Concurrent clock-outs serialise because each acquires the same locks. Cost: lock contention, potential deadlocks, and the requirement that every participating transaction uses the same lock protocol. Miss one caller and the skew sneaks in from the unlocked path.

4 — Upgrade to SSI. If your engine offers SSI (Postgres ≥ 9.1), flip the level and add a retry loop. Application code stays unchanged; the engine catches dangerous rw cycles at commit and aborts one side with SQLSTATE 40001. Overhead ~10-20%. This is chapter 60.

Prefer 1 and 4. Strategy 2 is a stopgap. Strategy 3 is where most applications actually land, because changing isolation levels is scary and invariants are often discovered after the schema is frozen.

The doctor scenario, full timeline and the rw cycle

Two transactions T1 (Alice's clock-out) and T2 (Bob's clock-out), running concurrently on Postgres REPEATABLE READ (which is SI).

Real time →
BOOTSTRAP    ─╮
  alice=true │
  bob=true   │
  commit     ─╯

T1.BEGIN              ────────────╮ snapshot = {BOOTSTRAP}
  read alice → true               │  (from BOOTSTRAP's version)
  read bob   → true               │
  count = 2, ≥ 1, OK              │
                                  │
              T2.BEGIN ─────────╮ │ snapshot = {BOOTSTRAP}
                read alice → true│ │  (from BOOTSTRAP's version)
                read bob   → true│ │
                count = 2, ≥ 1, OK│
                                 │ │
  write alice = false            │ │  staged in T1.writes
              write bob = false  │ │  staged in T2.writes
  T1.COMMIT → check alice chain  │ │
    only version is BOOTSTRAP,   │ │
    xmin ∈ snapshot, OK          │ │
    append (xmin=T1, value=false)│ │
    commit                       ─╯ │
              T2.COMMIT → check bob │
                only version is     │
                BOOTSTRAP, xmin ∈   │
                snapshot, OK        │
                append (xmin=T2,    │
                   value=false)     │
                commit             ─╯

POST-COMMIT STATE
  alice chain: [BOOTSTRAP(true, xmax=T1), T1(false, xmax=None)]
  bob   chain: [BOOTSTRAP(true, xmax=T2), T2(false, xmax=None)]
  read latest → alice=false, bob=false
  invariant "at least one on call" — violated

The dependency graph has exactly the structure Fekete's theorem describes: two rw-antidependencies forming a cycle.

The minimal write-skew dependency cycle. Two rw-antidependencies, in opposite directions, between `T1` and `T2`. SI's commit-time check only inspects ww conflicts (same row written by two concurrent transactions) — here the two writes are to different rows, so the check passes and both commit. The cycle is invisible to SI and can only be seen by a system that tracks rw-edges — which is exactly what SSI (chapter 60) does.

The repair under SI is any of the four strategies in the previous section. Under SSI the code above is literally unchanged; Postgres aborts T2 with SQLSTATE 40001 at commit time, the application retries, the retry sees count = 1 and correctly refuses to clock Bob out.

Common confusions

"SI is serialisable." No. SI forbids dirty reads, non-repeatable reads, and phantoms but admits write skew. Write skew is non-serialisable. SI is strictly weaker than serialisability.

"SI is the same as ANSI REPEATABLE READ." Depends which ANSI text and which engine. Postgres 9.1+ calls SI its REPEATABLE READ and reserves SERIALIZABLE for SSI. Oracle historically called SI its SERIALIZABLE. MySQL InnoDB's REPEATABLE READ is not SI at all — it is gap-lock-based. Never reason about isolation by level name across engines (see ch.57).

"Write skew is rare." Rare on TPC benchmarks. Common in applications with cross-row invariants — inventory, scheduling, balances, approvals, username reservation. In production under load, a top source of incident reports.

"Aborting on write-write conflict is enough." The mistake SI makes. First-committer-wins is a ww check; the doctor case has no ww conflict. A ww detector cannot catch rw dependencies, however clever.

"SELECT ... FOR UPDATE promotes SI to SSI." Only for the specific rows locked, and only if every participating transaction locks them. Miss one caller — a background job that updates doctors without the clock-out check — and skew sneaks in from the unlocked path. FOR UPDATE also cannot lock rows that don't exist yet (the username-reservation case).

"SI is bad because it admits write skew." SI is excellent — it made MVCC a practical engine architecture. Readers never block writers, writers never block readers, commits are O(write-set). The point is to know the one gap so you know when to reach for more.

Going deeper

The theory behind SI's gap and the techniques that close it.

Cahill, Röhm, Fekete 2008 — Serialisable Snapshot Isolation

"Serializable Isolation for Snapshot Databases" turns Fekete's 2005 theorem into a production algorithm. Maintain per-transaction in_conflict and out_conflict booleans, set them on every observed rw edge, abort any transaction whose both flags are set when it commits. Postgres adopted this in 9.1 (2011) in predicate.c, ~4000 lines of production SSI. Chapter 60 covers it in depth.

Adya 2000 thesis — the formal framework

Adya's MIT thesis reformulates isolation levels using dependency graphs over multi-version histories — precise, implementation-independent definitions for SI, serialisability, and everything between. Fekete 2005 uses Adya's framework; Cahill 2008 is justified within it.

Jepsen's write-skew tests

Kyle Kingsbury's Jepsen project has spent a decade testing distributed databases' isolation claims under partitions, clock skew, and node failures. Recurring finding: many vendors advertising "SERIALIZABLE" ship SI under the hood and fail write-skew tests. When a vendor claims serialisable, check whether Jepsen has tested it; if not, assume SI.

Where this leads next

SI is the weakest level that is also a useful default. Chapter 60 — Serialisable Snapshot Isolation (SSI) — closes the write-skew gap by adding the rw-edge tracking Fekete's theorem identifies as sufficient. Chapter 61 — How Postgres, Oracle, and InnoDB do MVCC differently — compares the engine-specific implementations of SI (and the levels layered on top), covering Postgres's version-chain-per-row, Oracle's undo-segment roll-forward, and InnoDB's cluster-index-with-hidden-columns approach. Chapter 62 closes Build 7 with the retry-and-backoff pattern every SERIALIZABLE application needs.

The arc of Build 7 has been: locking (too slow), optimistic validation (too aborty), MVCC (the middle path), snapshot isolation (what MVCC naturally gives), write skew (where SI falls short), SSI (the fix). After this you know enough to pick an isolation level on any production database and defend the choice.

References

Berenson, Bernstein, Gray, Melton, O'Neil, O'Neil, A Critique of ANSI SQL Isolation Levels, SIGMOD 1995 — the paper that named write skew and snapshot isolation and showed ANSI SQL-92's definitions could not distinguish SI from true serialisability. Introduces the A5B write-skew anomaly formally. Free PDF from Vistron Research.
Fekete, Liarokapis, O'Neil, O'Neil, Shasha, Making Snapshot Isolation Serializable, ACM TODS 30(2), 2005 — the theorem: every SI anomaly contains a cycle with two consecutive rw-antidependencies. The theoretical foundation SSI is built on. Also proves that certain application-level constraints force SI histories to be serialisable even without runtime tracking.
Cahill, Röhm, Fekete, Serializable Isolation for Snapshot Databases, SIGMOD 2008 — the SSI algorithm. ~10-20% overhead to upgrade SI to true serialisability. Implemented in Postgres 9.1 (2011). The canonical reference for the algorithm chapter 60 covers.
Adya, Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions, MIT PhD thesis 2000 — the formal framework for isolation levels using dependency graphs over version orders. Precise, implementation-independent, the specification the ANSI standard should have been. Required reading for formal work on isolation.
Kleppmann, Designing Data-Intensive Applications, O'Reilly 2017, chapter 7 "Transactions" — the clearest pitched-at-engineers treatment of SI and write skew in print. Worked examples covering the doctor on-call, meeting-room booking, and username-reservation patterns. Start here if the 1995/2000/2005 papers are too dense.
PostgreSQL documentation, Transaction Isolation — the practitioner-facing writeup of SI and SSI as implemented in Postgres. Includes the exact SQLSTATE codes (40001 for serialisation failure), the retry guidance, and the read-only / deferrable optimisations unique to Postgres's SSI implementation.