Percolator: Snapshot Isolation on a Sharded KV

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Percolator is the protocol Querion built in 2010 to add incremental processing — and, almost as a side-effect, full ACID multi-row transactions — on top of Bigtable, a sharded key-value store with no transactional support past a single row. The insight has shaped a decade of distributed-database design: you do not need the storage layer to know about transactions. You can build the whole protocol in the client, encoding all of its state as ordinary KV writes, as long as the underlying store gives you single-row atomicity.

Each logical row is split across three column families. Data holds the value at a given start-timestamp. Lock holds an in-flight transaction's lock (transaction id, primary's location). Write holds a commit-time pointer from commit_ts back to start_ts — the entry that makes a write visible. A reader at timestamp T looks up the most recent Write entry with commit_ts ≤ T, follows the pointer to the matching Data cell, and returns the value. No read locks. Snapshot isolation falls out for free.

The protocol is an optimistic 2PC, driven entirely by the client, with one twist that resolves the blocking problem: one of the writes is designated the PRIMARY, and its lock is the commit point. Prewrite writes Data + Lock for every key (primary first, secondaries pointing back at the primary). Commit gets a commit_ts from the Timestamp Oracle, then writes the Write entry on the PRIMARY and removes its lock in a single atomic per-row operation — that instant is the transaction's commit. The client then commits the secondaries asynchronously; if it dies mid-way, the next reader who hits a stale secondary lock chases the primary pointer, sees the primary committed, and rolls forward the secondary. If the primary is still locked and timed out, the reader rolls back. Recovery is cooperative — every reader is also a janitor. The Timestamp Oracle is centralised but batched (millions of timestamps per second); in TiDB / TiKV it lives inside the Raft-replicated Placement Driver.

You finished chapter 109 with a working two-phase commit and chapter 110 with a clear-eyed view of why textbook 2PC blocks the moment the coordinator dies between phases. The standard production answer is "make the coordinator a Paxos- or Raft-replicated state machine" — Spanner and CockroachDB go that way and chapter 112 walks through it. But there is an older, cheekier answer, and it is the one that taught the industry that the storage layer does not need to know about transactions at all.

In 2010 Daniel Peng and Frank Dabek published the Percolator paper, describing the system that powered incremental updates to Querion's web index. The web grows continuously; rebuilding the entire index from scratch on every crawl is wasteful. Percolator made the index incremental by giving the indexers a transactional view over Bigtable, so a small change (a new outbound link) could be applied as a multi-row ACID update. The paper, almost in passing, gave the world a recipe for distributed transactions on a non-transactional sharded KV — now the heart of TiKV / TiDB and conceptually informing every client-coordinated transaction system in production. This chapter teaches the recipe.

The shape of the problem

Bigtable is a sharded, log-structured KV. Its single transactional guarantee is single-row atomicity: a write that touches one row lands entirely or not at all. A write touching two rows is two independent operations; one can succeed and the other fail.

You want multi-row transactions on top of that, with snapshot isolation — point-in-time consistent reads, first-committer-wins on conflicts, no read locks. Why snapshot isolation rather than serialisability: serialisability requires tracking read-write dependencies, which is expensive on a distributed KV. SI is the strongest isolation level you can get cheaply on MVCC, and for the index-rebuild workload Percolator targeted, write skew was tolerable. The same trade-off shaped TiDB and CockroachDB defaults — SI by default, opt-in serialisable on top.

The Percolator move: encode the transaction protocol state inside the rows themselves, using extra column families the storage layer does not interpret. The storage layer thinks it is storing bytes. Clients, reading and writing those bytes by convention, run the protocol cooperatively. There is no transaction manager, no coordinator process. Every client coordinates its own transactions. Every reader is also a recovery agent for transactions other clients abandoned.

The three column families

Three column families per row. Lock records that an in-flight transaction has prewritten this cell. Write is the commit-time visibility pointer keyed by commit_ts. Data is the actual value, keyed by the writing transaction's start_ts. A reader at timestamp T finds the most recent Write entry with `commit_ts ≤ T`, follows its pointer to Data, returns the value. The split between Data (written at prewrite, start_ts) and Write (written at commit, commit_ts) is what gives Percolator MVCC visibility for free.

Every row in a Percolator-managed table has three column families per logical column. If your schema has a balance column on row A, the storage physically holds A:balance.lock, A:balance.write, A:balance.data.

Lock CF — (start_ts) → (primary_pointer, kind). While a transaction is in-flight, it places a lock entry keyed by its own start_ts. The value names the primary of the transaction and a kind tag (PRIMARY or SECONDARY). A row with a non-empty Lock CF is "in doubt" — somebody has prewritten it but the outcome is not yet decided.

Write CF — (commit_ts) → start_ts. The visibility table. On commit, the transaction writes one entry per affected row of the form commit_ts → start_ts. The presence of this entry is what makes the write visible. Crucially, the Write entry stores only a pointer (the start_ts), not the value. Why split the visibility entry from the value: this is the MVCC trick. The value is written eagerly at prewrite (when commit_ts does not exist yet), and visibility is flipped on later by writing a tiny pointer keyed by commit_ts. Without the split you'd write the full value at commit time too, doubling write amplification.

Data CF — (start_ts) → value. The actual value, keyed by the writer's start_ts. Multiple versions accumulate; old ones are reclaimed by GC.

A reader at timestamp T asking for row A's balance does this: check A:balance.lock for any entry with start_ts ≤ T (if found, recover and retry); scan A:balance.write for the largest commit_ts ≤ T, follow its pointer; read A:balance.data at that key. No read locks. The reader's snapshot is defined entirely by T.

The protocol — prewrite, then commit, with a primary

The protocol is a client-driven optimistic 2PC. The client is the only coordinator. There is no separate transaction manager process. The "decision point" of 2PC — the irrevocable instant at which the transaction switches from undecided to committed — is folded into a single atomic write on one specific row.

The protocol's six steps. Steps 2 and 3 are the prewrite phase — pessimistic locking via the Lock column family. Step 5 is the irrevocable instant: the atomic single-row write that adds the Write entry on the PRIMARY and removes its lock. After step 5, the transaction has *committed*, even if the client dies before step 6 — recovery on the secondary lock will chase the primary, find the Write entry, and roll forward.

Step by step:

Step 1 — Get start_ts. The client asks the Timestamp Oracle (TSO) for a globally unique, monotonic timestamp — the snapshot the transaction reads from. Why a centralised TSO and not Lamport clocks: snapshot isolation requires a total order on commit timestamps across all clients and all rows. Local clocks (even loosely synchronised) can disagree enough that a transaction at start_ts=100 could miss a write committed at wall-clock 99 on a different machine. The TSO is the cheapest way to get a true total order. Spanner pays with TrueTime; Percolator pays with a replicated counter.

Step 2 — Prewrite the primary. The client picks one of the writes as the PRIMARY (deterministic choice, e.g. lex-min). For that key, it issues an atomic single-row op: iff there is no existing Lock and no Write entry with commit_ts > start_ts, write Data at start_ts and Lock at start_ts marking this cell as PRIMARY. Conditional failure → abort.

Step 3 — Prewrite the secondaries. For each remaining key, the same conditional write — Data + Lock at start_ts — but the Lock value is a pointer back to the primary. Any failure → abort and clean up.

End of prewrite: every row has a Data version and a Lock. Nothing is yet visible (no Write entry). The transaction is fully prepared.

Step 4 — Get commit_ts. Another TSO timestamp, by construction larger than every concurrent prewrite's start_ts.

Step 5 — Commit the primary. The irrevocable moment. The client runs an atomic single-row op on the primary's row: if the Lock at start_ts is still there, replace it with a Write entry commit_ts → start_ts. Lock gone, Write entry present, atomically. After this instant, the transaction is committed — readers at ≥ commit_ts see the new value on the primary, and the secondaries are now committable on demand. If the CAS fails (somebody removed the lock during recovery), the client must abort.

Step 6 — Commit the secondaries. For each, write Write@commit_ts = start_ts, delete Lock@start_ts. Async, idempotent, can be done in any order, can even be skipped — the next reader hitting a stale lock will clean it up.

The asymmetry between primary and secondary commit is the entire reason Percolator does not block. The primary commit is an atomic flip from "locked at start_ts" to "visible at commit_ts" on one row. Anyone reading that row can determine the outcome by inspection.

Recovery — every reader is a janitor

Recovery is decentralised. Any reader who finds a stale lock chases its primary pointer, examines the primary row's Lock and Write columns, and decides the transaction's outcome by inspection. The single-row atomicity Bigtable already gives us is enough to break ties between two recovery routines that race on the same lock — both look at the primary, both reach the same answer, both apply the same idempotent cleanup.

Suppose a client crashes after step 5 (primary committed) but before step 6 finishes. Secondaries still have stale Lock@start_ts entries. A reader at timestamp T who tries to read one finds the stale lock and cannot ignore it — the lock means "this is in doubt; you do not know which value to return".

The reader: reads the lock's primary pointer, atomically reads the primary's Lock and Write columns at the relevant start_ts, and decides one of three ways. Primary has a Write entry pointing to start_ts and no Lock: transaction committed at the Write entry's commit_ts; roll the secondary forward (write the same Write entry on the secondary, delete its Lock). Primary has no Write entry and no Lock: transaction was aborted; roll the secondary back (delete the Data cell, delete the Lock). Primary still has its Lock: the transaction is in flight. If the lock is younger than the TTL (a few seconds), back off. If older, the client is presumed dead — atomically remove the primary lock (a conditional single-row op), which is the abort decision, then roll back the secondary.

Races are handled by single-row atomicity. Two readers might both decide to abort the same transaction; the CAS on the primary lock means exactly one succeeds, and both apply the same idempotent secondary cleanup.

Why this works without a separate recovery process: the primary lock + the Write entry on the primary together form a consensus on the outcome that any party can read. The single atomic write at step 5 (replace Lock with Write entry) is the moment of agreement. Before it, every reader sees an in-flight lock; after it, every reader sees the same committed answer. There is no possibility of disagreement because there is no replicated state — one row, one storage layer, one atomic operation.

Snapshot isolation falls out for free

A reader at timestamp T follows only Write entries with commit_ts ≤ T. The visibility cutoff is exactly the commit timestamp: every write with commit_ts ≤ T is visible (because its Write entry exists by the time T is read), and no write with commit_ts > T leaks in (because the reader does not follow it). That is point-in-time consistency.

Conflict detection — first-committer-wins — is also free. At step 2 (prewrite of primary), the conditional write rejects if any Write entry with commit_ts > start_ts exists on that row. So if any other transaction committed a change to your row after your snapshot was taken, your prewrite fails and you abort. Exactly the SI commit rule. (See the SI chapter for the broader story.)

What Percolator does not give you is freedom from write skew. SI permits write skew by construction; the hospital scheduling bug from chapter 59 happens on Percolator the same way it happens on Postgres-SI. Fixing it requires SELECT ... FOR UPDATE-style explicit row locks (TiDB supports this via the same Lock CF) or a stricter isolation level layered on top.

Real Python — commit and recovery

Here is the core of the protocol. The KV is a dictionary of dictionaries simulating Bigtable's row-keyed CFs; in production the same logic runs over actual RPCs to TiKV or Bigtable.

import time, threading
from dataclasses import dataclass

# A single-row atomic CAS operation is the only primitive we assume.
# In Bigtable / TiKV this is provided natively.

LOCK_CF, WRITE_CF, DATA_CF = "lock", "write", "data"
LOCK_TTL = 5.0  # seconds

@dataclass
class LockValue:
    primary_row: str
    primary_col: str
    kind: str          # "PRIMARY" or "SECONDARY"
    wall_time: float   # for TTL checks during recovery

class TSO:
    """Timestamp Oracle — globally monotonic counter."""
    def __init__(self):
        self._t = 0
        self._lock = threading.Lock()
    def next(self):
        with self._lock:
            self._t += 1
            return self._t

class KV:
    """Sharded KV with single-row atomicity. Each row holds three CFs."""
    def __init__(self):
        self.rows = {}  # row -> col -> {LOCK_CF: {ts: LockValue}, WRITE_CF: {ts: ts}, DATA_CF: {ts: value}}
        self.locks = {}  # row -> threading.Lock for single-row atomicity

    def _row_lock(self, row):
        return self.locks.setdefault(row, threading.Lock())

    def get_cell(self, row, col):
        return self.rows.setdefault(row, {}).setdefault(col, {LOCK_CF: {}, WRITE_CF: {}, DATA_CF: {}})

    def atomic(self, row, fn):
        with self._row_lock(row):
            return fn()

class PercolatorTxn:
    def __init__(self, kv, tso):
        self.kv = kv
        self.tso = tso
        self.start_ts = tso.next()
        self.writes = []           # list of (row, col, value)

    def get(self, row, col):
        cell = self.kv.get_cell(row, col)
        # 1. Refuse to read if there is a stale lock at or before start_ts.
        for lock_ts, lv in list(cell[LOCK_CF].items()):
            if lock_ts <= self.start_ts:
                resolve_lock(self.kv, row, col, lock_ts, lv)
        # 2. Find the largest commit_ts <= start_ts, follow to data.
        candidate = None
        for commit_ts, start_ts in cell[WRITE_CF].items():
            if commit_ts <= self.start_ts:
                if candidate is None or commit_ts > candidate[0]:
                    candidate = (commit_ts, start_ts)
        if candidate is None:
            return None
        return cell[DATA_CF].get(candidate[1])

    def set(self, row, col, value):
        self.writes.append((row, col, value))

    def commit(self):
        if not self.writes:
            return True
        primary = self.writes[0]
        secondaries = self.writes[1:]
        # Phase 1 — Prewrite primary.
        if not prewrite(self.kv, primary, primary, self.start_ts, "PRIMARY"):
            return False
        # Phase 1 — Prewrite secondaries.
        for w in secondaries:
            if not prewrite(self.kv, w, primary, self.start_ts, "SECONDARY"):
                rollback_all(self.kv, [primary] + secondaries, self.start_ts)
                return False
        # Phase 2 — Get commit_ts and atomically commit the PRIMARY.
        commit_ts = self.tso.next()
        if not commit_primary(self.kv, primary, self.start_ts, commit_ts):
            return False     # someone else aborted us
        # Phase 2 — Commit secondaries (best-effort; recovery handles failures).
        for w in secondaries:
            commit_secondary(self.kv, w, self.start_ts, commit_ts)
        return True

def prewrite(kv, write, primary, start_ts, kind):
    row, col, value = write
    p_row, p_col, _ = primary
    cell = kv.get_cell(row, col)
    def attempt():
        # Conflict checks: no lock anywhere, no committed write after start_ts.
        if cell[LOCK_CF]:
            return False
        for commit_ts in cell[WRITE_CF].keys():
            if commit_ts >= start_ts:
                return False
        cell[DATA_CF][start_ts] = value
        cell[LOCK_CF][start_ts] = LockValue(p_row, p_col, kind, time.time())
        return True
    return kv.atomic(row, attempt)

def commit_primary(kv, primary, start_ts, commit_ts):
    row, col, _ = primary
    cell = kv.get_cell(row, col)
    def attempt():
        if start_ts not in cell[LOCK_CF]:
            return False    # somebody removed our lock — txn was aborted
        cell[WRITE_CF][commit_ts] = start_ts
        del cell[LOCK_CF][start_ts]
        return True
    return kv.atomic(row, attempt)

def commit_secondary(kv, secondary, start_ts, commit_ts):
    row, col, _ = secondary
    cell = kv.get_cell(row, col)
    def attempt():
        if start_ts not in cell[LOCK_CF]:
            return     # already cleaned up by recovery — fine
        cell[WRITE_CF][commit_ts] = start_ts
        del cell[LOCK_CF][start_ts]
    kv.atomic(row, attempt)

def rollback_all(kv, writes, start_ts):
    for row, col, _ in writes:
        cell = kv.get_cell(row, col)
        def attempt(c=cell):
            c[LOCK_CF].pop(start_ts, None)
            c[DATA_CF].pop(start_ts, None)
        kv.atomic(row, attempt)

def resolve_lock(kv, row, col, lock_ts, lv):
    """Called by readers who hit a stale lock. Chases the primary pointer."""
    p_cell = kv.get_cell(lv.primary_row, lv.primary_col)
    def inspect_primary():
        # Look for a Write entry that points to lock_ts (i.e., committed at some commit_ts).
        for commit_ts, start_ts in p_cell[WRITE_CF].items():
            if start_ts == lock_ts:
                return ("COMMITTED", commit_ts)
        if lock_ts not in p_cell[LOCK_CF]:
            return ("ABORTED", None)
        # Primary is still locked. Check TTL.
        primary_lv = p_cell[LOCK_CF][lock_ts]
        if time.time() - primary_lv.wall_time > LOCK_TTL:
            del p_cell[LOCK_CF][lock_ts]
            return ("ABORTED", None)
        return ("PENDING", None)
    outcome = kv.atomic(lv.primary_row, inspect_primary)
    if outcome[0] == "COMMITTED":
        commit_ts = outcome[1]
        cell = kv.get_cell(row, col)
        def roll_forward():
            if lock_ts in cell[LOCK_CF]:
                cell[WRITE_CF][commit_ts] = lock_ts
                del cell[LOCK_CF][lock_ts]
        kv.atomic(row, roll_forward)
    elif outcome[0] == "ABORTED":
        cell = kv.get_cell(row, col)
        def roll_back():
            cell[LOCK_CF].pop(lock_ts, None)
            cell[DATA_CF].pop(lock_ts, None)
        kv.atomic(row, roll_back)
    # PENDING: caller backs off and retries

def cleanup_stale_locks(kv):
    """Background sweeper: scan all rows, resolve any expired locks."""
    for row, cols in kv.rows.items():
        for col, cell in cols.items():
            for lock_ts, lv in list(cell[LOCK_CF].items()):
                if time.time() - lv.wall_time > LOCK_TTL:
                    resolve_lock(kv, row, col, lock_ts, lv)

A few things to notice. commit_primary is the entire commit decision — the atomic transition from "lock present" to "Write entry present, lock removed" is done inside one kv.atomic(row, ...) call, which is a single-row CAS. This is the only thing in the protocol that has to be atomic across data items. The secondaries are best-effort and idempotent. resolve_lock is the recovery routine and it can be called by any reader, by a background sweeper, or by both racing on the same lock; the inspection of the primary is itself a single-row atomic operation, and the cleanup of the secondary is idempotent.

Worked example — the ₹500 transfer at the column-family level

The ₹500 transfer through Percolator, traced cell by cell

Account A starts at ₹2,000 on row "A", column "balance". Account B starts at ₹500 on row "B". The application runs transfer(A, B, 500). Initial state (one earlier write to each row at start_ts=2, commit_ts=3): both rows have Data {2: <bal>}, Write {3: 2}, Lock {}.

Step 1 — start_ts = 10. Client gets start_ts = 10 from the TSO.

Step 2 — prewrite the primary (A). Client picks A as primary. The prewrite CAS on row A: Lock empty (yes), no Write with commit_ts ≥ 10 (correct), atomically write Data@10 = 1500 and Lock@10 = PRIMARY. Row A is now Data {2:2000, 10:1500}, Write {3:2}, Lock {10: PRIMARY}. Row B unchanged.

Step 3 — prewrite the secondary (B). Same conditional on row B, with a SECONDARY lock pointing back to A. Row B becomes Data {2:500, 10:1000}, Write {3:2}, Lock {10: SECONDARY → A.balance}. The transaction is fully prepared. Neither row's new value is yet visible — no Write entry at any commit_ts for either. A concurrent reader at T=11 looking at row A finds Lock@10 and either waits or triggers recovery.

Step 4 — commit_ts = 15. Another TSO call.

Step 5 — commit the primary atomically. commit_primary on row A: if Lock@10 is still there, atomically write Write@15 = 10 and delete Lock@10. After:

Row A:  Data {2:2000, 10:1500}  Write {3:2, 15:10}  Lock {}
Row B:  Data {2:500, 10:1000}   Write {3:2}         Lock {10: SECONDARY → A}

The transaction has committed. A reader at T=20 on row A finds Write entry 15 → 10, follows to Data@10, returns ₹1500. A reader at T=14 finds the largest Write with commit_ts ≤ 14 is 3 → 2, returns ₹2000. Pre-commit value still visible to old snapshots — exactly MVCC.

Step 6 — commit the secondary. Client runs commit_secondary on row B: write Write@15 = 10, delete Lock@10. Row B becomes Data {2:500, 10:1000}, Write {3:2, 15:10}, Lock {}. Money conserved across the snapshot pair T < 15 (old values) and T ≥ 15 (new values). Invariant holds.

The crash variant. Replay from the end of step 5, but the client is killed before step 6. State:

Row A:  Data {2:2000, 10:1500}  Write {3:2, 15:10}  Lock {}
Row B:  Data {2:500, 10:1000}   Write {3:2}         Lock {10: SECONDARY → A.balance}

A different transaction at T=20 reads row B, finds Lock@10, calls resolve_lock. The lookup at row A scans Write CF for any entry whose value equals 10, finds 15 → 10, returns ("COMMITTED", 15). resolve_lock then runs roll_forward on row B: write Write@15 = 10, delete Lock@10. The reader retries, finds Write entry 15 → 10, follows to Data@10, returns ₹1000. The transaction committed via the reader's recovery action; the original client is still dead, but the bookkeeping is complete.

The abort variant. Suppose the client crashes between step 3 and step 5 — both rows hold prewrite locks but no primary Write entry exists. After LOCK_TTL seconds, a reader hits row B, chases to row A, sees Lock@10 still present and expired. It atomically removes Lock@10 from row A. The primary now has neither a Lock nor a Write entry for start_ts=10, so the outcome is ABORTED. The reader rolls back row B (deletes Lock@10 and Data@10); a separate sweep eventually does the same on row A. No money moved, every reader after recovery sees consistent state. The crash that would block 2PC indefinitely resolves itself in seconds in Percolator, without operator intervention.

The TSO — centralised but batched

The Timestamp Oracle is a single point of contention: every transaction needs two timestamps. If the TSO is one machine handing out one timestamp per RPC, it caps global throughput.

The fix is batching. The TSO does not hand out one timestamp at a time — clients batch their requests, the TSO batches across clients, and one round-trip carries hundreds of timestamps. Querion reported the TSO handing out 2 million timestamps per second on a single machine. TiDB's PD plays the same role at similar rates. For fault tolerance, in TiDB / TiKV the PD is a Raft-replicated cluster; the leader hands out timestamps, and a leader change rolls the timestamp counter forward by a safety margin. Why monotonicity across leader changes matters: snapshot isolation breaks if a transaction at start_ts=100 can ever see a write committed at commit_ts=99 from before its snapshot. If a new leader could hand out a smaller timestamp than the old one's last issued, this guarantee is lost. Persisting last_issued + safety_margin on takeover ensures monotonicity at the cost of small gaps in the sequence.

This is the bargain. You get distributed transactions over a sharded KV with cooperative recovery, in exchange for a hard dependency on a centralised (but replicated and batched) timestamp service. Spanner's answer to the same problem is synchronised wall clocks — chapter 112's story.

Where Percolator lives in production

TiKV / TiDB is the most prominent direct heir. TiKV is a Raft-replicated sharded KV; TiDB is the SQL layer on top. Their transaction model is Percolator with refinements: a pessimistic locking mode (SELECT ... FOR UPDATE acquires locks at read time), async commit (return success after primary commit, before secondaries land), and parallel commits. The Lock / Write / Data column families exist in TiKV as CF_LOCK, CF_WRITE, CF_DEFAULT.

CockroachDB is conceptually inspired by Percolator but uses a Hybrid Logical Clock instead of a centralised TSO, accepting small uncertainty windows for less coordination. The intent-resolution recipe — "if you find a write intent, look up the transaction record to decide its fate" — is the same shape as Percolator's primary-pointer recovery.

The pattern — KV that knows nothing about transactions, client library that encodes the protocol in extra column families — is general. Anyone building a transactional layer over an existing KV with single-key atomicity has Percolator as a working blueprint.

Common confusions

"The PRIMARY is a special node." No — the primary is a chosen key within the transaction, not a node. Any key in the write set can be the primary; the choice is a deterministic local decision by the client (e.g., lex-min).

"Recovery requires a separate process." No — every reader of a row that finds a stale lock is a recovery agent for that lock. A background sweeper helps with rows that are not being read, but the protocol is correct without one.

"Two readers race on the same lock." They might, and it is fine. Both look at the primary; both read the same outcome; both apply the same idempotent fix to the secondary. Worst case, two harmless duplicate writes.

"This is the same as 2PC." It is a 2PC, optimistic and client-driven, with the commit decision encoded as a single-row atomic write rather than a coordinator log fsync + broadcast. Textbook 2PC blocks because the coordinator's decision exists only in the coordinator's log; if the coordinator is gone, no one can read the decision. Percolator's decision exists in the primary row, which any reader can inspect — that is the structural reason it does not block.

Going deeper

Why Percolator's commit point is structurally non-blocking

Recall from chapter 110 that 2PC blocks because of the gap between deciding to commit and broadcasting the decision. If the coordinator dies in that gap, no participant can determine the outcome — they all have prepared state but no decision, and the decision lives only in the coordinator's log, which is unreachable.

Percolator collapses that gap to zero. The "decision" and the "first piece of broadcasting" are the same single-row atomic write on the primary. There is no instant at which the transaction is decided but the decision is invisible to other parties — the decision is the visibility. Every other party (every reader of the primary) can determine the outcome by inspection.

The cost: the protocol assumes the storage layer guarantees single-row atomicity. Bigtable does (via row-level Chubby locks); TiKV does (via Raft on a region); any KV that does not provide this primitive cannot host Percolator. The protocol pushes the atomicity requirement down the stack onto a per-row primitive that distributed storage can already provide cheaply, instead of requiring an across-row primitive that nothing provides cheaply.

Async commit and the "minimum commit ts"

A production refinement TiDB / TiKV ship is async commit: the client returns success to the user after step 5 (primary commit), without waiting for the secondaries to complete. This roughly halves perceived latency on multi-key transactions. It requires one careful addition: the primary's lock value, before it is replaced by the Write entry, encodes the list of all secondary keys and a min_commit_ts. After async commit, a reader who hits a stale secondary lock can determine the commit_ts without contacting the client — it follows the secondary lock to the primary, sees the Write entry, and infers commit_ts. The protocol's recovery story carries through unchanged.

Long-running reads and GC

Percolator runs a background garbage collector that drops Data and Write entries older than the oldest live snapshot — the minimum start_ts across all in-flight read transactions. A long-running read holds back GC for the entire database. TiDB exposes a tidb_gc_life_time knob with a 10-minute default. Cluster ops teams hate this knob exactly the way Postgres ops teams hate vacuum. Same trade-off in different clothing.

Why optimistic by default

Percolator's prewrite acquires locks only at commit time, not at read time. Two transactions that read the same rows but write different ones never see each other; conflicts surface only at prewrite. For Querion's index-rebuild workload — rare conflicts, latency-sensitive reads — optimistic was the right default. TiDB later added a pessimistic mode (acquire locks at read time, like 2PL on the prewrite path) for OLTP workloads where conflicts are common; the protocol is the same, only the lock-acquisition timing changes.

Where this leads next

You have seen one of the two great answers to the 2PC blocking problem: encode the decision in the storage so any reader can recover it, with help from a centralised timestamp service. Chapter 112 takes the other path: make the coordinator a Paxos-replicated state machine, and replace the centralised timestamp service with synchronised wall clocks (TrueTime). That is Spanner's answer, and it gives you external consistency — a stronger guarantee than Percolator's snapshot isolation.

The lesson Percolator left the industry with is bigger than the protocol itself. You do not need to design transactions into your storage layer. You can build them in the client, encode their state in the data, and use single-row atomicity as the only primitive. That insight made TiKV / TiDB possible; it informs CockroachDB; it shows up in FoundationDB's client design. Once you see it once, you see it everywhere.

References

Peng and Dabek, Large-scale Incremental Processing Using Distributed Transactions and Notifications, OSDI 2010 — the original Percolator paper. Sections 2 and 3 are the protocol; Section 4 is the notifications layer that powered the incremental web index. Read the protocol section twice; the prewrite/commit asymmetry only clicks on the second pass.
PingCAP, TiDB Optimistic Transaction Model and TiKV Percolator — the production implementation guide. Includes the modifications TiDB made (pessimistic mode, async commit, parallel commits) and operational notes on Lock CF cleanup.
Chang et al., Bigtable: A Distributed Storage System for Structured Data, OSDI 2006 — the storage substrate Percolator runs on. Section 6 covers the single-row atomicity guarantee that makes Percolator's commit point work.
Corbett et al., Spanner: Querion's Globally-Distributed Database, OSDI 2012 — the contrast read. Spanner replaces Percolator's TSO with TrueTime and gets external consistency in exchange. Read after this chapter to see the trade-offs concretely.
Cockroach Labs, CockroachDB vs TiDB — a direct architectural comparison of the two Percolator-influenced systems, with a careful discussion of the HLC-vs-TSO trade-off and intent resolution.
Huang et al., TiDB: A Raft-based HTAP Database, VLDB 2020 — the academic paper on TiDB, with Section 3 covering the transaction model and the production refinements layered on top of base Percolator.