Two-Phase Commit: The Protocol

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A bank transfer of ₹500 from Account A on shard 1 to Account B on shard 2 must be atomic — both updates land or neither does. Single-shard atomicity is free (the local WAL handles it). Cross-shard atomicity needs a protocol, and the textbook one is two-phase commit (2PC).

The cast: one coordinator that drives the protocol, and N participants (the shards / resource managers that hold the data). The protocol has exactly two phases.

Phase 1 — Prepare. The coordinator sends PREPARE to every participant. Each participant performs the write locally, durably writes a PREPARE log record (this is the load-bearing fsync of the entire protocol), takes whatever locks it needs to keep the prepared state intact, and votes YES if it can promise to commit on demand, NO otherwise.

Phase 2 — Commit/Abort. The coordinator collects all votes. If every vote is YES, the coordinator durably writes a COMMIT decision record to its own log (the second load-bearing fsync) and then sends COMMIT to every participant. If any vote is NO, or any participant times out, the coordinator writes nothing and sends ABORT to every participant. On receiving the verdict, each participant applies it, releases locks, and acknowledges.

Two fsyncs carry the entire correctness argument. A participant that votes YES has promised — across a crash, across a restart, across an arbitrary delay — to be able to commit when the coordinator says so. The coordinator that has flushed its COMMIT decision has irrevocably decided; if it crashes before broadcasting, recovery re-broadcasts. The presumed-abort optimisation says the coordinator only persists commit decisions, not aborts: if a participant later asks "what was the outcome of T?" and the coordinator has no record, the answer is ABORT.

The protocol is correct on the happy path. It has one ugly failure: if the coordinator crashes after Phase 1 but before broadcasting the decision, every prepared participant is stuck holding locks and cannot decide on its own. That is the blocking problem, and it is what the next chapter (110) takes apart in detail. This chapter teaches the protocol; the next chapter teaches its limit.

You read chapter 98 and accepted that sharding does not give you cross-shard atomicity for free. The CFO wants the ₹500 transfer to either land on both sides or land on neither side. Today you will write the protocol the textbook gives you for that — two-phase commit, first described by Jim Gray in 1978, formalised by Bernstein in the 1980s, standardised as X/Open XA in the 1990s, and still the literal protocol that runs underneath every distributed transaction in MySQL XA, PostgreSQL prepared transactions, Oracle distributed transactions, and Java's JTA.

The protocol is short. The implementation is shorter than you expect — a few hundred lines of Python. The reason it has filled half a textbook for forty years is not the happy path; it is the corner cases. This chapter shows you the protocol and gets you to a working coordinator + two participants doing the ₹500 transfer. Chapter 110 then takes the protocol apart at the failure modes — and that is where the protocol's reputation comes from.

The problem, concretely

Account A lives on shard 1. Account B lives on shard 2. They are different physical machines, possibly in different racks, possibly in different availability zones. The application says: debit ₹500 from A and credit ₹500 to B. The invariant the bank cares about is money conservation — the sum of A and B before the transfer equals the sum after.

Naive code:

def transfer(a, b, amount):
    shard1.debit(a, amount)   # local txn on shard 1
    shard2.credit(b, amount)  # local txn on shard 2

Most days both succeed. Some days the second line raises — shard 2 is down, the network ate the request, the lock was held too long. Now ₹500 has vanished from the books. The reconciliation team finds it the next morning and the operations director writes you a memo.

You cannot fix this with retries alone. If shard 2 is genuinely unreachable, you cannot retry forever; while you wait, A is short and B is unchanged. Why retries are insufficient: retries solve transient failure, not partial commitment. The window between the two local commits is unprotected — a third party reading the books in that window sees an inconsistent state. Atomicity is about the window, not just about the eventual outcome.

What you need is a protocol that gives both shards a chance to refuse before either of them commits, and a single point that decides the outcome for everyone. That is two-phase commit.

The cast

The coordinator drives the protocol; the participants do the work. Two messages out, votes back, decision message out, acks back. The two dashed boxes that say "fsync" carry the entire correctness argument — every other step is recoverable from those two durable records.

There are exactly two roles in 2PC.

The coordinator is the process that initiates the transaction and drives it to a decision. In a sharded SQL system this is usually the proxy that received the SQL from the application; in MySQL XA it is the application itself; in JTA it is the transaction manager (Atomikos, Narayana, Bitronix). The coordinator does not hold any of the data; it only holds the protocol state.

The participants are the data shards — sometimes called resource managers in the XA spec, cohorts in the original Gray paper. Each participant owns the rows it is responsible for, runs its own local transaction system (its own WAL, its own lock manager), and exposes two RPC handlers: handle_prepare(txn_id, ops) and handle_commit(txn_id) (plus handle_abort).

That is the entire cast. No clock, no quorum, no leader election. Just one coordinator and N participants.

Why one coordinator and not a quorum: 2PC's job is to make N independent log managers agree on one bit (commit or abort). The coordinator is the place where that bit is decided. Replacing the coordinator with a quorum is exactly what consensus-based commit (Paxos Commit, the rest of Build 14) does — and it is what makes the protocol non-blocking. For the classical protocol you only need one coordinator, and the price you pay for that simplicity is the failure mode chapter 110 will dissect.

Phase 1 — Prepare

The coordinator opens by tagging the transaction with a globally unique id T and recording a BEGIN(T, participants={p1, p2}) entry in its own log. This entry says "I have started transaction T, these are the participants involved." The coordinator does not yet need to fsync this — it is just bookkeeping.

Then the coordinator sends PREPARE(T, ops) to each participant. The ops payload contains the actual writes that participant must perform: for shard 1, "debit A by 500"; for shard 2, "credit B by 500".

Each participant, on receiving PREPARE(T, ops):

Acquires the locks the operations need (row lock on A for shard 1, row lock on B for shard 2). If a lock cannot be acquired in time, the participant votes NO.
Performs the operation against its in-memory state (or its undo/redo log machinery, depending on whether the engine is steal or no-steal — the WAL chapter covers the distinction).
Writes a PREPARE(T, ops) record to its WAL and fsyncs it. This is the load-bearing step.
Replies VOTE-YES to the coordinator.

If anything in steps 1–3 fails — the row is missing, the balance would go negative, the disk is full, the lock cannot be acquired — the participant writes nothing durable, releases any locks it grabbed, and replies VOTE-NO.

Why the prepare record must be durable before voting YES: voting YES is a promise. The participant is telling the coordinator "no matter what happens to me — I crash, I restart, I lose memory — I will be able to commit T when you tell me to." That promise is only honourable if the prepared state survives a crash, and the only way to make state survive a crash is to fsync it. If the participant voted YES on the basis of in-memory state and then crashed, on restart it would not know about T at all; the coordinator might receive the YES, decide to commit, broadcast COMMIT, and the participant would have no idea what to do. Durability before vote is the contract.

The participant is now in the prepared state. This is a strange in-between state — the changes are not yet committed (no COMMIT record in the WAL), but the changes are also not abortable on the participant's own initiative. The locks are held. Other transactions that need those rows wait. The participant has handed over its decision authority to the coordinator and is waiting for instructions.

Phase 2 — Commit or abort

Success: every participant votes YES, the coordinator durably records its commit decision, and broadcasts COMMIT. Each participant writes its own COMMIT log record, applies the prepared changes, releases locks, and acknowledges. Money conserved.

The coordinator collects votes from every participant. Three outcomes are possible.

Every vote is YES. The coordinator writes a COMMIT(T) record to its own log and fsyncs it. This is the second load-bearing fsync of the protocol. Once this record is on disk, T is decided — even if the coordinator crashes immediately afterward, recovery will read the COMMIT record and re-broadcast the decision. Then the coordinator sends COMMIT(T) to each participant.

Each participant, on COMMIT(T):

Writes a COMMIT(T) record to its own WAL (fsync recommended; the prepare record is the load-bearing one — losing the participant's commit record only forces it to ask the coordinator on recovery, which is recoverable).
Applies the prepared changes to the durable state (or marks the prepared state as committed, depending on the engine).
Releases the locks held by T.
Sends ACK to the coordinator.

After all acks arrive, the coordinator may write a FORGET(T) record (or simply drop the entry) — once every participant has acknowledged, the coordinator no longer needs to remember T's outcome.

Any vote is NO, or any participant times out. The coordinator writes nothing about commit. (See the presumed-abort optimisation below — the coordinator deliberately does not log abort decisions.) It sends ABORT(T) to every participant. Each participant rolls back the prepared state, releases locks, and acknowledges.

Abort: a single NO vote (or a timeout) is enough. The coordinator does not write a durable abort decision — it relies on the presumed-abort rule. Any participant that recovers from a crash and asks "what about T?" gets ABORT by default.

The durability discipline

Strip the protocol of message detail and what remains is a discipline about which records must be on disk before which messages are sent.

Rule 1 — A participant fsyncs its PREPARE record before sending VOTE-YES. The prepared state must survive a crash; otherwise the vote is a lie.

Rule 2 — The coordinator fsyncs its COMMIT decision before sending COMMIT. The decision must survive a crash; otherwise the broadcast can be lost and the participants disagree.

Rule 3 — Until both rules are satisfied for every relevant party, the protocol is in a recoverable in-flight state. A participant that has fsynced PREPARE but not received the verdict can re-ask the coordinator on recovery. A coordinator that has fsynced COMMIT but not finished broadcasting can resume broadcasting on recovery.

Everything else — the bookkeeping BEGIN record, the participant's COMMIT record, the FORGET record, the ACK messages — is performance, not correctness. Why those two fsyncs and no others: the protocol is solving a "decide then enact" problem. The "decide" is the coordinator's COMMIT record (the moment the outcome becomes irrevocable). The "enact" is each participant's commitment to honour the decision (the PREPARE record). Lose either one and the protocol breaks: lose PREPARE and a participant could renege; lose COMMIT and the coordinator could forget what it decided. Every other record can be reconstructed from these two by re-asking.

This is also why 2PC is slow on writes. Every cross-shard transaction pays at least three round-trips of message latency (PREPARE, vote, COMMIT) plus at least two fsyncs (one on each prepared participant, one on the coordinator). On commodity hardware that is several milliseconds; on geo-distributed deployments it is tens to hundreds. The protocol is not free.

The presumed-abort optimisation

The protocol as described logs both commit and abort decisions on the coordinator. But abort decisions are vastly more numerous — every conflict, every timeout, every constraint violation produces an abort — and most of them are short-lived. Logging all of them is wasteful.

The presumed-abort optimisation says: the coordinator only durably logs commit decisions. Aborts are not logged. If a participant recovers from a crash and asks the coordinator "what was the outcome of T?", and the coordinator has no record of T, the coordinator answers ABORT. Why this is safe: a participant that has fsynced PREPARE for T and is now asking for the outcome must have voted YES at some point. If the coordinator has no record of T, then either (a) T was aborted (no commit log was ever written, no record persists), or (b) T was committed and forgotten only after every participant acknowledged — but the asking participant clearly did not ack, so this case cannot apply. Therefore "no record" implies "aborted", and answering ABORT is correct.

Two consequences. First, the coordinator's log is much smaller — only commits, plus the BEGIN records that get garbage-collected on FORGET. Second, the coordinator can drop its memory of a transaction the moment the last participant acks; it does not need to remember aborts at all. Almost every production 2PC implementation runs presumed-abort. The X/Open XA spec calls it out; MySQL XA, PostgreSQL prepared transactions, and the Java JTA all use it.

There is a dual optimisation called presumed-commit which logs aborts and assumes commit on missing records. It saves a fsync in the rare case where a transaction touches many participants and they all vote YES, but pays for it on every abort. Presumed-abort is what the world settled on.

Real Python — coordinator and participant

Here is a working sketch. Two files: a Participant that owns a tiny in-memory ledger and a WAL on disk, and a Coordinator that drives the protocol over direct method calls (a real implementation would use RPC; the protocol logic is identical).

# participant.py
import json
import os
from enum import Enum

class TxnState(Enum):
    NONE = "none"
    PREPARED = "prepared"
    COMMITTED = "committed"
    ABORTED = "aborted"

class Participant:
    def __init__(self, name, accounts, wal_path):
        self.name = name
        self.accounts = dict(accounts)        # account_id -> balance
        self.locks = set()                    # account_ids currently locked
        self.txns = {}                        # txn_id -> (TxnState, ops)
        self.wal_path = wal_path
        self.wal = open(wal_path, "ab")       # append-only
        self._replay()                        # crash recovery on startup

    def _wal_append(self, record):
        self.wal.write((json.dumps(record) + "\n").encode())
        self.wal.flush()
        os.fsync(self.wal.fileno())           # the load-bearing call

    def _replay(self):
        with open(self.wal_path, "rb") as f:
            for line in f:
                rec = json.loads(line)
                t = rec["txn"]
                if rec["op"] == "PREPARE":
                    self.txns[t] = (TxnState.PREPARED, rec["ops"])
                    for op in rec["ops"]:
                        self.locks.add(op["account"])
                elif rec["op"] == "COMMIT":
                    state, ops = self.txns[t]
                    for op in ops:
                        self.accounts[op["account"]] += op["delta"]
                        self.locks.discard(op["account"])
                    self.txns[t] = (TxnState.COMMITTED, ops)
                elif rec["op"] == "ABORT":
                    _, ops = self.txns.get(t, (None, []))
                    for op in ops:
                        self.locks.discard(op["account"])
                    self.txns[t] = (TxnState.ABORTED, ops)

    def handle_prepare(self, txn_id, ops):
        # 1. check we can apply (locks free, balances will not go negative)
        for op in ops:
            if op["account"] in self.locks:
                return "VOTE-NO"   # someone else holds the row
            projected = self.accounts.get(op["account"], 0) + op["delta"]
            if projected < 0:
                return "VOTE-NO"   # would overdraw
        # 2. acquire locks
        for op in ops:
            self.locks.add(op["account"])
        # 3. fsync the PREPARE record BEFORE voting YES
        self._wal_append({"op": "PREPARE", "txn": txn_id, "ops": ops})
        self.txns[txn_id] = (TxnState.PREPARED, ops)
        return "VOTE-YES"

    def handle_commit(self, txn_id):
        state, ops = self.txns[txn_id]
        if state != TxnState.PREPARED:
            return "ACK"   # idempotent — already committed/aborted
        self._wal_append({"op": "COMMIT", "txn": txn_id})
        for op in ops:
            self.accounts[op["account"]] += op["delta"]
            self.locks.discard(op["account"])
        self.txns[txn_id] = (TxnState.COMMITTED, ops)
        return "ACK"

    def handle_abort(self, txn_id):
        state, ops = self.txns.get(txn_id, (TxnState.NONE, []))
        if state == TxnState.COMMITTED:
            raise RuntimeError("cannot abort committed txn — protocol bug")
        self._wal_append({"op": "ABORT", "txn": txn_id})
        for op in ops:
            self.locks.discard(op["account"])
        self.txns[txn_id] = (TxnState.ABORTED, ops)
        return "ACK"

The coordinator:

# coordinator.py
import json, os, uuid

class Coordinator:
    def __init__(self, log_path, participants):
        self.participants = participants     # {name: Participant}
        self.log_path = log_path
        self.log = open(log_path, "ab")
        self.committed = self._load_committed()

    def _log_commit(self, txn_id, names):
        rec = {"op": "COMMIT", "txn": txn_id, "participants": names}
        self.log.write((json.dumps(rec) + "\n").encode())
        self.log.flush()
        os.fsync(self.log.fileno())          # the second load-bearing call

    def _load_committed(self):
        out = {}
        if not os.path.exists(self.log_path):
            return out
        with open(self.log_path, "rb") as f:
            for line in f:
                rec = json.loads(line)
                if rec["op"] == "COMMIT":
                    out[rec["txn"]] = rec["participants"]
        return out

    def commit(self, ops_per_participant):
        # ops_per_participant: {name: [op, op, ...]}
        txn_id = str(uuid.uuid4())
        names = list(ops_per_participant.keys())

        # Phase 1 — PREPARE
        votes = {}
        for name, ops in ops_per_participant.items():
            votes[name] = self.participants[name].handle_prepare(txn_id, ops)

        if all(v == "VOTE-YES" for v in votes.values()):
            # Phase 2a — COMMIT
            self._log_commit(txn_id, names)          # decision durable
            for name in names:
                self.participants[name].handle_commit(txn_id)
            return ("committed", txn_id)
        else:
            # Phase 2b — ABORT (presumed abort — no log)
            for name in names:
                self.participants[name].handle_abort(txn_id)
            return ("aborted", txn_id, votes)

    def query_outcome(self, txn_id):
        # Used by participants on recovery — presumed abort applies
        if txn_id in self.committed:
            return "COMMIT"
        return "ABORT"

A few notes on what this code does and does not do.

What it does: the two load-bearing fsyncs are in the right places. Participant._wal_append is called before handle_prepare returns YES. Coordinator._log_commit is called before handle_commit is dispatched to any participant. Crash recovery on the participant replays the WAL and reconstructs the prepared state. query_outcome implements presumed-abort.

What it does not do: the coordinator does not implement timeouts, retries, or recovery-on-restart re-broadcast. The participants do not re-query the coordinator on recovery for in-doubt transactions. There is no concurrency — handle_prepare calls are serialised by Python's GIL and the test driver. A real implementation handles all of this and chapter 110 is where you start to see why those omissions matter.

Worked example — the ₹500 transfer

The ₹500 transfer, traced step by step

Account A starts at ₹2,000 on shard 1. Account B starts at ₹500 on shard 2. The application requests transfer(A, B, 500). There is one coordinator, two participants.

# Setup
p1 = Participant("shard1", {"A": 2000}, "wal_shard1.log")
p2 = Participant("shard2", {"B": 500},  "wal_shard2.log")
coord = Coordinator("coord.log", {"shard1": p1, "shard2": p2})

# The transfer
result = coord.commit({
    "shard1": [{"account": "A", "delta": -500}],
    "shard2": [{"account": "B", "delta": +500}],
})
# result == ("committed", "<txn_id>")

Step-by-step trace, in chronological order:

t=0 (Application calls coord.commit). Coordinator generates txn_id = T1. Nothing on any disk yet for T1.

t=1 (Coordinator → P1). Coordinator calls p1.handle_prepare(T1, [{A: -500}]).

t=2 (P1 inside handle_prepare). P1 checks: A is not locked, projected balance is 2000 - 500 = 1500, which is non-negative. P1 acquires the lock on A. P1 appends to its WAL: {"op":"PREPARE","txn":"T1","ops":[{"account":"A","delta":-500}]} and fsyncs. State on disk now: wal_shard1.log contains the prepare record. P1 returns "VOTE-YES".

t=3 (Coordinator → P2). Coordinator calls p2.handle_prepare(T1, [{B: +500}]).

t=4 (P2 inside handle_prepare). P2 checks: B is not locked, projected balance is 500 + 500 = 1000. P2 acquires the lock on B. P2 appends and fsyncs its prepare record. State on disk now: wal_shard2.log contains the prepare record. P2 returns "VOTE-YES".

t=5 (Coordinator collects votes). Both YES. Coordinator appends to its log: {"op":"COMMIT","txn":"T1","participants":["shard1","shard2"]} and fsyncs. State on disk now: coord.log contains the commit decision. T1 is now decided — even if the coordinator's process is killed at this exact instant, recovery reads coord.log and re-broadcasts.

t=6 (Coordinator → P1). p1.handle_commit(T1). P1 appends {"op":"COMMIT","txn":"T1"} to its WAL, applies the delta (A becomes 1500), releases the lock on A, returns ACK.

t=7 (Coordinator → P2). p2.handle_commit(T1). P2 appends commit record, applies the delta (B becomes 1000), releases the lock on B, returns ACK.

t=8 (Final state). A = 1500, B = 1000. Sum = 2500 = original sum. Money conserved. All locks released. The coordinator can now drop T1 from memory.

On-disk state at every moment — the column shows what is on disk for that file:

t	coord.log	wal_shard1.log	wal_shard2.log
0	(empty)	(empty)	(empty)
2	(empty)	PREPARE T1	(empty)
4	(empty)	PREPARE T1	PREPARE T1
5	COMMIT T1	PREPARE T1	PREPARE T1
6	COMMIT T1	PREPARE T1, COMMIT T1	PREPARE T1
7	COMMIT T1	PREPARE T1, COMMIT T1	PREPARE T1, COMMIT T1

The bold transition at t=5 is the irrevocable moment. Before that the coordinator could still abort. After that the coordinator must commit, even across crashes — recovery would re-broadcast the COMMIT to any participant that had not yet acknowledged.

Now consider what changes in the abort path. Suppose at t=4, account B is locked by another transaction. P2 returns "VOTE-NO". The coordinator sees one NO, writes nothing to its log (presumed-abort), and calls handle_abort(T1) on both participants. P1's prepared state is rolled back: the lock on A is released, and a tombstone ABORT record is appended for crash-recovery clarity. The final balances are unchanged from the start. No money moved.

The trace also makes clear why the protocol blocks. Suppose t=5 happens — coord.log contains COMMIT T1. Now the coordinator process dies before t=6. P1 and P2 are both in the PREPARED state, holding locks. They cannot commit (they don't know the decision) and they cannot abort (the coordinator might have committed). They wait. Other transactions that need account A or account B wait behind them. If the coordinator never comes back, the locks are held forever. If the coordinator comes back and its disk is intact, recovery reads COMMIT T1 from coord.log and re-broadcasts — the protocol resumes. If the coordinator's disk is gone, recovery is impossible without an operator decision. That is the blocking problem, and the next chapter pulls it apart in detail.

Where 2PC fits in real systems

Two-phase commit is not a hypothetical. It is the literal protocol underneath several production systems you have probably used.

MySQL XA. MySQL implements the X/Open XA standard for distributed transactions across multiple MySQL instances (or between MySQL and other XA-compliant resource managers like message brokers). The commands XA START, XA END, XA PREPARE, XA COMMIT, XA ROLLBACK map one-to-one onto the protocol described above. The application or a transaction manager plays the coordinator role. Production deployments are rare because of the blocking problem, but the protocol is shipped and supported.

PostgreSQL prepared transactions. PostgreSQL has PREPARE TRANSACTION 'gid' and COMMIT PREPARED 'gid' / ROLLBACK PREPARED 'gid'. Same protocol. Disabled by default (max_prepared_transactions = 0) because of the operational risk — a coordinator that dies leaves prepared transactions holding locks indefinitely.

Java JTA / X/Open XA. The X/Open XA spec from 1991 is the formal API contract for the participant side. JTA is the Java binding. Every Java application server (WebLogic, WildFly, Tomcat with Atomikos) ships an XA transaction manager that drives 2PC across the JDBC and JMS resources the application enlists.

Vitess --enable-2pc. Vitess (the YouTube-built MySQL sharding proxy) supports cross-shard transactions via 2PC, opt-in per query. The Vitess team's documentation explicitly warns that the protocol is not recommended for high-throughput workloads, and points at the blocking failure mode as the reason.

Spanner and CockroachDB. Both use 2PC for cross-range transactions, but the coordinator is itself replicated by Paxos (Spanner) or Raft (CockroachDB), which gives them a non-blocking coordinator. The atomic commit protocol is still 2PC; the consensus replication of the coordinator is what makes it production-safe at scale. Build 14 walks through how this works.

The pattern: classical 2PC (single coordinator) is the right protocol with the wrong availability story. Production systems that want atomic cross-shard commit at scale use the same protocol with a consensus-replicated coordinator. Get the protocol right first; the next chapters show how to make it non-blocking.

Common confusions

"PREPARE just means lock the rows." No — PREPARE means "perform the work, fsync a record promising to commit, and report whether you can keep that promise." The locks are part of it but are not the whole of it. A participant that voted YES has done the work and made a durable promise; it is in a different state from "locks held".

"The coordinator's log is the source of truth for the data." No — the participant's WAL is the source of truth for the data. The coordinator's log only holds the decision (commit or implicitly abort). If you lose the coordinator's log, you lose the ability to resolve in-doubt transactions; you do not lose the data.

"Two-phase commit and two-phase locking are the same thing." They are not — and this is the most common interview confusion. Two-phase locking (2PL) is a concurrency control protocol within one node: locks are acquired in a growing phase and released in a shrinking phase. Two-phase commit (2PC) is an atomic commit protocol across multiple nodes: prepare, then decide. Different problem, different protocol, similar name.

"Presumed-abort means abort is the default in the protocol." Both YES-everyone and any-NO are explicit decisions. Presumed-abort is only a recovery rule: if a participant asks the coordinator "what happened?" and the coordinator has no record, the answer is ABORT. It is a log-size optimisation, not a default behaviour during normal operation.

"2PC works as long as no one crashes." It works in any failure model where messages are eventually delivered and crashed nodes eventually recover with their disks intact. The protocol is correct against arbitrary message loss and process crashes — it just blocks in some of those cases. "Doesn't work" and "blocks until the coordinator comes back" are different.

Going deeper

Why the coordinator must fsync the COMMIT decision before broadcasting

The protocol gives the coordinator an ordering rule: fsync COMMIT, then send COMMIT. Suppose you reverse it — send COMMIT first, then fsync.

If the coordinator crashes between sending and fsyncing, the participants commit and the coordinator forgets the decision. On recovery, the coordinator does not see T in its log; presumed-abort says T was aborted. But the participants already committed. The next time anyone asks the coordinator about T, it answers "aborted". The two perspectives are now permanently inconsistent.

Persisting the decision before broadcasting is what guarantees that the coordinator never asserts an outcome to a participant that the coordinator could later forget. Once the coordinator has spoken, the coordinator's disk has spoken too. This is the property the protocol is built around.

How XA standardises the protocol

The X/Open XA specification (1991) is the contract that lets a transaction manager drive 2PC across resource managers from different vendors. It defines the participant-side API as a set of C functions: xa_start, xa_end, xa_prepare, xa_commit, xa_rollback, xa_recover, xa_forget. A resource manager that implements this API correctly can be coordinated by any compliant transaction manager.

The spec also formalises the recovery handshake: on transaction-manager startup, it calls xa_recover on every resource manager to obtain the list of in-doubt transactions (transactions in the prepared state that haven't received a verdict). The transaction manager then consults its own log to decide each one. Resource managers implement xa_recover by scanning their WALs for prepared-but-not-resolved records.

Hierarchical 2PC and tree coordinators

The protocol generalises to a tree. The root coordinator's children can themselves be coordinators of their own sub-trees. A non-leaf coordinator forwards PREPARE down its sub-tree, collects votes, and forwards a single vote up. The root makes the global decision and the protocol fans down the tree. This is how multi-region distributed transaction systems sometimes structure themselves — a regional coordinator in each region, a global coordinator over the regions — to bound the wide-area round-trips.

What 2PC does not give you

2PC is an atomic commit protocol. It does not give you isolation. Two concurrent 2PC transactions can interleave through the locks of their participants in ways that produce non-serialisable schedules. To get serialisability you need to combine 2PC with a concurrency-control protocol (2PL on each participant, or MVCC with a consistent snapshot timestamp picked by the coordinator). Production systems combine both: each participant runs 2PL or MVCC locally, and 2PC composes their local commits.

It also does not give you opacity to the application — sessions and isolation levels still apply. A reader on a participant in the prepared state for T sees the pre-T values (because T's writes are not yet committed) but cannot acquire conflicting locks (because T holds them). Reads block on the prepared state. This is the source of the lock-contention complaints about 2PC even on the happy path.

Where this leads next

This chapter taught the protocol on the happy path. Chapter 110 is the failure-mode chapter — what exactly breaks when the coordinator dies after Phase 1, why the prepared participants cannot decide on their own, what operators actually do at 2 AM when this happens in production, and why this single failure mode drove the industry to consensus-based atomic commit. Chapters 111 and onward then build the non-blocking version: Percolator's snapshot-isolation 2PC over a sharded KV, Spanner's TrueTime-coordinated commits, and the parallel-commits optimisation that collapses the prepare and commit phases into one fsync.

The protocol you wrote today is the foundation of everything that follows. The reason there is a Build 14 at all is that this protocol, as written, blocks in the one place that matters most — and the rest of the build is how the industry crossed that gap.

References

Gray, Notes on Data Base Operating Systems, Compustar Research RJ2188, 1978 — the original write-up of the two-phase commit protocol, written when the question of how to commit a transaction across multiple nodes was still genuinely open. The diagrams in this chapter follow Gray's original swim-lane convention.
Bernstein, Hadzilacos, and Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley 1987 (free PDF) — Chapter 7 is the textbook treatment of 2PC, including presumed-abort, presumed-commit, and the recovery rules. Still the clearest reference forty years on.
X/Open Company Ltd., Distributed Transaction Processing: The XA Specification, 1991 — the API spec that every JTA, MySQL XA, and PostgreSQL prepared-transactions implementation conforms to. The recovery section is the operational counterpart to the protocol.
Corbett et al., Spanner: Querion's Globally-Distributed Database, OSDI 2012 — Section 4 ("Concurrency Control") shows how Spanner runs 2PC over Paxos-replicated coordinators, the production answer to the blocking problem this chapter teases.
Oracle / MySQL, MySQL XA Transactions — the canonical end-user view of how 2PC commands surface in a SQL dialect, including the "in-doubt transaction" recovery commands operators run when a coordinator dies.
Lampson, How to Build a Highly Available System Using Consensus, 1996 — the framing chapter for why "make the coordinator a consensus-replicated state machine" is the right answer to 2PC's blocking problem; the intellectual hand-off from this chapter to the rest of Build 14.