Why 2PC blocks and what exactly breaks

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The previous chapter ended with a teaser: 2PC is correct on the happy path but blocks in one specific failure. This chapter takes that failure apart.

The danger window is small but lethal. Phase 1 completes — every participant has voted YES, fsynced its PREPARE record, and is holding locks. The coordinator is now obligated to write its COMMIT decision, fsync it, and broadcast. It crashes between those steps. From a participant's point of view, the coordinator's silence is indistinguishable from "the coordinator decided ABORT and the message is in flight" or "the coordinator decided COMMIT and the message is in flight" or "the coordinator's disk is intact and it will be back in five minutes" or "the coordinator's disk is gone and it will never tell anyone what it decided." The participant cannot guess. So it waits. With locks held.

"Blocking" sounds abstract until you trace what happens to the rest of the system. Every other transaction that needs the row locked by the prepared transaction also waits. If those waiters hold their own locks, transactions waiting on them also wait. The dependency graph fans out and a single coordinator crash on one cross-shard transfer can stall every transaction touching the affected rows on every shard. The bank's payment pipeline goes from "transactions per second" to "zero" until an operator manually resolves the in-doubt transactions or the coordinator returns.

The participant-side recovery story is: on restart, a participant in PREPARED state must contact the coordinator to learn the outcome. If the coordinator is reachable and has a COMMIT log record, commit. If the coordinator is reachable and has no record (presumed-abort), abort. If the coordinator is unreachable — block. As a partial workaround, participants can ask each other: if any peer has already heard the verdict, propagate it. This is the cooperative-termination protocol. It helps when the coordinator dies after at least one participant has been told, but if the coordinator dies before telling anyone, every peer is in the dark and the workaround does nothing.

Three-phase commit (3PC), Skeen 1981, tries to fix this by inserting a PRE-COMMIT phase between PREPARE and COMMIT. A participant in PRE-COMMIT knows everyone else voted YES, so on coordinator death it can safely commit unilaterally. Beautiful in a synchronous network. Useless in an asynchronous one — a partition between coordinator and a subset of participants can still produce inconsistency. 3PC trades blocking for unsafety; nobody runs it in production.

The modern answer is structural: make the coordinator itself a Raft or Paxos group. A coordinator that is a replicated state machine cannot crash to permanent silence — its replicas elect a new leader and continue. Spanner, CockroachDB, YugabyteDB, FoundationDB all do this. The atomic-commit protocol is still 2PC; the coordinator is a consensus group. The single point of failure has been moved from "one process" to "a majority of a Paxos quorum", which under reasonable assumptions does not fail.

You wrote a working two-phase commit in the previous chapter. The ₹500 transfer worked. The trace table showed every disk record at every moment. You closed the chapter with a footnote: what if the coordinator dies after Phase 1?

This chapter is that footnote. It is also the chapter that explains why your DBA refuses to enable PostgreSQL's prepared transactions, why MySQL XA is technically supported but operationally feared, and why Spanner — which runs the same 2PC you wrote — is considered production-safe and your handwritten 2PC is not.

The danger window, framed precisely

The danger window. Phase 1 has completed — both participants have fsynced PREPARE and voted YES. The coordinator is the only entity that knows the verdict, and the verdict is still in its head (or worse, half-written to disk). It crashes here. Both participants are now in `PREPARED` state holding locks, with no path forward on their own.

The protocol's correctness argument depends on the coordinator's COMMIT decision being either (a) durable on disk and broadcast or (b) not durable on disk and not broadcast. The forbidden state is "decided in memory but not yet on disk and not yet sent." A crash inside that window leaves the participants in the prepared state without a verdict.

There are actually three sub-windows, and they have different recovery stories:

Window A — coordinator crashes before fsyncing COMMIT. The decision was never made durable. On recovery, the coordinator's log has no record of T. By the presumed-abort rule, T should be aborted. The participants are still in PREPARED, holding locks, waiting. They can recover correctly if they can reach the coordinator and ask.

Window B — coordinator crashes after fsyncing COMMIT but before sending any message. The decision is on disk. On recovery, the coordinator reads its log, sees COMMIT T1, and re-broadcasts. Participants commit. Recovery works if the coordinator's disk is intact.

Window C — coordinator crashes after sending some COMMIT messages but before sending all. Some participants have committed and released locks; others are still PREPARED. On recovery, the coordinator re-broadcasts to everyone (handle_commit must be idempotent — note if state != TxnState.PREPARED: return "ACK" in the previous chapter's code). The committed participants ack again; the prepared ones now commit.

Why all three windows look the same to a participant: a participant cannot tell whether the coordinator hasn't sent the message yet, has sent it but the network ate it, or has sent it to someone else but not to me. From the participant's vantage point, "I voted YES and haven't heard back" is the only observable. Window A vs B vs C is invisible. The participant must therefore behave the same way in all three: contact the coordinator (if possible) and learn the outcome; if not possible, block.

What turns this from a theoretical bug into an operational nightmare is what "block" means in practice.

What "blocking" actually costs

A prepared participant holds locks on the rows it touched. In our running example, participant 1 holds a row lock on account A, participant 2 holds a row lock on account B. As long as the transaction is in PREPARED state without a verdict, those locks are held. Forever, if the coordinator never returns.

Now consider what happens to the rest of the system. Account A is a hot row — it is the source account for many transfers. The next transaction that wants to debit A waits behind the lock. The transaction after that also waits. Within seconds you have a queue of dozens of transactions stalled on a single row. Each of those transactions might itself hold locks on other rows that its successors are waiting on. The wait-for graph fans out exponentially. Within a minute, an entire shard's transaction throughput can grind to zero — not because the shard is overloaded but because every transaction is trapped behind a single in-doubt 2PC.

In a real production incident at a bank running MySQL XA, a coordinator crash during evening peak load locked 14 hot accounts; within 90 seconds, every account in the same hot-cold cohort was unable to process payments because every payment touched at least one of those 14 accounts and queued behind them. The coordinator restart took 4 minutes. Total business impact: 5.5 minutes of zero-throughput payments during peak hour. The post-mortem made the case for moving off classical 2PC.

Why "just timeout and abort" doesn't work: a participant in PREPARED state that times out the coordinator and unilaterally aborts is unsafe. The coordinator might have already fsynced COMMIT and might have already told other participants to commit. If this participant aborts, other participants commit, and the atomicity invariant is broken — money disappears. The participant cannot abort without permission, and the only entity that can grant permission is the coordinator (or, in 3PC, the protocol itself; we'll get there).

This is also where presumed-abort, which seemed like a clean optimisation in the previous chapter, becomes a subtle hazard. If the coordinator's disk is destroyed and the coordinator restarts with an empty log, every in-doubt transaction is presumed-aborted on the next query — but participants may have committed some of them already (Window C above). The presumed-abort rule is correct only as long as the coordinator's log survives the crash. Lose the log and you lose atomicity.

Participant-side recovery: query the coordinator

A real participant, unlike the toy in the previous chapter, must know how to recover from its own crash. On restart, it replays its WAL and discovers a set of transactions in PREPARED state with no COMMIT or ABORT record. These are in-doubt transactions. The participant cannot serve traffic on the affected rows until they are resolved.

The textbook recovery procedure is: contact the coordinator, ask for the outcome of each in-doubt transaction, apply the answer.

# participant.py — added to the Participant class from the previous chapter

def recover(self, coordinator_client, peer_clients=None, timeout=30):
    """Resolve in-doubt transactions left over from a crash.

    Called once on startup, after _replay() has rebuilt in-memory state.
    Blocks until every in-doubt transaction has a definitive answer or
    until `timeout` elapses, in which case the in-doubt set remains and
    the affected rows stay locked.
    """
    in_doubt = [t for t, (state, _) in self.txns.items()
                if state == TxnState.PREPARED]
    for txn_id in in_doubt:
        outcome = self._resolve_one(txn_id, coordinator_client,
                                    peer_clients or [], timeout)
        if outcome == "COMMIT":
            self.handle_commit(txn_id)
        elif outcome == "ABORT":
            self.handle_abort(txn_id)
        else:
            # outcome == "UNKNOWN" — coordinator unreachable AND no peer
            # knows the verdict. Locks remain held; row is unavailable.
            log.warning("txn %s unresolved; locks remain held", txn_id)

def _resolve_one(self, txn_id, coord, peers, timeout):
    # Step 1 — ask the coordinator.
    try:
        return coord.query_outcome(txn_id, timeout=timeout)
    except (NetworkError, Timeout):
        pass
    # Step 2 — ask peers (cooperative termination).
    for peer in peers:
        try:
            verdict = peer.query_peer_outcome(txn_id, timeout=5)
            if verdict in ("COMMIT", "ABORT"):
                return verdict
        except (NetworkError, Timeout):
            continue
    # Step 3 — nobody knows. Stay blocked.
    return "UNKNOWN"

A few things this code does that the previous chapter's toy did not:

The coordinator is asked first, by name. The coordinator is the only entity that owns the COMMIT decision (or its absence under presumed-abort), and it is the authoritative source. Why ask the coordinator first: a peer might know the outcome only because the coordinator already told it. Going through the peer is a strict subset of what the coordinator knows. If the coordinator is reachable, asking it is always strictly better than asking any peer.

Peers are asked second, as a fallback. This is the cooperative-termination protocol from the Bernstein–Hadzilacos–Goodman text. If any peer has already received the verdict, the recovering participant can use that verdict. Cooperative termination turns a class-1 stuck transaction into a class-3 stuck transaction: stuck only when all peers are also stuck.

There is an explicit UNKNOWN outcome. Locks remain held. The row is offline. An operator must intervene — typically by manually committing or aborting the in-doubt transaction (XA COMMIT / XA ROLLBACK in MySQL) based on out-of-band evidence (other shard's logs, the application's own state). This manual intervention is the operational tax of 2PC.

The peer-discovery limit

Cooperative termination is a clever workaround but it has a hard limit. Consider: the coordinator decides COMMIT, fsyncs, then crashes before sending any message. None of the participants has heard the verdict. They ask each other. Each one knows only its own state (PREPARED), not the coordinator's decision.

Peer discovery only helps when at least one peer has heard the verdict. If the coordinator dies before notifying anyone, every peer's answer is "I don't know either", the cooperative termination protocol degenerates to silence, and the participant remains in PREPARED with locks held until either the coordinator returns with its log intact or an operator intervenes.

This is the residual hardness of 2PC. Cooperative termination strictly improves the situation but does not eliminate it. There exists a window — coordinator decides, fsyncs, crashes before sending — in which no peer has any information and the cluster is genuinely stuck.

It is exactly this residual case that motivated Skeen's three-phase commit.

Three-phase commit — Skeen, 1981

Dale Skeen's PhD thesis (Berkeley 1982) and the earlier Skeen–Stonebraker paper argued that the blocking problem in 2PC is structural: there is a state (PREPARED) from which a participant cannot decide what to do without external information, and if the external source disappears, the participant is permanently stuck. Skeen's solution was to insert another phase between PREPARE and COMMIT, putting participants in a state where they could unilaterally commit if the coordinator died.

3PC inserts a PRE-COMMIT phase between PREPARE and COMMIT. Once a participant has acknowledged PRE-COMMIT, it knows all peers also voted YES; if the coordinator dies, surviving participants can elect a new coordinator and commit. The protocol is non-blocking under synchronous-network assumptions. In an asynchronous network — which is what every real network is — partitions break the safety proof.

The 3PC message flow is:

Coordinator → all: PREPARE. Participants vote, fsync prepare records, return YES/NO.
If all YES: coordinator → all: PRE-COMMIT. Participants record "ready to commit", ack.
Coordinator → all: COMMIT. Participants commit.

The new state — let us call it READY-TO-COMMIT — has a special property: a participant in this state knows that every participant has voted YES, because the coordinator only sends PRE-COMMIT after collecting all votes. So if the coordinator dies after some participants have entered READY-TO-COMMIT, the surviving participants can elect a new coordinator (typically by a deterministic ordering among themselves) and commit unilaterally — they all agree the answer is COMMIT.

Symmetrically, if a participant is still in PREPARED (not yet in READY-TO-COMMIT) and the coordinator dies, the surviving participants can deduce that no one has been told to commit yet, and they can safely abort.

The key claim: there is no state in which a participant cannot make progress on its own after a coordinator failure. Therefore 3PC is non-blocking.

The trouble is that this proof assumes a synchronous network. The election algorithm relies on the surviving participants being able to definitively distinguish "the coordinator has crashed" from "the coordinator is slow". In a synchronous network with a known maximum message delay, you can: wait long enough and absence of message means crash. In an asynchronous network — which every real network is — you cannot: the coordinator might be alive on the other side of a partition, talking to its half of the participants, while the other half has elected a new coordinator and is talking to itself. Now you have two coordinators, two decisions, and possible inconsistency.

Why 3PC is unsafe under partitions: imagine 5 participants, the coordinator is in READY-TO-COMMIT with 3 of them. A network partition cuts off the coordinator + 2 participants from the other 3. The 3 isolated participants timeout the coordinator, elect a new one among themselves, see they are in PREPARED but not READY-TO-COMMIT, and decide ABORT. Meanwhile the 2 participants on the coordinator's side are in READY-TO-COMMIT and the coordinator continues with COMMIT. After the partition heals, you have an atomicity violation — some participants committed, some aborted. 3PC trades blocking for unsafety. In a system that can't tolerate either, this trade is worse than the original.

This is why 3PC, despite being a famous textbook protocol, is essentially never deployed. Skeen himself in subsequent work moved toward consensus-based commit, which solves the problem properly.

The modern fix — replicate the coordinator

The structural insight that took the field thirty years to fully internalise: the blocking problem in 2PC is not a bug in the protocol, it is a single-point-of-failure problem with the role. The protocol is fine. The coordinator is the bottleneck. So replicate the coordinator.

Replace the single coordinator process with a Raft or Paxos consensus group. The "coordinator" becomes a replicated state machine. The decision (COMMIT T1) is committed by Raft/Paxos before being broadcast to participants. If any individual coordinator replica crashes, a majority is still alive and the group elects a new leader; the new leader reads the replicated log, sees COMMIT T1, and continues broadcasting. The participants see this as a brief delay, not a permanent stall.

The protocol from the participant's point of view is unchanged. It still receives PREPARE, fsyncs, votes, waits for COMMIT or ABORT, applies the verdict. The participant does not know — and does not need to know — that the coordinator is actually a Paxos quorum behind a single endpoint.

What changes is the failure model. The blocking-window failure of classical 2PC required a single process (the coordinator) to crash at a specific instant. The blocking-window failure of consensus-replicated 2PC requires a majority of the coordinator quorum to crash at a specific instant. With a 5-node Paxos group across 3 availability zones, the probability of losing a majority simultaneously approaches the probability of the data centre burning down. For practical purposes, the coordinator does not single-point-fail.

This is the architecture of every production system that runs 2PC at scale:

Spanner runs 2PC across Paxos-replicated tablets. Each tablet group is a Paxos group; one tablet within the transaction acts as the coordinator, and the coordinator's commit decision is itself replicated by that tablet's Paxos group before being broadcast to the participant tablets.
CockroachDB runs 2PC across Raft-replicated ranges. The transaction record is itself stored on a Raft group; the coordinator is the leaseholder of that Raft group. A coordinator crash causes a Raft leader election, the new leader reads the transaction record, and resumes.
YugabyteDB is the same architecture as Cockroach.
TiDB / Percolator-derived systems persist the coordinator's state in the underlying replicated KV (TiKV, which is itself Raft-replicated), so the coordinator's "log" is intrinsically replicated.

The pattern is: 2PC for atomicity, consensus for availability, composed. Build 14 walks through the details of how Spanner and Percolator make this work; this chapter is the chapter that explains why they have to.

Worked example — coordinator crash mid-transfer

The ₹500 transfer with a coordinator crash at t=5

Account A starts at ₹2,000 on shard 1. Account B starts at ₹500 on shard 2. The application requests transfer(A, B, 500). The coordinator and both participants are using the code from the previous chapter. Things proceed exactly as the happy-path trace until t=5.

t=0–4 (Phase 1 completes successfully). Same as the happy-path trace. P1 fsyncs PREPARE T1 for debit A 500. P2 fsyncs PREPARE T1 for credit B 500. Both vote YES. Both have row locks on their respective accounts.

t=5 (Coordinator about to write COMMIT). Coordinator has collected both YES votes. It is now in the danger window — about to call _log_commit which will append to disk and fsync. The coordinator process is killed by an OOM-killer / hardware failure / kubectl delete pod / electricity outage. It does not return.

State at the moment of crash:

Disk file	Contents
coord.log	(empty — COMMIT was never written)
wal_shard1.log	PREPARE T1
wal_shard2.log	PREPARE T1

State of locks:

P1: row lock on account A — held.
P2: row lock on account B — held.

t=10 (Application timeout). The application that initiated the transfer times out waiting for the coordinator. It returns an error to the user. The user retries, hitting a different coordinator endpoint or rolling back manually. But the locks on A and B are still held by P1 and P2. Account A and account B are now unavailable for any new transaction.

t=15 (New transfers queue up). Other customers attempt to debit account A (a popular salary-disbursement account). Each of those transactions sends PREPARE to P1, which tries to acquire a lock on A, fails (held by T1), and either waits or votes NO. The wait queue grows.

t=60 (Operations team notices). The on-call engineer sees prepared_transactions_held metric spike. They check P1 and P2's logs and find the in-doubt T1 in PREPARED state on both. They check the coordinator — process is gone. Nothing in the coordinator's log because the crash happened before _log_commit.

t=120 (Operator decision). With no coordinator log to consult, the operator has two choices:

Manually abort T1 on both participants (p1.handle_abort(T1), p2.handle_abort(T1)). Safe in this specific case because the coordinator never fsynced COMMIT, so the only consistent verdict was ABORT under presumed-abort. Money was never moved. Locks released.
Manually commit T1 if the operator has out-of-band evidence the verdict was supposed to be commit (other shard logs, application records). Risky — getting it wrong breaks atomicity.

In practice the operator chooses ABORT because it is the safer default under presumed-abort.

t=180 (Coordinator restarted on a new pod). The coordinator's pod is rescheduled. On startup, it loads coord.log — empty. It has no record of T1. If P1 or P2 had still been in-doubt, they could now call query_outcome(T1) and get ABORT back via presumed-abort. But the operator has already manually resolved them.

Now a worse variant. Suppose the coordinator had successfully fsynced COMMIT at t=5 (window B in the framing above) and then crashed before sending any message. coord.log now contains COMMIT T1. The operator looks at coord.log, sees the commit decision, and knows the verdict is COMMIT. They manually commit T1 on both participants. Money moves correctly. But this took 5 minutes of human attention.

Worst variant. Suppose the coordinator had successfully fsynced COMMIT and the coordinator's disk is destroyed (bad SSD, loss of the EBS volume, ransomware). On restart, coord.log is empty. By presumed-abort, T1 is ABORT. But P1 and P2 think (correctly) the verdict was supposed to be COMMIT, and an attentive operator with access to other logs would commit. With no human in the loop, the participants would call query_outcome and get ABORT and would abort — money never moved. But meanwhile if the coordinator had told one participant COMMIT before crashing (window C), that participant would have committed and the other would now abort — atomicity broken, money disappeared. This is the disaster case.

Now run the same scenario with the coordinator as a Raft group of 3 nodes. At t=5 one of the three coordinator nodes (the Raft leader) crashes after replicating COMMIT T1 to the Raft log. The other two nodes elect a new leader (typically within 1–2 seconds via Raft's election timeout). The new leader reads the Raft log, sees COMMIT T1, broadcasts COMMIT to P1 and P2. P1 and P2 commit. Total downtime: ~2 seconds. No operator paged. No locks held longer than the election timeout.

The contrast — 5 minutes of operator-driven recovery in the disaster case for classical 2PC versus 2 seconds of automatic recovery for consensus-replicated 2PC — is the entire production case for the consensus approach.

Common confusions

"3PC fixes the problem so we should run it." It fixes the problem under a synchronous-network assumption that does not hold in the real world. Under realistic asynchronous networks with partitions, 3PC can produce inconsistent commits, which is worse than blocking. Production systems do not run 3PC.

"Cooperative termination is enough." It strictly improves availability but does not eliminate the failure case where the coordinator dies before notifying any participant. If you need guaranteed liveness, cooperative termination is necessary but not sufficient.

"Just use very fast coordinator restart." Even with a 1-second coordinator restart, you have 1 second of zero throughput for every cross-shard transaction touching affected rows. For a payments processor that is unacceptable; for a low-traffic admin system it is fine. The acceptable answer depends on your SLO.

"The coordinator's disk is fine because it's on RAID/SSD/EBS." Disk loss is rare but it does happen — the EBS volume gets detached, the controller fries, the file system gets corrupted, the entire AZ becomes unreachable. Architectures that depend on a single disk surviving forever are gambling against tail probability. Replicating the coordinator's log via consensus eliminates this assumption.

"Spanner is just 2PC over Paxos." Yes — and that is the entire architectural insight. The atomic-commit protocol is unchanged from Gray 1978; the coordinator is replaced with a Paxos state machine. What looks like "Spanner" is really "the obvious composition of 2PC and Paxos that took the industry 25 years to settle on."

"My in-doubt transaction count is zero so I'm fine." Good — until it isn't. The metric that matters is "what happens during a coordinator crash". Run a chaos test: kill your coordinator process between Phase 1 and Phase 2 and observe how long the locks are held. That number is your worst-case latency floor for cross-shard transactions.

Going deeper

Pat Helland's escape route — get out of distributed transactions entirely

There is a separate school of thought, articulated most forcefully by Pat Helland (architect of Vistron's distributed transaction systems and later Riverone and CRMone systems). It argues that distributed atomic commit is the wrong primitive for most application-level work in the first place.

The Helland argument

The reasoning runs: 2PC is only ever necessary because two pieces of data are on different shards and the application wants them updated atomically. But that requirement is usually a relic of a single-machine relational schema being mechanically sharded. If you redesign the application around entities — single units of consistency that map to single shards — then most transactions become single-shard and need no 2PC. The few cases that genuinely need cross-entity atomicity are typically business-domain transactions (settlement, reconciliation) where the application can use higher-level idioms: idempotent operations, sagas with compensating actions, or workflow engines that drive a multi-step business process to completion without holding database-level locks across steps.

In Helland's framing, distributed transactions are a database leaking its concurrency-control problem into the application. The right architectural response is to redesign the data model so the leak does not occur, not to make the leak more efficient.

This argument is influential. Most large internet companies (Riverone, Querion's ads systems, Sociogram, large fintechs) have moved away from cross-shard 2PC for the application data path. The 2PC implementations in Spanner, Cockroach, and Yugabyte exist as a safety net — they let you reach across shards when you must — but the everyday workload is overwhelmingly single-shard transactions plus saga-style multi-step workflows.

Where this leaves the chapter

You now know exactly why classical 2PC blocks, what "blocks" means in operational terms, why the textbook fix (3PC) does not work in real networks, and what the production answer looks like (consensus-replicated coordinator). The next three chapters walk through that production answer in detail: Percolator builds 2PC over a sharded KV with snapshot isolation; Spanner uses TrueTime to give external consistency; the parallel-commits optimisation collapses two fsyncs into one for latency-critical workloads. The blocking problem is solved — by structural replacement, not by clever protocol design.

References

Skeen and Stonebraker, A Formal Model of Crash Recovery in a Distributed System, IEEE TSE 1983 — the analysis that first formalised 2PC's blocking property as a structural feature of the state machine, and motivated the search for non-blocking alternatives.
Skeen, Nonblocking Commit Protocols, ACM SIGMOD 1981 — the original 3PC paper. Introduces the PRE-COMMIT phase and proves non-blocking under synchronous-network assumptions; a careful read also exposes the assumptions that fail in real asynchronous networks.
Bernstein, Hadzilacos, and Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley 1987 — Chapter 7 covers the cooperative-termination protocol, the in-doubt resolution discipline, and the limits of peer-discovery as a workaround.
Lampson, Atomic Transactions, distributed systems lecture notes — the cleanest short statement of why the coordinator's role must be made fault-tolerant by consensus rather than by clever protocol surgery.
Corbett et al., Spanner: Querion's Globally-Distributed Database, OSDI 2012 — Section 4.2.1 ("Read-Write Transactions") describes how Spanner runs 2PC where each participant and the coordinator are themselves Paxos groups, the production embodiment of the structural fix this chapter ends with.
Helland, Life Beyond Distributed Transactions: An Apostate's Opinion, CIDR 2007 — the architectural argument that the right response to 2PC's blocking problem is often to redesign the data model around single-entity transactions plus sagas, rather than to make distributed atomic commit more efficient.