Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

3PC and why it doesn't help

Roopa, a backend engineer at PaySetu, finished reading the 2PC chapter at 11pm and typed the obvious follow-up question into her notes: "if 2PC blocks because PREPARED participants can't tell what the dead coordinator decided, why don't we add a third phase that announces the decision before committing? Then a participant in PRE-COMMIT knows the decision was COMMIT — even if the coordinator dies, the participants can finish the transaction among themselves." This is the question Dale Skeen asked in 1981. The protocol he designed — three-phase commit — is the textbook answer. It is also the reason most working distributed-systems engineers have never deployed it: 3PC's correctness proof rests on a single assumption that real networks violate every day. This chapter walks through the protocol, the proof of why it eliminates blocking under that assumption, and the precise way the assumption fails. The goal is not to teach you 3PC — you will not run 3PC in production. The goal is to teach you why people stopped trying.

Three-phase commit splits 2PC's commit phase into PRE-COMMIT and COMMIT, so that any participant past PRE-COMMIT knows the coordinator decided to commit and can finish the transaction with peers if the coordinator dies. This eliminates the "PREPARED but undecided" blocking case — but only under a synchronous network model, where message delays are bounded and a timeout reliably distinguishes a crashed node from a slow one. On a real partial-synchronous network (TCP retransmits, GC pauses, switch reboots), a partition can split nodes into groups that disagree about which phase has been reached, and 3PC can produce inconsistent decisions. In practice, 3PC trades blocking for the chance of a split-brain commit — and split-brain is operationally worse than blocking. Real systems use Paxos-commit (replicate the coordinator) or sagas (give up atomicity, accept compensations) instead.

What 3PC actually is — the protocol step by step

Three-phase commit takes the structure of 2PC and inserts a propagation phase between vote-collection and commit. The protocol assumes the same coordinator + N participants, with durable logs. The phases are PREPARE, PRE-COMMIT, and COMMIT.

Phase 1 — Prepare. Identical to 2PC's prepare. Coordinator sends PREPARE(txn_id). Each participant does its local work, durably writes a PREPARED log record, replies YES (or aborts and replies NO).

Phase 2 — Pre-commit. This is the new phase. After collecting all YES votes, the coordinator does not commit. Instead, it durably writes a PRE-COMMIT record and broadcasts PRE-COMMIT(txn_id) to every participant. Each participant durably writes its own PRE-COMMIT record and acks. The PRE-COMMIT record means: "the coordinator has decided to commit, and I have heard about that decision". No locks are released yet.

Phase 3 — Commit. After collecting all PRE-COMMIT acks, the coordinator durably writes a COMMIT record and broadcasts COMMIT(txn_id). Each participant applies the WAL entry, releases locks, writes COMMITTED, and acks.

Why the extra phase changes the blocking story: in 2PC, a participant in PREPARED that has lost contact with the coordinator cannot decide unilaterally — the coordinator might have already broadcast COMMIT to others. In 3PC, a participant in PRE-COMMIT knows every other participant voted YES (otherwise the coordinator would not have started PRE-COMMIT). So if the coordinator dies, the PRE-COMMIT participants can run a termination protocol: elect a new coordinator from among themselves, and that new coordinator can either commit (if it sees any PRE-COMMIT) or abort (if no participant has reached PRE-COMMIT). The state PRE-COMMIT carries enough information to make a safe decision without the original coordinator.

Three-phase commit message-sequence chartVertical lifelines for the coordinator and three participants A, B, C. Time flows downward. The coordinator sends PREPARE; participants reply YES. Coordinator sends PRE-COMMIT; participants ack. Coordinator sends COMMIT; participants commit. Three round-trips total. The PRE-COMMIT phase is highlighted as the new phase 3PC adds compared to 2PC. Annotations on the right indicate which states allow safe unilateral recovery and which do not. Illustrative — not measured data. 3PC: prepare, pre-commit, commit — three round-trips Coordinator A B C PREPARE YES PRE-CO PRE-COMMIT (new!) PRE-COMMIT records (this is the new state) ack COMMIT COMMIT PREPARED: bound, must wait for decision PRE-COMMIT: knows decision is COMMIT Illustrative — not measured data
3PC inserts the PRE-COMMIT phase between YES-votes and COMMIT. A participant in PRE-COMMIT carries enough information to finish the transaction without the original coordinator — provided every other participant in PRE-COMMIT can be reached.

The state machine — and the proof that 3PC is non-blocking under synchrony

The state machines for 3PC are similar to 2PC's, with one extra state per role.

Coordinator: INITWAIT (sent PREPAREs) → PRE-COMMIT (logged after all YES votes) → COMMIT (logged after all PRE-COMMIT acks) → DONE. Aborts can happen from WAIT (any NO vote, any timeout) but not after PRE-COMMIT is logged.

Participant: INITPREPARED (voted YES) → PRE-COMMIT-RECEIVEDCOMMITTED. Or INITABORTED from PREPARED if a NO occurred or the coordinator timed out before PRE-COMMIT.

Skeen's correctness argument rests on a simple invariant: at any moment, the participants' states differ by at most one step. If any participant is in COMMITTED, every other participant is in either PRE-COMMIT-RECEIVED or COMMITTED. If any participant is in PRE-COMMIT-RECEIVED, every other participant is in PREPARED, PRE-COMMIT-RECEIVED, or COMMITTED — never in INIT or ABORTED.

Why this invariant gives non-blocking recovery: when the coordinator dies, surviving participants run a termination protocol. They elect a new coordinator (any participant; deterministic election by lowest node-id is fine). The new coordinator polls everyone reachable for their state. If any participant reports COMMITTED or PRE-COMMIT-RECEIVED, the decision was COMMIT — broadcast COMMIT to all. If any reports ABORTED, the decision was ABORT. If everyone reports PREPARED and nobody reports PRE-COMMIT-RECEIVED, the new coordinator can safely abort: by the invariant, if any participant had reached PRE-COMMIT, at least one reachable participant would have too. The protocol is non-blocking because the PRE-COMMIT state carries the decision-bit forward without requiring the original coordinator.

Why this proof needs a synchronous network: the termination protocol's "if everyone reports PREPARED, abort" rule depends on the new coordinator being able to reach every participant that has reached PRE-COMMIT. If the network is partitioned and the new coordinator can only see a subset, it might see all-PREPARED in its subset while the partition's other side has all-PRE-COMMIT. Both sides run the termination protocol independently. One side aborts; the other commits. Atomicity broken. Skeen handles this by assuming a synchronous network: messages are either delivered within bound Δ or the sender knows they were dropped. With that assumption, the new coordinator can deterministically tell "no PRE-COMMIT exists anywhere" from "I cannot see them"; without that assumption, it cannot.

# Simulate 3PC's state machine and termination protocol under (a) synchronous
# network and (b) asymmetric partition. Demonstrates split-brain commit.
# pip install (no deps)
import enum, random

class S(enum.Enum):
    INIT = 0; PREPARED = 1; PRE_COMMIT = 2; COMMITTED = 3; ABORTED = -1

class Node:
    def __init__(self, name): self.name = name; self.state = S.INIT

def termination_protocol(reachable_nodes):
    """New coordinator's decision based on the states it can see."""
    states = [n.state for n in reachable_nodes]
    if any(s == S.COMMITTED for s in states): return "COMMIT"
    if any(s == S.PRE_COMMIT for s in states): return "COMMIT"
    if any(s == S.ABORTED for s in states): return "ABORT"
    return "ABORT"  # all PREPARED; safe under synchrony

# Setup: 5 participants, all voted YES, coordinator broadcast PRE-COMMIT
# but only A and B received it before the coordinator died.
nodes = [Node(c) for c in "ABCDE"]
for n in nodes[:2]: n.state = S.PRE_COMMIT     # A, B got PRE-COMMIT
for n in nodes[2:]: n.state = S.PREPARED       # C, D, E did not

# Synchronous network: new coordinator sees ALL nodes — invariant holds.
print("synchronous: new coord sees all 5 →", termination_protocol(nodes))

# Asymmetric partition: side-1 sees {A, B}; side-2 sees {C, D, E}.
side1 = nodes[:2]; side2 = nodes[2:]
print(f"partition side-1 sees {[n.name for n in side1]} →",
      termination_protocol(side1))
print(f"partition side-2 sees {[n.name for n in side2]} →",
      termination_protocol(side2))
print("→ split-brain: side-1 commits, side-2 aborts. Atomicity violated.")

Output:

synchronous: new coord sees all 5 → COMMIT
partition side-1 sees ['A', 'B'] → COMMIT
partition side-2 sees ['C', 'D', 'E'] → ABORT
→ split-brain: side-1 commits, side-2 aborts. Atomicity violated.

Per-line walkthrough:

  • termination_protocol is the recovery rule a new coordinator runs when the original dies: scan reachable participants, decide based on what state any of them has reached. The rule is correct iff the new coordinator can see every participant that reached PRE-COMMIT.
  • The synchronous case sees all 5 nodes — at least one reports PRE_COMMIT, so the decision is COMMIT, matching what the original coordinator would have done. This is the case Skeen's proof covers.
  • The asymmetric partition is the killer. Side-1 sees A and B (both PRE_COMMIT) and decides COMMIT. Side-2 sees C, D, E (all PREPARED, none PRE_COMMIT) and decides ABORT. Both decisions are "correct" by the rule; the rule's premise — full visibility — is what failed. The system has split-brained: a transaction is committed on one side and aborted on the other.

This is not a bug in the simulation. It is exactly what 3PC does on a real network. The fault is structural.

Why real networks break 3PC's assumption

The synchrony assumption says: there exists a known bound Δ such that any message between two non-failed nodes arrives within Δ. Equivalently, a timeout greater than Δ reliably distinguishes a crashed node from a slow one. Real networks violate this in three ways every distributed-systems engineer has experienced.

Garbage-collection pauses. A JVM stop-the-world GC at PaySetu's old wallet service routinely paused processes for 800 ms — sometimes 4 seconds. During that pause, the process cannot send heartbeats, cannot ack messages, cannot do anything. From the outside, it looks crashed. If you set the 3PC timeout below 4 s, the new coordinator declares the GC-paused node "failed" and runs the termination protocol without it. When the GC finishes, the paused node's state is still PRE_COMMIT — it joins the network with a state that contradicts the new coordinator's decision. Same outcome as the partition above: split-brain.

TCP retransmit storms. A switch in CricStream's VPC reboots; for the next 18 seconds, packets are queued at the upstream switch and arrive in a burst when the link comes back. During those 18 s, every TCP connection treats the silence as a partition and the burst as a recovery. If 3PC's timeout fired during the silence, the recovery from the burst delivers PRE-COMMIT messages to nodes whose new-coordinator already decided ABORT. They cannot agree.

Asymmetric partitions. A misconfigured firewall rule lets traffic from rack-A reach rack-B but blocks rack-B's replies. The coordinator on rack-A thinks rack-B is dead (no replies); rack-B can hear the coordinator just fine and thinks the coordinator is alive. Both sides start their own termination protocol with different views of who is reachable. This is the most common real-world cause of distributed-system disagreement, and 3PC's correctness proof has nothing to say about it because the proof assumes symmetric reachability.

The deeper point is from the FLP impossibility result (Fischer, Lynch, Paterson, 1985): in an asynchronous network with even one possible faulty process, no deterministic protocol can guarantee both safety and liveness for consensus-style problems. 3PC tries to be non-blocking (live) and atomic (safe). FLP says you can't have both without strengthening the network model. 3PC strengthens it to "synchronous", which is exactly the assumption the real network does not provide.

3PC split-brain commit under network partitionTwo side-by-side panels show the same 5-node 3PC cluster. Left panel ("synchronous network") shows the new coordinator reaching all 5 nodes; the decision is COMMIT. Right panel ("asymmetric partition") shows a vertical dashed line dividing the cluster; side-1 has 2 PRE-COMMIT nodes, side-2 has 3 PREPARED nodes. Each side independently runs the termination protocol and decides differently — side-1 COMMITs, side-2 ABORTs — producing inconsistent outcomes. Illustrative. Same protocol, two networks: only one is safe Synchronous network new coordinator reaches all 5 participants A PRE-COMMIT B PRE-COMMIT C PREPARED D E decision: COMMIT (sees A,B in PRE-COMMIT) Asymmetric partition two sides cannot reach each other A PRE-COMMIT B C PREPARED D E side-1: COMMIT side-2: ABORT split-brain — atomicity broken
Same five nodes, same protocol. With a synchronous network the termination protocol gives the right answer; with an asymmetric partition it gives two different answers on two different sides.

Production stories — what people actually use instead

KapitalKite's brief 3PC experiment. In 2017, KapitalKite (a fictional stockbroker) had a senior engineer who'd read the Skeen-Stonebraker papers in graduate school and prototyped 3PC for the order-book ↔ settlement transaction. He benchmarked it against 2PC: 3PC ran at ~62% of 2PC's throughput because of the extra round-trip (47 ms median for 2PC, 76 ms for 3PC on the same hardware), which he was prepared to accept. What he was not prepared to accept was the staging-environment incident in week 3: a deliberately-induced asymmetric partition (iptables rule blocking outbound from rack-2's switch but allowing inbound) caused a single trade to be marked SETTLED on the order-book side and UNSETTLED on the settlement side. The reconciliation script flagged it; the engineer reverted to 2PC + Consul-replicated coordinator within a sprint. His post-mortem to the team is still cited internally: "3PC trades blocking for inconsistency. Blocking is observable and pages on-call. Inconsistency is silent and shows up in a regulator's audit two months later." The lesson generalised: KapitalKite has not run 3PC since.

Spanner does not use 3PC. Google's Spanner is the canonical "atomic distributed transactions at planet scale" system, and it explicitly chooses Paxos-commit over 3PC. The Paxos paper (Gray and Lamport, 2006) frames the choice plainly: replicate the coordinator's decision through a Paxos quorum, and the decision survives any minority of failures without requiring synchrony. Spanner pays one Paxos round-trip for the commit decision (~5 ms intra-region, ~80 ms cross-region). It cannot split-brain because Paxos is provably safe under asynchrony — only liveness depends on partial synchrony, and "liveness fails" means "the transaction blocks until partition heals", which is operationally tolerable. 3PC's failure mode is "atomicity fails", which is not.

MealRush's saga-shaped order pipeline. When a hungry customer at MealRush places an order, the platform must reserve inventory at the restaurant, charge the customer, and assign a delivery rider — three services, three databases. The team explicitly chose a saga over any commit protocol. The forward path is reserve_inventory → charge_card → assign_rider → deliver. If assign_rider fails (no riders available within 8 minutes), compensating steps run: refund_card → release_inventory. The saga is not atomic — there is a moment when the card is charged but the rider hasn't been assigned, and during that moment the customer's app shows "preparing" while the system is mid-saga. But the saga survives partition: each step is independent, idempotent, and recoverable; no synchrony assumption anywhere. The team's architectural review note: "we considered 3PC for ~30 minutes in the design phase. We rejected it because rejecting it took 30 minutes; choosing 2PC would have required us to re-litigate the operational assumptions, and choosing Paxos-commit would have meant building a Paxos library before shipping the order pipeline."

Common confusions

  • "3PC fixes 2PC's blocking problem." Only under a synchronous network. On a real partial-synchronous network, 3PC trades 2PC's blocking for the possibility of split-brain commit. Operationally that is a worse trade — blocked transactions page on-call within seconds; split-brain takes hours or days to detect. See /wiki/2pc-in-detail-including-failure-modes for the blocking case 3PC was meant to solve.
  • "Add more phases and you get more safety." Skeen-Stonebraker proved the opposite: no atomic-commit protocol with a finite number of phases can be non-blocking on an asynchronous network. Adding phases trades round-trip latency for the ability to handle one extra failure scenario, but the FLP impossibility puts a hard ceiling on what protocols can guarantee. The way out is not more phases; it is replicating the decision (Paxos-commit) or giving up atomicity (sagas).
  • "3PC is non-blocking because PRE-COMMIT carries the decision." PRE-COMMIT carries the intended decision, but only the coordinator can authoritatively make the decision. If the coordinator dies between logging PRE-COMMIT and broadcasting it, and the partition divides those who heard from those who didn't, the surviving sides can both think they are the authority. PRE-COMMIT carries a decision-bit; it does not authorise unilateral decisions.
  • "You can deploy 3PC if your network is reliable enough." Reliability is not a binary. Even AWS's intra-AZ network, which is among the most reliable production environments in the world, has measurable rates of GC-pause-induced asymmetric "partitions" and TCP-retransmit-induced delivery delays. The synchrony assumption is violated even in good networks, just rarely; 3PC's failure mode is rare and silent, which is the worst combination.
  • "3PC is just 2PC with an extra ack — what's the harm?" The harm is that the extra ack creates a state (PRE-COMMIT) where participants believe they have enough information to act unilaterally. Their belief is correct under synchrony and wrong under partition. 2PC's PREPARED participants don't believe they can act unilaterally, so they block; blocking is louder but safer than acting on a stale belief.
  • "Modern systems use 3PC under another name." They don't. Paxos-commit, Raft-replicated coordinators, two-phase commit with replicated leaders, and sagas are all distinct protocols with different correctness properties. None of them is 3PC. If you see a system labelled "3PC" in production, it is almost certainly either a misnamed Paxos-commit or a research prototype.

Going deeper

The Skeen-Stonebraker proof and where it breaks

Skeen and Stonebraker's 1983 paper formalises the 3PC argument. Their model: N participants, one coordinator, durable logs, and a synchronous network where any non-failed node receives any sent message within bound Δ. Under this model, they prove the buffer property: at any moment, the participants' states cannot differ by more than one phase. This bounds the recovery work — a new coordinator polling reachable participants needs only to find the highest state any of them has reached. The proof relies on Δ being known and finite; if Δ is unbounded, the buffer property fails because a slow message between phases can outlive any timeout. Real networks have unbounded Δ (a TCP connection can stall for minutes during a routing flap), and that is exactly where 3PC's safety proof stops covering it. The Dwork-Lynch-Stockmeyer (1988) work on partial synchrony is the formal framework that explains why protocols that assume bounded delay fail when delay is unbounded but eventually-bounded.

Paxos-commit — the protocol that replaced 3PC in practice

Gray and Lamport (2006) proposed Paxos-commit as a non-blocking atomic-commit protocol that does not require synchrony. The structure: each participant's vote is itself a Paxos-replicated value, and the commit decision is computed from the quorum of votes. If the original "coordinator" (more accurately, the leader of the Paxos group) dies, the next leader recovers the vote-quorum from the Paxos log and computes the same decision. The protocol is safe under asynchrony (Paxos is) and live under partial synchrony (Paxos is). The cost: each phase becomes a Paxos round (multiple message exchanges), so commit latency is higher than 2PC's. Spanner uses this. Percolator uses a related design. Google's Megastore used it before Spanner. The pattern in industry is consistent: when teams need non-blocking atomic commit, they replicate the decision through consensus, not through extra phases.

Why no production database ships 3PC

Audit the major production databases — Postgres, MySQL, MongoDB, CockroachDB, Spanner, FoundationDB, TiDB, YugabyteDB. None of them implements 3PC. Most implement either standard 2PC (Postgres prepared transactions, MySQL XA) or Paxos/Raft-based atomic commit (CockroachDB, Spanner, YugabyteDB). The reason is unanimous in the design docs: 3PC's synchrony assumption cannot be guaranteed in deployments these systems target (cloud VPCs with GC-pausing runtimes, geo-distributed clusters with WAN latency, container orchestrators that pause processes during scheduling). The synchrony assumption is testable in academic simulations and not testable in production. Database authors who care about correctness pick protocols whose assumptions match their deployment realities.

Where this leads next

The structural lesson of 3PC is that adding phases doesn't escape FLP — only changing the network model or weakening the guarantee does. The two productive directions are:

The next chapters in this part follow these two directions. 3PC is the protocol you study to understand why neither direction is optional.

References

  • Skeen, D. (1981). "Nonblocking commit protocols". SIGMOD '81. The original 3PC paper. https://dl.acm.org/doi/10.1145/582318.582339
  • Skeen, D., & Stonebraker, M. (1983). "A formal model of crash recovery in a distributed system". IEEE TSE. The synchrony assumption made explicit.
  • Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). "Impossibility of distributed consensus with one faulty process". JACM. The FLP result. https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf
  • Dwork, C., Lynch, N., & Stockmeyer, L. (1988). "Consensus in the presence of partial synchrony". JACM. The framework that explains why 3PC's network model is unrealistic.
  • Gray, J., & Lamport, L. (2006). "Consensus on transaction commit". ACM TODS. Paxos-commit; the modern replacement. https://www.microsoft.com/en-us/research/publication/consensus-on-transaction-commit/
  • Bernstein, P. A., Hadzilacos, V., & Goodman, N. (1987). Concurrency Control and Recovery in Database Systems. Addison-Wesley. Chapter 7 covers 2PC and 3PC and is the standard textbook reference.
  • Kingsbury, K. "Jepsen analyses". The Jepsen test reports document real-world consistency failures in distributed databases; many trace back to synchrony assumptions like 3PC's. https://jepsen.io/analyses
  • /wiki/2pc-in-detail-including-failure-modes — the protocol 3PC was meant to fix.