Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Two-phase commit (2PC) in detail, including failure modes

At 02:14 IST, Karan at PaySetu was staring at a single row in the audit-events Postgres replica that read state='PREPARED', txn_id='tx-9f2a-...', held_since='02:14:33'. The wallet service had the same row. The ledger service had the same row. All three had voted YES. None of them had committed. The coordinator process — a Java service on a host called coord-payments-3 — had received the third YES at 02:14:33.214, written its commit-decision log entry, and then segfaulted. The three participants were now in a state called prepared but undecided: they had promised to commit if asked, they could not safely abort, and they could not look at each other to figure it out. They could only wait for the coordinator's stand-by to take over and tell them what the dead coordinator had decided. For 89 seconds, every transaction touching the merchant's wallet, the ledger's payment-events table, or the audit's event-log queued up. The 02:14 page was not a 2PC bug. The 02:14 page was 2PC working exactly as designed.

Two-phase commit is the canonical atomic-commit protocol: a coordinator asks every participant "can you commit?" (PREPARE), every participant durably votes YES or NO, and only after all YES votes does the coordinator issue COMMIT. It guarantees atomicity in the absence of failures and remains atomic across most failures, but it pays for that with blocking — a participant that has voted YES cannot unilaterally decide what to do if the coordinator dies, so it holds locks indefinitely. Every production 2PC system layers timeouts, presumed-abort/presumed-commit optimisations, and a replicated coordinator on top to make blocking rare; none of them eliminates it.

What 2PC actually is — the protocol step by step

The protocol is named for its two phases: prepare and commit. It assumes a coordinator (one process, one machine) and N participants (the nodes whose state will change). Each participant has a local write-ahead log; each participant can durably persist a vote. The coordinator has its own log. The protocol uses these logs to make decisions survive crashes.

Phase 1 — Prepare. The coordinator sends PREPARE(txn_id) to every participant. Each participant does whatever local work it needs (acquires row locks, writes the proposed changes to its WAL in an "uncommitted" form, runs constraint checks). If the local work succeeds, the participant writes a PREPARED log record durably and replies YES. If anything fails (constraint violation, deadlock detected, disk full), the participant writes an ABORTED record locally and replies NO. Crucially, after replying YES, the participant gives up its right to abort unilaterally — the row locks stay held, and the only way out is the coordinator's decision.

Phase 2 — Commit (or abort). The coordinator collects votes. If every participant replied YES, the coordinator writes a COMMIT record to its own log durably (this is the commit point — the moment the transaction's outcome becomes determined) and sends COMMIT(txn_id) to every participant. Each participant applies the WAL entry, releases locks, writes a COMMITTED record, and acks. If any participant replied NO, or any participant timed out before voting, the coordinator writes an ABORT record and sends ABORT(txn_id) to every participant; each participant rolls back its WAL entry, releases locks, writes an ABORTED record, and acks.

Why the coordinator's log entry — not the messages going out — is the commit point: messages can be lost, but the log entry is durable. Once COMMIT is in the coordinator's log, the coordinator (or its stand-by, after a crash) is committed to driving the protocol forward to completion. If the coordinator crashes after writing COMMIT but before sending the messages, the recovery process reads the log, sees the decision, and re-sends. The participants are idempotent on COMMIT(txn_id) — receiving the same commit twice is a no-op. The commit point is the durable log entry; everything else is delivery mechanics.

Two-phase commit message-sequence chartMessage-sequence chart with vertical lifelines for the coordinator and three participants A, B, C. Time flows downward. The coordinator sends PREPARE to each participant. Each participant writes a PREPARED log record (small box) and replies YES. The coordinator writes its COMMIT log record (highlighted box, labelled "commit point — decision becomes durable"), then sends COMMIT to each. Each participant writes COMMITTED and acks. Annotations on the right show the lock-held window per participant. Illustrative — not measured data. 2PC: prepare, vote, commit-point, decision broadcast Coordinator A B C PREPARE PREPARED records (durable) YES COMMIT commit point (durable log) COMMIT ack locks held: PREPARED → COMMIT (this is the blocking window) Illustrative — not measured data
The commit point is the coordinator's durable log entry, not the broadcast. Lock-held windows on participants extend from PREPARED to COMMIT — the longer this window, the more transactions queue behind these locks.

The state machine — what each role can be in

Every 2PC analysis comes back to the state machines. The coordinator and the participants each move through a small set of states, and the failure modes are exactly the states from which neither party can make progress alone.

Coordinator states. INIT (no transaction yet) → WAIT (sent PREPAREs, waiting for votes) → either ABORT (some vote was NO or some timeout fired) or COMMIT (all votes YES, decision logged) → DONE (all participants acked the decision).

Participant states. INIT → either ABORTED (voted NO) or PREPARED (voted YES, locks held, awaiting decision) → COMMITTED or ABORTED (decision received, locks released).

The dangerous state is PREPARED on the participant side. A participant in PREPARED is bound — it has promised the coordinator that it will follow the coordinator's decision, and it is holding locks while waiting. It cannot decide unilaterally because some other participant might have crashed before voting; if this participant aborts and another commits, the transaction is no longer atomic.

Why a PREPARED participant cannot just time out and abort: the coordinator may have already received this participant's YES, collected all other YES votes, written COMMIT to its log, and crashed before sending COMMIT messages. The decision is COMMIT — durable, recoverable. If this participant unilaterally aborts, then when the coordinator's stand-by recovers and broadcasts the COMMIT decision to other participants who didn't time out, those participants will commit and this one will have aborted. Atomicity broken. The only safe action in PREPARED is to wait for the coordinator (or its replacement) to tell you the decision.

# A working 2PC implementation, including coordinator and participant logs.
# pip install (no deps)
import enum, json, os, time, uuid

class Vote(enum.Enum):
    YES = "yes"
    NO = "no"

class Participant:
    """A participant with a durable WAL — survives a crash via log replay."""
    def __init__(self, name, log_path, fail_in_prepare=False):
        self.name = name
        self.log_path = log_path
        self.fail_in_prepare = fail_in_prepare
        self.state = "INIT"
        self.locks = set()

    def _log(self, record):
        with open(self.log_path, "a") as f:
            f.write(json.dumps(record) + "\n")
            f.flush(); os.fsync(f.fileno())  # durable

    def prepare(self, txn_id, write_set):
        if self.fail_in_prepare:
            self._log({"txn": txn_id, "state": "ABORTED", "reason": "local-fail"})
            self.state = "ABORTED"; return Vote.NO
        # Acquire locks; if any conflict, vote NO.
        for k in write_set:
            if k in self.locks:
                self._log({"txn": txn_id, "state": "ABORTED", "reason": "lock-conflict"})
                self.state = "ABORTED"; return Vote.NO
            self.locks.add(k)
        self._log({"txn": txn_id, "state": "PREPARED", "writes": list(write_set)})
        self.state = "PREPARED"
        return Vote.YES

    def commit(self, txn_id):
        assert self.state == "PREPARED", f"{self.name} cannot commit from {self.state}"
        self._log({"txn": txn_id, "state": "COMMITTED"})
        self.state = "COMMITTED"; self.locks.clear()

    def abort(self, txn_id):
        self._log({"txn": txn_id, "state": "ABORTED"})
        self.state = "ABORTED"; self.locks.clear()

class Coordinator:
    def __init__(self, log_path): self.log_path = log_path
    def _log(self, record):
        with open(self.log_path, "a") as f:
            f.write(json.dumps(record) + "\n"); f.flush(); os.fsync(f.fileno())
    def run(self, txn_id, participants, write_sets):
        votes = [p.prepare(txn_id, write_sets[p.name]) for p in participants]
        if all(v == Vote.YES for v in votes):
            self._log({"txn": txn_id, "decision": "COMMIT"})  # COMMIT POINT
            for p in participants: p.commit(txn_id)
            return "COMMITTED"
        else:
            self._log({"txn": txn_id, "decision": "ABORT"})
            for p in participants: p.abort(txn_id)
            return "ABORTED"

# Demo: a 3-participant transfer; B will fail in prepare.
os.makedirs("/tmp/2pc", exist_ok=True)
A = Participant("A", "/tmp/2pc/A.log")
B = Participant("B", "/tmp/2pc/B.log", fail_in_prepare=True)
C = Participant("C", "/tmp/2pc/C.log")
coord = Coordinator("/tmp/2pc/coord.log")
result = coord.run("tx-1", [A, B, C],
                   {"A": {"wallet:karan"}, "B": {"ledger:row-9"}, "C": {"audit:evt-1"}})
print(f"transaction tx-1 result: {result}")
print(f"A.state={A.state}  B.state={B.state}  C.state={C.state}")

Output:

transaction tx-1 result: ABORTED
A.state=ABORTED  B.state=ABORTED  C.state=ABORTED

Per-line walkthrough:

  • _log calls f.flush() then os.fsync(...) because that's what makes the record survive a crash. A write without fsync sits in the page cache; if the kernel panics, the record is gone. Real production 2PC implementations (Postgres prepared transactions, MySQL XA) use the same pattern. The fsync cost is exactly why 2PC commits are slow — every participant pays one fsync at PREPARE and one at COMMIT, and the coordinator pays one at the commit point.
  • coord.run writes decision: COMMIT only after all votes are YES — that line is the commit point. If you crash the process between the _log call and the loop sending commit messages, the recovery procedure reads the log, sees COMMIT, and replays the broadcast. If you crash before the _log call, the recovery procedure sees no decision and aborts.
  • The assert self.state == "PREPARED" in commit catches the out-of-order bug where a confused coordinator tries to commit a participant that didn't vote YES. Real implementations log this and refuse — silent acceptance would be a correctness violation.

Failure modes — the six places it can break

A 2PC protocol can fail in six places. The classification is from the original Skeen-Stonebraker paper and is the foundation of every recovery procedure.

1. Participant fails before voting. No votes received, coordinator times out, decision is ABORT. Safe — the failed participant didn't promise anything; on recovery it sees no PREPARED record for the txn, treats it as never-started, no action needed.

2. Participant fails after voting NO. Coordinator gets enough information to abort. The participant on recovery sees ABORTED in its log and releases locks (they're already released — it crashed after voting). No action needed.

3. Participant fails after voting YES, before receiving decision. This is the dangerous case. On recovery, the participant reads its log, sees PREPARED, and is in the bound state. It must contact the coordinator (or any peer) to learn the decision. If the coordinator is also dead, the participant blocks until coordinator recovery. Production systems mitigate via a "termination protocol": the recovering participant asks every other participant — if any has COMMITTED, the decision was COMMIT; if any has ABORTED and is reachable, ABORT; if all are PREPARED, block.

4. Coordinator fails before logging COMMIT. No decision was made durable. On recovery, the coordinator reads its log, sees no decision for the txn, and aborts. It broadcasts ABORT to all participants. Safe.

5. Coordinator fails after logging COMMIT, before sending COMMIT messages. This is the case the durable log is designed for. On recovery, the coordinator reads COMMIT, broadcasts COMMIT to all participants. Participants are idempotent — they apply it once. Safe but introduces latency.

6. Coordinator fails after some COMMIT messages sent. Some participants are committed, others are still PREPARED. On recovery, the coordinator broadcasts COMMIT to all (idempotent on already-committed participants). Eventually all commit. Safe.

The problem case is #3 + #5 simultaneously: coordinator dies after logging COMMIT, and one participant also dies. The other participants are PREPARED, the coordinator's stand-by takes over and starts broadcasting COMMIT, but the dead participant cannot receive it. The other participants commit. The dead participant comes back, finds itself PREPARED, queries the others, sees they committed, commits itself. The system is eventually consistent — but during the window between coordinator-recovery and dead-participant-recovery, the dead participant's data is stale, and any reader of its data is stale. This is the partial-availability window of 2PC — and it is bounded only by how fast the dead participant recovers.

Six failure modes of 2PCA grid of six small panels. Each panel shows a different failure point on the 2PC timeline. Panels 1, 2, 4, 5, 6 are coloured green ("safe — recoverable without blocking"). Panel 3 is coloured amber ("blocks until coordinator recovers — the structural cost"). Each panel shows which actor crashed and at what step. Illustrative. Six failure points; only one of them blocks 1. Participant fails pre-vote coord times out → ABORT no PREPARED record exists safe 2. Participant fails post-NO coord aborts (NO seen) recovery: log has ABORTED safe 3. Participant fails post-YES log says PREPARED — bound must contact coord on recovery BLOCKS if coord also dead 4. Coord fails pre-COMMIT-log no decision durable → ABORT stand-by aborts on recovery safe 5. Coord fails post-COMMIT-log decision durable → COMMIT stand-by replays broadcast safe (with latency) 6. Coord fails mid-broadcast some participants committed stand-by re-broadcasts; idempotent safe
Five of the six failure modes recover safely. Mode 3 — participant fails after voting YES — is the structural cost: it blocks until the coordinator (or its stand-by) tells the participant the decision.

Production stories — what 2PC actually does at scale

PaySetu's locked-merchant incident, in detail. The 02:14 page traced to coordinator process coord-payments-3 segfaulting at 02:14:33.214 immediately after writing its COMMIT record. The three participants — wallet, ledger, audit — were all in PREPARED, holding row locks on the merchant's wallet row, the ledger's payment-event row, and the audit's event-log partition. The stand-by coordinator coord-payments-3-standby had been receiving a tail of the WAL at ~30 ms lag. It detected the master's death via Consul session expiry at 02:14:48 (a 15-second TTL — the ops team later argued whether to drop this to 5 s or accept the false-positive rate). It started recovery: read the WAL, found the COMMIT record, broadcast COMMIT to all three. By 02:14:52, all three participants had committed and released locks. Total blocking window: 89 seconds, of which 15 s were Consul TTL, 19 s were WAL replay (the stand-by had been lagging due to a backup process competing for IOPS), and 55 s were retry / acknowledgment. During that 89 s, ~14,000 merchant-wallet transactions queued. The ops team's runbook now says "if 2PC blocking exceeds 30 s, force takeover via Consul session-kill"; the architectural fix shipped six months later was Paxos-commit for the merchant flow, eliminating the single-coordinator dependency.

KapitalKite's order-book vs settlement 2PC. KapitalKite (a fictional stockbroker) uses 2PC between the order-book service and the settlement service for trade confirmations. They measured 99.9th-percentile commit latency at 47 ms intra-region (single AZ, ~0.5 ms RTT) and 180 ms cross-AZ. They explicitly do not use 2PC across the broker → exchange boundary — that's a saga with at-least-once delivery and idempotent settlement. The reasoning: 2PC blocking is acceptable when both participants are in the broker's blast radius (a coordinator failure pages the broker's on-call, who fixes both); 2PC blocking is not acceptable when one participant is a third-party exchange, because the broker cannot force the exchange's coordinator to recover. The lesson generalises: 2PC is a within-trust-boundary protocol. Across trust boundaries, you want a saga.

MySQL XA in production — and why most teams turn it off. MySQL's XA transactions implement the X/Open 2PC protocol, allowing 2PC across multiple MySQL instances or across MySQL and other XA-capable resources. In practice, MySQL XA has a known bug history (XA transactions surviving server restarts incorrectly, prepared transactions not always replicated to replicas) and significant performance overhead (each prepare / commit forces an fsync, cutting write throughput by 3-5×). Most production MySQL deployments at scale (BharatBazaar's checkout service, MealRush's order pipeline) explicitly disable XA and use application-level sagas instead — the operational complexity of a stuck XA transaction at 03:00 IST exceeds the consistency benefit. Postgres prepared transactions (PREPARE TRANSACTION, COMMIT PREPARED) are healthier but still discouraged for cross-database use; most Postgres-on-Postgres 2PC stories in the wild are within a single application talking to two schemas.

Common confusions

  • "2PC is the same as Paxos." 2PC is an atomic-commit protocol with a single coordinator; Paxos is a consensus protocol that replicates a value to a quorum. 2PC blocks if its single coordinator dies; Paxos doesn't, because the value is replicated. Saying "we use 2PC" tells you nothing about whether the coordinator is itself fault-tolerant. Paxos-commit (Gray-Lamport 2006) wraps the 2PC decision in a Paxos quorum to eliminate blocking — it is 2PC with a non-blocking coordinator.
  • "PREPARED means committed." No. PREPARED means "I have promised to commit if asked, but I have not yet committed." The transaction's effect is not yet visible to readers (locks are held but the changes are uncommitted). A reader on a participant in PREPARED state sees the pre-transaction value, not the post-transaction value. Visibility happens at COMMIT, not PREPARE.
  • "You can avoid blocking by adding more participants." No, the opposite. Adding more participants increases the probability that some participant fails during PREPARE, increases the lock-held window (slowest participant determines commit latency), and increases the blast radius if the coordinator dies (more PREPARED participants stuck). 2PC scales poorly with participant count for exactly this reason.
  • "3PC fixes 2PC's blocking." 3PC reduces blocking only under a stronger network-synchrony assumption (bounded message delay). On a real partial-synchronous network, 3PC can still block, and it doubles the message count. In practice, nobody runs 3PC; the alternatives are Paxos-commit or sagas. See /wiki/3pc-and-why-it-doesnt-help.
  • "If the coordinator is replicated, blocking goes away." A replicated coordinator (Paxos / Raft) eliminates blocking on coordinator crashes, but blocking on participant crashes during PREPARE still exists — the coordinator must wait for the participant or time out and abort. Replicated-coordinator 2PC is faster to recover, not faster to commit.
  • "Sending COMMIT before all participants ack PREPARE is a small optimisation." It is a correctness violation. The protocol's atomicity proof depends on the coordinator collecting all YES votes before logging COMMIT. Optimising "early commit" is one of the most common ways production 2PC implementations get atomicity wrong in incident reports.

Going deeper

Presumed abort and presumed commit — the two classical optimisations

Vanilla 2PC requires both the coordinator and every participant to log every state change durably and to keep state for every transaction until all acks are received. For high transaction volumes this is expensive. Two optimisations from the original Mohan-Lindsay-Obermarck paper:

Presumed abort. If the coordinator forgets a transaction (purges its log), and a participant later asks "what was the decision for txn-X?", the coordinator answers ABORT by default. This means the coordinator can forget aborted transactions immediately after broadcasting ABORT — no need to wait for acks, no need to log. Saves coordinator log space and reduces post-abort messages. The cost: committed transactions still require full logging.

Presumed commit. Symmetric — the coordinator answers COMMIT on missing transactions. This requires the coordinator to log its own decision before sending PREPAREs (so a forgotten transaction's default answer is correct), but it skips the per-commit ack collection. Saves messages on the common path (most transactions commit). The cost: every transaction pays an extra log write up front.

Most production systems (Postgres prepared transactions, X/Open XA, the original DEC Acta) implement presumed abort. Presumed commit is more aggressive but harder to recover correctly when the coordinator's pre-PREPARE log is lost.

The replicated coordinator — Paxos-commit and Spanner

If the single coordinator is the blocking risk, replicate it. Paxos-commit (Gray-Lamport 2006) replaces the coordinator with a Paxos group: the commit decision is itself a value agreed by a quorum. Now there is no single coordinator to crash — the decision is replicated, and if a leader dies, the next leader recovers the decision from the quorum.

Spanner takes this to production scale. Every Spanner write is a 2PC across the participating shards' Paxos groups, and the commit decision is itself replicated by Paxos. Single-region writes pay ~5 ms (one Paxos round-trip). Multi-region writes pay ~80 ms (cross-region Paxos plus the TrueTime "wait out the uncertainty" bound, typically ~7 ms). Spanner's authors are explicit: "we pay this latency on every commit because the alternative — blocking on a coordinator failure — is operationally worse than slow commits". Spanner is the existence proof that you can have atomic commit at planet scale; you just pay the consensus latency.

Why heuristic decisions are the worst possible answer

Some XA implementations allow a "heuristic decision": a stuck PREPARED participant, after a configurable timeout, is allowed to unilaterally commit or abort. This is operationally tempting — you don't want a participant blocked forever — but it breaks atomicity by design. If the participant heuristically commits and the coordinator's recovered decision was ABORT, the system is inconsistent: a transaction that was aborted by the protocol is committed by one participant. Most XA tools log "heuristic mismatch" alerts when this happens, but the inconsistency is real and must be reconciled by hand.

The right answer is not heuristic decisions. The right answer is either (a) wait longer, or (b) use a protocol where blocking can't happen (Paxos-commit, sagas). Heuristics are a sign that 2PC was the wrong choice for the workload.

Reproduce a stuck PREPARED participant on your laptop

# Simulate: coordinator dies after writing COMMIT but before broadcasting.
# The stand-by must read the log and finish the broadcast.
# pip install (no deps)
import json, os, random

os.makedirs("/tmp/2pc-recover", exist_ok=True)
COORD_LOG = "/tmp/2pc-recover/coord.log"
A_LOG = "/tmp/2pc-recover/A.log"

# Simulate: write coordinator's COMMIT record, but never broadcast.
with open(COORD_LOG, "w") as f:
    f.write(json.dumps({"txn": "tx-7", "decision": "COMMIT"}) + "\n")
with open(A_LOG, "w") as f:
    f.write(json.dumps({"txn": "tx-7", "state": "PREPARED",
                        "writes": ["wallet:karan"]}) + "\n")

# A is now in PREPARED, holding a lock on wallet:karan. Coordinator is "dead".
# Stand-by recovery: read coord log, find decision, broadcast COMMIT.
def recover():
    with open(COORD_LOG) as f:
        records = [json.loads(line) for line in f if line.strip()]
    decisions = {r["txn"]: r["decision"] for r in records if "decision" in r}
    # For each prepared participant, send the decision.
    with open(A_LOG) as f:
        a_records = [json.loads(line) for line in f if line.strip()]
    for r in a_records:
        if r.get("state") == "PREPARED" and r["txn"] in decisions:
            d = decisions[r["txn"]]
            with open(A_LOG, "a") as f:
                f.write(json.dumps({"txn": r["txn"],
                                    "state": "COMMITTED" if d == "COMMIT" else "ABORTED"}) + "\n")
            print(f"recovered {r['txn']} on A: {d}")
recover()
print(open(A_LOG).read())

Output:

recovered tx-7 on A: COMMIT
{"txn": "tx-7", "state": "PREPARED", "writes": ["wallet:karan"]}
{"txn": "tx-7", "state": "COMMITTED"}

Why this works only because the coordinator's COMMIT log was durable: if the coordinator had crashed before fsyncing the COMMIT record, the recovery procedure would find no decision, and the only safe choice would be to ABORT — which is correct, because a non-durable COMMIT means the protocol never reached the commit point. The fsync is the protocol's atomicity boundary; everything that happened before it is recoverable, everything that happened after it is determined.

Where this leads next

The next chapter, /wiki/3pc-and-why-it-doesnt-help, walks through three-phase commit — the obvious "just add another phase" attempt to fix 2PC's blocking — and shows formally why it fails to deliver under realistic network assumptions. After that, /wiki/sagas-forward-and-compensating takes the saga approach in detail, including how to design compensations that actually compose correctly.

Beyond that, /wiki/percolator covers Google's Percolator design, which uses snapshot isolation plus a 2PC-like protocol over Bigtable, and /wiki/spanner-and-truetime covers Spanner's Paxos-commit + TrueTime architecture — the production answer to 2PC's blocking problem.

By the end of Part 14, every "transactional" claim in your stack will be visibly one of these shapes — 2PC, saga, or Paxos-commit — plus a marketing name on top. Recognising the shape is the goal.

References