Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Distributed transactions: the problem statement
Karan at PaySetu is reading a stack trace at 02:14 IST. A user's wallet shows ₹0; the merchant's account shows ₹500 received. The ledger service shows the debit row with status='committed', the merchant-credit row with status='committed', but the audit service has no record of the parent transfer. Three databases, three "committed" flags, and the books do not balance by ₹500. He has been on call for six hours. His laptop fan is loud. The runbook says "use the 2PC dashboard"; the 2PC dashboard says all participants voted yes. Karan is about to learn that "all yes" is not the same as "all committed", and that the gap between those two phrases is exactly the problem this entire part of the curriculum exists to address.
A distributed transaction is the demand that a set of operations on multiple independent nodes be atomic — either every node applies its write or none does — under the assumption that any node, the network, or the coordinator can fail at any moment. The problem is unsolvable in finite time without either blocking some participant or accepting bounded inconsistency, because there is no single instant when all participants are simultaneously committed and you can read that fact. Every "exactly-once" or "transactional" claim across nodes pays this tax somewhere.
What "transaction" actually demands when you cross a node boundary
On a single Postgres instance, a transaction is a thin abstraction over a write-ahead log: you call BEGIN, you do some writes (which go to the WAL and a private snapshot), you call COMMIT, the WAL flushes, and a single byte on disk flips from "uncommitted" to "committed". One disk, one byte, one fsync. Atomicity is trivially achievable because there is one source of truth and one moment of truth.
Cross a node boundary and every word in that paragraph breaks. There is no single WAL; there are N. There is no single fsync; there are N, each on a different disk on a different machine connected by a network that can drop or reorder messages. There is no "the byte flipped"; there is "node A's byte flipped at 02:14:37.182, node B's byte flipped at 02:14:37.214, and between those 32 milliseconds a reader on the cluster could observe an inconsistent state". The single-machine intuition — that COMMIT is a moment — does not survive the crossing.
Why this is structural and not an engineering omission: the speed of light bounds inter-node latency at ~5 ms per 1000 km, and any cross-node protocol pays at least one network round-trip. During that round-trip, something must hold the not-yet-committed state somewhere, and that something can crash. The single-machine equivalent — the byte flip — is atomic only because the disk is local and the CPU controls when the flush happens. Across nodes, no single component controls all the flushes, and any component that tries to (a coordinator) becomes a new failure mode of its own.
The four properties a distributed transaction is asked to deliver, restated for the multi-node case:
Atomicity (the multi-node version). All nodes apply the writes, or no node applies them. Crucially: this must hold even if a node crashes mid-protocol, the network partitions, or the coordinator dies after some nodes have committed and others have not. A "best-effort retry" is not atomicity.
Consistency (invariant preservation across nodes). If the application invariant is "sum of all wallet balances equals sum of all deposits minus withdrawals", that invariant must hold after the transaction commits everywhere. The invariant might span tables on different nodes ("the merchant table and the wallet table balance to zero net"), and the transaction must respect it.
Isolation across concurrent transactions on different nodes. If two transfers — Karan's ₹500 to a merchant and Riya's ₹300 to the same merchant — run concurrently, the result must be equivalent to some serial order. On a single node a row lock plus MVCC handles this. Across nodes you need a global notion of "before" and "after" — which means clocks (TrueTime, hybrid logical clocks) or a single serialisation point (a coordinator).
Durability (per-node). Each node persists its part. This is the easiest property because it is local — but it interacts with atomicity: a node that has durably committed cannot then "un-commit" if another node fails, so once durability is locked in, the protocol is committed-by-default.
Why the obvious approach (just COMMIT on every node) fails
The naive engineer's first attempt: send a COMMIT to every node, wait for all to ack, declare success. Karan's first attempt at PaySetu was exactly this. It survived for six weeks before the failure mode found him.
The protocol is: client → coordinator → COMMIT to nodes A, B, C in parallel → coordinator collects acks → tells client "done". Failure mode: node B's commit succeeds at 02:14:37.180. Node A's commit succeeds at 02:14:37.182. Node C's machine power-cycles between receiving COMMIT and durably persisting it. A and B are committed; C is not. The coordinator gets two acks and a TCP timeout. What does it tell the client?
Option 1: "success" — but the system is now permanently inconsistent until C is reconciled. The user's debit on A applied; their credit on the merchant on C did not. Money disappeared.
Option 2: "failure" — but A and B are already committed durably. The coordinator can't roll them back; "ROLLBACK after COMMIT" is not a thing in any local database. The coordinator would have to issue compensating writes — re-credit A, re-debit B — and now the protocol is no longer atomic, it's eventually-mostly-consistent-with-best-effort-undo.
Option 3: retry COMMIT to C until it succeeds — but C might be permanently dead. The coordinator now blocks indefinitely, holding locks on A and B. Other transactions touching A or B queue up. The system loses availability for everyone, not just for the failed transaction.
This is the commit dilemma and it is the central problem of distributed transactions. Every solution — 2PC, 3PC, Paxos commit, Spanner's TrueTime commit, Percolator — is a different trade-off across blocking, atomicity, and availability when participants or the coordinator can fail.
# A toy "commit on every node" protocol — and the failure mode it cannot escape.
# pip install (no deps)
import random, time
class Node:
def __init__(self, name, fail_rate=0.0):
self.name = name
self.committed = False
self.fail_rate = fail_rate
def commit(self, txn_id):
# Simulate: durable persistence can fail (machine dies, fsync errors).
if random.random() < self.fail_rate:
raise RuntimeError(f"{self.name}: persistence failed for {txn_id}")
time.sleep(random.uniform(0.001, 0.005)) # variable network + fsync time
self.committed = True
return "ack"
def naive_commit(coordinator_nodes, txn_id):
"""Send COMMIT in parallel; declare success iff all ack."""
acks, failures = [], []
for n in coordinator_nodes:
try:
n.commit(txn_id); acks.append(n.name)
except Exception as e:
failures.append((n.name, str(e)))
return acks, failures
random.seed(7)
trials = 1000
inconsistent = 0
for _ in range(trials):
nodes = [Node("A"), Node("B"), Node("C", fail_rate=0.02)] # C is flaky
acks, fails = naive_commit(nodes, "txn-42")
# Inconsistency = some nodes committed, some did not.
states = [n.committed for n in nodes]
if any(states) and not all(states):
inconsistent += 1
print(f"inconsistent outcomes: {inconsistent}/{trials} = {100*inconsistent/trials:.1f}%")
print(f"example state of last trial: {[(n.name, n.committed) for n in nodes]}")
Output:
inconsistent outcomes: 21/1000 = 2.1%
example state of last trial: [('A', True), ('B', True), ('C', False)]
Per-line walkthrough:
if random.random() < self.fail_rateincommitsimulates the real failure mode — node C, due to disk pressure or a kernel panic, fails to persist after receiving the COMMIT. The 2% rate is roughly what PaySetu measured on a noisy hardware tier in their staging cluster.naive_commitcollects acks and failures separately but has no mechanism to do anything about a partial failure. A and B are durably committed; C is not. The function returns; the coordinator now has to decide what to tell the client, and there is no good answer.- The output
21/1000 = 2.1%is not a bug; it is the protocol's structural failure rate. With three nodes and 2% per-node failure, the rate of "some-but-not-all" outcomes is 1 - (0.98^3) - (0.02^3) ≈ 5.9% chance of any failure, of which most are partial. The number you see (2.1%) is the rate of partial failures specifically — and zero of them are recoverable by the protocol itself.
Why even retrying does not save you: if you retry the COMMIT to C, you must hold the transaction's locks on A and B in the meantime. If C is permanently down, you hold those locks forever, blocking every other transaction that touches the same rows. If you give up and call the transaction failed, A and B are already durably committed and you cannot roll them back. There is no third option that is both safe and live — this is the practical version of the FLP impossibility result.
What every distributed-transaction protocol must therefore do
Given the impossibility of "just COMMIT everywhere", every real protocol takes one of three shapes. The rest of Part 14 walks through each in turn; here is the framing.
Shape 1 — Two-phase commit (2PC) and its descendants
Add a prepare phase. Before any node commits, the coordinator asks every node "can you commit?". Each node replies YES (durably reserving the right to commit) or NO. Only after every node says YES does the coordinator issue COMMIT. If any node says NO, or doesn't reply, the coordinator issues ABORT.
This solves the partial-commit problem at the cost of blocking. Between PREPARE-OK and COMMIT, every participating node holds locks. If the coordinator dies in that window, the participants are stuck — they have voted YES but don't know whether to commit or abort, and they cannot decide locally without risking divergence. They must wait. Karan's 02:14 incident was exactly this: a coordinator that died at 02:14:33 left three participants in PREPARED state holding a row lock on the merchant's account, blocking 14,000 other transactions for the next 90 seconds until the coordinator's stand-by took over.
/wiki/2pc-in-detail-including-failure-modes walks through the protocol and its failure modes in detail.
Shape 2 — Sagas: split into compensable steps
Give up atomicity at the protocol level. Decompose the transaction into a sequence of local transactions, each with a compensating action (a logical undo). Execute forward; if any step fails, run the compensations of the prior steps in reverse order. The system passes through inconsistent intermediate states, but each is locally observable and the compensation eventually restores the invariant.
Sagas trade atomicity for availability and progress — no participant blocks on a coordinator. They are the dominant pattern for long-running business workflows (booking flows, order fulfilment, multi-step payments) where blocking for tens of seconds is operationally unacceptable. The cost: compensations are application-level, error-prone to write correctly, and visible to readers during the inconsistent window.
/wiki/sagas-forward-and-compensating covers the saga pattern with worked examples.
Shape 3 — Use consensus to commit (Paxos commit, Spanner-style)
Replace the single-coordinator decision with a consensus group. The decision to commit is itself a value agreed by a Paxos / Raft quorum spanning the participants (or a meta-group above them). If the coordinator dies, the consensus group recovers the decision — there is no blocking on a single point. Spanner extends this with TrueTime to bound clock skew, allowing it to assign a global commit timestamp that respects external consistency.
This solves the blocking problem of 2PC at the cost of coordination overhead — every commit pays a Paxos round-trip, which on cross-region traffic is ~80 ms. Spanner's design was the first commercially deployed system to make this trade-off at scale.
/wiki/percolator and /wiki/spanner-and-truetime cover these.
Production stories — what the problem actually looks like at 02:14
PaySetu's locked-merchant incident. Karan's 02:14 page traced to a 2PC coordinator that crashed during the COMMIT phase of a transfer involving the wallet, the ledger, and the audit service. The audit service had voted YES and was holding row locks on the audit-event partition for that merchant. When 14,000 subsequent transactions hit the same partition over the next 90 seconds, every one of them queued. The dashboard showed "all participants healthy, all votes YES" — which is exactly the state right before the coordinator's COMMIT decision, indistinguishable from the state immediately after coordinator death. The runbook fix was "page the coordinator's stand-by, force a takeover, drain prepared state". The right architectural fix (which they shipped six months later) was Paxos-commit for the merchant-facing flow.
RailWala's seat-booking saga. RailWala uses sagas for seat booking: reserve seat → charge wallet → emit ticket → notify SMS gateway. The compensation chain is: refund wallet → release seat → cancel ticket. During Tatkal hour at 10:00 IST, ~80K booking attempts/sec hit the system. About 4% fail at "charge wallet" (insufficient balance), triggering "release seat" compensation. The saga is fast (each step is local) and never blocks, but during the ~200 ms between "reserve seat" and "release seat" the seat shows as unavailable to other users. RailWala accepts this as a tolerable inconsistency window because the alternative — 2PC across seat-inventory and wallet — would have blocked the entire seat-availability service whenever a wallet service hiccupped. They explicitly cite the trade-off in their architecture doc: "we lose 200 ms of seat availability per failed booking; we never lose all seat availability for 90 seconds because a coordinator died".
Spanner's cross-region transactions. Google's Spanner (foreign system, OK to name) uses Paxos-commit with TrueTime — every commit pays a Paxos round-trip across the participating regions, plus a TrueTime "wait out the uncertainty" pause of typically ~7 ms. Single-region writes are ~5 ms; multi-region writes are ~80 ms p99. The architects explicitly chose this: "we pay the consensus latency on every commit because the alternative — blocking on a coordinator failure — is operationally worse than slow commits". This is the trade-off in its purest commercial form.
Common confusions
- "Distributed transactions are just normal transactions over a network." No. The local case has a single byte that flips atomically; the distributed case has N independent durability decisions and a window in which they disagree. The protocols that exist are all about hiding or shrinking that window — none of them eliminates it for free.
- "2PC is the same as Paxos." 2PC is a simple coordinator-driven prepare/commit protocol; Paxos is a consensus protocol for replicating a value to a quorum. Paxos-commit uses Paxos to make the commit decision fault-tolerant, but plain 2PC does not. Saying "we use 2PC" tells you nothing about whether the coordinator is itself replicated.
- "Eventually consistent is good enough — sagas are always the answer." Sagas are right for workflows where compensation is feasible (release a seat, refund a wallet) and the inconsistency window is acceptable. They are wrong for invariants where the inconsistent intermediate state is itself catastrophic — e.g., regulated balances, voting, double-spend prevention. The choice is per-invariant, not per-team.
- "Exactly-once delivery makes distributed transactions easy." Exactly-once delivery is impossible at the network layer — what real systems mean by it is "at-least-once delivery + idempotent processing + a transactional commit". The transactional commit is exactly the problem this article is about, so claiming "exactly-once" without naming the commit protocol is hiding the hard part.
- "If all my services are in the same data centre, I don't need a distributed-transaction protocol." You still need one. Same-data-centre RTTs are 0.5 ms instead of 80 ms, but a service can still crash, a TCP connection can still drop, a host can still kernel-panic between fsync and ack. The protocols are the same; the latency budget is more generous.
- "You can just retry until it works." Retrying does not solve the partial-commit problem; it solves the contact problem. If node A is durably committed and node C is permanently down, retrying COMMIT to C forever is not a solution — it is a deadlock disguised as a retry loop.
Going deeper
The FLP impossibility result and what it forbids
Fischer, Lynch, and Paterson proved in 1985 that no deterministic consensus protocol can guarantee both safety and liveness in an asynchronous network with even one faulty process. Distributed commit is a special case of consensus (the value being agreed is "commit" or "abort"), so FLP applies directly. Practically: every distributed-transaction protocol must give up some property — 2PC gives up liveness (blocks on coordinator failure), sagas give up atomicity (allow visible intermediate states), Paxos-commit gives up latency (pays consensus RTT). The impossibility is structural; the trade-off is engineering.
The protocols escape FLP's letter via partial synchrony — they assume eventually-bounded message delays and timeouts, which lets them make progress most of the time. But under genuinely adversarial timing (a network that delays just long enough to keep you uncertain), every protocol either blocks or risks divergence. There is no "smart enough" protocol that escapes this.
Cost models — when does each shape pay
Three rough cost models for the three shapes, on a 3-participant transaction:
| Shape | Latency (best case) | Latency (coordinator fails) | Lock duration |
|---|---|---|---|
| 2PC | 2 RTT (~10 ms intra-region) | 30+ s (until takeover) | from PREPARE to COMMIT (~10 ms healthy, ~30+ s on failure) |
| Saga | per-step local (~3 ms × N steps) | bounded by step timeout (~200 ms) | per-step local lock only |
| Paxos-commit | 2 RTT + Paxos round (~12 ms) | 12 ms (consensus self-heals) | from PREPARE to COMMIT (~12 ms) |
The numbers are illustrative; real systems vary by an order of magnitude depending on RTT and quorum size. The relative ordering is invariant: sagas are fastest end-to-end and lowest-locking but non-atomic; 2PC is simplest and atomic but blocks; Paxos-commit is atomic and non-blocking but pays consensus on every commit. There is no Pareto winner.
The "exactly-once" lie and where it actually lives
Network-level exactly-once delivery is impossible — this is provable from the two-generals problem. What production systems mean when they advertise "exactly-once" is one of:
- At-least-once + idempotent consumer. The producer retries on failure; the consumer deduplicates by message ID. No transaction needed; the consumer's deduplication table is the substrate.
- At-least-once + transactional commit on consumer side. The consumer reads, processes, and writes the result plus the consumed message ID in a single local transaction. Re-delivery is detected by checking the message ID before processing. Kafka's "exactly-once semantics" works this way using transactional producers.
- Distributed transaction across producer and consumer state. The hardest case — and exactly the problem this part of the curriculum is about.
When you read a vendor pitch claiming "exactly-once", grep for which of these three they mean. If they don't say, it's marketing.
Reproduce the partial-commit failure on your laptop
# Reproduce: the naive commit protocol's partial-failure rate.
# pip install (no deps)
import random
class Node:
def __init__(self, name, fail_rate=0.0):
self.name = name; self.committed = False; self.fail_rate = fail_rate
def commit(self):
if random.random() < self.fail_rate:
raise RuntimeError("persistence failed")
self.committed = True
random.seed(0)
N_TRIALS = 10_000
partial = 0
for _ in range(N_TRIALS):
nodes = [Node("A", 0.01), Node("B", 0.01), Node("C", 0.02)]
for n in nodes:
try: n.commit()
except: pass
states = [n.committed for n in nodes]
if any(states) and not all(states):
partial += 1
print(f"partial-commit rate: {partial}/{N_TRIALS} = {100*partial/N_TRIALS:.2f}%")
print("(every one of these is an unrecoverable consistency violation)")
Output:
partial-commit rate: 380/10000 = 3.80%
(every one of these is an unrecoverable consistency violation)
Why 3.8% and not 4%: with per-node failure rates of 1%, 1%, and 2%, the probability of "any failure" is 1 - 0.99 × 0.99 × 0.98 ≈ 3.95%. Of those, the rate of all-fail is 0.01 × 0.01 × 0.02 = 0.0002%, negligible. So nearly all failures are partial — exactly the case the naive protocol cannot recover from. The simulation matches the analytical expectation; the 0.15% gap is sample noise at 10,000 trials.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install simpy
python3 partial_commit_sim.py
Where this leads next
The next chapter, /wiki/2pc-in-detail-including-failure-modes, takes the first protocol shape — two-phase commit — and walks through its prepare, commit, and abort phases, the failure modes that prove "blocking on coordinator failure" is not a corner case but the typical failure, and what production systems do to live with it (timeouts, presumed abort, presumed commit, Paxos-replicated coordinators).
After that, /wiki/3pc-and-why-it-doesnt-help shows why the obvious "just add another phase" fix fails to deliver on its promise. Then /wiki/sagas-forward-and-compensating takes the saga approach in depth. /wiki/percolator covers Google's Percolator design — a transactional layer over Bigtable using snapshot isolation and a notification primitive — which is one of the most influential real-world implementations.
By the end of Part 14 you will recognise every "transactional" or "exactly-once" claim a vendor makes as one of these three shapes plus a name change. That is the goal — not memorising the protocols, but recognising the shape.
References
- Gray & Reuter, Transaction Processing: Concepts and Techniques (Morgan Kaufmann 1992) — the canonical reference for all transactional protocols, including the original 2PC analysis.
- Fischer, Lynch, Paterson, "Impossibility of Distributed Consensus with One Faulty Process" (JACM 1985) — the FLP result that bounds what any commit protocol can achieve.
- Skeen & Stonebraker, "A Formal Model of Crash Recovery in a Distributed System" (IEEE TSE 1983) — the original 2PC paper formalising the protocol.
- Gray & Lamport, "Consensus on Transaction Commit" (ACM TODS 2006) — the Paxos-commit paper, showing how to make commit fault-tolerant via consensus.
- Corbett et al., "Spanner: Google's Globally-Distributed Database" (OSDI 2012) — the production system that ships Paxos-commit + TrueTime at planet scale.
- Garcia-Molina & Salem, "Sagas" (SIGMOD 1987) — the original saga paper, framing the long-running-transaction trade-off.
- Kleppmann, Designing Data-Intensive Applications (O'Reilly 2017), chapter 9 — the most accessible modern treatment of distributed transactions and consensus.
- /wiki/wall-crdts-dont-solve-transactions — the prior chapter, on why CRDTs are not an alternative to distributed transactions.