Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Wall: atomic broadcast needs ordering
PaySetu's payments-ledger team spent six months building distributed transactions. They shipped 2PC for the cross-region transfer path, sagas for the long-running disbursement workflow, Calvin-style deterministic ordering for the high-contention hot-row case, and a Spanner-derivative for the global SQL surface. Four protocols, four codebases, four sets of failure modes. In the post-mortem retrospective, an engineer drew a single diagram on the whiteboard — every protocol, stripped to its core, was the same shape: agree on the next operation, replicate it everywhere, apply it in that order. The protocols differed only in how they agreed and what counted as an operation. Underneath all of them sat one primitive: atomic broadcast. And the part of atomic broadcast that nobody could engineer away was the ordering.
This is the wall that closes Part 14. The previous seven chapters showed protocols. This chapter is the reason every one of those protocols looked, at a deep level, like the same idea. If you finish Part 14 with one phrase in your head, make it this: distributed transactions are atomic broadcast in disguise, and atomic broadcast is consensus in disguise. Each disguise hides different costs, but no disguise hides the ordering work itself.
Atomic broadcast is the distributed-systems primitive that delivers each message to every replica in the same order. Once you have it, state-machine replication is trivial: feed the ordered messages into a deterministic state machine, get identical replicas. Once you have it across shards, distributed transactions are trivial: order the transactions, apply them, done. The hard part — provably equivalent in difficulty to consensus — is the ordering. Every protocol in Part 14 (2PC, sagas, Percolator, Calvin, Spanner) is a different way of paying for that ordering: 2PC pays per-transaction, Calvin pays once at the front, Spanner pays via a clock. None of them skip the ordering.
Why the four protocols collapse into one
Walk through what each protocol in the previous chapters actually does. 2PC acquires locks on every shard, votes, then commits — but the commit decision is itself a record that every shard must apply, and the global serial order is the order of those commit decisions. Sagas execute a sequence of local transactions in a fixed order with compensating actions on failure — the order is explicit in the saga definition. Percolator uses a centralised TSO to assign monotonically-increasing commit timestamps — the order is the timestamp order. Calvin runs a sequencer that decides the global order before any shard touches the data — the order is decided at the front. Spanner uses TrueTime + commit-wait so that timestamp order matches real-time order — the order is the TrueTime-derived timestamp.
Five protocols. Five flavours of one thing: agree on a total order of operations and replicate that order. Why this collapse is forced and not a coincidence: a distributed transaction by definition mutates state on multiple replicas. For the post-transaction state to be unambiguous, every replica must end in the same state. For every replica to end in the same state from a deterministic state machine, every replica must apply the same operations in the same order. That requirement is independent of the protocol — it is a consequence of "the system has more than one replica and clients see one logical state". Atomic broadcast is the name of that requirement. Every protocol is just a different way of implementing it.
What atomic broadcast actually requires
Atomic broadcast — sometimes called total-order broadcast — is a primitive with four properties:
- Validity: if a correct process broadcasts message
m, every correct process eventually deliversm. - Uniform agreement: if any process (correct or faulty) delivers
m, every correct process eventually deliversm. - Integrity: each
mis delivered at most once, and only if it was previously broadcast. - Total order: if processes
pandqboth deliverm1andm2, they deliver them in the same order.
The first three are reliable broadcast — easy in a partial-synchrony model with retries and message IDs. The fourth one — total order — is what makes atomic broadcast hard. And total-order broadcast is provably equivalent to consensus: you can build either from the other in finite steps. This was shown by Chandra & Toueg in their 1996 paper Unreliable Failure Detectors for Reliable Distributed Systems and is the reason every atomic-broadcast implementation in production — Raft, Paxos, Zab, ViewStamped Replication — is a consensus protocol underneath.
Why uniform agreement (delivery to all correct processes if any process — including a faulty one — delivered) is the strict, non-negotiable form: a weaker variant called regular agreement would let a faulty process deliver m and crash before the message reaches the rest, leaving the correct processes ignorant. That is fatal for state-machine replication, because the application state at the crashed process — say, a balance debit visible to a client — is now in storage somewhere with no way to make the rest of the cluster agree on it. Uniform agreement closes this hole: once anyone delivers, the cluster is collectively committed to delivering. The price is that uniform agreement forces a quorum-write before any single process delivers, which is one of the two reasons consensus protocols pay a quorum-RTT per decision.
Why total order is the consensus-hard bit: deciding that m1 is delivered before m2 at every replica is itself a distributed agreement problem. Any two processes that disagree about the order produce diverging state machines. Reaching that agreement under failures (a process crashes, the network partitions, a message is delayed) requires a quorum-based decision — which is consensus. The other three properties can be added on top with retries and dedup tables; the ordering cannot.
You may have noticed this is the same wall the FLP impossibility result hits. FLP says no deterministic protocol can guarantee consensus in an asynchronous network with even one faulty process. Since atomic broadcast is consensus, FLP applies to atomic broadcast. The way out — for both — is the same: assume partial synchrony, use timeouts, use leader leases, accept that under sufficiently bad network conditions the system will pause rather than diverge. There is no protocol that gets total-order broadcast for free.
A walkthrough — building state-machine replication on top of atomic broadcast
Once atomic broadcast exists, replication is mechanical. Every replica is a deterministic state machine. The atomic-broadcast layer delivers messages in a globally-agreed order. Each replica feeds those messages into its state machine and gets identical state. This is state-machine replication (SMR), Lamport's 1984 framing of replicated services.
The "deterministic" part is doing more work than it looks. A state machine that calls time.time() is non-deterministic across replicas — different wall-clocks produce different transitions. A state machine that calls random.random() is non-deterministic. A state machine that iterates over a Python dict produces different orders across versions. A state machine that does a floating-point summation across a hash-map is non-deterministic on different CPU architectures. Production SMR systems work hard to keep state transitions purely a function of the input message: timestamps are baked into the message before broadcast, randomness comes from a seed in the message, iteration orders are sorted, monetary arithmetic is integer-only. Every leak from "purely a function of input" produces divergent replicas — the bug is silent, the divergence is permanent, and the recovery path is "reseed from a snapshot and replay the log".
Here is a minimal SMR ledger built on a stub atomic-broadcast layer. The atomic broadcast is simulated; the point is to show how trivial the SMR layer becomes once ordering is provided.
# State-machine replication on top of atomic broadcast.
# The atomic_broadcast() function is a stub that delivers messages in
# a globally-agreed order to every replica. In production this is Raft / Paxos / Zab.
import threading, queue, hashlib
from dataclasses import dataclass, field
from typing import Dict, List, Tuple
@dataclass
class Replica:
name: str
balances: Dict[str, int] = field(default_factory=dict)
applied: List[str] = field(default_factory=list)
inbox: "queue.Queue[Tuple[int, dict]]" = field(default_factory=queue.Queue)
def apply(self, op):
if op["type"] == "credit":
self.balances[op["acct"]] = self.balances.get(op["acct"], 0) + op["amt"]
elif op["type"] == "debit":
cur = self.balances.get(op["acct"], 0)
if cur < op["amt"]:
self.applied.append(f"reject:{op['id']}"); return
self.balances[op["acct"]] = cur - op["amt"]
self.applied.append(op["id"])
def run(self):
while True:
seq, op = self.inbox.get()
if op is None: return # shutdown
self.apply(op)
class AtomicBroadcast:
"""Single global sequencer — stand-in for Raft/Paxos. The whole point is
that every replica gets the same (seq, op) sequence."""
def __init__(self, replicas):
self.replicas = replicas
self.seq = 0
self.lock = threading.Lock()
def broadcast(self, op):
with self.lock:
self.seq += 1
for r in self.replicas:
r.inbox.put((self.seq, op)) # totally-ordered delivery
# DEMO — three replicas of PaySetu's micro-ledger
replicas = [Replica(f"r{i}") for i in range(3)]
threads = [threading.Thread(target=r.run, daemon=True) for r in replicas]
for t in threads: t.start()
ab = AtomicBroadcast(replicas)
ab.broadcast({"id": "op-1", "type": "credit", "acct": "alice", "amt": 1000})
ab.broadcast({"id": "op-2", "type": "credit", "acct": "bob", "amt": 500})
ab.broadcast({"id": "op-3", "type": "debit", "acct": "alice", "amt": 300})
ab.broadcast({"id": "op-4", "type": "debit", "acct": "bob", "amt": 900}) # rejects
import time; time.sleep(0.1)
for r in replicas:
digest = hashlib.sha256(repr(sorted(r.balances.items())).encode()).hexdigest()[:12]
print(f"{r.name}: bal={r.balances} applied={r.applied} digest={digest}")
Realistic output:
r0: bal={'alice': 700, 'bob': 500} applied=['op-1', 'op-2', 'op-3', 'reject:op-4'] digest=4e2c8f1d9a07
r1: bal={'alice': 700, 'bob': 500} applied=['op-1', 'op-2', 'op-3', 'reject:op-4'] digest=4e2c8f1d9a07
r2: bal={'alice': 700, 'bob': 500} applied=['op-1', 'op-2', 'op-3', 'reject:op-4'] digest=4e2c8f1d9a07
Walk through the load-bearing pieces. AtomicBroadcast.broadcast is the only synchronisation in the entire program. It assigns a sequence number under a lock and pushes the same (seq, op) to every replica's inbox. In production this lock is replaced by a Raft round (or Paxos, or Zab); the contract — every replica receives the same sequence — is preserved. Replica.apply is purely local. It does not coordinate, retry, or check anything across replicas; it just runs the deterministic transition. reject:op-4 is the deterministic rejection of an over-debit — every replica reaches the same conclusion because they all see bob's balance at the same point in the sequence. The matching digests at the end are the property atomic broadcast bought us: identical state across replicas, with no replica-to-replica chatter at all. Replace the stub with a real Raft cluster and the SMR layer is unchanged.
Where this leaves the reader of Part 14
Part 14 began with the problem statement — distributed transactions are hard. It walked through 2PC, 3PC, sagas, Percolator, Calvin, and Spanner. Each protocol added a different cost shape to the same fundamental requirement. The wall is this: you cannot have correct distributed transactions without an atomic-broadcast equivalent somewhere in the stack. The only question is where you pay for it.
- 2PC pays inside each cross-shard transaction (per-transaction quorum).
- Sagas pays inside the application layer (the saga's order is the broadcast).
- Percolator pays at the central TSO (every transaction touches one box).
- Calvin pays at the front (the sequencer is one big atomic broadcast).
- Spanner pays via TrueTime (commit-wait + Paxos = atomic broadcast over time).
A protocol designer's job is to pick the cost shape that matches the workload. KapitalKite's stockbroker workload — high contention, short transactions, single region — chose Calvin-style sequencing because the front-loaded ordering cost was the cheapest per-transaction. CricStream's video-metadata workload — low contention, large blobs, multi-region — chose Spanner-style commit-wait because the per-transaction cost dominated and TrueTime kept it small. PaySetu's mixed workload ended up with a hybrid: Calvin for the hot-row settlement path, Spanner-derivative for the global ledger queries. Both layers used Paxos as the underlying atomic-broadcast primitive. The ordering, however it was paid for, was non-negotiable.
The reason this matters for an engineer reading Part 14 is that "which transaction protocol should I use" is the wrong question to bring to a design review. The right question is "which cost shape do I want to pay for the ordering?" The cost is real, the cost is roughly fixed, and the only thing under your control is where it shows up — in the client's tail latency, in a central sequencer's CPU, in the application's complexity, in the clock infrastructure's operational burden. Phrase the question that way and the protocol falls out of the workload's contention pattern, blast radius, and SLA. Phrase it the other way — picking 2PC because "we know 2PC" — and you ship a system whose latency distribution is a surprise.
Common confusions
- "Atomic broadcast is the same as reliable broadcast." It is not. Reliable broadcast guarantees every correct process gets every message; atomic broadcast adds the requirement that they get them in the same order. The order is the consensus-hard bit. A multicast over UDP with retries gets you reliable broadcast; you need a Raft/Paxos/Zab layer underneath to upgrade it to atomic broadcast.
- "If I have eventual consistency I do not need atomic broadcast." Correct — and that is exactly the trade-off eventual-consistency makes. You skip total-order broadcast and accept that replicas temporarily diverge. But for distributed transactions — operations that mutate multiple shards atomically — you cannot skip ordering and still call the result a transaction.
- "Kafka gives me atomic broadcast for free." Kafka gives you total order per partition, which is atomic broadcast as long as your transaction lives entirely within one partition. Cross-partition ordering is your problem; this is why Kafka transactions are limited and why Kafka Streams ships its own state-store-per-partition model.
- "Consensus and atomic broadcast are different problems." They are formally equivalent — Chandra & Toueg's 1996 reduction shows you can build either from the other. In practice every atomic-broadcast implementation is a consensus protocol underneath because it is the simplest known way to implement total order under failures.
- "Sagas avoid atomic broadcast because they have no global lock." Sagas avoid the implementation of atomic broadcast at the database layer — but the saga definition itself is the atomic broadcast. The order of saga steps is total, deterministic, and applied identically by every executor. The ordering cost has been moved to the application designer's head, not removed.
- "Spanner avoids atomic broadcast because TrueTime gives it order." TrueTime is the clock used to derive timestamps; the atomic broadcast still happens — it is the Paxos round inside each shard's group, plus the 2PC layer above. TrueTime makes commit-wait short; it does not eliminate the ordering work.
Going deeper
Lamport's state-machine replication paper
Lamport's 1984 paper Using Time Instead of Timeout for Fault-Tolerant Distributed Systems introduces state-machine replication and proves that atomic broadcast is sufficient for replicating any deterministic service. The construction: feed the broadcast output as input to identical state machines, get identical replicas. Forty years later this remains the canonical framing. Every distributed-systems lecture course re-derives it; every production replication system implements it under a different name (replicated WAL, change-data-capture, log shipping). The paper is unusually short — six pages — and worth reading once you have the consensus background, because the proof reduces to "deterministic transitions on identical inputs produce identical outputs", which is so obvious it hides the work being done by the broadcast layer.
Chandra-Toueg and the equivalence proof
Chandra & Toueg's Unreliable Failure Detectors for Reliable Distributed Systems (JACM 1996) proves the formal equivalence between consensus and atomic broadcast in asynchronous systems augmented with failure detectors. The reduction in one direction — atomic broadcast from consensus — is a sequence of single-value consensus instances, each deciding the next message in the order. The reduction in the other direction — consensus from atomic broadcast — is "broadcast your value, take the first delivered value as the decision". The proof is short; the consequence is huge. Every paper that claims to give you a "weaker than consensus" replication primitive is implicitly claiming to weaken the ordering, and it pays for that with weaker semantics (eventual consistency, conflict resolution, application-level reconciliation).
Zab — atomic broadcast in production
Zookeeper's atomic-broadcast protocol, Zab, is the textbook example of a production atomic-broadcast layer optimised for high-throughput state-machine replication. Zab differs from Raft in two practical ways: it preserves primary order (the order the leader proposed messages, not the order it received client requests), and it has an explicit recovery phase that synchronises a new leader with the cluster's log before serving writes. The recovery phase is what allows Zab to give primary-order atomic broadcast — a strictly stronger guarantee than total-order broadcast. Most users of Zookeeper (Kafka pre-KRaft, HBase, Solr, BharatBazaar's coordination layer) consume Zab indirectly through the Zookeeper API; the property they rely on is that watches and ephemeral nodes are ordered consistently with writes, which falls out of primary-order broadcast.
The Raft log is atomic broadcast — a concrete mapping
It helps to make the abstraction concrete against a system you have actually run. A 5-node Raft cluster running etcd is an atomic-broadcast implementation. The broadcast m is a key-value write. Total order is the Raft log index — every replica applies entries in log-index order. Validity holds because the leader replicates committed entries to all followers, retrying until acked. Uniform agreement holds because Raft requires a quorum-write before considering an entry committed, and the quorum overlap argument prevents a committed entry from being lost on leader change. Integrity holds because each entry is uniquely identified by (term, index). The four atomic-broadcast properties are the four Raft safety properties under different names. RailWala's seat-inventory cluster — a 5-node etcd deployment fronting their booking write path — is, mechanically, an atomic-broadcast service: every booking is a broadcast message, every node delivers in log-index order, every node's local state machine ends in the same place. The fact that the team thinks of it as "etcd transactions" rather than "atomic broadcast" is a vocabulary gap, not a structural one.
The cost shape PaySetu actually paid
PaySetu's settlement engine ran a stress test with all five Part-14 protocols against the same 50-shard ledger workload, 200K transactions. The total CPU-seconds spent on ordering was within 12% across all five protocols — the cost was real and roughly equal. The differences were in latency distribution. Calvin's p50 was 4 ms, p99 was 6 ms (the sequencer is fast and uniform). 2PC's p50 was 8 ms, p99 was 95 ms (cross-region prepare RTT dominates the tail). Spanner-style p50 was 11 ms (commit-wait + Paxos), p99 was 18 ms (commit-wait makes the tail tight). The team picked Calvin for the hot-row path because of the tight p99. The same workload, the same total ordering cost, very different operational characteristics — that is the lesson Part 14 leaves the reader with. The protocol choice is a latency-distribution choice, not a cost-elimination choice.
What an "atomic broadcast bypass" actually buys you
Some replication systems claim to skip atomic broadcast — Dynamo-style sloppy quorums, gossip-based eventually-consistent stores, CRDT-replicated databases. They are not lying, but the claim is narrower than it sounds. What they skip is the strong-ordering form of atomic broadcast; what they keep is reliable broadcast plus a per-key conflict-resolution rule (last-writer-wins, vector-clock merge, CRDT join). For workloads where every key is independently mutable and the application can tolerate temporary divergence — shopping-cart adds, like-counts, presence — this works and is dramatically cheaper than consensus. For workloads where two operations interact across keys — "transfer ₹500 from account A to account B if A has ₹500" — the divergence becomes a correctness violation and the bypass falls apart. The protocol is not the bug; the workload chose the wrong replication primitive. This is the cleanest way to describe what eventual consistency is: it is the explicit decision to skip total-order broadcast, accept divergence, and push the resolution into the application or the data type. It is a perfectly valid choice for the right workload — but it is a choice, not a free upgrade.
Where this leads next
Atomic broadcast is the upper bound on replication strength: every replica sees every operation in the same order, applies it deterministically, and produces identical state. It is also the upper bound on replication cost — you pay quorum-RTT-per-decision unless you amortise via leader leases or sequencers. Part 15 picks up the thread by treating the broadcast itself as the product — the atomic-broadcast log becomes a streaming substrate that downstream consumers replay at their own pace.
The next chapter, /wiki/queues-vs-streams-the-fundamental-split, reframes the topic: where Part 14 saw atomic broadcast as a means to replicated state, Part 15 sees it as an end in itself. Kafka, Pulsar, Redpanda, and NATS all expose the broadcast log as the primary API. The state machine — if there is one — is the consumer's problem.
If you want to come back to the transactions thread, the practical chapters are /wiki/2pc-in-detail-including-failure-modes, /wiki/calvin-deterministic-concurrency, and /wiki/spanner-style-txns-with-truetime. The theoretical chapters that anchor this wall are /wiki/flp-impossibility and /wiki/causal-consistency.
References
- Lamport, Using Time Instead of Timeout for Fault-Tolerant Distributed Systems, TOPLAS 1984 — the SMR paper.
- Chandra & Toueg, Unreliable Failure Detectors for Reliable Distributed Systems, JACM 1996 — the consensus / atomic-broadcast equivalence.
- Junqueira, Reed & Serafini, Zab: High-performance broadcast for primary-backup systems, DSN 2011 — Zookeeper's atomic-broadcast protocol.
- Schneider, Implementing fault-tolerant services using the state machine approach, ACM Computing Surveys 1990 — the canonical SMR survey.
- Cachin, Guerraoui & Rodrigues, Introduction to Reliable and Secure Distributed Programming (2011) — the textbook on broadcast hierarchies.
- /wiki/flp-impossibility — why total-order broadcast cannot be solved deterministically in pure asynchrony.
- /wiki/spanner-style-txns-with-truetime — the previous chapter; one specific implementation of the ordering primitive this article generalises.
- /wiki/calvin-deterministic-concurrency — the front-loaded sequencer alternative.
- /wiki/2pc-in-detail-including-failure-modes — the per-transaction quorum approach.
- /wiki/eventual-consistency — the explicit decision to skip total-order broadcast and accept divergence.
- /wiki/causal-consistency — a strictly weaker ordering primitive that some workloads can use instead of total order.