In short
Build 12 gave you sharding. Throughput scales linearly with the number of shards, single-shard queries stay fast when the shard key covers them, and most of your traffic never crosses a shard boundary. For a while everything looks like a clean win.
Then real applications happen. Money transfers cross shards. An order touches inventory on one shard and payment on another. A workflow updates three services in three databases. A social-graph follow writes both follower → followee and followee → follower. Each of these is a distributed transaction whether you call it that or not, and chapter 97 already showed you the four ways the cluster can corrupt itself when local ACID does not compose.
You have four answers and three of them are inadequate. Two-phase commit with a single coordinator works in the happy path and freezes the cluster when the coordinator crashes between deciding and announcing. Sagas never block but expose partial states that a banking app cannot legally show. Avoiding cross-shard operations is theoretically clean and practically impossible the moment a CFO asks for an atomic transfer. The fourth answer — consensus-based atomic commit — is the one production NewSQL systems (Spanner, CockroachDB, YugabyteDB, TiDB) actually use. The coordinator itself is replicated by Raft or Paxos so that no single failure can strand the protocol.
That is what the rest of this book is about. Build 13 teaches consensus — Raft and Paxos, the protocols that let a group of nodes agree on a sequence of operations despite failures. Build 14 applies consensus to atomic commit and gets you transparent ACID across shards, the same SQL you wrote on a single Postgres in chapter 1, scaled across an entire cluster. This chapter is the wall — the moment when sharding stops being enough and the path forward narrows to one direction.
The CFO walks into your sprint review uninvited. She is holding a printout. "We are losing money. Customers are filing chargebacks because their multi-shard transfers debit the source account but the destination credit never lands. The reconciliation team is paying refunds out of operating margin. I want this fixed this quarter, and the engineering blog post about how we sharded last year is no longer a good answer. What is the plan?"
You sharded last year. The launch was clean. P99 latencies dropped. The product team shipped three features that would have been impossible on a single Postgres. Single-customer reads — SELECT * FROM accounts WHERE user_id = ? — are blisteringly fast because the shard key covers them. Single-customer writes are the same.
The transfer endpoint is what broke. The application code is the same code you wrote when there was one Postgres:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE user = 'alice';
UPDATE accounts SET balance = balance + 100 WHERE user = 'bob';
COMMIT;
On the sharded cluster Alice's row lives on shard 3 and Bob's row lives on shard 7. The proxy splits the transaction into two independent local transactions. Most days both succeed. A handful of days a year — during a shard 7 failover, during a network partition, during a rolling restart — only the first one commits. Money disappears. Chapter 97 named this; the CFO is now asking what you are going to do about it.
This chapter argues that you have already used up your good options. The path forward is consensus, which is what Build 13 and Build 14 teach. Everything else is a stalling tactic.
Why sharding is not enough
Sharding solves throughput. It does not solve atomicity, isolation, global indexing, or referential integrity across shards. None of those came along for free.
What sharding gives you:
- Horizontal write throughput proportional to shard count, as long as the workload distributes across shards.
- Fast single-shard queries when the shard key is in the predicate.
- Storage scaling by adding more nodes rather than buying a bigger box.
- Operational isolation — a slow query on one shard does not slow the others.
What sharding does not give you, by default:
- Atomic commit across shards. Two independent local transactions are not the same thing as one distributed transaction. This is the entire subject of chapter 97.
- Cross-shard isolation. A read that touches multiple shards has no consistent snapshot — by the time you finish reading shard 3, shard 7 has moved on.
- Read-your-own-writes when the read and the write hit different shards. Your write to shard 3 has nothing to say about whether shard 7 is up to date with whatever you wrote there earlier.
- Foreign-key enforcement across shards. The database engine enforces FKs within one node; once
ordersis on shard A andusersis on shard B, yourON DELETE CASCADEis application code or it does not exist. UNIQUEconstraints across shards. A username is unique only to the shard that holds it; the same username can be inserted on two different shards before either learns about the other.
Why none of this composes from local ACID: each shard is its own isolated transaction system with its own commit decision, its own lock manager, its own MVCC snapshot. There is no shared clock, no shared lock space, no shared transaction id. The pieces are independently correct and collectively unaware of each other. Composition would require a layer above them that coordinates — and that coordinator is the thing chapter 97 already showed you cannot make non-blocking without consensus.
You can paper over individual gaps. Push uniqueness into a separate locking service. Push cross-shard reads into a stale snapshot service. Push referential integrity into application code. Each patch works for a narrow workload. None of them gives you back the property the CFO is asking for, which is "money does not disappear during a transfer".
The three inadequate answers
Before reaching for the answer that actually works, walk through the three that don't, because every team tries them first.
Answer 1 — Two-phase commit with a single coordinator
Pick a coordinator. The coordinator runs the transaction. It sends PREPARE to every shard the transaction touches. Every shard locks the rows, writes a prepare record to its WAL, and votes YES or NO. If all vote YES, the coordinator writes its own COMMIT record and tells every shard to commit. If any vote NO, it tells every shard to abort. Chapter 97 walked through this in detail.
This works in the happy path. It is also the answer many sharded SQL systems ship — Vitess and Citus both support 2PC for opt-in cross-shard transactions.
The failure mode is the one chapter 97 spent half its pages on: the coordinator crashes between writing the COMMIT record and broadcasting it to the shards. Some shards have their commit message and have applied the writes. Others are still in the prepared state, holding locks. They cannot commit on their own — what if the coordinator decided to abort? They cannot abort on their own — what if the coordinator decided to commit? They wait. The locks they hold block every other transaction that needs the same rows.
In production this becomes a 2 AM page. The cross-shard transaction was a transfer between two big customer accounts that get hammered all day. Twenty downstream transactions queue behind the held locks. Throughput on those rows goes to zero until an operator finds the coordinator's log on disk, reads the decision, and force-resolves the prepared state on each shard. If the coordinator's machine is gone — disk failure, terminated cloud instance, evicted Kubernetes pod with no persistent volume — you are now in a recovery procedure that requires a human to reason about which outcome is safe.
Sharded systems that ship 2PC mostly ship it disabled. Vitess defaults to single-shard transactions; cross-shard 2PC is a per-query opt-in with a documented warning that it is not recommended for high-throughput workloads. The reason is exactly the blocking story above. 2PC is correct on the happy path and dangerous on the failure path, and you cannot tell from inside the application which day you are having.
Answer 2 — Saga with compensating actions
A saga turns one distributed transaction into a sequence of local transactions, each on a single shard, with a compensating action defined for each forward step. If step k+1 fails, you run the compensations for steps 1 through k in reverse to undo the visible effects.
For the transfer:
def transfer_saga(workflow, alice, bob, amount):
workflow.run(
forward=lambda: shard_for(alice).debit(alice, amount),
compensate=lambda: shard_for(alice).credit(alice, amount),
)
workflow.run(
forward=lambda: shard_for(bob).credit(bob, amount),
compensate=lambda: shard_for(bob).debit(bob, amount),
)
If step 1 succeeds and step 2 fails, the workflow runs the compensation for step 1 — credit Alice back. End state: no transfer happened. Money conserved, no held locks, no stuck cluster. The saga never blocks because there is no consensus protocol to stall.
The cost is partial-state visibility. Between the debit on shard 3 and the credit on shard 7, a third party reading both balances sees Alice 100 short and Bob unchanged. For some applications this is fine — most user-facing flows tolerate a brief window where "the order is being processed". For a bank reconciliation that runs every minute, it is not fine. A regulator looking at the books at the wrong instant sees money missing, and "it will be back in 800 ms" is not a defence.
Worse: compensation is sometimes impossible. You cannot un-send an email. You cannot un-publish a tweet to a million followers. You can ship an apology, but the state machine has already moved past the point where undo restores the original state. Sagas work when forward steps are reversible. They fail open when they are not.
The microservice world reaches for sagas because the alternative — Paxos commit across services owned by different teams — is too heavy. The financial-systems world rejects sagas for the same operations because the visibility window is illegal. The right tool for one is the wrong tool for the other.
Answer 3 — Avoid cross-shard transactions entirely
The cleanest answer in theory: design the data model so that no transaction ever crosses a shard. Co-locate everything that needs to be modified together. If transfers happen between accounts, put both accounts on the same shard. If orders touch inventory, put orders and inventory on the same shard. Pick the shard key to absorb the join.
This works when there is a natural locality. A SaaS app that shards by tenant_id rarely needs cross-tenant transactions; almost everything a tenant does stays within their own data. WhatsApp sharding by user — every user's chats and contacts on one shard — almost never needs a cross-user atomic write because messages are append-only and conflict-free.
It collapses the moment your workload has a many-to-many relationship that cannot be flattened. A bank cannot put every pair of accounts on the same shard — the pairs form a quadratic explosion. An e-commerce checkout cannot put user, cart, inventory, and payment on the same shard if inventory is sharded by SKU and payment is sharded by transaction id. A social network cannot put every follower-followee pair on the same shard because the graph is dense.
You can sometimes mitigate with denormalisation — store both sides of the relationship redundantly so that the cross-shard join becomes a pair of single-shard reads. This is the Cassandra and DynamoDB design pattern. It works for read-heavy access where the redundant copies can be eventually consistent. It fails for write-heavy operations that need both copies to update atomically, because keeping two redundant copies in sync across shards is the same atomic-commit problem you started with.
"Don't do cross-shard transactions" is good advice that you can follow for some of your workload some of the time. It is not a strategy for the parts of the workload that genuinely require cross-shard atomicity. Those parts always exist by the time you grow large enough to shard.
The real answer — consensus-based atomic commit
The fourth answer is the one chapter 97 hinted at and the one Spanner, CockroachDB, YugabyteDB, and TiDB all ship: make the coordinator a consensus-replicated state machine.
The blocking failure of classical 2PC is a single-coordinator problem. The coordinator's commit decision lives in one place; if that place becomes unreachable, no participant has the authority to decide. Replace the single coordinator with a group of nodes that run a consensus protocol — Raft or Paxos — and the decision lives in the group. Any majority of the group can drive the protocol forward. A minority failing or partitioning away does not stall anything.
This is not a hand-wave. It is what production NewSQL systems do at the level of every transaction. CockroachDB stores each transaction's status in a Raft-replicated key. Spanner uses Paxos groups for both data ranges and for the transaction record itself. YugabyteDB has a transaction-status tablet that is itself replicated by Raft. The mechanism is the same in all three: every step the coordinator would take is run through consensus before it takes effect.
Build 13 spends seven chapters on consensus. The protocols are intricate — leader election, log replication, safety invariants, snapshotting, membership change. Build 14 then applies consensus to atomic commit and gets you the property the CFO is asking for: a money transfer that is atomic across shards, with no single failure able to stall the cluster, and with the application API still being plain SQL.
The systems you will end up using are:
- Spanner (Google, 2012) — Paxos groups, TrueTime for external consistency, runs Google's ad and Play infrastructure. Available as a managed service on GCP.
- CockroachDB (Cockroach Labs, 2014) — Raft per data range, distributed transactions on top, open source. Postgres-wire-compatible.
- YugabyteDB (2017) — DocDB storage engine with Raft per tablet, a Postgres front-end and a Cassandra front-end on top.
- TiDB (PingCAP, 2016) — Percolator-style transactions over a Raft-replicated key-value store (TiKV). MySQL-wire-compatible.
These are the systems Build 14 covers in detail. They share the architecture; they differ in clock model, protocol details, and which SQL surface they expose.
The path from sharded to distributed SQL
The industry took twenty years to climb this hill. Reading the timeline is the cleanest way to see what each step bought.
Sharded SQL — Vitess (YouTube 2010), Citus (2011). Take Postgres or MySQL, split data across shards by a chosen key, route queries through a proxy that knows the shard map. The big win is throughput. The big loss is exactly what chapter 97 named — cross-shard atomicity is either disabled or opt-in 2PC with a single coordinator.
Sharded SQL with bolted-on 2PC — XA, Vitess --enable-2pc, Citus distributed transactions. Add a coordinator process that drives 2PC across the shards. The protocol works on the happy path. Operators who run it in production at scale describe the same set of incidents: a coordinator crash, prepared transactions on every shard, locks held until manual intervention. The opt-in flag exists because some workloads genuinely need it, not because it is the recommended default.
Spanner (Google, 2012). The first widely-deployed system that solved the coordinator-failure problem properly. Each data range is a Paxos group. Cross-shard transactions run 2PC, but the coordinator role is itself a Paxos-replicated state machine — every coordinator decision (received YES from shard A, decided to commit T) is a Paxos round. A coordinator-replica failure is not fatal; the surviving majority continues. The atomic-commit problem stops being a single-point-of-failure problem and becomes a consensus problem.
The NewSQL wave — CockroachDB (2014), TiDB (2016), YugabyteDB (2017). Open-source recreations of Spanner's architecture, each with its own twist on the clock model and protocol details. Same core idea: Raft per shard, Raft-replicated transaction records, atomic commit on top.
The pattern is clear. Sharding alone gives you throughput. Sharding plus a single coordinator gives you atomic commit at the price of a blocking failure mode. Sharding plus a consensus-replicated coordinator gives you atomic commit with no single-failure blocking. The wall is the gap between the middle answer and the right answer, and crossing it requires Raft or Paxos.
Why this is the wall
Earlier chapters in Build 12 added features. Hash sharding, range sharding, scatter-gather queries, secondary indexes. Each chapter added something the previous design did not have, but the underlying machinery was the same — independent local databases coordinated by a routing layer.
Cross-shard atomic commit cannot be added the same way. It is not a feature on top of independent local databases; it is a property that requires the local databases to coordinate at the level of their commit decisions, and the only way to make that coordination non-blocking is to replicate the coordinator's state by consensus. Consensus is genuinely new machinery. It has its own protocols, its own correctness proofs, its own performance characteristics. A team that has not implemented Raft cannot implement non-blocking atomic commit.
That is why this chapter is "the wall". You cannot bolt the answer onto the architecture you have. You need a different architecture, and the entry ticket to that architecture is Build 13.
Python sketch — what you cannot easily do without consensus
Here is the transfer the CFO is asking you to make atomic. On a single Postgres this is one transaction and three lines of code. On a sharded cluster without consensus, the most natural-looking version is broken:
def transfer(alice, bob, amount):
shard_for(alice).debit(alice, amount)
shard_for(bob).credit(bob, amount)
# If line 2 fails, alice has lost the money. No atomic guarantee.
You can dress it up with retries, but that does not help — if shard B is genuinely down, you cannot retry forever, and during the retry window Alice is short and Bob is unchanged. You can add idempotency tokens, but they only help once the shard comes back. You can add a saga compensation:
def transfer_with_compensation(alice, bob, amount):
shard_for(alice).debit(alice, amount)
try:
shard_for(bob).credit(bob, amount)
except Exception:
shard_for(alice).credit(alice, amount) # compensate
raise
This is a saga. It avoids the case where money disappears, at the cost of a window where Alice is debited and Bob has not yet been credited. A regulator scanning the books at that instant sees money missing. A user reading their own balance during the window sees a transfer that has not finished. For most fintech workloads in India, that visibility window is unacceptable — RBI requires the books to balance at all times, not eventually.
The version that satisfies the CFO is the one you cannot write yourself without consensus:
def transfer_atomic(alice, bob, amount):
with cluster.distributed_transaction() as txn:
txn.debit(alice, amount)
txn.credit(bob, amount)
# Both apply atomically, or neither does.
# The coordinator that decides is a Raft group.
cluster.distributed_transaction() is the API CockroachDB and Spanner expose. Underneath it, every step is run through a Raft-replicated transaction record; the commit decision survives any single failure; locks are released the moment the decision is committed in the Raft log. Build 14 shows how it is wired up. The point of this chapter is that you need it.
Real-world scenarios requiring cross-shard atomicity
The CFO scenario is not contrived. Every real application accumulates operations of this shape.
Bank transfers. Source account on one shard, destination on another. The atomic-commit invariant is "money is conserved". RBI explicitly requires this. UPI transfers between two banks are a multi-system version of the same problem.
E-commerce checkout. A successful checkout writes to orders (sharded by order id), inventory (sharded by SKU), payments (sharded by transaction id), and users.last_order_id (sharded by user id). Four shards, four writes, all-or-nothing. If payments succeeds and inventory fails, you charged a customer for stock you do not have.
Multi-step workflows. A travel booking writes flights, hotel, car rental, and payment. A loan disbursal writes the customer record, the account ledger, the regulatory report, and the email-notification queue. A school admission writes the student record, the fee payment, the class allocation, and the parent-portal credentials. Each step is on its own shard for sensible scaling reasons. The workflow has to be atomic.
Social-graph updates. A follow writes both followers[bob] (containing Alice) and following[alice] (containing Bob). If the followers are sharded by followee and the followings by follower, the two writes are on different shards. The application invariant is "follow is symmetric in the graph"; without atomic commit, half the reads see a relationship the other half do not.
Inventory reservations. A flash sale reserves seats across multiple inventory pools. Either the user gets exactly the seats they paid for, or the reservation fails and they are refunded. Partial reservation — three out of four seats, with payment for four — is the failure mode that brings tickets to the CEO's desk.
In every case the operation crosses shards because the data legitimately lives on different shards for legitimate reasons. You cannot fix it by re-sharding. You need atomic commit.
What you gain by moving to distributed SQL
Distributed SQL — Spanner, CockroachDB, YugabyteDB, TiDB — gives you a strictly stronger contract than sharded SQL.
Transparent atomic commit across shards. The application writes ordinary SQL: BEGIN; UPDATE accounts ... ; UPDATE accounts ... ; COMMIT;. The engine routes the writes to the right ranges, runs 2PC over Raft groups, and the commit either applies everywhere or nowhere. There is no opt-in flag, no warning about latency, no operator runbook for stuck prepared transactions.
Strong-consistency global secondary indexes. A unique constraint on username is enforced across the cluster, not within one shard. The database engine handles the cross-range coordination using the same transaction machinery.
Cross-shard joins with snapshot isolation. You can join orders and users even when they are on different ranges, and the join sees a consistent snapshot of both as of the same logical time. Snapshot isolation falls out of the consensus protocol — every Raft group has a linearisable timeline, and the cross-range coordinator picks a snapshot timestamp that is consistent across all of them.
Full SQL semantics. Subqueries, foreign keys, triggers, views — most of the things that worked on Postgres still work, modulo a handful of features that distributed SQL deliberately does not implement (cursors over arbitrary result sets, certain stored-procedure semantics).
The cost is real and worth naming.
Consensus latency on every write. A write commits only when it is durable on a majority of the Raft group. In practice this is one extra round-trip within a region — about 1 ms — and several round-trips for cross-region replication. Read-mostly workloads barely notice. Write-heavy workloads pay this on every transaction.
Operational complexity. A distributed-SQL cluster has more moving parts than a sharded Postgres. Range splits, leaseholder rebalancing, Raft snapshots, version upgrades that respect the consensus protocol — all of these are new operational concerns.
Cost. Spanner is a managed service with managed-service pricing. CockroachDB and YugabyteDB are open source but require more nodes (a Raft group has at least three replicas) and more disk than a sharded primary-replica setup with the same effective storage.
For the workloads where atomic commit is non-negotiable, the cost is the price of correctness. For the workloads that can live without it, paying the cost is a waste.
Why some workloads still don't use distributed SQL
Not every system in the world should be on Spanner.
Workloads that genuinely don't need cross-shard atomicity. A telemetry pipeline that ingests trillions of events per day, each event independent, never reads anything that requires a cross-shard view — Cassandra and DynamoDB are better fits. The cost of consensus on every write is wasted; you have nothing to coordinate.
Workloads that tolerate eventual consistency. A product-catalogue service whose updates propagate through CDN caches anyway gets nothing from atomic commit at the storage layer. A recommendations system that recomputes hourly does not care if a single write is delayed by a few seconds.
Cost-sensitive workloads at moderate scale. A small fintech with a million users and one Postgres primary is fine on Postgres. Sharding hurts for this size; distributed SQL hurts more. The wall this chapter describes only matters once you have already decided sharding is necessary.
Latency-critical workloads inside one region with strong availability. If 99.99% availability and millisecond latency are the goal and the data fits comfortably on one big Postgres with a hot standby, that is the right choice. Distributed SQL adds latency you cannot make smaller than the consensus round-trip.
The honest framing is: distributed SQL is the right answer when you have both the scale that requires sharding and the consistency requirements that require atomic commit. Plenty of large systems have only one of those. Plenty of small systems have neither. The wall is real for the workloads that have both.
Flipkart payments — three architectures, one decision
Flipkart's payments service has the CFO's exact problem. A purchase debits the user's wallet, credits the merchant's account, writes a payment record for reconciliation, and posts a ledger entry for tax. Four writes across four logical tables. At Indian e-commerce scale — Big Billion Days peaks at hundreds of thousands of transactions per second — single-Postgres throughput is the first thing that breaks.
Walk three architectures.
Architecture 1 — single Postgres. All four tables on one primary. Every checkout is one local transaction; ACID is automatic. Fails the moment write throughput exceeds what one machine can sustain. Big Billion Days lasts six days, so six days a year this architecture is the bottleneck.
Architecture 2 — Vitess. Shard by user id. Wallet writes scale; merchant writes scale; payment records scale. Most reads are single-shard and fast. Cross-shard transactions opt-in to 2PC. Two consequences in production: latency on cross-shard transfers roughly doubles because of the 2PC round-trips, and once a year a coordinator crash leaves prepared transactions on five shards and pages the on-call SRE at 2 AM. The architecture is mostly fine; the failure mode is the one chapter 97 named.
Architecture 3 — CockroachDB. Range-sharded by primary key, Raft per range, distributed transactions on top. Every checkout is one logical SQL transaction; the database engine routes the four writes to four ranges, runs 2PC over Raft, and commits atomically. No opt-in flag, no operator runbook for stuck prepared transactions. Latency is slightly higher than Architecture 2's single-shard path because every write goes through a Raft round, but the cross-shard latency is comparable to 2PC — and the failure mode is gone.
The decision Flipkart-shaped companies make in 2026: CockroachDB or YugabyteDB for payments and accounting, where atomic commit is non-negotiable; Vitess or Cassandra for catalogue, search, and recommendations, where eventual consistency is fine and the consensus tax is wasted. Different stores for different invariants. The naive engineer wants one database for everything; the experienced one matches each workload to the cheapest store that gives the guarantees that workload needs.
Common confusions
"Sharding is enough." True for the 99% of traffic that does not cross shards. False for the 1% that does, which is precisely the operations the business depends on most — payments, transfers, workflows. The wall is hit by that 1%, not by averages.
"Sagas are always the right answer for distributed transactions." Sagas are the right answer when partial-state visibility is acceptable to the business. They are the wrong answer when a regulator or a financial counterparty is reading the books. "Always" is doing a lot of work in this sentence; the right framing is "depending on what the application can show".
"Consensus is too slow." Consensus inside a single region adds about 1 ms to every write. For most applications that is invisible. Cross-region consensus adds tens of milliseconds, which matters for the wide-area writes but not for read-heavy paths. "Too slow" is what people say before they have measured.
"Build it yourself — 2PC is just a few hundred lines." The happy path is a few hundred lines. The failure modes — coordinator crash recovery, network partition, clock skew, log truncation, leader change — take years of production hardening. The reason CockroachDB and YugabyteDB exist as products is that the work is genuinely hard and most teams who try to write it themselves end up with a system that corrupts data once a year. Use the systems that have been hardened.
"NewSQL is just sharded Postgres." The user-facing surface is similar; the internals are entirely different. NewSQL systems are Raft-per-range from the storage layer up. The transaction system, the SQL planner, the failover model — all distinct from a sharded primary-replica deployment. Treating them as Postgres with a different operational model misses what they actually do.
"My workload doesn't need atomic commit." Maybe. Audit the operations that touch more than one entity. Walk through what happens if each step succeeds and the next one fails. If the answer is ever "data is corrupted in a way the business cannot tolerate", you need atomic commit somewhere — possibly in a saga with strong compensations, possibly in distributed SQL, but somewhere.
Going deeper
From Bigtable to Spanner — the arc that defined NewSQL
In 2006 Google published the Bigtable paper. Bigtable was a sharded wide-column store that explicitly disclaimed cross-row transactions; the design assumption was that applications would do without. For five years Google built on top of that assumption. By 2010 they had hit the wall this chapter describes — the ad system needed cross-row atomicity, the play store needed inventory invariants, and applications were each reinventing partial atomic-commit machinery on top of Bigtable.
The internal solution was Megastore (2011), a layer above Bigtable that added Paxos-replicated commit. It worked but was slow — every transaction was several Paxos rounds. Spanner (2012) was the rewrite that made it fast: Paxos groups for the data ranges themselves, 2PC over those groups for cross-range transactions, and a TrueTime-based clock that gave external consistency without an extra round-trip on reads. The 2012 OSDI paper is the document that defined NewSQL as a category.
The open-source NewSQL wave — CockroachDB, YugabyteDB, TiDB — is essentially the project of recreating Spanner without the proprietary clock hardware. Each system makes a different choice for the clock model: hybrid logical clocks with bounded uncertainty in CockroachDB; a centralised timestamp authority in TiDB; loosely synchronised wall-clock in YugabyteDB. The underlying Raft-per-range and 2PC-over-Raft architecture is shared.
Paxos Commit (Gray and Lamport 2006)
The single paper that most clearly explains why consensus is the right answer to atomic commit is Gray and Lamport's Consensus on Transaction Commit. The paper reframes 2PC as a degenerate consensus problem — one where the participants vote and the coordinator is the only one who decides — and shows that replacing the coordinator with a Paxos group gives you a non-blocking protocol with the same correctness properties. Every NewSQL system is a descendant of this paper's design, even when the implementation uses Raft instead of Paxos for engineering reasons.
The key insight: if you accept that atomic commit reduces to consensus, then all the work you put into making consensus practical pays off twice — once for ordinary log replication (Build 13), once for atomic commit (Build 14). That is why Build 13 and Build 14 are next to each other.
Where this leads next
Build 13 — Consensus. Seven chapters on Raft and Paxos. Leader election, log replication, safety, snapshotting, membership change, multi-Raft, the geography of consensus latency. By the end you understand the protocol that the rest of the book assumes.
Build 14 — Atomic commit via consensus. Six chapters that wire consensus into the storage engine. Paxos commit, parallel commits, transaction status records, retries and back-offs, hybrid logical clocks, the Spanner external-consistency story.
Build 15 — NewSQL systems in detail. Spanner, CockroachDB, YugabyteDB, TiDB — each gets a chapter on architecture, a chapter on transactions, a chapter on operations.
The CFO walks back into your sprint review three quarters later. Transfers are atomic. Reconciliation is back to balanced books. The migration from Vitess to CockroachDB took two quarters and one production incident; the incident was a configuration mistake, not a correctness bug. She does not say thank you. She moves on to the next problem.
That is the shape of the rest of the book.
References
- Corbett et al., Spanner: Google's Globally-Distributed Database, OSDI 2012 — the foundational NewSQL paper that introduced 2PC-on-Paxos with TrueTime. The "Transactions over Paxos groups" section is the precise architectural answer this chapter points toward.
- Cockroach Labs, CockroachDB Architecture: Transaction Layer — the clearest open-source treatment of how a distributed-SQL engine wires Raft-replicated transaction records to atomic commit, including the parallel-commits optimisation that collapses prepare and commit phases.
- Gray and Lamport, Consensus on Transaction Commit, ACM TODS 2006 — the paper that reframes atomic commit as a consensus problem and introduces Paxos Commit. The intellectual root of every system in this chapter.
- Kleppmann, Designing Data-Intensive Applications, Chapter 7 — Transactions, and Chapter 9 — Consistency and Consensus, O'Reilly 2017 — the textbook treatment of why local ACID does not compose across shards and why consensus is the load-bearing primitive for atomic commit.
- Aslett, What We Talk About When We Talk About NewSQL, 451 Research 2011 — the post that named the NewSQL category, predating Spanner. Useful historical context for how the industry framed the wall before the production answers existed.
- Brewer, CAP Twelve Years Later: How the Rules Have Changed, IEEE Computer 2012 — the reflection on CAP that explicitly acknowledges Spanner-style systems and reframes the trade-off as latency rather than impossibility, which is the framing distributed SQL operates inside.