Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Spanner-style transactions with TrueTime
PaySetu runs a global ledger that mirrors balances across ap-south-1, ap-southeast-1, and us-east-1. A reconciliation job reads every region at the same logical instant and asserts the sums match. Last quarter the assertion failed three times; each failure traced back to the same shape — two writes committing within 4 ms of each other on different continents, and a reader in a third region observing them in an order the application had never produced. The team migrated the ledger to a Spanner-style transactional layer: 2PC across Paxos groups, with commit timestamps drawn from a TrueTime API that exposes its own uncertainty. The reconciliation job has not failed since. This chapter walks through what that protocol actually does — how 2PC, Paxos, and TrueTime fit together, what commit-wait is, and why the design buys external consistency at the cost of a few milliseconds per commit.
Spanner-style transactions layer 2PC over Paxos replication groups and use TrueTime — a clock API that returns an interval [earliest, latest] — to assign commit timestamps without a central oracle. Each shard is a Paxos group; the transaction's coordinator picks s = TT.now().latest, replicates prepared records via Paxos, then waits until TT.now().earliest > s before releasing locks. That wait — commit-wait, typically 1–7 ms — is what guarantees external consistency across regions. The reader gets the entire SQL surface of an ACID database, globally distributed, with no central transaction manager.
The shape of the system — Paxos groups, then 2PC on top
A Spanner database is sharded into thousands of tablets. Each tablet is replicated across 3, 5, or 7 nodes — usually one per zone — running a Paxos state machine. A write to a single tablet is a single Paxos round: the leader proposes, a majority acks, the entry is committed. That is one consistent shard. The hard problem is what happens when a transaction touches two or more tablets, possibly in different regions.
For multi-shard transactions, Spanner runs two-phase commit on top of the Paxos groups. The participants in 2PC are not individual nodes but entire Paxos groups — each represented by its current leader. The transaction picks one of those leaders as the coordinator; the rest are participants. The participants prepare by Paxos-replicating a prepared record into their tablet's log; once a majority of replicas in each participant group have logged prepared, that group is durably promised to the transaction. The coordinator then collects all prepared votes, picks a commit timestamp, Paxos-replicates a committed record into its own tablet's log, and notifies the participants. Each participant Paxos-replicates a committed record into its log, releases its locks, and the transaction is done.
prepare and commit record is itself replicated via Paxos within its group. Illustrative — not measured timings.Why 2PC over Paxos rather than Paxos alone: Paxos within a group commits writes to that group atomically. But a single write that spans two groups is two independent Paxos rounds, and there is no atomicity across them — a crash between rounds could commit one shard's write and abort the other's. 2PC is the wrapper that makes the cross-group commit atomic; Paxos within each group is what makes 2PC's prepared and committed records survive node failure.
TrueTime — what TT.now() actually returns
TrueTime is Spanner's clock API. Unlike time.time(), TT.now() returns a struct with two fields, earliest and latest, satisfying the invariant: the absolute true time at the moment of the call lies somewhere in [earliest, latest]. The width latest - earliest is the uncertainty window, denoted ε. Inside a Google datacentre, ε is bounded by GPS receivers and atomic clocks running on time masters that the local time daemon disciplines its clock against; published values are 1–7 ms, with most calls sub-2 ms.
Two derived calls matter for the transaction protocol:
TT.after(t)returns true ifTT.now().earliest > t— that is, if the present is provably aftert.TT.before(t)returns true ifTT.now().latest < t— provably beforet.
A Spanner transaction commit picks s such that s ≥ TT.now().latest at the moment of choice, then waits in user code until TT.after(s) returns true before releasing locks and acknowledging the client. That wait is the price of external consistency.
Why this works: at the moment the coordinator picked s, true time was somewhere ≤ TT.now().latest = s. By waiting until TT.now().earliest > s, the coordinator proves that true time is now strictly after s. So any later transaction T2 that begins at real time t2 > s will pick its own commit timestamp s' ≥ TT.now().latest ≥ t2 > s. The order of timestamps is the order of real time — that is external consistency. Without commit-wait, T2 could start in real time after T1 commits but pick a numerically smaller timestamp, and a third-region reader would see them in the wrong order.
A walkthrough simulator — 2PC + Paxos + TrueTime in Python
This is a reduced Spanner-style coordinator written against an in-memory Paxos-group simulation. It implements prepare, commit-timestamp picking, commit-wait, and the participant notification sequence. PaySetu's payments-ledger team uses an internal harness shaped exactly like this for protocol tests; the simulation here mirrors that shape.
# Spanner-style 2PC + Paxos + TrueTime simulator.
# Each group is a Paxos abstraction; commit timestamps come from a TrueTime API.
import time, random, threading
from dataclasses import dataclass, field
from typing import Dict, List, Optional
EPSILON_MS = 4.0 # TrueTime window width
@dataclass
class TrueTime:
skew_ms: float = 0.0
def now(self):
t = time.time() * 1000 + self.skew_ms
return (t - EPSILON_MS / 2, t + EPSILON_MS / 2)
def after(self, ts_ms): return self.now()[0] > ts_ms
@dataclass
class PaxosGroup: # one shard, replicated
name: str
log: List[dict] = field(default_factory=list)
locks: Dict[str, str] = field(default_factory=dict) # key -> txn_id
def replicate(self, entry):
time.sleep(0.0008 + random.random() * 0.0006) # ~1ms Paxos round
self.log.append(entry)
return True
def prepare(self, txn_id, keys, prep_ts):
for k in keys:
if self.locks.get(k, txn_id) != txn_id: return None
self.locks[k] = txn_id
self.replicate({"type": "prepared", "txn": txn_id, "ts": prep_ts})
return prep_ts
def apply_commit(self, txn_id, keys, commit_ts):
self.replicate({"type": "committed", "txn": txn_id, "ts": commit_ts})
for k in keys: self.locks.pop(k, None)
def spanner_commit(txn_id, writes, coord, participants, tt):
# writes: {group_name: [keys]}; coord is a PaxosGroup, participants is a list.
prep_ts = tt.now()[1] # prepare-time floor
votes = {coord.name: coord.prepare(txn_id, writes[coord.name], prep_ts)}
for p in participants:
votes[p.name] = p.prepare(txn_id, writes[p.name], prep_ts)
if any(v is None for v in votes.values()):
return False # abort — already lockless
s = max(max(votes.values()), tt.now()[1]) # commit_ts ≥ TT.latest
coord.replicate({"type": "committed", "txn": txn_id, "ts": s})
while not tt.after(s): # ← COMMIT-WAIT
time.sleep(0.0005)
coord.apply_commit(txn_id, writes[coord.name], s)
for p in participants: p.apply_commit(txn_id, writes[p.name], s)
return s
# DEMO — Riya transfers ₹500 across PaySetu's ap-south-1 and ap-southeast-1 ledgers.
tt = TrueTime()
g_in = PaxosGroup("ledger-in"); g_sg = PaxosGroup("ledger-sg")
t0 = time.time()
ts = spanner_commit("txn-9173", {"ledger-in": ["alice"], "ledger-sg": ["bob"]},
g_in, [g_sg], tt)
elapsed_ms = (time.time() - t0) * 1000
print(f"committed at ts={ts:.2f} elapsed={elapsed_ms:.2f} ms (incl. commit-wait)")
print(f"in-log: {len(g_in.log)} entries sg-log: {len(g_sg.log)} entries")
Realistic output:
committed at ts=1714291837412.31 elapsed=8.74 ms (incl. commit-wait)
in-log: 2 entries sg-log: 2 entries
Walk through five lines. prep_ts = tt.now()[1] — the coordinator records a floor that every participant's prepared record must be at least as large as; this is the protocol's first half of the timestamp constraint. for k in keys: ... self.locks[k] = txn_id — prepare acquires per-key locks under Paxos. Locks survive a leader change because the prepared log entry is replicated. s = max(max(votes.values()), tt.now()[1]) — the coordinator picks s ≥ TT.now().latest and s ≥ every participant's prep_ts. The second condition is what stops a future read from missing a write that was prepared but not yet committed. while not tt.after(s): time.sleep(0.0005) — commit-wait. The protocol blocks here. The wait is short because ε is small, but it is non-zero by construction. coord.apply_commit(...) and the participant loop — this is the second Paxos round at each group, recording the committed entry and dropping the lock. The DEMO transfers ₹500 across two regions; the elapsed ≈ 8 ms is dominated by the two Paxos rounds plus the commit-wait of ~ε.
Read-only transactions — the reason commit-wait is worth it
A Spanner read-only transaction does not need any locks. The client picks a snapshot timestamp s_read (typically TT.now().latest for "as of now" reads), and every shard the read touches simply returns the value at s_read. Because of commit-wait, s_read is provably after every transaction that committed before the read began (in real time). This is the payoff: read-only transactions are free — no locks, no 2PC, no Paxos round on the read path — but they still see a globally consistent snapshot.
What the protocol handles, and what it does not
Network partition between coordinator and a participant. The coordinator's prepare RPC times out; the coordinator either retries (if the participant's prepared record is still there, the retry idempotently completes) or aborts the transaction. Any aborted prepare leaves a prepared log entry that a recovery process eventually rolls back. Participants never block forever — they detect coordinator-loss via lease timeouts and run a coordinator-recovery query against the coordinator group's Paxos log.
Slow node masquerading as failure. Each Paxos group can elect a new leader if the current leader is slow, without affecting the 2PC layer; the new leader inherits the group's prepared records from the Paxos log. The 2PC coordinator is itself a Paxos group leader, so coordinator failover is the same mechanism.
Clock skew exceeding ε. This is the mode TrueTime is built to detect. If a node's clock disagrees with the time master by more than ε, the time daemon raises an alarm and the node refuses to serve TrueTime calls — better to fail loudly than silently violate external consistency. Spanner reportedly evicts such nodes from leadership within seconds.
Commit-wait under high load. Commit-wait blocks one fibre, not the whole node. Spanner pipelines transactions, so commit-wait overlaps with prepare RPCs and Paxos rounds for the next transaction. Throughput is bottlenecked by Paxos round-trip time, not by commit-wait.
Byzantine participants. Spanner does not handle Byzantine failures — it assumes participants follow the protocol. A malicious participant could lie about its prepared timestamp. This is the same threat model as Paxos.
Why the protocol does not block on commit-wait at every replica: only the coordinator's leader runs commit-wait, in user-space, after the committed Paxos entry has been replicated. The followers simply replay the log and apply the commit at timestamp s; they do not re-wait. The wait is per-transaction, not per-replica — and because it overlaps with the next transaction's prepare phase via pipelining, the system's steady-state throughput is bounded by the Paxos round-trip, not by ε.
Common confusions
- "TrueTime is just NTP." It is not. NTP gives you a single time value with no published uncertainty bound. TrueTime gives you
[earliest, latest]and guarantees the true time lies in that interval — a guarantee maintained by GPS receivers, atomic clocks, and time-master disciplining at every datacentre. The interval is what makes commit-wait sound. - "Spanner uses Paxos for everything, so 2PC is redundant." Paxos is per-shard; 2PC is across shards. A single-shard transaction is one Paxos round and skips 2PC entirely. Multi-shard transactions need 2PC because no single Paxos group's log can durably commit a write that spans groups.
- "Commit-wait makes Spanner slow." Commit-wait is 1–7 ms — comparable to a single cross-region Paxos round, and pipelined with the next transaction's prepare phase. In practice the bottleneck is the cross-region Paxos round, not commit-wait. Single-region Spanner clusters routinely achieve sub-10 ms p99 commits.
- "Spanner gives you serialisable isolation by default." It gives you external consistency, which is strictly stronger than serialisable. Serialisable says "the result is some serial order"; external consistency says "the result is the serial order matching real-time commit order". If T1 commits before T2 starts, every observer sees T1 first.
- "Read-only transactions still need locks." They do not. A read-only transaction at
s_readreads each shard's snapshot ats_read. Because of commit-wait, every committed write ats ≤ s_readis visible everywhere. No locks, no 2PC, no extra round-trips. - "Spanner-style txns require Google's hardware." They require a clock with a bounded, published uncertainty. Google built one with GPS+atomic-clock time masters; CockroachDB approximates it with HLC + a configurable
max_offset; YugabyteDB does the same. The protocol is portable; the ε is bigger on commodity hardware (50–500 ms) which makes commit-wait longer.
Going deeper
The original Spanner paper and what is sometimes mis-summarised
Corbett et al.'s Spanner: Google's Globally-Distributed Database (OSDI 2012) is one of the most-cited distributed-systems papers of the 2010s. The popular summary — "GPS clocks make consistency cheap" — understates the protocol design. The paper's actual contribution is the combination: 2PC over Paxos, with TrueTime providing the timestamps that allow the system to skip a global timestamp oracle and still achieve external consistency. Removing any one piece breaks the protocol. Without 2PC, multi-shard transactions are not atomic. Without Paxos, prepared records do not survive failure. Without TrueTime, the timestamps could violate real-time order.
CockroachDB's approximation — HLC + max_offset
CockroachDB uses hybrid logical clocks plus a configured max_clock_offset (default 500 ms). Instead of waiting out an interval, it uses uncertainty restarts: a transaction reads a value with a timestamp inside its uncertainty window and restarts itself with a higher start timestamp. This converts commit-wait into occasional client-visible retries, which is acceptable for many workloads but breaks down under high contention. KapitalKite ran benchmarks showing that under p99.9 hot-row contention, CockroachDB's retry rate climbed sharply where Spanner's commit-wait stayed flat — different cost structures, same protocol family.
YugabyteDB and the same trick on commodity time
YugabyteDB also uses HLC and a max_clock_skew_usec (default 500 ms). Their protocol is closer to Spanner's: they wait out the uncertainty window during commit, rather than restarting. Operating teams report ε-equivalents of 100–250 ms on AWS without dedicated GPS hardware, vs Spanner's 1–7 ms inside a Google datacentre. The trade-off is that every multi-shard commit pays a 100–250 ms wait — the protocol works but transactions are slower.
PaySetu's deployment — the three knobs that mattered
PaySetu runs a YugabyteDB cluster across ap-south-1, ap-southeast-1, and us-east-1 for the global ledger. Their initial deployment had p99 multi-region commit latency of 340 ms. Three changes brought it to 95 ms. (1) Reduce max_clock_skew_usec from 500 ms to 150 ms after measuring AWS Time Sync's actual ε across regions over 30 days — the default was unnecessarily conservative. (2) Co-locate the txn coordinator with the most-active region, removing one cross-region RTT from the prepare phase. (3) Enable read-only transactions for the reconciliation job so reads bypass 2PC entirely. The biggest single win was the third; the first two were incremental but additive.
The FLP boundary and why Spanner does not violate it
Spanner does not solve consensus in the FLP model — it relies on synchrony assumptions about the network and the clock. If TrueTime's ε explodes (e.g. all GPS receivers fail simultaneously), the time daemon halts the local node rather than serving stale TrueTime, and the affected Paxos group loses quorum. This is the safety-vs-availability trade-off being made explicit: when the synchrony assumption breaks, Spanner sacrifices availability rather than consistency. See /wiki/flp-impossibility-no-consensus-with-one-faulty-node for the formal statement.
Where this leads next
Spanner-style transactions are the upper bound of what 2PC + Paxos + TrueTime achieves: globally-distributed ACID with external consistency, at the cost of one commit-wait per transaction. Variations down the cost curve include /wiki/calvin-deterministic-concurrency, which avoids 2PC entirely by deterministically ordering transactions in advance, and /wiki/percolator, which trades external consistency for snapshot isolation and a centralised TSO.
The next chapters in Part 14 cover deterministic concurrency, FaunaDB's Calvin-derived design, and the broader frontier of cross-shard transactions without 2PC. After that, Part 17 picks up the multi-region thread — what changes when one of those Paxos groups is on a different continent, and the network RTT dominates the protocol.
If you operate a Spanner-derivative in production, the operational chapters worth reading next are /wiki/2pc-in-detail-including-failure-modes, /wiki/leader-leases-and-fencing-tokens, and the chapters on multi-region replication.
References
- Corbett et al., Spanner: Google's Globally-Distributed Database, OSDI 2012 — the canonical Spanner paper.
- Bacon et al., Spanner: Becoming a SQL System, SIGMOD 2017 — the SQL surface added to the original key-value Spanner.
- Brewer, Spanner, TrueTime and the CAP Theorem, Google whitepaper 2017 — Brewer's own analysis of where Spanner sits in CAP/PACELC.
- CockroachDB blog — Living without atomic clocks — how CRDB approximates TrueTime with HLC + max_offset.
- YugabyteDB docs — Distributed transactions — the protocol's commodity-cloud variant.
- Lamport, Paxos Made Simple, 2001 — the consensus primitive Spanner builds on.
- /wiki/truetime-spanner-and-physical-logical-hybrids — TrueTime as a clock primitive, in detail.
- /wiki/percolator — the older sibling that traded external consistency for a centralised TSO.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip # no third-party deps; stdlib only
python3 spanner_sim.py # the script in §4 above; prints ts and elapsed