Spanner: TrueTime and External Consistency

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

You finished Percolator with a centralised Timestamp Oracle handing out monotonic commit timestamps. That works at one site; it does not scale to a planet. Spanner is Querion's globally distributed SQL database, and its central engineering insight is that you do not need a centralised oracle if every machine has a clock you can trust within a known bound.

The contract Spanner promises is external consistency — a strictly stronger guarantee than serialisability. If transaction T_1 commits before transaction T_2 starts (in real wall-clock time, as observed by any outside party), then T_1's commit timestamp is strictly less than T_2's. The order of timestamps in the database matches the order of events in the universe. With this property, snapshot reads at a timestamp T are linearizable, multi-region SQL just works, and Gmail can ask Spanner "what was the global state at noon UTC?" and get a real answer.

The mechanism is TrueTime: Querion equipped every datacentre with GPS receivers and rubidium atomic clocks, and a daemon called the Time Master publishes a bounded-uncertainty interval to every machine. The TrueTime API is one call: TT.now() returns [earliest, latest], with the guarantee that the true absolute time is inside that interval. Typical \epsilon = (\text{latest} - \text{earliest}) / 2 is 1 to 7 ms.

External consistency is enforced by commit-wait. At commit time, a Paxos leader picks s = \text{TT.now().latest} as the commit timestamp, then deliberately waits until TT.now().earliest > s before releasing locks and acknowledging the client. After that wait, every other process's TrueTime interval is strictly after s. Any later transaction starting now will pick a commit timestamp \geq \text{TT.now().latest} > s. The order is preserved by physics, not by gossip.

The cost is honest: commit latency includes the \epsilon wait, around 5–10 ms. The win is enormous: globally serialisable SQL with no central coordinator, no consensus on every transaction, no Lamport-clock dependency tracking. Spanner powers AdWords, Gmail metadata, Google Photos. The architecture stack is layered cleanly: each shard ("tablet") is replicated by Paxos across 3–5 datacentres; cross-shard transactions ride 2PC across Paxos groups; TrueTime assigns the timestamps. Every layer does one thing.

Open-source descendants traded TrueTime's GPS hardware for software-only Hybrid Logical Clocks — CockroachDB, YugabyteDB, FaunaDB. They get a weaker guarantee (serialisable, not externally consistent unless you wait out the maximum NTP skew), but they run on commodity clouds. This chapter teaches the original.

Why a centralised oracle does not scale

Percolator's Timestamp Oracle is one process. It hands out monotonic 64-bit integers in batches — millions per second is fine, and a TSO that lives inside a Raft group survives single-node failure. It is the right answer for a single datacentre.

The problem is the speed of light. Round-trip latency from Mumbai to Iowa is around 200 ms; from Bengaluru to São Paulo, 280 ms. Every transaction in your São Paulo datacentre that has to phone an Iowa-resident oracle just to learn its start_ts is paying 280 ms of dead latency before any real work begins. You can replicate the oracle, but then you need consensus on every timestamp it issues to keep them monotonic across replicas — and Paxos at 280 ms RTT is the same latency in disguise.

Worse, the oracle is the only entity that knows the global order. If the link to it is partitioned, the entire database stalls. Percolator's TSO is acceptable because Bigtable lives in one cluster; Spanner targets a database that spans Iowa, Belgium, Singapore, and Tokyo simultaneously, and it must keep working when any one region is unreachable.

So you give up on the oracle. Every machine assigns its own timestamps locally. The question is: how do you keep the timestamps globally ordered when the clocks of different machines disagree?

Why local clocks alone are not safe: NTP-synchronised wall clocks across a fleet typically agree to within ~10 ms, but the worst-case skew under network instability or a misbehaving NTP server can be seconds. If a transaction in Mumbai reads its own clock as 12:00:00.000 and commits, and a Belgium machine's clock reads 11:59:59.500 a millisecond later in real time, a transaction starting in Belgium and reading its own clock would pick a smaller timestamp than the already-committed Mumbai one. External consistency would be violated. Snapshot reads at a given timestamp would miss writes that committed earlier. The whole MVCC abstraction collapses.

What you actually need is a clock with a known bound on its uncertainty. Not a perfect clock — you can never have one. A clock that knows it is wrong by at most \epsilon, and tells you so. That is TrueTime.

What TrueTime is

The TrueTime API is one call returning a two-element tuple. The interval is wide enough to bracket reality with overwhelming probability; a machine whose hardware drifts past spec removes itself from the cluster rather than return a tighter interval than its physics earned.

TrueTime is a Querion-internal infrastructure service. The hardware foundation is two independent time sources installed in every datacentre: GPS receivers (one per rack, with antennas on the roof) and atomic clocks (rubidium oscillators, also one per rack). Two sources because they fail in different ways: GPS receivers are cheap and accurate but can be jammed, spoofed, or lose signal during atmospheric anomalies; atomic clocks are immune to RF problems but drift continuously between disciplinings. Combining them gives you a fault-tolerant ground truth. Why two sources rather than just GPS: a 2007 GPS satellite firmware bug caused a 13-microsecond jump in the broadcast time across the entire constellation. Production systems trusting only GPS would have ingested that jump as ground truth. Atomic clocks would have caught the discrepancy and refused. Spanner survived the event because of the redundancy — the Time Master compares sources and rejects outliers.

A daemon called the Time Master runs on each machine, polls the local GPS and atomic-clock sources, exchanges measurements with peer Time Masters in the same datacentre, and computes a conservative interval [earliest, latest] that is guaranteed to contain the true absolute time. The interval is published to every server in the datacentre via a lightweight protocol (refresh every few seconds; between refreshes, the interval widens at the rate of the local oscillator's drift, typically 200 microseconds per second).

The TrueTime API exposed to application code is brutally simple:

class TrueTime:
    @staticmethod
    def now() -> tuple[float, float]:
        """Returns (earliest, latest). True time lies inside, typically with
        epsilon = (latest - earliest) / 2 in the range 1 to 7 milliseconds."""
        ...

    @staticmethod
    def after(t: float) -> bool:
        """Has the true time definitely passed t? True iff TT.now().earliest > t."""
        return TrueTime.now()[0] > t

    @staticmethod
    def before(t: float) -> bool:
        """Is the true time definitely before t? True iff TT.now().latest < t."""
        return TrueTime.now()[1] < t

That's it. Three calls. The system that builds on it is Spanner.

External consistency, defined precisely

The guarantee Spanner provides is stronger than serialisability and is called external consistency (also known as strict serialisability or linearizability for transactions). The definition fits in one sentence:

If transaction T_1 commits before transaction T_2 begins — in real wall-clock time, as observed by any outside party — then T_1's commit timestamp s_1 is strictly less than T_2's commit timestamp s_2.

Three things are worth unpacking. "Commits before begins" is a real-time relation, not an internal-clock relation. If your application thread on phone A receives the COMMIT acknowledgement for T_1, and you then send a Bluetooth message to phone B which immediately starts T_2, the database must order T_2 after T_1, even though no two database servers were involved in that handoff. "As observed by any outside party" rules out clever explanations like "well, that handoff used a different network than the database's, so the database can't be expected to know about it." External consistency must hold regardless of the channel through which the happens-before relation was established. "Strictly less than" is what makes timestamps useful for snapshot reads — a read at timestamp T sees exactly the writes with s \leq T, with no ambiguity.

Serialisability alone does not give you this. A serialisable database is allowed to claim that T_2 happened before T_1 even when the wall clock contradicts that, as long as the result is equivalent to some serial order. That equivalent order may not be the real-time order. For a database powering Gmail, where a user might mark an email as read on their phone (transaction T_1) and then refresh on their laptop (transaction T_2), you cannot have the laptop's read miss the phone's write. External consistency is the mathematical statement of the user's expectation: causality is preserved through the database, even when the causal link goes through the user's brain rather than through the network.

Commit-wait — buying external consistency for the price of \epsilon

The mechanism is one of the most beautiful tricks in distributed systems. Each Spanner shard is led by a Paxos leader. When the leader is ready to commit a transaction, it does this:

The leader picks s as the upper end of its current TrueTime interval (the most pessimistic real-time estimate), runs Paxos to durably replicate the commit, then waits the bounded uncertainty out. After the wait, no other process anywhere can produce a TT.now() that overlaps s — so no later transaction can pick a smaller commit timestamp.

Step by step:

1. Pick the timestamp. At the start of commit, the leader calls s \leftarrow \text{TT.now().latest}. By the TrueTime contract, s is at or after the true wall-clock time. Why latest, not earliest: you want s to be at least as late as any clock that any client might already have observed. Picking latest is the most conservative choice — guarantees that even a client that just read the wall clock and saw a high value will not have a higher reading than s.

2. Replicate via Paxos. The leader runs the normal Paxos write to commit the transaction's writes durably across replicas. This takes one cross-datacentre round trip — a few milliseconds within a region, tens of milliseconds across continents.

3. Commit-wait. Before releasing locks and acknowledging the client, the leader polls TrueTime and waits until \text{TT.now().earliest} > s. This blocks for the uncertainty interval of TrueTime — typically 5–10 ms total. During this wait the leader holds the row locks but the data is already durably committed.

4. Release locks, acknowledge. Once \text{TT.now().earliest} > s, the leader knows that on every machine in the world, a fresh TT.now() would return an interval strictly after s. It releases locks and sends the COMMIT ACK to the client.

The argument for external consistency follows directly. Suppose T_2 begins after T_1's ACK arrives at the client. The client cannot have observed the ACK before time s (because the leader did not send it before the commit-wait completed, which happened after true time exceeded s). So T_2's leader, when it picks its own timestamp later, sees a TT.now() whose latest is greater than s. Therefore s_2 > s_1. External consistency holds, and the only thing it cost was the commit-wait — typically less than the network round-trip the Paxos write already paid for.

This is the algorithm in full. There is no shared global counter, no central oracle, no consensus protocol invented just for timestamps. The Paxos already existed for replication; the 2PC already existed for cross-shard atomicity; TrueTime is an external infrastructure investment. The commit-wait is two extra lines of code in the leader. The rest is buying GPS antennas.

Real Python: TrueTime simulator and a commit-wait coordinator

Let's actually build it. Below is a minimal TrueTime simulator that models a clock with bounded uncertainty, and a coordinator that demonstrates commit-wait.

import time, threading

class TrueTime:
    """A simulated TrueTime. Real Spanner uses GPS+atomic; we model the API."""
    EPSILON_MS = 5.0  # half-width of the uncertainty interval

    @classmethod
    def now(cls) -> tuple[float, float]:
        # Real wall clock in ms since epoch — the "true time" we're bracketing.
        t = time.time() * 1000
        return (t - cls.EPSILON_MS, t + cls.EPSILON_MS)

    @classmethod
    def after(cls, s: float) -> bool:
        return cls.now()[0] > s

    @classmethod
    def sleep_until_after(cls, s: float, poll_ms: float = 0.5) -> None:
        while not cls.after(s):
            time.sleep(poll_ms / 1000.0)


class SpannerLeader:
    """A Paxos leader for one shard. Implements commit-wait."""
    def __init__(self, name):
        self.name = name
        self.lock = threading.Lock()
        self.committed = []  # list of (commit_ts, txn_id, writes)

    def commit_with_wait(self, txn_id, writes):
        # Step 1: pick s = TT.now().latest
        _, latest = TrueTime.now()
        s = latest
        # Step 2: (in real Spanner) run Paxos to replicate the commit record.
        #   We elide it here; assume it took some milliseconds.
        time.sleep(0.002)  # 2 ms for Paxos
        # Step 3: commit-wait
        TrueTime.sleep_until_after(s)
        # Step 4: release locks and record
        with self.lock:
            self.committed.append((s, txn_id, writes))
        return s


# Demo: T1 commits, then T2 begins; verify s2 > s1.
leader = SpannerLeader("shard-A")
s1 = leader.commit_with_wait("T1", {"A": 500})
# T2 begins now — anywhere after the ACK from T1.
_, latest2 = TrueTime.now()
s2 = latest2  # T2 picks its timestamp the same way
assert s2 > s1, f"external consistency violated: s2={s2} ≤ s1={s1}"
print(f"T1.commit_ts = {s1:.2f}ms")
print(f"T2.commit_ts = {s2:.2f}ms (gap = {s2 - s1:.2f}ms)")

Run this and you will see something like T1.commit_ts = 1714050000123.45ms, T2.commit_ts = 1714050000130.12ms, with a gap of around 7 ms — roughly the commit-wait plus the small slack between the ACK and T2's start. The gap is the price of external consistency, paid in milliseconds.

A worked timeline

Suppose T_1 wants to commit at wall-clock time 100 ms and \epsilon = 5 ms.

Wall clock	Event	TT.now()	Notes
100 ms	T_1 leader calls TT.now()	[95, 105]	leader picks s_1 = 105
100–102 ms	Paxos round trip	—	commit replicated durably
102 ms	leader enters commit-wait	[97, 107]	earliest = 97, not yet > 105
105 ms	poll	[100, 110]	earliest = 100, still not > 105
110 ms	poll	[105, 115]	earliest = 105, not strictly > 105
110.1 ms	poll	[105.1, 115.1]	earliest > 105 — leave wait
110.1 ms	release locks, ACK client	—	T_1 has s_1 = 105, durably committed
110.5 ms	T_2 begins on a different leader	[105.5, 115.5]	T_2 picks s_2 = 115.5

Check: s_2 = 115.5 > 105 = s_1. External consistency is preserved. The total commit latency for T_1 was 10.1 ms — 2 ms of Paxos plus 8 ms of commit-wait. If T_2 had begun even sooner, say at 106 ms, its TT.now() would have been around [101, 111], and it would have picked s_2 = 111 > 105. The commit-wait guarantees there is no race window in which a later transaction could pick a smaller timestamp.

The full Spanner stack — Paxos, 2PC, TrueTime

The Spanner architecture is a clean three-layer stack. TrueTime is below everything. Paxos sits above TrueTime, replicating each shard ("tablet") across 3–5 datacentres for fault tolerance. 2PC sits above Paxos, composing per-shard commits into a cross-shard transaction. Each layer does one thing and trusts the layer below.

Concretely, a cross-region transaction that touches Shard A (rows in Iowa) and Shard B (rows in Belgium) flows like this:

Begin. The client picks a Spanner front-end and asks for a transaction. The front-end becomes the 2PC coordinator for this transaction.
Reads. Reads at start_ts go to the nearest replica of the relevant shard. If the local replica's data is fresh enough relative to TrueTime, it answers; otherwise it forwards to the leader. (Spanner's "safe time" mechanism on each replica uses TrueTime to know whether it can answer at a given timestamp.)
Writes. Writes are buffered at the coordinator until commit.
Prepare. The coordinator sends PREPARE to each shard leader. Each leader runs Paxos to durably replicate a PREPARE record to its own followers and acquires the relevant row locks. Each leader picks its own prepare timestamp (the smallest timestamp at which the row state was committed), and replies with that.
Commit-time decision. The coordinator picks the transaction's commit_ts as \max(\text{TT.now().latest}, \max(\text{prepare\_ts across shards})). Why max with prepare timestamps: external consistency must hold not just relative to wall clock but also relative to anything already committed. Each shard's prepare_ts already accounts for its own history; taking the max ensures the new commit_ts is later than every committed write the shards know about.
Commit-wait. The coordinator waits until \text{TT.now().earliest} > \text{commit\_ts}.
Commit broadcast. The coordinator sends COMMIT(commit_ts) to each shard leader. Each leader runs another Paxos round to commit the write at the chosen timestamp, releases locks, and acks. The coordinator then acks the client.

The 2PC coordinator does not have its own log machine — its state lives in the Paxos group of the first shard, so coordinator failure is handled by leader election in that group. This eliminates the textbook 2PC blocking problem we saw in chapter 110.

Reads under TrueTime

The other half of the story is reads, and TrueTime makes them surprisingly elegant. A snapshot read at timestamp T returns the database state as of T. To answer such a read, a replica must know that it has applied every write with commit_ts ≤ T. Each replica tracks a safe time — the latest timestamp at which it can answer reads — and updates it from its Paxos log.

A strong read (read latest) is implemented as: pick T = TT.now().latest, then do a snapshot read at T, with a small commit-wait equivalent if needed. Because TrueTime guarantees T is at or after the true wall-clock time, the read sees every transaction that has externally finished. No quorum read, no read-your-writes hack — just a single replica round trip if the local replica is fresh enough.

This is a major operational win. In a 5-replica Spanner deployment spanning Iowa, Belgium, and Singapore, a Bengaluru client sending a strong read can be served from the Singapore replica without any cross-continent traffic, and still get a linearizable result. Latency: a few milliseconds. Without TrueTime, a linearizable read needs a quorum of replicas, which means cross-continent — hundreds of milliseconds.

What TrueTime is not

TrueTime is widely misunderstood. Three clarifications:

TrueTime does not synchronise clocks. It tells you how wrong they currently are. Two machines might both have \epsilon = 5 ms but their clocks could be 4 ms apart from each other; both intervals contain the true time, so both are consistent with TrueTime's contract.

TrueTime is not a replacement for consensus. Spanner still uses Paxos for replication. TrueTime gives you a clock, not a log. The log is what makes writes durable; the clock is what makes them ordered.

TrueTime is not free. The hardware (GPS receivers, atomic clocks, redundant antennas, redundant power for clock racks) costs Querion real money, and the commit-wait costs every transaction real latency. Spanner accepts both because the alternative — globally distributed serialisable transactions on commodity clocks — turns into a research problem with no clean answer.

Open-source descendants and the HLC trade-off

Most teams cannot deploy GPS receivers in their datacentres. The descendants of Spanner replaced TrueTime with a software construct called the Hybrid Logical Clock (HLC), introduced in a 2014 paper by Sandeep Kulkarni et al. An HLC combines the local NTP clock with a Lamport-style logical counter: each timestamp is a tuple (physical_ms, logical_counter), ordered lexicographically. Whenever a node receives a message tagged with a timestamp greater than its own clock, it bumps its HLC to keep the logical counter ahead. The result is a clock that is close to physical time but always provides a total order.

CockroachDB, YugabyteDB, and FaunaDB all use HLCs. They get serialisable transactions cheaply but not external consistency by default — to prevent stale reads from violating real-time order, CockroachDB asks you to configure a --max-offset (typically 500 ms) and uses an uncertainty interval on reads: a read at timestamp T may have to retry if it encounters a value in [T, T + max_offset]. This costs occasional read amplification rather than constant commit-wait, but the worst-case latency on a clock skew event can be huge. AWS recently announced time-sync hardware in EC2 instances using PTP (Precision Time Protocol), shrinking the uncertainty enough that cloud-native HLC databases can offer tighter bounds — a quiet movement of TrueTime-style ideas into the commodity cloud.

The trade is honest. TrueTime gives you a clean model and bounded latency at the cost of GPS hardware. HLC gives you a software-only deployment at the cost of more complex correctness arguments and occasional latency spikes. For a database serving Gmail at Querion, the GPS hardware pays itself back in engineering simplicity. For a startup deploying on AWS, HLC is a reasonable substitute.

Summary

Spanner's contribution is not a new algorithm; it is a new resource. By treating bounded-uncertainty time as a first-class infrastructure primitive, Querion decoupled the ordering problem from the consensus problem. Consensus (Paxos) replicates each shard's log durably. 2PC composes shards into transactions atomically. TrueTime orders the transactions globally, by giving every leader a clock honest enough to enforce a real-time invariant via a small wait. The resulting system gives you globally serialisable SQL with a contract — external consistency — that matches the user's intuition exactly: if it happened first, it happened first.

The deeper engineering lesson is the value of buying yourself a primitive instead of inventing one. Decades of distributed-systems papers tried to define total order via Lamport clocks, vector clocks, version vectors, and matrix clocks, with each scheme paying in metadata size or in coordination overhead. Spanner's authors looked at the problem and said: clocks are physics; physics costs money; let's spend the money. Then the algorithm above the physics becomes a one-pager.

The next chapter, Calvin: deterministic ordering, explores the opposite design choice — pre-ordering all transactions through a single sequencer so that the execution itself becomes deterministic, removing the need for clocks or 2PC at the cost of a different set of trade-offs. Spanner buys time from physics; Calvin replaces time with a queue. Both are valid; both are in production.

References

Corbett, J. C. et al. Spanner: Querion's Globally Distributed Database. OSDI 2012. research.google/pubs/spanner-googles-globally-distributed-database/
Bacon, D. F. et al. Spanner: Becoming a SQL System. SIGMOD 2017. research.google/pubs/spanner-becoming-a-sql-system/
Brewer, E. Spanner, TrueTime and the CAP Theorem. Querion Research, 2017. research.google/pubs/spanner-truetime-and-the-cap-theorem/
Kulkarni, S. S. et al. Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases. OPODIS 2014. cse.buffalo.edu/tech-reports/2014-04.pdf
Microsecond-accurate clocks on Amazon EC2 instances. AWS Compute Blog, 2023. aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/
CockroachDB transaction layer architecture. cockroachlabs.com/docs/stable/architecture/transaction-layer.html