TrueTime (Spanner) and physical-logical hybrids

Riya runs the global-balance reconciliation job at PaySetu. Every night at 02:30 IST it scans 14 region-shards of the user-balance table — ap-south-1, ap-southeast-1, eu-west-1, three regions in the US, four in EMEA, the rest in APAC — and asserts that the sum of balances is exactly equal to the previous night's sum minus net withdrawals plus net deposits. Last Tuesday the assertion failed by ₹ 47,18,920. The shards individually balanced. The aggregate did not. After eight hours of debugging, the cause turned out to be three writes that committed within 6 ms of each other on three different continents, with NTP-disciplined wall-clock timestamps that placed them in an order inconsistent with the order in which the application had observed them. The HLC layer had given each write a monotonic-per-node timestamp, but across regions the HLCs were not externally consistent — a read in ap-south-1 had returned balance B before the write that produced B was visible in eu-west-1's HLC ordering. This is the wall against which Spanner's TrueTime was built. TrueTime is the radical idea that the clock should expose its own uncertainty: every call returns an interval [earliest, latest] rather than a point, and Spanner uses that interval to wait out the ambiguity before committing — buying external consistency at a global scale that HLC alone cannot.

TrueTime is Google's clock API that returns TT.now() = [earliest, latest] — a time interval whose width epsilon is bounded by GPS receivers and atomic clocks at every datacentre. Spanner uses TrueTime by picking a commit timestamp s = TT.now().latest and then waiting until TT.now().earliest > s before releasing locks (commit-wait), guaranteeing that any later transaction's timestamp is strictly greater. The result is external consistency — if transaction T1 commits before T2 starts in real time, T1's timestamp is less than T2's everywhere in the cluster. The cost is a 1–7 ms wait per commit. HLC and HLC-plus-uncertainty hybrids approximate this without the hardware budget.

What HLC cannot do, and what TrueTime promises

Hybrid logical clocks give you a 64-bit timestamp that is monotonic per node, captures causality through message exchange, and stays close to wall-clock time. They are an excellent primitive for snapshot isolation within a single cluster. What they do not give you is external consistency: the property that if transaction T1 finished (returned commit-OK to the client) before transaction T2 started (the client called begin), then every observer everywhere sees T1's effects before T2's. HLC's ordering depends on message exchange — if T1 commits in ap-south-1 and T2 starts in eu-west-1 without any message between them carrying T1's HLC, T2 might pick an HLC numerically smaller than T1's, and a third reader in us-east-1 might observe T2's writes "before" T1's. The application sees a balance update get reverted, then re-applied; reconciliation fails by ₹ 47,18,920.

Spanner's authors named this property external consistency to distinguish it from linearizability (which is about a single object) and serializability (which is about a transaction order, not necessarily real-time-respecting). External consistency says: the database's transaction order respects the real-time partial order of begin/commit events as observed by clients. To preserve it, the database must assign timestamps such that commit_time(T1) < start_time(T2) whenever T1 actually finished before T2 started.

The naive way is a global timestamp service — a single Paxos-elected leader that hands out monotonically-increasing numbers. Every transaction does a round-trip to the leader. At Spanner scale, that leader becomes a bottleneck (millions of transactions per second, oceans between datacentres) and a single point of failure. The Spanner team rejected it.

The TrueTime way is to give every node a calibrated clock with bounded uncertainty, then write the database to embrace the uncertainty rather than pretend it doesn't exist. Each node's TT.now() returns [earliest, latest] where latest - earliest = 2 * epsilon and epsilon is the maximum drift since the last GPS/atomic synchronisation. Spanner's timestamp authority is not a process — it is the physical world, mediated by GPS satellites and caesium clocks.

HLC versus TrueTime — point timestamps versus intervalsTop row shows three HLC point timestamps that risk overlapping under skew; bottom row shows three TrueTime intervals where commit-wait closes the uncertainty window before the next transaction begins.HLC point timestamps versus TrueTime intervals — same three commits t = 0 ms t = 5 ms t = 10 ms t = 14 ms HLC point timestamps T1 commit T2 begin T2's HLC = 4 (skew!) T2 < T1 — externally inconsistent TrueTime intervals T1: [0, 4] commit-wait T2: [5, 9] T3: [10, 14] strictly ordered
Illustrative — HLC's point timestamps can place T2 before T1 if T2's node is behind on skew, even though T1 finished before T2 began in real time. TrueTime exposes uncertainty as an interval and Spanner inserts a commit-wait equal to the interval width, so T1's interval closes before T2's begins. Width drawn at ε ≈ 4 ms.

Why exposing uncertainty changes everything: if you treat the clock as a single number you must either trust it (and accept inversions under skew) or distrust it entirely (and fall back to logical clocks, losing real-time ordering). An interval is the only honest answer. The clock cannot tell you the exact time; it can tell you the time up to ε milliseconds. Once you accept that, you can wait out the uncertainty before committing, and the rest of the database design follows mechanically.

How TrueTime is built — GPS, atomic clocks, and time masters

TrueTime is implemented as a daemon (the timeslave) on every Spanner machine, which talks to a small number of time masters in the same datacentre. Each datacentre has two kinds of time masters: GPS time masters (with rooftop GPS antennas) and Armageddon masters (with rubidium or caesium atomic clocks). The two kinds exist so that a single failure mode — GPS spoofing, antenna failure, or a regional GPS outage — doesn't take down all time masters simultaneously. The atomic clocks are independent of GPS and drift only ~1 ms over weeks; the GPS masters re-sync to satellites every few seconds and have ε of <1 ms.

The timeslave on each machine polls multiple time masters every 30 seconds, computes a time interval representing its current uncertainty, and exposes that interval through TT.now(). Between polls, the interval widens linearly at the drift rate of the local quartz oscillator — typically 200 microseconds per second (200 ppm), giving 6 ms of growth in 30 seconds plus the masters' own ε. After a successful sync, ε drops back to a small value: in Google's published numbers, ε is typically 1–4 ms during steady state, with the daemon's polling-and-marzullo logic compressing the interval down each refresh.

TT.now()  →  [earliest, latest]
where  latest - earliest = 2 * ε  (current uncertainty)
       earliest ≤ true_time ≤ latest  (with overwhelming probability)

The TrueTime API exposes three methods: TT.now() returning the interval, TT.after(t) returning true iff t < TT.now().earliest (i.e., t is definitely in the past), and TT.before(t) returning true iff t > TT.now().latest (i.e., t is definitely in the future). Spanner uses TT.after for commit-wait and TT.before to verify that snapshot reads are reading a state that has fully committed.

Why two independent time-source families (GPS + atomic) and not just GPS: GPS is vulnerable to spoofing and to regional outages (Google has experienced multi-hour GPS issues in single datacentres). Atomic clocks are independent of any external signal — a caesium oscillator's frequency is defined by physics. Pairing them means the timeslave can detect when the two disagree and refuse to use a corrupted source, falling back to whichever family is healthy. The composite uncertainty interval grows in proportion to the maximum of the two families' uncertainty, not the sum, because both are estimating the same physical quantity.

Commit-wait — the cost of external consistency

Spanner's commit protocol is the elegant part. When a read-write transaction is ready to commit, the coordinator picks a commit timestamp s and then waits until that timestamp is definitely in the past from every observer's perspective.

# spanner_commit.py — the commit-wait protocol in 60 lines
import time
import random
from dataclasses import dataclass

@dataclass
class TTInterval:
    earliest: float  # ms
    latest: float    # ms

class TrueTime:
    """Simulated TrueTime: ε grows between syncs, drops after sync."""
    def __init__(self, max_eps_ms=4.0):
        self.max_eps_ms = max_eps_ms
        self.last_sync = time.time() * 1000

    def now(self):
        # Real wall time, plus simulated growing uncertainty
        t = time.time() * 1000
        # ε grows linearly with time since last sync, capped
        elapsed = t - self.last_sync
        eps = min(self.max_eps_ms, 0.5 + elapsed * 0.0002)  # 200 ppm drift
        return TTInterval(t - eps, t + eps)

    def after(self, ts_ms):
        return self.now().earliest > ts_ms

class Spanner:
    def __init__(self, tt):
        self.tt = tt
        self.committed_writes = []  # (timestamp, key, value)

    def commit(self, key, value):
        # 1. Pick s = TT.now().latest — guaranteed >= true time now
        s = self.tt.now().latest
        # 2. Apply the write at timestamp s (made visible internally)
        self.committed_writes.append((s, key, value))
        # 3. COMMIT-WAIT: hold locks until TT.after(s) is true
        while not self.tt.after(s):
            time.sleep(0.0005)  # 0.5 ms poll
        # 4. Now safe to release locks and ack the client
        return s

    def read_at(self, ts_ms, key):
        # Return the most recent write to `key` with timestamp <= ts_ms
        candidates = [(t, v) for (t, k, v) in self.committed_writes
                      if k == key and t <= ts_ms]
        return max(candidates, key=lambda x: x[0]) if candidates else None

if __name__ == "__main__":
    tt = TrueTime(max_eps_ms=4.0)
    db = Spanner(tt)
    # Three transactions in real-time order: T1 then T2 then T3
    print(f"start: TT.now() = [{tt.now().earliest:.2f}, {tt.now().latest:.2f}]")
    s1 = db.commit("balance:riya", 12500)
    print(f"T1 committed at s={s1:.2f}, locks released")
    s2 = db.commit("balance:karan", 8400)
    print(f"T2 committed at s={s2:.2f}, locks released")
    s3 = db.commit("balance:asha", 31200)
    print(f"T3 committed at s={s3:.2f}, locks released")
    print(f"\nExternal consistency: s1 < s2 < s3 = {s1 < s2 < s3}")
    print(f"Gap (s2 - s1) = {s2 - s1:.2f} ms (≥ 2ε)")

Sample run:

start: TT.now() = [1714003814000.50, 1714003814008.50]
T1 committed at s=1714003814008.51, locks released
T2 committed at s=1714003814017.04, locks released
T3 committed at s=1714003814025.58, locks released

External consistency: s1 < s2 < s3 = True
Gap (s2 - s1) = 8.53 ms (≥ 2ε)

The output shows the three commits separated by ~8.5 ms each — slightly more than 2ε = 8 ms. That gap is exactly the commit-wait. T1 picks s = latest (which is 4 ms ahead of true time). It then waits until TT.now().earliest > s, which happens 4 ms later when the clock has advanced by another 4 ms (the original ε plus enough to clear it). When T2 starts, TT.now() has advanced past s1, so T2's latest is strictly greater than s1, and T2's commit-wait closes a non-overlapping window. By induction, every commit timestamp is greater than every prior commit's timestamp, even across machines that never exchanged a message — the only thing that matters is that wall-clock time is moving forward and ε is bounded.

Why picking s = latest (not earliest) is critical: if Spanner picked s = earliest, then TT.after(s) would be true immediately (because earliest < true_time already), and there would be no wait — but s would be in the past, and a later transaction starting in real time could pick a smaller timestamp under skew. Picking latest guarantees s ≥ true_time, and waiting until earliest > s guarantees true_time > s from every observer's perspective. The wait length is exactly , which is why Google invests so heavily in keeping ε small — every millisecond saved is a millisecond off every commit's tail latency.

Why this gives external consistency without any cross-region message exchange: if T1 commits at s1 (with commit-wait done), then real-time-now > s1 from every clock that's within ε of true time. If T2 begins at any later real time, T2's latest is at least real-time-now - ε, which is greater than s1 - ε > s1 - 2ε. T2 then picks s2 = latest > s1. The proof is local to each transaction — no coordination required between regions. This is the crucial property: external consistency is not paid for with a global Paxos round.

A production story — Spanner deployment at Google and the ε budget

Spanner runs across Google's global fleet. The published numbers (Corbett et al., OSDI 2012, plus follow-up papers) show ε in the 1–7 ms range across thousands of machines, with median around 4 ms. The Time Masters were designed with the Marzullo algorithm to fuse readings from multiple servers and reject outliers; the timeslave daemon polls every 30 seconds and tightens the interval each cycle. Google's investment in this: rooftop GPS antennas at every datacentre, atomic-clock backups in case of GPS spoofing, dedicated fibre between time masters and customer racks, and a custom protocol that beats NTP by about 10× in achievable ε.

CockroachDB and YugabyteDB looked at this and made a different choice. CockroachDB's HLC layer assumes ε ≈ 500 ms and relies on commodity NTP — no GPS required, runs on AWS, GCP, on-premises, anywhere. The trade is real: CockroachDB does not offer external consistency by default. It offers serializable isolation within a cluster, with a configuration flag (linearizable=true) that adds a commit-wait step similar to Spanner's, sized to max_offset (500 ms by default — much larger than Spanner's 4 ms). At PaySetu, the SRE team enables linearizable=true for the balance-reconciliation pipeline and disables it for low-stakes analytical queries — paying the 500 ms tail latency only where it's load-bearing.

Aditi, the SRE who once shipped a transaction system on top of plain HLC and got bitten by external-inconsistency bugs, runs the math: at 500 ms commit-wait, throughput per single-row hot key drops to 2 commits/sec/key. For PaySetu's low-frequency reconciliation jobs that's fine. For the user-facing payment write-path, that would destroy latency budgets. Spanner's 4 ms ε would translate to 250 commits/sec/key, which is an order of magnitude better but still too slow for hot keys — Spanner's solution is to shard hot keys at the application level, which Spanner-native customers (AdWords, Photos) already do for unrelated reasons.

Commit-wait cost as a function of εA bar chart comparing single-key commit throughput under different ε values: 1ms (Spanner best), 4ms (Spanner typical), 50ms (high-quality NTP), 500ms (CockroachDB max_offset).Single-hot-key commit throughput vs ε (commit-wait dominates) 500/s 125/s 10/s 2/s ε = 1 ms 500/s Spanner best ε = 4 ms 125/s Spanner typical ε = 50 ms 10/s tight NTP ε = 500 ms 2/s CRDB default throughput = 1 / (2ε)
Illustrative — single-hot-key commit throughput is bounded by `1 / (2ε)` because each commit holds locks for a commit-wait equal to `2ε`. Spanner's hardware investment in GPS+atomic clocks pays for itself in two orders of magnitude of throughput compared with CockroachDB's commodity-NTP `max_offset = 500 ms` setting.

The deeper lesson: TrueTime's hardware budget is not exotic. A pair of GPS antennas and a rubidium clock cost a few thousand dollars per datacentre. The expensive part is the engineering: writing a timeslave that handles GPS outages, atomic-clock drift, master failover, and the Marzullo fusion correctly, and writing the database around an interval API rather than a point API. Most companies will not build TrueTime; they will use Spanner, or they will accept HLC's weaker semantics, or they will invest in an HLC-plus-uncertainty hybrid (CockroachDB's linearizable=true mode, MongoDB's $clusterTime with bounded drift, YugabyteDB's HLC with explicit skew detection). The hybrids are the practical middle ground.

Common confusions

Going deeper

The Marzullo algorithm and why fusion matters

The TrueTime daemon polls multiple time masters and must combine their readings into a single interval. The naive answer (intersect all the intervals) is wrong: a single misbehaving master with a tight interval that doesn't actually contain true time would shrink the result to an incorrect range. The Marzullo algorithm (Keith Marzullo, 1984) and its refinement by Mills (the intersection algorithm in NTPv4) handle this: take the smallest interval that contains the majority of the input intervals. If five masters report [a₁, b₁], …, [a₅, b₅] and three of them overlap in [A, B], the result is [A, B] — the assumption being that no more than half the masters are misbehaving. The algorithm is O(n log n) and runs every 30 seconds in the timeslave; it is the same algorithm NTP uses, with the difference that TrueTime applies it to GPS+atomic masters rather than internet NTP servers.

Commit-wait optimisation — overlap with replication

A transaction's total commit latency is replicate_phase + commit_wait, but Spanner overlaps them: the coordinator picks s and starts the Paxos accept phase, and the commit-wait runs in parallel with the Paxos round-trip. By the time the Paxos majority acks, several milliseconds have passed; the remaining commit-wait is max(0, 2ε - paxos_rtt). For cross-region writes where paxos_rtt is 50–200 ms, commit-wait is effectively free — the round-trip subsumes it. For single-region writes where paxos_rtt is 1–2 ms, commit-wait dominates. This is why TrueTime's engineering pays back disproportionately for low-latency single-region workloads, where you might otherwise expect HLC to suffice.

What HLC-with-uncertainty hybrids look like (CockroachDB's linearizable=true)

CockroachDB exposes an opt-in mode, --linearizable=true (set at the cluster level), that adds a commit-wait of max_offset to every transaction's commit. Internally it uses HLC, not an interval API, but the commit-wait sized to max_offset gives a similar guarantee under the skew assumption. The difference from Spanner: ε is conservative-and-static (500 ms by default) rather than measured-and-dynamic (1–7 ms). The math works out: under the assumption that no clock skews more than max_offset, TT.now().latest = HLC + max_offset and TT.now().earliest = HLC - max_offset work as a software-only TrueTime substitute. The 500-ms cost is the price of not having GPS+atomic hardware. YugabyteDB's --max_clock_skew_usec parameter is the same idea, defaulting to 500,000 microseconds (also 500 ms) but tunable down to 50–100 ms with high-quality NTP. The full HLC-with-uncertainty paper is Mehdi et al., "I Can't Believe It's Not Causal! Scalable Causal Consistency with No Slowdown in Partially Replicated Systems" (NSDI 2017), which formalises the construction.

The "TrueTime is just an API" lesson

The deepest engineering insight in TrueTime is the API shape. Every database before Spanner exposed clocks as scalar values: now() returns int64. Every consumer of those clocks pretended the value was exact and silently broke under skew. Spanner's TT.now() returns Interval forced every consumer to confront uncertainty at the type level — there is no "exact time" in the API, so there is no way to write code that pretends. This is a type-system enforcement of correct behaviour: by removing the cheap-but-wrong abstraction, the API makes wrong code harder to write than right code. Modern systems borrowing from TrueTime — FoundationDB's versionstamps, Aurora DSQL's distributed snapshot timestamps, even some Kafka transactional designs — all expose intervals or version-vectors rather than scalars, partly because the TrueTime experience showed how much hidden bug surface scalar timestamps create.

Reproduce on your laptop

# Reproduce the Spanner-style commit-wait simulation
python3 spanner_commit.py

# Sweep ε from 1ms to 50ms; observe throughput collapse
python3 -c "
from spanner_commit import TrueTime, Spanner
import time
for eps in [1, 4, 10, 50]:
    tt = TrueTime(max_eps_ms=eps)
    db = Spanner(tt)
    start = time.time()
    for i in range(20):
        db.commit(f'k{i}', i)
    elapsed = time.time() - start
    tps = 20 / elapsed
    print(f'ε={eps}ms: 20 commits in {elapsed*1000:.1f}ms → {tps:.1f} TPS')
"

Where this leads next

TrueTime sits at the apex of Part 3's clock theory. Its key implication — that exposing uncertainty is more honest than pretending precision — propagates into nearly every later chapter:

The unifying takeaway: when you find yourself reasoning about clocks across regions, ask whether your design needs external consistency (real-time-respecting global order) or only serializability (some valid order). If only the latter, HLC is enough and the hardware budget evaporates. If the former — financial reconciliation, regulatory audit trails, distributed locks with real-time semantics — you are paying for TrueTime or one of its software-only hybrids. There is no free version.

References

  1. Corbett et al., "Spanner: Google's Globally-Distributed Database" — OSDI 2012. The canonical TrueTime + Spanner paper. §3 defines the TrueTime API; §4 derives commit-wait and external consistency.
  2. Brewer, "Spanner, TrueTime and the CAP theorem" — Google Research 2017. Eric Brewer reflects on how Spanner's TrueTime relates to his own CAP theorem and to PACELC.
  3. Marzullo, "Maintaining the Time in a Distributed System" — Operating Systems Review 1985. The fusion algorithm TrueTime uses to combine multiple time-master readings into a single interval.
  4. Kulkarni et al., "Logical Physical Clocks" — OPODIS 2014. The HLC paper; references TrueTime as the hardware-backed alternative HLC trades against.
  5. CockroachDB linearizable mode and clock-offset enforcement — production HLC-with-uncertainty hybrid.
  6. Meta's Time Appliance Project — PTP-based open-source path toward TrueTime-quality time service in commodity datacentres.
  7. Hybrid logical clocks — internal cross-link to the chapter that motivates TrueTime's hardware investment.