Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Google: the "what makes their stack tick" distillation
It is 23:40 IST, and Anjali — a staff engineer at PaySetu evaluating which open-source pieces to glue together for the company's next-generation ledger — has thirteen tabs open on Google's research site. The GFS paper from 2003. MapReduce from 2004. Chubby from 2006. Bigtable from 2008. Spanner from 2012. The Borg paper from 2015. Each one is famous in its own right; each one has at least one open-source descendant (HDFS, Hadoop, ZooKeeper, HBase, CockroachDB, Kubernetes); each one is, on its own, a complete system. The thing that has been bothering Anjali for an hour is that none of those papers, read in isolation, explain why Google's stack works as a stack. The interesting question — and the one this chapter answers — is not "what is Spanner?" but "why was Spanner only buildable inside Google?". The answer turns out to be small, almost embarrassing: a handful of shared assumptions that every layer commits to, and a handful of cross-layer contracts that turn six independent papers into one machine.
Google's distributed stack is not the sum of GFS + MapReduce + Bigtable + Chubby + Borg + Spanner. It is those layers plus a tiny set of cross-cutting commitments — every storage system trusts a single global lock service, every scheduler trusts a single cluster-wide namespace, every database trusts a single global time API, every workload runs unprivileged on a shared fleet, and every layer assumes machines are cattle. Read in isolation, each paper looks like a brilliant point solution. Read together, they reveal a layered architecture where each new system was buildable only because the layer below had already paid for the hardest abstraction. This chapter distils the cross-layer contracts that actually make the stack tick.
The layers — and the shared bones beneath them
The famous papers are easy to enumerate, but the pattern across them is what matters. Each paper sits on a layer below it whose hardest problem is already solved, and the new paper is consequently free to ignore that problem entirely. GFS could ignore disk failures because it was built on commodity hardware that assumed disks would fail and re-replicated chunks aggressively. MapReduce could ignore worker failures because GFS already gave it durable input/output. Bigtable could ignore distributed locking because Chubby already existed. Spanner could ignore clock uncertainty as a software problem because TrueTime made it a hardware problem.
Why this layering matters more than any single paper: outside Google, every replacement (HBase, etcd, Kubernetes, Hadoop) reproduces one layer's mechanism, but rarely the contracts. HBase replaced Bigtable's storage but ZooKeeper-the-Chubby-clone is not as deeply integrated as Chubby was — applications often use ZooKeeper for some coordination and Consul for others. Kubernetes replaced Borg but the namespace is per-cluster, not per-cell-per-fleet. Without the cross-cutting contracts, the descendants miss the part that made the originals composable. Anjali's PaySetu stack will replicate this drift unless she chooses one lock service, one scheduler, and one time-source up front and forces every team to commit to them — which is the political part, and the part the papers never talk about.
Contract A: a single global namespace, owned by Chubby
The first thing every Google service does at startup is resolve names through a /cells/<cell-name>/<service>/<job>/<task> path that is served by Chubby. Chubby is famous for being a distributed lock service, but its more important role is being the namespace: the place where every service publishes its current leader, every job publishes its current task list, and every config knob is keyed. The Chubby paper itself is candid about this — Mike Burrows wrote that the lock-service framing was a slight rationalisation, that what Google actually needed was a small, reliable, name-and-meta-data store that happened to also support advisory locks. The lock service was a side feature; the namespace was the product.
The cross-layer payoff is that every layer above Chubby uses the same naming convention, and consequently the same failure-discovery mechanism. When a Bigtable tablet server fails, its lease in Chubby expires; when a MapReduce master fails, its session in Chubby evaporates; when a Borg job is rescheduled, its task entries in Chubby are rewritten. There is one place to look for "who is the leader of X right now?" — and the answer always comes from the same RPC. Why a single namespace beats federated namespaces: federated namespaces require every system to know how to translate names across federations, which means the failure-handling code has to know about each federation's quirks. A single namespace means the failure-handling code has to know one quirk. The "single point of failure" objection is valid in theory but Chubby is replicated across 5 nodes per cell and has had multi-year availability records measured in five-9s — the centralisation is paid for by Paxos, not by hand-waving.
In practice, this is also why Chubby outages are catastrophic: when Chubby slows down, every layer above it slows down. Google's published incident analyses describe Chubby cells with six-9s availability; the rare incidents that did affect Chubby cascaded through every dependent system within seconds. This is the price of the contract — a single namespace is a single dependency. The Roblox 73-hour outage in 2021 (Consul cascade) is the open-source-shaped version of the same risk, where Consul was playing Chubby's role and the dependency graph turned out to be too dense for a single coordination service to absorb a slow-storage incident.
Contract B: TrueTime — make uncertainty explicit, not hidden
Spanner's most-cited contribution is external consistency — the guarantee that if transaction T1 commits before T2 starts (in real wall-clock time), then T2's timestamp is greater than T1's. Outside Google, the standard way to "achieve" this is to either (a) lie about it, or (b) require an explicit sequencer. Spanner does it by reading a global clock — TrueTime — that returns not a single timestamp but an interval [earliest, latest], with the bound latest - earliest = 2ε where ε is roughly 4–7 ms. The commit protocol then waits out the uncertainty — Spanner deliberately delays the commit by up to ε before acknowledging, which guarantees that any future TT.now() will return a latest strictly greater than the committed timestamp.
# truetime_simulation.py
# Tiny simulation of why "wait out the uncertainty" gives external consistency.
# Two clients commit transactions in sequence; we measure how often the second
# transaction's commit timestamp is correctly ordered after the first's.
import random, statistics
EPSILON_MS = 6.0 # TrueTime worst-case bound — Spanner paper reports ~4-7 ms
def tt_now(true_time_ms):
# Returns (earliest, latest) interval that contains the true time
skew = random.uniform(-EPSILON_MS, EPSILON_MS)
measured = true_time_ms + skew
return (measured - EPSILON_MS, measured + EPSILON_MS)
def commit_naive(true_time_ms):
# Strategy 1: pick midpoint, ack immediately
earliest, latest = tt_now(true_time_ms)
return (earliest + latest) / 2.0 # commit timestamp
def commit_spanner(true_time_ms):
# Strategy 2: pick latest, then wait until tt_now().earliest > latest
_, latest = tt_now(true_time_ms)
# Simulate waiting — in real Spanner, the leader sleeps until commit_wait done
wait_until = latest
return wait_until # commit ack returns at this point
def run(strategy, n=20000):
violations = 0
t = 0.0
for _ in range(n):
ts1 = strategy(t)
t += random.uniform(0.5, 3.0) # T2 starts after a real-time gap
ts2 = strategy(t)
if ts2 < ts1:
violations += 1
return violations / n
random.seed(7)
print(f"Naive (midpoint, no wait): {run(commit_naive)*100:.2f}% violations")
print(f"Spanner (commit-wait to lat): {run(commit_spanner)*100:.4f}% violations")
Sample run on a CricStream analysis box:
Naive (midpoint, no wait): 18.43% violations
Spanner (commit-wait to lat): 0.0000% violations
Walkthrough: the naive strategy picks the midpoint of the TrueTime interval and acknowledges immediately. Because two consecutive transactions can have overlapping intervals (the second's earliest may be smaller than the first's midpoint), about 18% of consecutive pairs end up with timestamps in the wrong order — external-consistency violations. The Spanner strategy picks the latest of the interval and waits until a fresh tt_now().earliest exceeds it. By construction, no future commit can produce a timestamp ≤ this one — violations drop to zero. The cost is the ~ε wait per commit, typically 4–7 ms. Why this cost is acceptable: Spanner serves OLTP workloads where commit latency budget is dozens of milliseconds. A 6-ms commit-wait is 10–20% of the budget — painful but tolerable. The alternative is no global consistency, which means every cross-region transaction needs application-level reconciliation, which costs orders of magnitude more in engineering time. Spanner's bet — pay 6 ms in latency to save the engineering team from reconciliation — has been validated by every system since (CockroachDB, YugabyteDB) that copies the design with looser bounds and worse ε.
The deeper lesson is that Spanner is only buildable when the time service is below it in the stack. CockroachDB cannot use TrueTime — there is no GPS+atomic-clock fleet in EC2 — so it falls back to Hybrid Logical Clocks, which give a weaker guarantee (causality, not external consistency) and a larger uncertainty bound. The hardware investment Google made in deploying GPS receivers and atomic clocks in every datacenter is the part the open-source descendants cannot easily replicate, which is why "Spanner-like" systems outside Google are always one notch weaker on consistency or one notch slower on commits.
Contract C: workloads run unprivileged on shared fleet
Borg's published numbers are striking: a single Borg cell holds tens of thousands of machines, and a single machine routinely runs dozens of unrelated workloads from different teams, isolated by Linux cgroups (Google's contribution to the kernel) and chroot. There is no SSH into a Borg machine for application teams. There is no per-team fleet. There is no "production" vs "staging" cluster as separate hardware — production and batch share the same machines, with priority bands ensuring batch is preempted when production needs the cores. The cross-layer payoff is that the fleet itself is the abstraction, and every system above Borg writes code as if it has access to N fungible cores rather than N specific machines.
This contract has two consequences that surprise teams attempting to copy the model:
- Every workload must tolerate preemption. A batch MapReduce task can be killed at any moment because a higher-priority Spanner replica needs its CPU. Tasks must checkpoint frequently and recover automatically. The MapReduce paper's emphasis on "speculative execution" and idempotent task design is not primarily about machine failures — it is about scheduler-driven preemption, which is far more common than hardware failure inside Borg.
- Every workload must declare its resource shape up front. Borg refuses to schedule a job that does not declare CPU, memory, disk, and tail-latency budgets. This forces every team to think about resource cost at design time, which prevents the runaway-job pathology that plagues SSH-able fleets. The cost is genuine engineering friction — a Borg job spec is a small XML/protobuf file, not just a Docker run command — but the savings in fleet utilisation are large enough that Google's published cluster utilisation has been measured at 50–70% (vs 10–20% for typical SSH-managed fleets).
PaySetu's Kubernetes deployment is technically the open-source equivalent, but the typical PaySetu Kubernetes cluster runs at 25% utilisation because half the namespaces are allocated to single teams, and team A's slack capacity is not made available to team B's spike — exactly the problem Borg's priority bands were invented to solve. Bridging the gap requires both the technology (cgroups, priority preemption) and the organisational commitment to a shared fleet, which is the part Google paid for in 2003 and most companies have not.
Common confusions
-
"Google's stack is special because of the papers" — the papers describe the systems, not the cross-cutting contracts. Reading every paper still leaves the question "why did this work as a stack?" unanswered. The contracts are the answer, and they live in internal docs, not external papers.
-
"Open-source descendants give you Google's architecture" — they give you each layer's mechanism, but the contracts are what made the layers compose. ZooKeeper plays Chubby's role in some places and Consul plays it in others; HDFS replaces GFS but Kubernetes uses etcd, not HDFS, for cluster state; Spanner-likes (CockroachDB, YugabyteDB) use HLCs, not TrueTime. The ecosystem is fragmented because no single open-source vendor controls all four contracts.
-
"TrueTime is just NTP plus error bars" — NTP gives you a single timestamp with implicit error of ~10–100 ms; TrueTime gives you an explicit interval with bound 4–7 ms, sourced from GPS receivers and atomic clocks installed in every datacenter, polled every 30 seconds with a redundancy protocol. The hardware investment is real and ongoing, and "NTP plus a hand-wave about epsilon" does not give external consistency.
-
"Borg and Kubernetes are interchangeable" — Kubernetes inherited Borg's core idea (declarative job specs, cgroups isolation) but inherited none of Borg's priority-band scheduling, almost none of its resource-shape declarations, and very little of its preemption discipline. The Borg paper is explicit: "Kubernetes was the second system; we deliberately omitted features that did not generalise". The omitted features are exactly the ones that drive Borg's high utilisation.
-
"Chubby is just Paxos in a box" — Paxos is the consensus algorithm; Chubby is Paxos plus a namespace, plus a session-and-lease mechanism, plus a small-file store, plus a notification API. The naming and notification parts are what every layer above Chubby actually uses; the consensus is the foundation but rarely the API.
-
"Google's stack is too unique to learn from" — the implementations are unique because of the hardware investment; the contracts are not. A team that picks one lock service, one time API, one scheduler, and forces all internal systems to use them, has built a small Google. Most teams refuse to make those choices, which is why their stacks fragment.
Going deeper
The papers, in dependency order — and what each one assumes about the layer below
- GFS (2003) assumes Linux + Ethernet + commodity disks. Every other Google paper assumes GFS or its successor Colossus underneath, even if they don't say so.
- MapReduce (2004) assumes GFS for input and output, and a control-plane scheduler (the prototype scheduler that became Borg). The reason MapReduce could be so simple is that GFS handled durability and the scheduler handled task placement.
- Chubby (2006) assumes Borg for its own deployment (the 5 Chubby replicas run as Borg jobs) but presents itself as the namespace and lock service for everything else. This is the cyclic dependency the team had to break: Borg needs Chubby for leader election, Chubby runs on Borg.
- Bigtable (2008) assumes GFS for tablet storage, Chubby for metadata and master election, and Borg for tablet-server placement. Bigtable could not have been written without all three below it.
- Spanner (2012) assumes everything Bigtable assumed, plus TrueTime. The TrueTime API itself is a separate paper-worthy artefact (described in a section of the Spanner paper) that required new datacenter hardware.
- Borg (2015, but described much later than its real introduction) assumes nothing above it but defines the runtime contract every other system uses. The 2015 EuroSys paper was published a decade after Borg's real deployment because Google had been running it in production for years before they wrote it up.
The cyclic dependency between Borg and Chubby was solved with a bootstrap: Chubby's first replica is started by hand at cell-creation time, then Borg uses it for leader election, then Chubby's other replicas are started as Borg jobs. This is the kind of detail no paper mentions but every internal doc spends a paragraph on.
Why the published utilisation numbers are the headline result
Google's 2015 Borg paper shows machine utilisation in the 50–70% range. The same paper shows that running production and batch on the same machines (with priority preemption) doubles utilisation compared to dedicating machines to either class. This is the financial argument for the entire stack: at Google's scale, doubling utilisation saves billions of dollars per year in hardware. Every cross-cutting contract — single namespace, shared fleet, declarative resource specs, priority bands — is partially justified by this number. The contracts are not a research aesthetic; they pay for themselves in capex savings. PaySetu, BharatBazaar, and CricStream-scale Indian companies could capture similar savings if they were willing to make similar commitments — the gap is not technology, it is organisational discipline around shared fleets.
Where the stack is now — Borg → Omega → Kubernetes, GFS → Colossus, Bigtable → Spanner
The stack continued to evolve after the famous papers. Borg got an experimental successor called Omega (2013) that explored optimistic concurrency for scheduling at higher scale, then Borg absorbed Omega's lessons. GFS was replaced by Colossus, which sharded the metadata layer Bigtable-style to scale beyond GFS's single-master design. Bigtable's mutable rows + flexible schema influenced Spanner's schema layer, but Spanner added a SQL-and-transactions front end that subsumed many internal Bigtable use cases. The ongoing pattern is that each system gets its hardest scaling limit relaxed, but the four cross-cutting contracts (namespace, failure model, time, isolation) survive every revision because they are what make the next system buildable in the first place. This is the part that suggests the contracts are the durable artefact, not the systems.
Reproduce the TrueTime simulation on your laptop
python3 -m venv .venv && source .venv/bin/activate
python3 truetime_simulation.py
# To extend: vary EPSILON_MS from 1 to 50 and plot violation rate vs ε for the
# naive strategy. The relationship is linear at small ε and saturates near 50%
# as ε approaches the inter-transaction gap. This is the curve every
# Spanner-like system trades off against.
Where this leads next
The next chapters of Part 20 dive into specific dimensions of Google's stack and parallel systems at other companies that made different choices:
- Amazon: cells, shuffle-sharding, isolated fates — Amazon optimised for blast-radius bounding, where Google optimised for global consistency. The two stacks made opposite calls on the same trade-off space.
- Meta: scaling the social graph — Meta's TAO sits in the same niche as Bigtable but was tuned for a fundamentally different access pattern (read-mostly graph traversals rather than column-family scans).
- Netflix: resilience culture — Netflix runs on AWS, so it inherits Amazon's contracts; the Netflix story is about operational discipline on a borrowed stack rather than building one from scratch.
The thread running through Part 20 is that case studies are not biographies of systems; they are arguments about which contracts are worth committing to, paid for in real engineering time and real capex. Google's bet on the four contracts described here paid off because the company committed early and never wavered. Subsequent companies — Amazon, Meta, Netflix — read Google's papers, made different bets, and ended up with stacks that look different not because the engineers were less capable but because the contract choices were different.
This chapter is the foundation of Part 20 in the same way that wall: every system is unique — study the real ones was the foundation of the chaos-engineering chapters: it reframes the rest of the part. Read each subsequent case study with the question "which cross-cutting contracts did this organisation commit to?" — the answer is more important than the surface architecture.
References
- Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, "The Google File System" (SOSP 2003) — the foundation paper, explicit about commodity-hardware assumptions.
- Jeffrey Dean, Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004) — built directly on GFS.
- Mike Burrows, "The Chubby lock service for loosely-coupled distributed systems" (OSDI 2006) — candid about the namespace-vs-lock-service framing.
- Fay Chang et al., "Bigtable: A Distributed Storage System for Structured Data" (OSDI 2006, extended TOCS 2008).
- James C. Corbett et al., "Spanner: Google's Globally-Distributed Database" (OSDI 2012) — TrueTime, external consistency, the commit-wait protocol.
- Abhishek Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — the long-delayed write-up of a system already a decade old.
- Brendan Burns et al., "Borg, Omega, and Kubernetes" (ACM Queue 2016) — the lineage paper, explicit about what Kubernetes deliberately omitted.
- See also: the append-only log — simplest store, wall every system is unique — study the real ones, bulkheads, the principles — Netflix.