FoundationDB's layered design

Most distributed databases bundle SQL, documents, indexes, queues, and storage into one binary you have to take or leave. FoundationDB takes the opposite bet: ship a tiny sorted-KV kernel that does only transactions, and let every other data model live as a client-side library on top.

The FDB kernel does one thing — strictly serialisable, sorted, multi-key transactions on byte keys, capped at 5 seconds and 10 MB per transaction. SQL, MongoDB-compatible documents, Apple's CloudKit Record Layer, Snowflake's metadata catalogue all sit above the kernel as in-process libraries that translate their data model into ordinary KV reads and writes. The smallness of the kernel is what lets the deterministic simulator test it to a degree no monolithic database can match.

The thesis: one kernel, many libraries

A document is a set of (doc_id, field) → value entries. A row is (table, pk, column) → value. A queue is (queue, monotonic_id) → message. A graph adjacency list is (node, neighbour) → edge. If your KV store has sorted keys and supports range scans, every one of those data models reduces to "pick a key encoding, then read and write bytes."

FoundationDB's founders, Dave Rosenthal and Nick Lavezzo, looked at the data-model wars of 2009 — KV vs. document vs. graph vs. column-family — and bet that all of them are projections of the same underlying object. So they built that object and refused to ship anything else.

The kernel exposes:

And nothing else. No SQL. No JSON. No secondary indexes. No triggers. No stored procedures. Why refuse those features even when users beg for them: every feature added to the kernel is a feature the simulator has to model and the correctness story has to cover. Adding JSON drags in UTF-8 normalisation; adding SQL drags in a parser, a planner, and 200 edge cases per release. Pushing each data model out into a library keeps the kernel small enough to fully simulate, and a buggy library can never corrupt the storage cluster — the worst it can do is write malformed bytes into a key the next library will refuse to read.

FoundationDB's layered architectureA vertical stack. At the bottom: the FDB kernel — Storage and Transaction layers, the only server processes. Above it: three foundation libraries — Tuple, Subspace, Directory — that every higher layer depends on. Above those: the data-model layers — Document Layer, Record Layer, SQL Layer — each a client-side library on top of the kernel. An arrow on the side marks the boundary: everything above the kernel is library code in the application process. Kernel = sorted KV + transactions. Everything else = library. Kernel: Storage + Transaction (server processes) Sorted KV · multi-version · strict serialisability · 5 s / 10 MB caps Sequencer · Proxies · Resolvers · Logs · StorageServers Tuple Layer pack tuples → order-preserving bytes Subspace Layer prefix-namespace isolated keyspaces Directory Layer name → short prefix cheap renames Document Layer MongoDB wire- compatible Record Layer CloudKit · typed records, indexes SQL Layer Tigris and others parse · plan · KV Your application: Razorpay payments, Flipkart cart, BookMyShow seats… libraries (in your process) kernel (server)
The two boxes at the bottom are the only thing FDB itself ships as servers. Everything above is a library that runs in your application process and translates its data model into ordinary range reads and key writes against the kernel.

The boundary line in that diagram — between Transaction (server) and Tuple (library) — is the most consequential design choice in FoundationDB. Crossing the boundary means RPC, replication, and simulation testing. Above the boundary it is just code in your binary, iterable on a weekly release cadence by people who never touch the storage cluster.

The commit path: Sequencer, Proxy, Resolver, Log

The kernel runs five process roles. To understand how a transaction commits, walk through what each one does.

FoundationDB's optimistic commit path with Resolver conflict checkingFive vertical swimlanes left to right: Client, Sequencer, Proxy, Resolver, Log + StorageServer. Time flows top to bottom. The client first asks the Sequencer for a read version, reads from StorageServers at that version, buffers writes locally, then on commit ships read ranges and writes to a Proxy, which gets a commit version, asks Resolvers whether anything in the read ranges has been modified since the read version, and on a CLEAN response writes mutations to the Log, finally acking the client. Mutations propagate asynchronously from Log to StorageServers. Optimistic commit: validate read ranges, then durable-write Client Sequencer Proxy Resolver Log + Storage 1. get read version RV = 1000 2. snapshot reads at RV=1000 buffer writes locally 3. commit: read ranges + writes 4. get commit version CV = 1100 5. read ranges (sharded by key) scan recently- committed ranges CLEAN / CONFLICT 6. mutations → Log (durable) ack: committed at CV=1100
The seven steps of an FDB transaction. Steps 1–2 happen at start. Steps 3–6 happen at commit. Step 7 (Log → StorageServers, async) explains why a freshly committed key can take a few milliseconds to appear in a StorageServer's response.

Walk through it.

Step 1 — read version. The client asks the Sequencer (a single elected process per cluster) for a read version. The Sequencer hands out monotonically increasing 64-bit versions; in production it batches thousands of requests per round-trip. Why a single Sequencer rather than a Lamport clock or a hybrid logical clock: strict serialisability requires a total order on commits. One elected process is the simplest way to get one. The Sequencer is not a bottleneck — every request through it is a few bytes of integer state, and FDB benchmarks have it pushing a million-plus versions per second on commodity hardware. If the Sequencer dies, recovery elects a new one in well under a second; in-flight transactions retry.

Step 2 — snapshot reads. The client reads from StorageServers at the read version. StorageServers retain a small window of multi-version history, so a read at RV=1000 returns whatever was committed at version ≤ 1000 — a consistent snapshot.

Step 3 — buffer writes. Writes are kept entirely in client memory until commit. There is no write intent on any StorageServer; nothing is visible to other transactions yet.

Step 4 — commit. The client ships the read conflict ranges (the spans of keys it actually read or range-scanned) and the write set (the mutations) to a Proxy. The Proxy gets a commit version from the Sequencer.

Step 5 — Resolver check. The Proxy fans the read ranges out to Resolvers, sharded by key range. Each Resolver maintains an in-memory log of recently committed conflict ranges — bounded to the last 5 seconds, the same as the transaction time limit. For each incoming read range, the Resolver answers one question: did any committed range overlap this read range, with a commit version greater than the transaction's read version? If yes → CONFLICT. Otherwise → CLEAN. Why ranges and not individual keys: range scans are first-class in FDB. If you read "all keys with prefix users/2/", a write to users/2/47 between your RV and CV must abort you, even though users/2/47 was not in your individual read set. Tracking conflict ranges on both sides makes the check correct without you having to enumerate every key the scan might have touched.

Step 6 — Log durability. If every Resolver returns CLEAN, the Proxy writes the mutations to the Log — a small set of replicated, append-only processes. Once a majority of Logs has acked the durable write, the Proxy returns success to the client. The transaction is now committed at the version assigned in step 4.

Step 7 — apply to StorageServers (async). Logs stream their mutations to StorageServers in version order. There is a small window where a committed write is durable in the Log but not yet visible on a StorageServer; clients reading at versions newer than the last applied version either wait briefly or are served from a Proxy-side cache.

The 5-second cap is what keeps Resolver and StorageServer memory bounded: each Resolver remembers 5 seconds of committed conflict ranges; each StorageServer keeps 5 seconds of MVCC history. Why 5 seconds rather than infinity: arbitrary transaction durations make conflict tracking unbounded in memory and turn garbage collection into a research problem. The cap converts a hard correctness problem into a knob with a fixed memory budget. If you need a longer-running operation, batch it into multiple short transactions — the discipline that forces produces better distributed code anyway.

A tiny Tuple-layer-style packer in Python

Layers in FDB are libraries — code in your client process that turns a higher-level data model into KV operations. The simplest one is the Tuple Layer: it packs Python tuples into byte strings that sort lexicographically the way a human would expect. (2, 5, 100) packs to a byte sequence that compares less than (2, 5, 101), which compares less than (2, 6). With order-preserving keys, hierarchical models like (library, shelf, book_id) get free range scans on any prefix.

Here is a simplified pack/unpack you could write yourself in twenty minutes. Open your editor — type this in.

# tinytuple.py — a stripped-down version of FoundationDB's Tuple Layer
# Goal: pack Python tuples to bytes whose lexicographic order matches
#       the natural tuple order. Range scans on a prefix become trivial.

TAG_NULL     = 0x00
TAG_BYTES    = 0x01
TAG_STRING   = 0x02
TAG_INT_ZERO = 0x14
TAG_INT_POS  = 0x15  # followed by 1 byte length, then big-endian int

def pack_int(n: int) -> bytes:
    if n == 0:
        return bytes([TAG_INT_ZERO])
    if n > 0:
        nbytes = (n.bit_length() + 7) // 8
        return bytes([TAG_INT_POS, nbytes]) + n.to_bytes(nbytes, "big")
    raise NotImplementedError("negative ints elided for brevity")

def pack_str(s: str) -> bytes:
    # 0x02 tag, UTF-8 bytes, 0x00 terminator. Real FDB escapes inner 0x00.
    return bytes([TAG_STRING]) + s.encode("utf-8") + b"\x00"

def pack(tup) -> bytes:
    out = bytearray()
    for v in tup:
        if v is None:        out.append(TAG_NULL)
        elif isinstance(v, int):  out.extend(pack_int(v))
        elif isinstance(v, str):  out.extend(pack_str(v))
        else: raise TypeError(type(v))
    return bytes(out)

def range_for_prefix(prefix_tuple):
    """Every key starting with this tuple lies in [begin, end)."""
    p = pack(prefix_tuple)
    return p, p + b"\xff"

# --- try it ---
k1 = pack(("lib", 2, 5, 200))
k2 = pack(("lib", 2, 5, 201))
k3 = pack(("lib", 2, 6, 1))
assert k1 < k2 < k3, "tuple order must survive packing"

begin, end = range_for_prefix(("lib", 2, 5))
# In real FDB code:  for k, v in tx.get_range(begin, end): ...
print(begin.hex(), end.hex())

Output on a 2024 laptop:

$ python tinytuple.py
0273686c6962000215022002001502c800 0273686c6962000215022002001502c800ff

That is the entire Tuple Layer in spirit. The real one handles negative integers, floats, UUIDs, nested tuples, and binary escaping for embedded 0x00 bytes — but the shape is the same: a function from Python objects to bytes that preserves order. Once you have it, every higher data model (rows, documents, graphs) is a key-encoding choice plus a range scan. That is the engine of the layered design.

A worked transaction: moving a book in a Karnataka library catalogue

A library network across Bengaluru, Mysuru, and Mangaluru

The Karnataka State Central Library system runs a catalogue across roughly 700 branches. Books live on shelves; shelves live in branches; the catalogue records which book is where. The query patterns are: list every book on shelf 5 of branch 2; move a book atomically from shelf 5 of branch 2 to shelf 3 of branch 7; find a specific book by ID across the whole network.

Model each book as one KV entry, keyed by (branch, shelf, book_id):

pack(("lib", 1, 3, 101)) → "Sapiens · Yuval Noah Harari"
pack(("lib", 2, 5, 200)) → "Discovery of India · Nehru"
pack(("lib", 2, 5, 201)) → "Annihilation of Caste · Ambedkar"
pack(("lib", 2, 5, 202)) → "Midnight's Children · Rushdie"
pack(("lib", 7, 3, 999)) → "Things Fall Apart · Achebe"

"List shelf 5 of branch 2" is one prefix scan, one round trip:

@fdb.transactional
def books_on_shelf(tx, branch, shelf):
    begin, end = range_for_prefix(("lib", branch, shelf))
    return [(k, v.decode()) for k, v in tx.get_range(begin, end)]

The interesting query is the atomic move. Riya, a librarian in Mangaluru, wants to move book 201 from (branch=2, shelf=5) to (branch=7, shelf=3). That is a delete at the old key and a put at the new key. They must happen together or not at all.

@fdb.transactional
def move_book(tx, book_id, src_branch, src_shelf, dst_branch, dst_shelf):
    src = pack(("lib", src_branch, src_shelf, book_id))
    metadata = tx[src]                       # snapshot read at RV
    if metadata is None:
        raise KeyError("book not found at source")
    dst = pack(("lib", dst_branch, dst_shelf, book_id))
    del tx[src]                               # buffered locally
    tx[dst] = metadata                        # buffered locally
    # On commit: Proxy gets CV; Resolver checks src and dst weren't
    # touched since RV; if CLEAN → Log → ack. Otherwise retry.

Now suppose Riya in Mangaluru and Asha in Bengaluru both attempt to move book 201 at the same instant — Riya to (7, 3), Asha to (4, 1). Both transactions read src at their respective read versions and stage their writes. The first to reach a Proxy commits at, say, CV=1100. When the second hits a Resolver, the Resolver sees that the key range covering src was modified at CV=1100, which is greater than the second transaction's RV of 1099 → CONFLICT → abort. The retry reads the new value of src (which is now None, because the first move already happened), raises KeyError, and surfaces a clean error in Asha's UI. No lost book, no duplicate book, no manual locking code anywhere in the catalogue service. A single @fdb.transactional decorator paid for all of that correctness.

Where FoundationDB ended up: Snowflake, CloudKit, Tigris

FoundationDB started as a startup in 2009. Apple acquired it in March 2015 and pulled the public downloads offline; for three years it disappeared from view, used internally to build CloudKit. In April 2018 Apple open-sourced it again under Apache 2.0. Today it is the metadata store under Snowflake — every table schema, every micro-partition manifest, every transaction commit record lives in FDB — as well as the persistence layer under iCloud, Tigris, Wavefront, and a long list of internal services at large companies.

Two case studies make the layered design's payoff concrete.

Snowflake's metadata catalogue. Snowflake's bulk data lives in object storage (S3, GCS, Azure Blob). The metadata — what micro-partitions exist, which warehouse holds which lock, which schema applies at which time-travel version — needs ACID transactions across many keys, often across regions. That fits FDB perfectly: many small KV transactions per second, strict serialisability, range scans for "give me every micro-partition of this table." Snowflake's engineering blog describes the role in detail. The choice to put every byte of Snowflake's metadata in FDB — at the throughput of one of the largest data warehouses in the industry — is the strongest endorsement the design has received.

Apple CloudKit. The Record Layer on top of FDB is the substrate for CloudKit, which is the substrate for iCloud. Every photo you sync, every Reminders entry, every Safari tab handoff transitively touches an FDB cluster in an Apple datacentre. The Record Layer adds typed records (Protobuf-defined schemas), secondary indexes (computed and stored as ordinary KV entries under different prefixes), and a small query planner — all as a library on top of the same kernel. A team adding new index types ships a new Record Layer release; the FDB kernel keeps running unchanged.

The point worth stealing from both stories: a layer-as-library is cheaper to evolve than a layer-as-server. CloudKit ships features on its own cadence, Snowflake's metadata team ships features on its cadence, and neither team has to coordinate kernel upgrades with the other. One cluster, many independent products built on top.

Common confusions

Going deeper

Deterministic simulation as the load-bearing methodology

A small kernel becomes trustworthy only if you can convince yourself it is correct. FoundationDB's answer was deterministic simulation testing, built from day one. The team wrote the database in Flow, an actor-style C++ extension that compiles to single-threaded asynchronous code. Every external dependency — network, disk, clock, filesystem — sits behind a virtual interface with two implementations: a real one that talks to the OS, and a simulated one that talks to an in-memory model.

In simulation mode, the entire cluster — Sequencer, Proxies, Resolvers, Logs, StorageServers — runs as actors inside one process, on one thread, scheduled by a deterministic event loop driven by a seed. The simulated network drops packets, reorders them, partitions the cluster; the simulated disk corrupts blocks, slows to a crawl, returns ENOSPC; the simulated clock jumps backwards. Each CPU-hour of simulation replays days of cluster operation under continuous fault injection. When a bug fires, the seed reproduces it bit-for-bit. The "buggify" pattern tags every code site that could plausibly fail and lets the simulator coin-flip whether to inject a fault there. By 2014 FoundationDB was widely cited as the most-tested distributed system in the industry — not because the team was unusually careful, but because the methodology let them be cheaply careful at scale.

The kernel's smallness is what makes the simulator tractable. Adding a SQL parser to the kernel would force the simulator to model SQL parsing too; the surface would explode beyond what one team could maintain. The cost of testing the kernel is what enforces its smallness, and the smallness is what makes the testing cheap. The two facts hold each other up.

Why layers cannot push computation down to data

The honest cost of the layered design is no pushdown. A SQL WHERE column_x = 7 on the SQL Layer cannot tell a StorageServer "filter on column_x" — the StorageServer has no idea what a column is. So the SQL Layer range-scans a wider region of the keyspace and filters in client memory. CockroachDB's DistSQL, by contrast, ships the predicate down to the storage node and filters there, saving bytes on the wire. For analytical scans this matters; for OLTP point reads and small range scans, it does not. FoundationDB explicitly accepted slower analytics in exchange for a simpler kernel.

The mitigation is to keep transactions small and reads narrow. The Record Layer's secondary indexes, for example, store an extra KV entry per indexed value, so that a "find by index" lookup is one prefix scan rather than a table scan. Trading writes for reads is the layer's job, not the kernel's.

Versionstamps: turning a write into a stable monotonic ID

A versionstamp is a 10-byte token the kernel substitutes into a key or value at commit time, equal to the commit version plus a within-batch sequence number. Once committed, the versionstamp is globally monotonic across the cluster. Layers use them to build queues (the versionstamp becomes the queue position), change feeds (read all keys with versionstamp greater than X), and idempotency keys (the versionstamp is a unique stable ID for the work the transaction did). It is the FDB-specific equivalent of CockroachDB's HLC timestamps, but generated server-side at commit and visible to clients.

Where FDB does not fit

FoundationDB is the wrong choice when your workload is a single giant analytical scan over terabytes — Spark on object storage will be cheaper and faster. It is wrong when you want stored procedures or triggers, because the kernel does not run user code. It is wrong when your team cannot tolerate the conceptual overhead of writing the data model as KV — for many teams, "just use Postgres" remains the right call. FDB only pays off when operational scale or multi-model requirements force the discussion. When the fit is right — Snowflake's metadata, CloudKit's records, Tigris's multi-model API — it does what no other system does as well: a small, transactional, sorted-KV substrate that other systems can bet their whole product on without owning the storage layer themselves.

Comparison to CockroachDB, Spanner, and Percolator

CockroachDB bundles SQL into the storage process and uses Multi-Raft for replication; it is a monolith optimised for one data model (Postgres-compatible SQL) and gets pushdown for free at the cost of a much larger kernel. Spanner uses TrueTime hardware (atomic clocks, GPS) to assign commit timestamps without a single Sequencer; the kernel is smaller than CockroachDB's but still SQL-aware. Percolator builds distributed transactions on top of a non-transactional KV (Bigtable) by stuffing protocol state into user-visible keys; the kernel does no transactional work itself but the layer is fragile to clock skew and abandoned coordinator nodes. FoundationDB is closest in spirit to "Spanner with a single Sequencer instead of TrueTime, and no SQL in the engine."

Where this leads next

In chapter 117 you will plot the systems you have studied — Spanner, CockroachDB, TiDB, FoundationDB, Cassandra, DynamoDB — onto the real CAP and PACELC map of production deployments, and see where each one trades latency for consistency. After that, chapter 118 is the honest counter-essay: when the right answer is a single-node Postgres on a beefy box.

The ideas to carry forward:

References