FoundationDB's Layered Design

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

FoundationDB is the cleanest expression in the industry of one design idea: build a tiny, simple, correct, transactional, ordered key-value store; then build every higher-level data model as a thin library on top. The kernel does one thing — sorted KV with strictly serialisable multi-key transactions, capped at 5 seconds and 10 MB per transaction — and refuses to do anything else. There is no SQL in the engine. There are no documents in the engine. There are no queues in the engine. Those are layers: client-side libraries that translate their data model into KV reads and writes the kernel already supports.

The kernel earned its reputation through deterministic simulation testing. Every external dependency — network sockets, disks, clocks, threads — is abstracted behind an interface, and a simulator implementation of each runs the entire cluster (Sequencer, Proxies, Resolvers, Logs, Storage) inside one process with a single deterministic event loop. The simulator injects faults systematically: machines disappear, disks return wrong data, packets reorder, clocks jump. A failure replays bit-for-bit from a seed. FoundationDB was famously the most-tested distributed system in the industry by 2014; the team would joke that they spent more compute on testing the database than on running it.

The transaction protocol is optimistic and reads-validate-then-write. The client gets a read version from a single Sequencer process, reads consistent values at that version from StorageServers, buffers writes locally, and at commit ships read-key ranges and writes to a Proxy. The Proxy fans the read ranges out to Resolvers (sharded by key), each of which checks whether anything in your read range was modified since your read version. If all Resolvers say clean, the Proxy writes the mutations to the Log (durable, replicated), assigns a commit version, and acks the client; the new mutations flow lazily out to StorageServers. The 5-second cap is what keeps the Resolver's "recently committed" memory bounded.

Layers transform every other data model into KV operations. The Tuple Layer packs (user_id, post_id) into bytes that sort lexicographically the way humans expect. The Subspace and Directory layers prefix-namespace keys. The Document Layer speaks the MongoDB wire protocol. The Record Layer powers Pearle's CloudKit with typed records, secondary indexes, and queries — every iCloud user transitively touches FDB. The SQL Layer is what Tigris and others built on top.

FoundationDB started in 2009, was acquired by Pearle in 2015 (and immediately disappeared from public view), and was open-sourced again in 2018. Today it is the metadata store under Snowflake — every table schema, every micro-partition manifest, every transaction commit lives in FDB — as well as the persistence layer under iCloud, Wavefront, Tigris, and a long list of internal services at large companies.

By chapter 115 you have seen Percolator build distributed transactions by stuffing protocol state into a non-transactional KV, and CockroachDB build them by stamping write intents and a transaction record on a Multi-Raft KV. Both designs put SQL, documents, and the rest of the user-visible data model on top of the same monolith — one binary, many features. FoundationDB rejects that. It says: if the kernel is small enough to be provably correct and fast enough to be cheap, then everything else is a library, and the library writers can innovate without touching the kernel. This chapter walks through that bet — the kernel's deliberately tiny surface, the simulation-testing methodology that made the bet survivable, the transaction protocol, and the layers that grew on top.

The thesis: the kernel is sorted KV; everything else is a library

Most databases bundle a storage engine, a transaction manager, a query planner, a wire-protocol parser, and a half-dozen data models into one process. The bundle is convenient — install one binary, get SQL — but it is also why those systems are so hard to test, so hard to extend, and so unstable when you push them past their original use case. FoundationDB's founders, watching the data-model wars of 2009 (key-value vs. document vs. graph vs. column-family), made a bet that all of those are projections of the same underlying object: an ordered, transactional map from byte-strings to byte-strings.

If the kernel exposes that object cleanly, then a document is just a set of keys (doc_id, field) → value; a table row is (table, primary_key, column) → value; a queue is (queue, monotonic_id) → message; a graph adjacency list is (node, neighbour) → edge_data. Range scans on the sorted keys give you "all fields of this document", "all columns of this row", "all neighbours of this node" in a single operation.

So the kernel ships exactly that — and only that. Strictly serialisable, the gold-standard isolation level, equivalent to running every transaction one at a time in a global order. Sorted keys, so range scans are cheap. Multi-key transactions spanning the whole keyspace — not just within a row, not just within a partition. And nothing else: no SQL parser, no document model, no secondary indexes, no triggers, no stored procedures. Why refuse those features even though users ask for them: every feature added to the kernel is a feature the simulator must model and the correctness proofs must cover. Document support requires JSON parsing, which requires UTF-8 handling, which requires Unicode tables; the surface explodes. By forcing every feature to live as a library, the kernel stays small enough to fully simulate, and library bugs cannot crash the storage cluster. The trade-off is real: a layer cannot push computation down to data the way DistSQL does, so analytical queries on FDB are slower than on CockroachDB. The team explicitly accepted that.

The FoundationDB stack. The two boxes at the bottom — Storage and Transaction — are the only thing FoundationDB itself ships as servers. Everything above (Tuple, Subspace, Directory, then Document, Record, SQL) is a *library* that runs in the client process and translates its data model into ordinary KV reads and writes. Add a new data model? Write a new library. The kernel never changes.

The boundary line in that diagram — between Transaction (kernel, server) and Tuple (library, client) — is the most consequential design choice in the system. Crossing the boundary means RPC, replication, simulation testing. Above the boundary it is just code in your binary.

The simulation testing that earned the kernel its reputation

A kernel cannot be small and trusted unless you can convince yourself it is correct. FoundationDB's answer was to invest, from day one, in deterministic simulation testing. The investment dwarfs the database itself; for years the simulator was a larger codebase than the production code.

The trick: write the database in Flow, an actor-style C++ extension the team built that compiles transparently to single-threaded asynchronous code. Then abstract every external dependency — network sockets, disk I/O, clocks, threads, file system — behind a virtual interface. There are two implementations of each interface: a real one that talks to the OS, and a simulated one that talks to an in-memory model.

In simulation mode, the entire cluster — Sequencer, Proxies, Resolvers, Logs, StorageServers, the lot — runs as actors inside one process, on one thread, scheduled by a deterministic event loop driven by a seed. The simulated network introduces latency, drops packets, reorders, partitions. The simulated disk corrupts blocks, returns ENOSPC, slows to a crawl. The simulated clock jumps backwards. The simulator runs at thousands of times wall-clock speed — a single CPU-hour replays days of cluster operation under continuous fault injection.

When a bug fires, the seed reproduces it bit-for-bit. The team called this "buggify": every site in the code that could plausibly fail is tagged, and the simulator flips a coin at every tagged site to decide whether to inject a fault. Combined with property-based workloads (random transactions checked against a single-process oracle), this caught classes of bugs other distributed systems are still discovering in production.

Why this matters for the layered thesis: the simulator only models the kernel. Layers are out of scope — they live in client code and are tested with conventional unit tests. The kernel staying small is what kept simulation tractable. If you added a SQL parser to the kernel, you would have to model SQL parsing in the simulator too, and the surface would explode beyond what one team could maintain. The smallness of the kernel is enforced by the cost of testing it.

The transaction protocol — Sequencer, Proxies, Resolvers, Logs

The commit path. The Sequencer hands out monotonic read and commit versions; the Proxy is the per-transaction coordinator; the Resolvers are the conflict-detection layer, each owning a key shard and remembering recently committed conflict ranges; the Log is the durability layer; StorageServers eventually catch up by tailing the Log. Steps 1–2 happen at transaction start. Steps 3–6 happen at commit. Step 7 is asynchronous and explains why a freshly committed key may take a few milliseconds to appear in a StorageServer read.

Walk through it step by step.

Step 1 — read version. The client asks the Sequencer (a single elected process per cluster) for a read version. The Sequencer hands out monotonically increasing 64-bit versions; in production it batches thousands of requests per round-trip to keep up. Why one Sequencer rather than a Lamport clock or HLC: FoundationDB chose strict serialisability, which requires a total order on commit versions. A single elected Sequencer is the simplest way to get one. The Sequencer is not a bottleneck because every operation through it is a few bytes of integer state — millions of versions per second per process. If the Sequencer dies, a new one is elected in sub-second; transactions in flight retry.

Step 2 — snapshot reads. The client reads from the StorageServers at the read version. StorageServers keep a small window of multi-version history, so a read at RV=1000 returns whatever was committed at version ≤ 1000 — a consistent snapshot.

Step 3 — buffer writes. Writes are kept entirely in client memory until commit. There is no write intent on the StorageServers; nothing is visible to other transactions yet.

Step 4 — commit. The client ships the read conflict ranges (the spans of keys it actually read or range-scanned) and the write set (the mutations) to a Proxy. The Proxy gets a commit version from the Sequencer.

Step 5 — Resolver check. The Proxy fans the read ranges out to the Resolvers, sharded by key. Each Resolver maintains an in-memory log of recently committed conflict ranges (anything committed in the last 5 seconds — the transaction time limit). For each incoming read range, the Resolver checks: did any committed range overlap this read range, with a commit version greater than my read version? If yes → CONFLICT. Otherwise CLEAN. Why ranges and not individual keys: range scans are first-class in FDB, so you can read "all keys with prefix users/2/". The conflict checker has to detect a write to users/2/47 as conflicting with that read even though users/2/47 was not in your read set. Tracking conflict ranges makes the check correct without you enumerating every key you might have read.

Step 6 — Log durability. If all Resolvers say CLEAN, the Proxy writes the mutations to the Log (a small set of replicated, append-only processes). Once the Log majority has acked, the Proxy returns success to the client. The transaction is committed at the commit version assigned in step 4.

Step 7 — apply to StorageServers. Asynchronously, the Logs stream their mutations to the StorageServers in version order. Each StorageServer applies the mutations into its sorted KV. There is a small window where a committed write is durable in the Log but not yet visible on a StorageServer; clients reading at a version >= that commit version will be served the value either by waiting briefly or by the proxy's own cache, depending on the client.

The 5-second transaction limit shows up in two places: the Resolver only needs to remember 5 seconds of committed conflict ranges (bounded memory), and the StorageServers only need to keep 5 seconds of MVCC history (bounded disk). Why 5 seconds rather than infinity: Spanner-class systems pay for arbitrary transaction durations with garbage collection complexity, deadlock detection, and unbounded conflict-tracking memory. FoundationDB chose smallness and capped the duration. If you need a longer-running operation, batch it as multiple short transactions. The cap is part of the discipline that keeps the kernel simple.

A tiny "Tuple-layer-like" wrapper in Python

Layers in FDB are libraries — code in your client process that turns a higher-level data model into KV operations. The simplest one is the Tuple Layer: it packs Python tuples into byte strings that sort lexicographically the way you intend. (2, 5, 100) packs to a byte sequence that, when compared with (2, 5, 101), comes first; and (2, 6) comes after both. That gives you free range scans on hierarchical structures.

Here is a simplified pack/unpack you could write yourself in twenty minutes:

import struct

# Type tags chosen so packed bytes sort the right way
TAG_NULL = 0x00
TAG_BYTES = 0x01
TAG_STRING = 0x02
TAG_INT_NEG = 0x0B  # negative int (placeholder, simplified)
TAG_INT_ZERO = 0x14
TAG_INT_POS = 0x15  # positive int (1 byte length follows)

def pack_int(n):
    if n == 0:
        return bytes([TAG_INT_ZERO])
    if n > 0:
        # Big-endian, minimum bytes needed
        nbytes = (n.bit_length() + 7) // 8
        return bytes([TAG_INT_POS, nbytes]) + n.to_bytes(nbytes, "big")
    raise NotImplementedError("negative ints elided for brevity")

def pack_str(s):
    # 0x02 tag, UTF-8 bytes, 0x00 terminator (with 0x00 inside escaped — elided)
    return bytes([TAG_STRING]) + s.encode("utf-8") + b"\x00"

def pack(tup):
    out = bytearray()
    for v in tup:
        if v is None:
            out.append(TAG_NULL)
        elif isinstance(v, int):
            out.extend(pack_int(v))
        elif isinstance(v, str):
            out.extend(pack_str(v))
        else:
            raise TypeError(type(v))
    return bytes(out)

def range_for_prefix(prefix_tuple):
    """Range scan helper — every key starting with this tuple."""
    p = pack(prefix_tuple)
    return (p, p + b"\xff")  # (begin_inclusive, end_exclusive)

# --- Use it ---
k1 = pack(("users", 2, 47))
k2 = pack(("users", 2, 48))
k3 = pack(("users", 3, 1))
assert k1 < k2 < k3   # lexicographic order matches tuple order

begin, end = range_for_prefix(("users", 2))
# In real FDB: tx.get_range(begin, end) returns every (user_id) under tenant 2

That is the entire Tuple Layer in spirit. The real one handles negative integers, floats, UUIDs, nested tuples, and binary escaping, but the shape is the same: a function from Python objects to bytes that preserves order. Once you have it, hierarchical models like (tenant, user, post, comment) pack into single keys, and a range scan over pack((tenant, user)) gives you everything for that user in sorted order.

Worked example: a library catalogue

A library catalogue across thousands of branches and millions of books

A network of public libraries across India — say, the Karnataka State Central Library system — wants a single transactional catalogue. Books live on shelves; shelves live in libraries; the catalogue records which book is where. The query patterns are: list every book on shelf 5 of library 2; move a book from shelf 5 of library 2 to shelf 3 of library 7 atomically; find a specific book by ID across the whole network.

Model each book as one KV entry, keyed by (library_id, shelf_id, book_id):

Three books on shelf 5 of library 2 form a contiguous range in the sorted KV. A single prefix scan — `tx.get_range(pack(("lib",2,5)), pack(("lib",2,5)) + b"\xff")` — fetches all of them in one round trip.

The "list shelf 5 of library 2" query is one range scan, one round trip:

@fdb.transactional
def books_on_shelf(tx, library_id, shelf_id):
    begin = pack(("lib", library_id, shelf_id))
    end = begin + b"\xff"
    return [(unpack(k), v.decode()) for k, v in tx.get_range(begin, end)]

The interesting one is the atomic move. Moving book 201 from (lib, 2, 5) to (lib, 7, 3) is two writes — a delete at the old key and a put at the new key — and they must happen together or not at all. In FDB this is the default; you wrap them in a single @fdb.transactional and the kernel either commits both or neither.

@fdb.transactional
def move_book(tx, book_id, src_lib, src_shelf, dst_lib, dst_shelf):
    src_key = pack(("lib", src_lib, src_shelf, book_id))
    metadata = tx[src_key]            # read at the snapshot RV
    if metadata is None:
        raise KeyError("book not found at source shelf")
    dst_key = pack(("lib", dst_lib, dst_shelf, book_id))
    del tx[src_key]                   # buffered locally
    tx[dst_key] = metadata            # buffered locally
    # On commit: Proxy gets CV; Resolver checks if anyone else
    # touched src_key or dst_key since RV; if clean → Log → ack.

Now imagine two librarians simultaneously try to move the same book to two different destinations. Both transactions read src_key at their respective RVs and stage their writes. The first to reach the Proxy commits. When the second hits the Resolver, the Resolver sees that src_key was modified at a commit version greater than the second transaction's RV → CONFLICT → the second transaction is aborted and retried. The retry reads the new value at src_key (which is None, because the first transaction already moved the book), raises KeyError, and surfaces a clean error to the second librarian. No corruption, no duplicate book, no manual locking code anywhere in the library catalogue service.

The Pearle acquisition arc and the Snowflake metadata role

The arc. Six years as an independent company, three years dark inside Pearle, six years and counting as open-source infrastructure under several of the largest cloud services in the world. Snowflake's choice to put every byte of metadata in FDB is the strongest endorsement the design has received — Snowflake's transactional metadata throughput is one of the largest single workloads in the industry.

A few things worth noting about the Snowflake fit. Snowflake's data is in object storage (S3, GCS, Azure Blob); the metadata — what micro-partitions exist, which warehouse owns which lock, what schema applies to which table at which time-travel version — needs ACID transactions across many keys, often across regions. That is exactly what FoundationDB does well: many small KV transactions per second, strict serialisability, range scans for "give me all partitions of this table". Snowflake's engineering blog describes the role in detail.

Pearle's use is wider: the Record Layer on top of FDB is the substrate for CloudKit, which is the substrate for iCloud. Every photo you sync, every Reminders entry, every Safari tab handoff transitively touches an FDB cluster somewhere in an Pearle datacentre. The Record Layer adds typed records (Protobuf-defined schemas), secondary indexes (computed and stored as ordinary KV entries under different prefixes), and a query planner — all as a library on top of the same kernel.

Going deeper — what the layered design buys and what it costs

For the engineer designing their own storage stack

The layered design pays a cost and buys a benefit. The cost is no pushdown. A SQL WHERE clause running on the SQL Layer cannot tell the StorageServer "filter on column X = 7" — the StorageServer does not know what a column is. So the SQL Layer has to range-scan a wider region and filter in client memory. CockroachDB's DistSQL pushes the filter down and is faster on analytical scans for that reason. FoundationDB makes you accept that and counters with operational simplicity.

What you get in return

The benefit is composability. If you want to add a graph data model, you write a graph layer; nobody on the kernel team needs to do anything. The kernel keeps shipping the same correctness guarantees, and your layer is just a Python library you can iterate on weekly. Multiple data models can coexist on the same cluster, sharing the operational footprint — one cluster, one set of dashboards, one backup pipeline, but Document, Record, and SQL all running side by side.

The benefit also includes blast-radius isolation. A bug in a layer cannot corrupt the storage. The worst it can do is write malformed bytes into a key, which a different layer will refuse to read. Compared to a monolithic database where a bug in the SQL planner can occasionally produce wrong results from clean storage, the layered model is much harder to corrupt.

Why layers are libraries and not server processes

A subtle point: layers are in-process libraries, not separate server processes. You import the Document Layer into your application, and your application talks directly to the FDB cluster. There is no Document Layer server. Why not run layers as servers: a layer-as-server adds a network hop and another component to operate. By making layers libraries, FDB pushes the layer's compute to the application, which scales naturally with the application's CPU. The trade-off is that the layer has to be available in your language; the Tuple/Subspace/Directory layers are in every supported binding, but the Document and Record layers are mostly Java/Go.

The 5-second cap as discipline

Many users discover the 5-second transaction limit and complain. The team's response is: if your transaction needs more than 5 seconds, you are not building it right; batch it. The cap forces you to design idempotent, restartable units of work, which turns out to be the right discipline for distributed systems generally. CockroachDB inherited a similar discipline through transaction restarts, and Spanner through bounded-staleness reads. The cap is uncomfortable until you accept it; then it is liberating.

Where FDB does not fit

FoundationDB is not a good choice if your workload is one giant analytical scan over terabytes — Spark on object storage will be cheaper and faster. It is not a good choice if you want stored procedures or triggers — the kernel does not run user code. And it is not a good choice if you cannot tolerate the conceptual overhead of writing your data model as KV — for many teams, "just use Postgres" is correct, and FDB only pays off when the operational scale or the multi-model requirement forces the discussion.

But when the fit is right — Snowflake's metadata, CloudKit's records, Tigris's multi-model API — FoundationDB does what no other system does as well: a small, transactional, sorted-KV substrate that other systems can bet their whole product on without owning the storage layer themselves.

What you've internalised

You now understand the FoundationDB bet: keep the kernel small, transactional, sorted, and bulletproof; let every higher-level data model live as a client-side library. You can sketch the commit path through Sequencer, Proxy, Resolver, Log, and StorageServer; you can explain why the 5-second cap is a feature; you can pack tuples into sortable bytes; and you can describe why Snowflake and Pearle chose this kernel for workloads that absolutely cannot fail. In chapter 117, you will assemble the systems you have built — Spanner, CockroachDB, TiDB, FoundationDB, Cassandra, DynamoDB — into the real CAP/PACELC map of production, and see where each one trades latency for consistency.

References

FoundationDB: A Distributed Unbundled Transactional Key Value Store — the SIGMOD 2021 paper. The definitive architecture description from the team itself.
FoundationDB documentation: testing — the official description of the deterministic simulation methodology.
How FoundationDB Powers Snowflake Metadata Forward — Snowflake's engineering blog on its FDB metadata layer.
The FoundationDB Record Layer: A Multi-Tenant Structured Datastore — the SIGMOD 2019 paper on the Record Layer behind CloudKit.
Ryan Worl: FoundationDB technical overview — a clear independent introduction to the design.
FoundationDB Document Layer — the open-source, MongoDB-wire-compatible document layer built on top of FDB.