CockroachDB's Architecture in Detail

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

CockroachDB is the open-source answer to a single, audacious question: can you get Spanner without Querion? Spanner's headline guarantee — global, externally consistent, serialisable SQL — depends on TrueTime, which depends on GPS receivers and rubidium atomic clocks bolted into every datacentre rack. If you do not work at Querion, you do not have those. CockroachDB's bet, made by ex-Googlers in 2014, is that you can replace TrueTime's hardware with a software-only Hybrid Logical Clock (HLC), accept a slightly weaker guarantee (serialisable rather than strictly externally consistent), and run on three rented Linux boxes.

The architecture stacks cleanly into six layers. SQL at the top speaks the Postgres wire protocol — your psql client and your Django ORM connect without knowing CockroachDB exists. DistSQL plans the query into a flow of operators and pushes them out to the nodes that own the data, so a WHERE clause filters at the leaf instead of streaming a million rows back to the gateway. The Transaction layer turns writes into write intents — tentative MVCC values tagged with a transaction ID — and serialises everything by default. The Distribution layer maintains a cache mapping (key range → leader node) and routes each request to the right place. Replication is Multi-Raft — every 64 MB range is its own Raft group with its own leader and log. Storage is Pebble, a Go reimplementation of RocksDB that backs every node's local LSM tree.

The transaction lifecycle is Spanner-shaped but HLC-timed. A transaction picks a start timestamp from the gateway node's HLC. As it writes, it leaves intents on every key it touches. The transaction record — a tiny KV stored on a designated primary range — holds the status: PENDING / COMMITTED / ABORTED. The instant the client succeeds in flipping that record to COMMITTED, the transaction has committed; the intents are still scattered across the cluster as tentative values, but every reader that touches one chases the transaction record, sees COMMITTED, and treats the intent as real. Intent resolution — replacing each intent with a real MVCC value — happens asynchronously after commit.

HLC in place of TrueTime costs you the unconditional external-consistency guarantee. CockroachDB recovers most of it through uncertainty intervals: a read from timestamp T on node N treats any value with a timestamp in (T, T + \text{maxOffset}] as uncertain and either restarts the transaction or pushes it forward. Combined with an enforced cluster-wide maximum clock skew (default 500 ms; a node that drifts past it kills itself), this gives you serialisable behaviour in practice without paying for atomic clocks.

CockroachDB powers small and mid-sized Indian fintech today — payment platforms that need cross-region durability, lenders who need transactional ledgers across three AZs in Mumbai, edtech billing systems. Its closest architectural cousin is TiDB (born at PingCAP in Beijing), which makes nearly the same trade-offs but layers on top of TiKV using Percolator-style optimistic 2PC instead of CockroachDB's intent-and-record protocol. The two systems are convergent evolution of the same idea.

By chapter 115 you have built every piece. Multi-Raft sharded the keyspace into 64 MB Raft groups. Percolator gave you client-driven optimistic 2PC over a sharded KV. Spanner showed how TrueTime makes external consistency cheap if you have GPS in the rack. This chapter assembles the pieces into CockroachDB — walking down the six layers from SELECT to disk, tracing a single distributed transaction end-to-end, and explaining HLC and intents in the depth a working engineer needs.

The big picture

CockroachDB is engineered as a tower of six layers, each with one job, each communicating with the layer below through a narrow interface.

The layer diagram you should keep in your head. Each layer has one responsibility and a small interface to the next. SQL parses and plans; DistSQL distributes the plan; Transaction stamps writes with intents and ferries them to the right place; Distribution looks up which node owns the range; Replication makes the write durable through Raft; Storage persists it on Pebble. Reading CockroachDB's Go source tree is much easier once you can put each package into one of these boxes.

The cleanliness of the stack is not aesthetic — it is the reason CockroachDB development can move at all. Postgres compatibility is one team. The vectorised executor (the "vec" engine that processes columnar batches instead of one row at a time) is another. DistSQL flow planning is a third. The transaction protocol is a fourth. Each team can iterate on its layer with confidence that the layer below honours its contract.

Why these particular layer cuts: DistSQL as a separate layer (rather than folded into SQL) means the planner does not need to know about ranges — it produces a logical plan, and DistSQL maps it onto the cluster's current topology, which can change underneath without re-planning. The Transaction layer above Distribution (rather than below) means transaction state is independent of where the data physically lives — a transaction touching keys on five ranges sees one coherent intent-based view, not five sharded views. The Storage layer being Pebble (Go) rather than RocksDB (C++) was a 2019–2020 migration motivated by cgo overhead — every call from Go into C++ paid a ~200 ns context switch, which on a hot LSM scan was a 30% throughput tax. Pebble removes the call boundary entirely.

The layers, top to bottom

SQL

Your client connects on port 26257 using the Postgres v3 wire protocol. CockroachDB's parser is a Go fork of Postgres's grammar, the optimiser is cost-based with per-table statistics, and the executor is vectorised for analytics — it processes columnar batches of ~1024 values at a time inside tight CPU loops, rather than one row at a time through an iterator chain. The node you connect to becomes the gateway for your session; it does not need to own any of the data your queries touch.

DistSQL

After the optimiser produces a logical plan, DistSQL turns it into a physical flow of operators and decides which node should execute each operator. The principle is push compute to data: a SELECT SUM(amount) FROM payments WHERE date = '2026-04-25' should not stream every payment row back to the gateway and sum at the top — it should run a TableReader → Filter → Aggregator flow on each node that owns a range of payments, with each node producing a partial sum, and a final aggregator on the gateway adding the partials. Joins distribute the same way: hash-partition both sides on the join key so all rows with the same key end up on the same node, join locally, stream results back. The DistSQL flow framework, documented in a 2016 design RFC, manages routing, back-pressure, and failure handling.

Why bother with DistSQL when you could centralise execution: a 100 GB analytical join with results streamed back to one gateway saturates one node's network and CPU while the rest of the cluster sits idle. With DistSQL the join executes in parallel on N nodes, each handling 100/N GB of input, and only the (much smaller) result rows cross the network. On TPC-H this turns hour-long queries into minute-long ones.

Transaction

This is the layer that makes CockroachDB a database rather than a sharded KV. Every write the SQL layer emits flows through the transaction coordinator running on the gateway node. The coordinator does four things:

Picks a provisional commit timestamp from the gateway's HLC at BEGIN.
For each write, ships an EndTxnRequest (or Put etc.) to the leaseholder of the relevant range, which writes a write intent rather than a real value.
Tracks every range it has written to in an in-memory intent record so it can resolve them at commit.
At COMMIT, atomically flips the transaction record on the primary range to COMMITTED (or ABORTED).

The default isolation is serialisable (the strict version, equivalent to SERIALIZABLE in Postgres terms — no write skew, no anomalies). CockroachDB also supports READ COMMITTED since v23.1 for compatibility with workloads that explicitly need it.

Distribution

Every key lives in exactly one range (a contiguous span of sorted keyspace, default 64 MB). Every range has three replicas; one of them holds the leaseholder, a long-lived right to serve consistent reads and propose writes. The Distribution layer maintains a range cache on every node — a copy of the cluster's two-level meta table (meta1 and meta2, both stored as ordinary ranges with their own Raft groups). When the cache is stale, a request fails with NotLeaseholderError, the cache refreshes, and the request retries. The allocator lives here too: a continuous background process that watches load (CPU, QPS, replica count, disk usage) and reacts by rebalancing replicas, splitting hot ranges, and transferring leaseholds.

Replication (Raft)

You have already built this layer. Each 64 MB range is its own Raft group with three replicas. The leaseholder is usually the Raft leader (CockroachDB co-locates the two), but they are conceptually separate — leaseholding is about consistent reads, leadership is about log replication. Writes go through Raft: the leaseholder appends a RaftCommand to its log, replicates to followers, waits for majority, applies to its local Pebble. Reads served by followers (opt-in) use closed timestamps: the leaseholder periodically tells followers "I will not commit anything with timestamp ≤ T", so a follower can safely serve any read at timestamp ≤ T without contacting the leaseholder.

Storage (Pebble)

At the bottom of every node sits Pebble, a Go LSM tree that is API- and on-disk-format-compatible with RocksDB but implemented in pure Go. Each range's data is encoded as MVCC keys: (key, timestamp) → value, where the timestamp is an HLC value. Pebble batches writes from every range on the node into a single fsync stream — this is the disk parallelism that makes Multi-Raft work even on one disk.

Range layout

The allocator's view of a small cluster. The keyspace is sliced into eight 64 MB ranges (R1..R8). Each range has three replicas spread across the three nodes; the leaseholder for each range is highlighted by colour. Reads and writes for any key flow to that key's leaseholder. When a range grows past 64 MB it is split into two via a Raft-coordinated operation; when adjacent ranges shrink, they are merged. Both operations are themselves Raft membership changes — described in detail in [the splits-and-merges chapter](/wiki/multi-raft-sharding-consensus).

Hybrid Logical Clocks: TrueTime without the hardware

Spanner gets external consistency by waiting out the TrueTime epsilon at commit. CockroachDB, with no atomic clocks, replaces TrueTime with the Hybrid Logical Clock (HLC), an idea formalised by Sandeep Kulkarni et al. in 2014. The HLC is a 64-bit value built from two parts:

A physical time component — the local node's wall-clock time in microseconds.
A logical counter — an integer that breaks ties when multiple events happen at the same physical microsecond, and that advances even when the wall clock does not whenever the node sees a higher peer timestamp.

The update rule on receiving a peer timestamp (p_peer, l_peer):

def hlc_update(local_clock, peer_ts):
    p_peer, l_peer = peer_ts
    p_local = wall_time_now()
    p_new = max(local_clock.p, p_peer, p_local)
    if p_new == local_clock.p == p_peer:
        l_new = max(local_clock.l, l_peer) + 1
    elif p_new == local_clock.p:
        l_new = local_clock.l + 1
    elif p_new == p_peer:
        l_new = l_peer + 1
    else:
        l_new = 0
    local_clock.p, local_clock.l = p_new, l_new
    return (p_new, l_new)

Why this works as a causal clock: if event A happens-before event B (in the Lamport sense — A's node sent a message that B's node received before B), then hlc(A) < hlc(B). The physical component keeps things close to wall-clock so timestamps are interpretable; the logical counter handles ties and ensures monotonicity even if the wall clock tries to go backwards. Unlike TrueTime, HLC tells you nothing about the uncertainty in absolute time — it only tells you about the causal order. That gap is what the uncertainty interval mechanism patches.

The trade-off is honest: HLC alone gives you serialisability with respect to causally related transactions but not strict external consistency for concurrent ones whose causality is invisible to the system. CockroachDB closes most of the gap with the uncertainty interval: every read at timestamp T from node N treats any value with timestamp in (T, T + maxOffset] as uncertain. If such a value exists, the transaction is restarted at a higher timestamp that includes the uncertain value in its read snapshot. The maxOffset (default 500 ms) is a strict cluster-wide bound — nodes that drift past it kill themselves rather than violate the invariant.

For the systems engineer, the practical consequence is: CockroachDB's serialisability holds as long as your NTP keeps clocks within 500 ms, which is trivial in any modern datacentre. You do not get Spanner's "any transaction commits before any subsequent transaction starts, in real time" — you get "any transaction commits before any subsequent transaction that can prove a happened-before relationship starts" — and in 99% of workloads that is indistinguishable.

Write intents and the transaction record

CockroachDB's transaction protocol is conceptually similar to Percolator but engineered differently. Both leave tentative writes that readers must inspect; both have a single atomic commit point. The differences are: CockroachDB's tentative writes are write intents (a single-value MVCC tagged with a transaction ID), and the commit point is the transaction record, not the primary's lock.

A write intent is an MVCC key-value of the form (key, txn_provisional_ts) → (txn_id, value). It lives at the same Pebble layer as ordinary MVCC values, distinguished only by the presence of the txn_id tag. When any reader scans the key and finds an intent, it cannot just return the value — it does not know whether the tentative write committed.

The reader's job: look up the intent's txn_id, locate the transaction record for that txn, and read its status:

COMMITTED → treat the intent as a real MVCC value, return it (and asynchronously resolve the intent, replacing the intent encoding with a plain MVCC value).
ABORTED → ignore the intent; return the previous version. Asynchronously delete the intent.
PENDING → either wait briefly for the txn to finish, or push the txn (raise its commit timestamp so this read can proceed at the lower timestamp), or restart this read at a higher timestamp.

The lifecycle of a single distributed transaction. Steps 1–4 lay write intents on every range the transaction touches. Step 5 picks the final commit timestamp from the gateway's HLC and atomically flips the transaction record on its designated primary range from PENDING to COMMITTED — that single atomic write *is* the commit point. After step 5, intents are still scattered as tentative values across the cluster; intent resolution (step 6, not shown — it happens asynchronously) replaces each intent with a plain MVCC value at `commit_ts`.

The transaction record itself is just a small KV stored at a deterministic location: the start of the primary range (the range containing the first key the transaction wrote). It holds the txn ID, current status, current commit timestamp, and a list of all the spans the transaction has written to (so the resolver knows where to find intents). Like every other KV in CockroachDB, it lives in a Raft group and is replicated three times. Why colocate the txn record on the primary range rather than a dedicated transactions service: every distributed-systems engineer who has tried to build a separate "transaction manager" service has discovered that it becomes a bottleneck and a single point of failure. By stashing the record on a normal range, the txn record gets the same Raft replication, the same load distribution, and the same failure handling as any other piece of data. Transactions whose primary range crashes recover when that range's Raft group elects a new leader.

Real Python pseudocode

Here is the transaction lifecycle, with intent encoding and the read path that handles intents. This is deeply simplified — real CockroachDB is many millions of lines of Go — but it captures the protocol.

import secrets, dataclasses

@dataclasses.dataclass
class Intent:
    txn_id: str
    value: bytes
    provisional_ts: tuple  # HLC

@dataclasses.dataclass
class TxnRecord:
    txn_id: str
    status: str              # "PENDING" | "COMMITTED" | "ABORTED"
    commit_ts: tuple | None
    written_spans: list

class CockroachClient:
    def __init__(self, gateway):
        self.gw = gateway

    def begin(self):
        txn_id = secrets.token_hex(8)
        start_ts = self.gw.hlc.now()
        # Transaction record is created lazily on first write,
        # on the range containing the first written key.
        self.txn = TxnRecord(txn_id, "PENDING", None, [])
        return self.txn

    def put(self, key, value):
        leader = self.gw.distribution.find_leaseholder(key)
        leader.write_intent(key, Intent(self.txn.txn_id, value,
                                        self.gw.hlc.now()))
        self.txn.written_spans.append(key)
        if len(self.txn.written_spans) == 1:
            # Primary range is the range containing the first key.
            self.primary_range = leader

    def commit(self):
        # Pick the final commit timestamp via HLC.
        commit_ts = self.gw.hlc.now()
        self.txn.commit_ts = commit_ts
        # Atomic flip on the primary range — this IS the commit point.
        self.primary_range.update_txn_record(self.txn.txn_id,
                                             status="COMMITTED",
                                             commit_ts=commit_ts)
        # Asynchronously resolve intents into real MVCC values.
        for span in self.txn.written_spans:
            leader = self.gw.distribution.find_leaseholder(span)
            leader.async_resolve_intent(span, self.txn.txn_id, commit_ts)

def read_with_intent_handling(leader, key, read_ts):
    val = leader.scan_mvcc(key, at=read_ts)
    if isinstance(val, Intent):
        record = lookup_txn_record(val.txn_id)
        if record.status == "COMMITTED" and record.commit_ts <= read_ts:
            return val.value          # treat intent as real
        elif record.status == "ABORTED":
            return leader.scan_mvcc(key, at=read_ts, skip_intent=True)
        else:  # PENDING or future commit_ts
            raise WaitOrPushError(record)
    return val

The crucial property: there is exactly one atomic write that decides the transaction's fate — the update_txn_record call. Everything before it is tentative; everything after it is just bookkeeping.

Worked example

A bank transfer crossing two ranges

You are running CockroachDB on a 3-node cluster in Mumbai for an Indian neobank. Account A (₹1500) lives on key acct/A, which lives in range R3. Account B (₹400) lives on key acct/B, which lives in range R7. R3's leaseholder is on node 2; R7's leaseholder is on node 3. The client connects to node 1.

The application issues BEGIN; UPDATE accounts SET bal=bal-500 WHERE id='A'; UPDATE accounts SET bal=bal+500 WHERE id='B'; COMMIT;.

Step 1. Client sends BEGIN to node 1 (the gateway). Node 1's HLC reads (1714062000.000123, 7) — physical microseconds, logical counter. It assigns this as start_ts. The transaction is in memory only at this point.

Step 2. The first UPDATE arrives. Node 1 parses it, computes bal=1000, and emits a Put(acct/A, 1000). The Distribution layer's range cache says: key acct/A lives in R3, leaseholder is node 2. Node 1 sends the Put (tagged with txn_id and provisional timestamp) to node 2. Node 2's R3 Raft group leader proposes a Raft entry that writes a write intent at (acct/A, 1714062000.000123, 7) → (txn_id=abc123, value=1000). The entry is replicated to node 1 and node 3 (R3's followers), committed via Raft, applied to Pebble. Node 2 returns success. Because this is the first write, R3 is now the primary range for txn abc123, and a transaction record (abc123, status=PENDING, commit_ts=None, spans=[acct/A]) is created on R3.

Step 3. The second UPDATE arrives. Node 1 emits Put(acct/B, 900). Range cache: R7, leaseholder node 3. The intent is written on node 3 via R7's Raft group. Node 1 updates the txn's tracked spans to [acct/A, acct/B].

Step 4. Client sends COMMIT. Node 1's HLC reads (1714062000.001891, 0); this becomes commit_ts. Node 1 sends an EndTxn(commit, txn_id=abc123, commit_ts=1714062000.001891) to R3 (the primary range). R3's Raft group atomically flips the transaction record to status=COMMITTED, commit_ts=1714062000.001891. At this instant, the transaction has committed. Node 1 acknowledges to the client.

Step 5 (asynchronous). Node 1 sends ResolveIntent requests to R3 (resolve acct/A) and R7 (resolve acct/B). Each leaseholder atomically replaces the intent encoding with a plain MVCC value: (acct/A, 1714062000.001891) → 1000 and (acct/B, 1714062000.001891) → 900. The intent records are deleted.

What if a reader hits an intent before resolution? Suppose between step 4 and step 5, another transaction reads acct/A at timestamp 1714062000.005000. It scans R3, finds the intent at acct/A, sees txn_id=abc123, looks up the txn record on R3, sees COMMITTED, commit_ts=1714062000.001891. Since commit_ts ≤ read_ts, the reader treats the intent as a real value and returns 1000. As a janitorial side-effect, it triggers intent resolution.

What if node 1 dies between step 3 and step 4? The transaction record on R3 is still PENDING. The intents on acct/A and acct/B sit there. After the configured timeout (default 5 seconds), any reader that touches one of those intents will push the txn — try to flip the txn record to ABORTED. If successful, the intents are cleaned up and the transaction is rolled back. The atomicity guarantee holds because no one ever observed any of the writes as committed.

CockroachDB in the wild

In India, CockroachDB shows up most often in fintech ledgers — payment platforms that need cross-AZ durability for regulatory reasons (the RBI's data localisation rules effectively require multi-zone deployments inside India), lending platforms whose accounting must be transactionally correct across services, and a few of the larger UPI processing layers. The deployment pattern is typically three nodes per region across three Indian AZs (Mumbai 1a / 1b / 1c, often), giving you survivability under a single-AZ failure with sub-millisecond cross-AZ latency.

The closest architectural cousin is TiDB, born at PingCAP in Beijing and widely deployed in Chinese fintech. TiDB layers a stateless SQL frontend over TiKV, a Multi-Raft Rust KV store; transactions use Percolator-style optimistic 2PC over TiKV, with a centralised Placement Driver (PD) playing the role of TSO and metadata service. Where CockroachDB embeds the transaction record on a normal range and uses HLC, TiDB explicitly hands out timestamps from PD and stores lock state in TiKV's column families. The two systems converge on similar guarantees from different starting points.

Going deeper

The above is the architecture every CockroachDB user should be able to draw on a whiteboard. Below are the corners that matter once you operate the system.

Closed timestamps and follower reads

The leaseholder is the natural place to serve consistent reads — it knows what is committed. But for analytical queries that are read-heavy and latency-sensitive, fan-out to followers helps. CockroachDB's solution is closed timestamps: the leaseholder periodically broadcasts to its followers "I will not commit anything with timestamp ≤ T_closed". A follower can then safely serve any read at timestamp ≤ T_closed without contacting the leaseholder. The closed-timestamp interval is configurable; default is around 3 seconds behind real time, which is fine for dashboards and reporting and unsuitable for read-your-writes.

Range splits and merges

A range exceeds 64 MB → the range's Raft group commits a SplitTrigger entry, which atomically creates two new Raft groups (one for the left half, one for the right) and updates the meta range. Each new range inherits the parent's three replicas. Merges are the inverse, triggered when adjacent ranges are both small and on the same set of nodes. Both operations are designed to be online — splits and merges happen continuously in production without operator intervention.

Parallel commits

A 2019 optimisation: instead of doing the EndTxn flip after all intents are durably written, CockroachDB now does it in parallel. The transaction record records STAGING instead of PENDING and lists exactly which intents are pending. A reader that finds a STAGING record can independently verify that all listed intents exist; if so, the txn is implicitly committed even though the record never explicitly flipped. This shaves one round-trip off the commit path — typically dropping commit latency from 4 ms to 2 ms on a three-region deployment.

References

CockroachDB architecture overview — the official layered diagram and per-layer documentation.
Living without atomic clocks — the original 2016 Cockroach Labs blog post explaining why HLC + uncertainty intervals replace TrueTime.
How CockroachDB does distributed, atomic transactions — the canonical write-up of intents, transaction records, and the commit path.
CockroachDB: The Resilient Geo-Distributed SQL Database (Taft et al., SIGMOD 2020) — the academic paper covering the full design.
Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases (Kulkarni et al., 2014) — the original HLC paper that CockroachDB implements.
CockroachDB DistSQL design RFC — the gRPC-based distributed SQL execution flow design.