Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Calvin: deterministic concurrency

In 2012 a paper out of Yale, written by Daniel Abadi's group, made a claim that sounded almost reckless: distributed transactions do not need two-phase commit. They do not need any commit protocol at all. The trick — and the part that confused readers for a year — is that if every replica receives the exact same ordered list of transactions, and if every replica executes that list with a deterministic scheduler, then the replicas cannot diverge. There is nothing to commit because there is nothing to disagree about. The system, called Calvin, became the reference design for a small but fierce school of distributed databases — FaunaDB is the most prominent commercial descendant — and the conceptual frame that explains why some systems can do cross-shard transactions at single-shard cost. This chapter is about what determinism buys you, what it costs, and why most production systems still pick Percolator instead.

Calvin replaces two-phase commit with deterministic execution over a globally-agreed transaction sequence. A central sequencer batches incoming transactions and uses Paxos to replicate the batch order, not the data. Every replica then executes the same batch with the same scheduler, producing identical state by construction. The cost: the transaction's read/write set must be known before execution begins, which forces a reconnaissance phase for transactions that depend on data they will read.

What Calvin is solving — the cost of agreement-after-execution

Classical distributed transactions execute first and agree later. A 2PC transaction sends prepare to every shard, each shard locks its rows and votes, the coordinator collects votes and decides. The decision phase is on the critical path: the rows stay locked until the coordinator's commit lands, which means cross-shard latency directly translates into lock-hold time. If the coordinator crashes between prepare and commit, every participating shard is blocked holding locks until a recovery protocol runs. Percolator softens this by pushing the coordinator state into the primary row and making recovery passive, but it still pays two RPCs per row and still blocks readers behind in-flight transactions.

Calvin asks: what if you flipped the order? Agree on what is going to happen before anything happens. Then execution is just replay — every replica is a deterministic state machine consuming an ordered log, and there is no possibility of disagreement. There is no coordinator log, no two-phase prepare, no abort caused by a participant timing out. Locks are still held during execution, but only on the local node, and only for the duration of a single transaction's deterministic run — never across the network.

The trade is severe. To agree on what is going to happen, you must know the transaction's read and write sets up front. If a transaction reads row R and uses R's value to decide whether to update row S, the sequencer cannot pre-order it correctly without first running it speculatively. This is the dependent-transaction problem, and Calvin solves it with a protocol called OLLP (Optimistic Lock-Location Prediction) that runs a reconnaissance pass, captures the read set, then resubmits — and re-runs if the data shifted in between.

Calvin's three-layer architectureA three-tier diagram. Top tier: clients submitting transactions T1, T2, T3, T4 to a sequencer. Middle tier: the sequencer batches them into a 10ms epoch, runs Paxos to replicate the batch order to the other replicas, and stamps the batch with a sequence number. Bottom tier: each replica has a scheduler that hands the ordered batch to local storage workers. Three replicas R1, R2, R3 each receive identical batch [T1,T2,T3,T4], execute it deterministically, and produce identical state. Annotations: "Paxos agrees only on the batch ORDER, not the data" and "no commit phase — replicas cannot diverge". Illustrative. Calvin: sequencer → scheduler → storage, replicated by order clients T1 T2 T3 T4 SEQUENCER batch every 10 ms epoch → assign seq# → Paxos-replicate the BATCH ORDER batch #4719: [T1, T2, T3, T4] identical batch sent to every replica replica R1 scheduler: T1, T2, T3, T4 storage: deterministic exec final state ≡ R2 ≡ R3 replica R2 scheduler: T1, T2, T3, T4 storage: deterministic exec final state ≡ R1 ≡ R3 replica R3 scheduler: T1, T2, T3, T4 storage: deterministic exec final state ≡ R1 ≡ R2 Paxos replicates the batch ORDER, not the data — no commit phase needed because replicas cannot diverge
Calvin's three layers: sequencer batches and orders, scheduler hands the ordered batch to storage workers, storage executes deterministically. The Paxos round runs once per batch on the sequence, not per transaction on the data.

The protocol — sequencing, scheduling, executing

A Calvin transaction T moves through three layers in order. The layers are physically separate processes; in Yale's reference implementation, each is its own thread pool sitting behind a small RPC interface.

Layer 1 — Sequencer. Clients submit transactions to any sequencer node. The sequencer accumulates submissions for a fixed epoch (the original paper used 10 ms; FaunaDB tunes this around 5–20 ms based on workload). At epoch end, the sequencer takes the entire batch, assigns it a globally-monotonic sequence number, and runs a Paxos round across all replicas to durably agree on the batch contents and its order. Once Paxos returns, the batch is committed to the global log — every replica will see this batch before it sees any later one.

Why batching is essential: Paxos pays a fixed RTT cost (one accept-phase round) per agreement. If you ran Paxos per transaction, you would multiply per-transaction latency by 1 RTT — at India-to-Singapore latency that is 80 ms, which is uncompetitive. Batching amortises that cost: a 10 ms epoch holding 5000 transactions pays one Paxos RTT for the entire batch, so the per-transaction overhead is 80 ms ÷ 5000 = 16 µs. The trade is a fixed 10 ms wait at submission. For OLTP workloads where median latency is dominated by RTT anyway, this is a win.

Layer 2 — Scheduler. Once a replica receives a sequenced batch, the local scheduler reads the read/write sets declared in each transaction header, computes a serial order that respects all conflicting accesses (typically the same order as the sequence number), and dispatches each transaction to the local storage worker that owns the relevant rows. The scheduler holds local-only locks for the duration of the transaction, but those locks are never reached across the network — they protect intra-replica concurrency only.

Layer 3 — Storage. Each storage worker executes its assigned transaction deterministically. "Deterministic" here means no random choices, no wall-clock decisions, no thread-scheduling-dependent outcomes — the execution must be a pure function of (input batch, prior state). When the transaction touches rows on remote nodes, the local worker sends a remote read (just a value fetch, not a lock request), waits for the value, and continues. There is no remote write phase: every node knows from the read/write set which rows it owns, applies its writes locally, and the global state is the union.

Why no commit phase is needed: every replica receives the same batch in the same order. Every replica's scheduler produces the same serial schedule. Every storage worker is deterministic. Therefore every replica's post-batch state is identical — there is nothing to vote on, nothing to abort, nothing to recover. Crash recovery is replaying the global log from a snapshot. A node that crashed during batch #4719 just re-applies #4719 from scratch when it comes back up; the result is identical to the surviving replicas because the function is deterministic.

OLLP: handling dependent transactionsA timeline showing a transaction whose write-set depends on a read. Step 1: client submits transaction T (declared read-set unknown). Step 2: sequencer runs reconnaissance pass — executes T against a snapshot of state, captures the actual read-set and write-set as observed. Step 3: client resubmits T with the captured sets. Step 4: sequencer puts T into a real batch and runs Paxos. Step 5: replicas execute T deterministically. Side note: if the data has shifted between reconnaissance and execution, the deterministic execution detects the mismatch and the transaction restarts from step 1. Illustrative. OLLP: optimistic lock-location prediction for dependent transactions step 1: client submit T (no R/W set) step 2: recon pass execute on snapshot step 3: capture sets resubmit T with R/W step 4: real batch Paxos batch + execute step 5: replicas execute T deterministic exec — done If data shifted between recon and execution: the deterministic exec detects the read-set mismatch — transaction is aborted and restarted from step 1. Restart rate is the headline cost of OLLP under contention. Illustrative — OLLP in three steps before the transaction enters a real batch
Dependent transactions need OLLP: a recon pass discovers the read-set, the client resubmits with the set declared, the real batch executes deterministically. If state shifted, the transaction restarts.

A working Calvin simulator

This is a reduced Calvin written as a single-process simulator. It models a sequencer, a single-replica scheduler, and a deterministic storage layer. Three sequencers contributing to the same global log would add Paxos on the batch boundaries; here we collapse Paxos to an in-memory queue to keep the artefact short and the determinism property visible. PaySetu's transactional ledger uses a Calvin-style design for cross-account transfers — the simulation here mirrors the shape of that design.

# Calvin simulator — sequencer batches, scheduler orders, storage executes deterministically.
# Demonstrates: replicas with same batch produce identical state. No commit phase.
import threading, queue, random, copy
from collections import defaultdict

class Calvin:
    def __init__(self, name):
        self.name = name
        self.state = defaultdict(int)               # row -> value
        self.log = []                               # batches we have applied

    def apply_batch(self, seq, txns):
        # Deterministic: txns are applied in their declared order, no concurrency.
        for txn in txns:
            reads = {r: self.state[r] for r in txn['reads']}
            for k, op, v in txn['writes']:
                if op == 'set':         self.state[k] = v
                elif op == 'add':       self.state[k] += v
                elif op == 'sub_if_ge': self.state[k] = self.state[k] - v if self.state[k] >= v else self.state[k]
        self.log.append(seq)

class Sequencer:
    def __init__(self, replicas, epoch_size=4):
        self.replicas = replicas
        self.epoch_size = epoch_size
        self.buffer, self.next_seq = [], 0

    def submit(self, txn):
        self.buffer.append(txn)
        if len(self.buffer) >= self.epoch_size: self._flush()

    def _flush(self):
        # In real Calvin: Paxos round here. We collapse to an ordered broadcast.
        batch = list(self.buffer); self.buffer.clear()
        seq = self.next_seq; self.next_seq += 1
        for r in self.replicas: r.apply_batch(seq, batch)

# DEMO — Riya transfers ₹500 alice→bob; Rahul deposits ₹200 to bob; Asha withdraws ₹300 from carol.
R1, R2, R3 = Calvin('R1'), Calvin('R2'), Calvin('R3')
for r in (R1, R2, R3): r.state.update({'alice': 1500, 'bob': 200, 'carol': 1000})
seq = Sequencer([R1, R2, R3])
seq.submit({'reads': ['alice', 'bob'], 'writes': [('alice', 'sub_if_ge', 500), ('bob', 'add', 500)]})
seq.submit({'reads': ['bob'],          'writes': [('bob', 'add', 200)]})
seq.submit({'reads': ['carol'],        'writes': [('carol', 'sub_if_ge', 300)]})
seq.submit({'reads': ['alice', 'bob'], 'writes': [('alice', 'add', 100), ('bob', 'sub_if_ge', 100)]})
print(f"R1 state: {dict(R1.state)}")
print(f"R2 state: {dict(R2.state)}")
print(f"R3 state: {dict(R3.state)}")
print(f"identical: {R1.state == R2.state == R3.state}")

Realistic output:

R1 state: {'alice': 1100, 'bob': 800, 'carol': 700}
R2 state: {'alice': 1100, 'bob': 800, 'carol': 700}
R3 state: {'alice': 1100, 'bob': 800, 'carol': 700}
identical: True

A line-by-line walkthrough. Calvin.apply_batch is the determinism kernel — it iterates the transactions in the exact order given, never spawns threads, never consults a clock, and uses only the operations set, add, sub_if_ge (a deterministic conditional). Two replicas given the same batch produce the same state because the function is pure. Sequencer._flush is where Paxos would live in production — a real implementation calls into multi-paxos or Raft to make the batch durable across replicas before broadcasting; here we collapse that to an in-memory iteration. seq.submit appends to the buffer and triggers a flush at epoch_size transactions; in production the trigger is a timer (10 ms epoch). sub_if_ge is the deterministic version of "subtract if the balance is high enough" — note that it never aborts; it silently no-ops if the balance is too low. In real Calvin the storage worker can return a flag to the client indicating which transactions executed vs no-opped, but the state transition itself is deterministic. The DEMO submits four transactions; all three replicas land at identical post-state; the assert R1.state == R2.state == R3.state is the property the entire architecture exists to deliver.

Failure modes — what breaks deterministic execution

Calvin's elegance rests on three assumptions, and production teams who deploy it spend most of their tuning time on what happens when each assumption is violated.

Assumption 1 — every replica sees the same batch order. Sequencer-replication uses Paxos / Raft; if the Paxos leader is slow or partitioned, a new leader takes over and the next batch may have a higher sequence number while the old leader's batch is being re-proposed. Calvin handles this with strict seq ordering in the scheduler — a replica refuses to apply batch n+1 until it has applied n. If batch n is delayed, the replica stalls. The whole replica stalls, not just one transaction. This is why Calvin needs a fast and reliable consensus group on the sequencer path.

Assumption 2 — transactions are deterministic. Any non-determinism — random.random(), wall-clock-based decisions, hash-iteration order, floating-point fast-math — corrupts replica state. CalvinDB's stored procedures run in a sandboxed mini-language that bans non-deterministic primitives. FaunaDB takes a similar approach with FQL. For ad-hoc SQL, the sequencer must capture seeds and clocks at sequence-time and pass them as inputs to every replica.

Assumption 3 — read/write sets are knowable. Pure read-set is fine for static queries (UPDATE accounts SET balance = balance - 500 WHERE id = 'alice'). It fails for queries whose targets depend on data (UPDATE the_top_10_buyers ...). OLLP's reconnaissance pass solves this, but reconnaissance is not free — it is essentially executing the transaction once read-only, then re-executing it under deterministic schedule. Under high contention, the re-execution may find the data has shifted and abort, restarting the whole cycle. The original Calvin paper measured a restart rate of <2% on TPC-C — but TPC-C is benign. Under heavier contention (a hot account row), restart rates climb quickly.

Why restart rate is the dominant cost knob: every OLLP restart pays a full reconnaissance pass plus a full deterministic execution — roughly 2× the work of a non-dependent transaction. At 1% restart rate the overhead is invisible; at 10% the throughput floor drops by ~10% because every tenth transaction is retrying. At 30% restart rate the system is in collapse, because each restart adds load to the very rows that caused the original conflict. Production teams running Calvin-style systems instrument restart rate as a top-line metric, not p99 latency, because restart rate is the leading indicator of meltdown.

Common confusions

  • "Calvin is just Raft / Paxos with extra steps." No. Raft and Paxos replicate the log of operations post-execution. Calvin replicates the schedule pre-execution. Raft commits each entry only after a quorum acks; Calvin commits batches via Paxos but then executes deterministically with no further coordination. The replication boundary is in a different place.
  • "Deterministic execution means single-threaded." Not necessarily. Within a batch, transactions whose read/write sets do not overlap can run in parallel — the scheduler computes a conflict graph from the declared sets and parallelises non-conflicting work. The determinism is in the commit order of conflicting operations, not in the execution timeline.
  • "Calvin avoids 2PC and avoids locks entirely." It avoids 2PC, not locks. A storage worker still acquires local row locks during execution to serialise concurrent batches' workers — but those locks are local, short-lived, and never reach across the network. The "no 2PC" claim is about the absence of a cross-shard prepare/commit handshake.
  • "Calvin needs a single sequencer, which is a bottleneck." No — Calvin scales sequencers horizontally. Each sequencer node accepts client submissions independently; the global log is the deterministic merge of all sequencer streams, ordered by epoch. The Paxos round runs across sequencer nodes per epoch, not per submission.
  • "FaunaDB is Calvin." FaunaDB is inspired by Calvin and shares the determinism-over-replicated-log idea, but the implementation diverges substantially — FaunaDB uses temporal queries, an interval-tree-based optimistic concurrency layer, and a JIT-compiled FQL interpreter. The core insight is shared; the engineering is different.

Going deeper

The Yale paper and its quiet impact

Thomson, Diamond, Weng, Ren, Shao, and Abadi published Calvin: Fast Distributed Transactions for Partitioned Database Systems at SIGMOD 2012. The paper's headline number was 500,000 TPC-C NewOrder transactions/sec on 100 nodes — at a time when Spanner had not yet been published and most distributed databases struggled past 10,000 cross-shard transactions/sec. The deeper contribution was conceptual: the paper drew a clean line between agreement and execution, and showed that you could spend your consensus budget exclusively on the former. Most subsequent deterministic-database work — VoltDB's H-Store roots, FaunaDB, the more recent SLOG paper — traces directly back to Calvin's architecture diagram.

OLLP and the dependent-transaction trap

OLLP — Optimistic Lock-Location Prediction — is the protocol that handles transactions whose read/write sets depend on data. The reconnaissance pass executes the transaction against a database snapshot to discover the actual sets. Then the transaction is resubmitted with those sets declared, sequenced, and executed deterministically. If, between reconnaissance and execution, the data has shifted such that the declared sets are wrong, the deterministic execution detects the mismatch — typically by checking that the values read during execution match what reconnaissance saw — and the transaction is aborted and restarted. The cost is the restart rate, which is workload-dependent. CricStream's paywall-validation transactions on a Calvin-derived store run at a 0.4% restart rate during steady traffic but climbed to 8% during the 2024 IPL final's coupon-redemption spike, where 25M users were hammering 200 promo-code rows simultaneously. The fix was application-level: rather than read-modify-write the promo-code row, a separate counter row was incremented atomically (a deterministic add operation needs no read), and the redemption-status row was written separately. Restart rate dropped to 0.1%.

Why most production systems still pick Percolator over Calvin

Calvin is provably faster for cross-shard transactions and provably simpler at the protocol level — yet most production systems pick Percolator or Spanner-style 2PC. The reasons are mundane and operational. (1) Determinism constrains the application: no now(), no random, no UDFs that touch the filesystem. Every team that adopts Calvin first audits their stored procedures and discovers ten violations they did not know about. (2) The dependent-transaction restart rate is workload-sensitive, and unpredictable restarts make latency SLOs harder to hit. (3) The sequencer is on the critical path: a 10 ms epoch adds 5 ms of median latency that is hard to explain to a team migrating from MySQL where median is 200 µs. (4) Operational tooling — monitoring tools, backup formats, query analysers — have decades of MySQL/Postgres lineage and almost zero Calvin lineage. PaySetu evaluated Calvin in 2022 for their core ledger; the tech-fit was strong but the migration cost (rewriting 1200 stored procedures into the deterministic dialect) was estimated at 18 person-years, and they shipped on TiKV/Percolator instead.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
# (no external deps — the simulator uses only the standard library)
python3 calvin_sim.py

The simulator above is self-contained; copy it into calvin_sim.py and run it. Try changing the operation order in seq.submit(...) calls and observe that all three replicas remain in lockstep regardless of the schedule, because the schedule is the same on every replica.

Where this leads next

Calvin is one of three structurally distinct answers to the cross-shard transaction problem: 2PC over a coordinator log, Percolator's primary-pointer recovery, and Calvin's deterministic-replay-over-Paxos-batches. Each has a regime where it dominates: 2PC is the textbook default and the right choice for traditional RDBMS; Percolator wins when transactions are short, the read/write sets are unpredictable, and the storage layer is already a sharded KV; Calvin wins when transactions are stored procedures (read/write sets are known ahead of time) and global throughput matters more than per-transaction latency.

The next chapters cover /wiki/truetime-spanner-and-physical-logical-hybrids — Spanner's TrueTime-augmented variant of 2PC — and /wiki/sagas-forward-and-compensating — the alternative when even 2PC is too synchronous.

References