Calvin: Deterministic Ordering

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Two-phase commit is expensive because the participants discover each other's decisions at runtime. Each shard runs its work, votes, waits for the verdict, holds locks across the round-trip, and risks blocking if the coordinator dies. The expense is fundamentally about non-determinism — different replicas of the same shard can reach different decisions if they receive operations in different orders, so something has to coordinate them online.

Calvin's thesis is to delete that non-determinism. Decide the global order of every transaction before anyone touches the data; then let every replica of every shard execute that ordered sequence independently. If the input is the same and the execution is deterministic, the output is the same — no votes needed.

The architecture is two layers stacked on top of each other. A sequencer layer receives client transactions, batches them into 10-100ms windows called epochs, and uses Paxos to globally agree on the contents and order of each epoch. The output is a totally-ordered sequence of transaction batches that every replica will see identically. A storage layer sits underneath: each replica reads the global sequence, looks up which shards each transaction touches, applies the read/write set, and commits. No PREPARE, no votes, no in-doubt state.

The trick is the read/write set: every transaction must declare its full set of keys up front. For "transfer ₹500 from A to B", the set is {A, B} and you know it before you start. For "transfer ₹500 from A to whichever account the user picks based on this query result", you don't — and Calvin's answer is OLLP (Optimistic Lock Location Prediction): do a reconnaissance read first, predict the set, submit the transaction, and re-run if the prediction was wrong.

The cost is upfront declaration. The win is throughput: the original Calvin paper reports 500K+ transactions per second on commodity hardware, because there is no synchronous cross-shard coordination on the hot path. Spanner uses 2PC over Paxos and pays the latency for it. Calvin avoids 2PC entirely and pays in expressivity. FaunaDB is the production system built on this idea by Daniel Abadi, one of Calvin's authors.

You read chapter 109 and learned 2PC. You read chapter 110 and saw why it blocks. You read chapter 112 and saw how Spanner makes 2PC production-safe by replicating the coordinator with Paxos and synchronising clocks with TrueTime. Now this chapter shows you the other branch of the family tree — the systems that decided 2PC was the wrong protocol to make non-blocking and instead designed around it entirely.

Calvin came out of Daniel Abadi's lab at Yale in 2012. The paper, Calvin: Fast Distributed Transactions for Partitioned Database Systems, is a short, sharp piece of database research that made one assertion and built around it: non-determinism is the source of distributed-commit pain, and you can engineer it out of the system if you accept a constraint on what transactions look like. The constraint is the upfront read/write set; the win is that 2PC disappears.

Today you build the architecture, see why determinism removes the need for runtime voting, and write a tiny sequencer + worker that demonstrates the property end-to-end on three shards.

Why 2PC exists in the first place

Step back to the protocol you wrote in chapter 109. The bank transfer of ₹500 from A on shard 1 to B on shard 2 needed PREPARE, votes, COMMIT — a synchronous round-trip — because each shard could independently fail to commit. Shard 1 might find A's row locked by another transaction. Shard 2 might fail a constraint check. The two shards have to agree on whether the transaction commits, and the only way to discover whether they all agreed is to ask them.

Now ask: why are the shards making independent decisions at all? Because they received the operations as they arrived, in some order that depended on network timing, process scheduling, and which clients connected to which proxies. If two replicas of shard 1 receive transactions T1 and T2 in different orders, they can reach different states. The coordination protocol exists to mask the non-determinism — to make sure every shard agrees on whether T1 committed before T2 or after.

Why this framing matters: 2PC is solving the symptom. The disease is that replicas can disagree about state because their inputs arrive in different orders. If the inputs arrive in the same order on every replica, and the execution is deterministic, the replicas are bit-for-bit identical and there is nothing to vote about. The work of 2PC moves to the work of agreeing on the input order. That work, you can do once per batch instead of once per transaction.

This is the Calvin insight. Push the agreement up to the input layer; do it for batches of hundreds of transactions at once; and let the storage layer execute deterministically with zero runtime coordination.

The two-layer architecture

Two layers, sharply separated. The sequencer layer's only job is to produce the canonical ordered sequence of transaction batches. The storage layer's only job is to apply that sequence deterministically. Replicas of the same shard, fed the same input, produce the same state — no online voting required.

The system splits cleanly into two layers that share nothing except the batch sequence between them.

The sequencer layer is a small, replicated service whose entire purpose is to produce a totally-ordered log of transaction batches. It receives transaction submissions from clients, buffers them in memory for a configurable epoch (typically 10ms in the original Calvin paper, sometimes longer in production deployments), and at the end of each epoch runs a Paxos round to agree with its peers on what was in the batch and what its sequence number is. Every sequencer node ends each epoch with the same batch contents and the same sequence number for that batch.

The output is a stream that conceptually looks like:

Batch 0: [T1, T2, T3, ..., T87]
Batch 1: [T88, T89, ..., T160]
Batch 2: [T161, ..., T243]
...

Every replica everywhere sees this exact same stream in this exact same order. That is the load-bearing property.

The storage layer is the actual database — partitioned across shards as usual, with each shard replicated for fault tolerance. Each replica has one job: tail the batch stream, and for each transaction in each batch, look at the read/write set, see which keys belong to this shard, perform the work, and move on. There is no PREPARE, no vote, no acknowledgement to a coordinator. The replica trusts that every other replica is doing the exact same thing in the exact same order, and so by induction they are all in the same state.

Why this works: in the sequenced model, a transaction's "decision" is made when it enters a batch. Once it's in the batch, every replica of every relevant shard will execute it. There's no concept of a shard "rejecting" the transaction at execution time — the batch is the contract. If a transaction would have failed a check (insufficient balance, missing row), that's encoded in the deterministic execution: every replica observes the same failed check, every replica records the same "failed" outcome. Failure is part of the deterministic computation, not a coordination signal.

Deterministic execution — the heart of it

Each replica is a deterministic state machine fed the same input log. Every replica computes the same trajectory: 0 → 10 → 20 → 15 → 16. Replicas never need to talk to each other during execution. They might fall behind in time (one replica is reading batch 50 while another is on batch 53), but the *content* of their state for any given batch index is identical.

The storage layer's design discipline is determinism. Anything that could cause two replicas to diverge given the same input has to be eliminated:

No wall-clock time inside transaction logic. If the transaction needs a timestamp, it gets one assigned by the sequencer (the same value every replica sees).
No randomness without a seed from the batch. If a transaction uses a random number, the sequencer or the batch supplies the seed.
No reading from outside services without recording the response in the batch. External calls have to be done at sequencing time (or via the OLLP reconnaissance step) so every replica gets the same answer.
Deterministic ordering of operations within a transaction. No iteration over hash maps with non-deterministic order; sorted iteration only.

These are the same disciplines you'd impose on any state-machine-replication system. The Calvin paper is explicit that this constraint is the entry tax for the throughput win.

For multi-shard transactions — say T7 reads a row on shard A and writes a row on shard B — the deterministic execution is more interesting. Both shards know T7 is in the batch. Shard A reads the row and sends the value to shard B (a single one-way message, no ack needed). Shard B waits for the read result, performs the write, commits. There is no abort path: if T7 declared its read/write set correctly and the data is there, it executes; if a check inside T7 fails (insufficient balance), every replica computes the same failure deterministically and records the same "did not commit due to check" outcome. Either way, no voting.

Why this is fundamentally different from 2PC: in 2PC, shard B has to wait for shard A's vote before it can know whether to commit, because shard A might independently abort. In Calvin, shard A cannot independently abort — its participation is decided by the batch. Shard B knows that if it received the read value from A, A is going to apply T7's writes locally with the same result. The one-way message replaces the round-trip vote.

Why 2PC becomes unnecessary

The "decision moment" moves from step 3 in 2PC (coordinator's COMMIT broadcast, after the round-trip) to step 2 in Calvin (the Paxos append of the batch, before any shard touches data). Once the batch is in the log, every transaction in it is committed in the protocol sense — the storage layer's job is just to compute the new state, not to *decide* anything.

Walk through what 2PC was protecting against, one concern at a time, and ask whether Calvin still has it.

"What if shard B votes NO because B's row is locked?" In Calvin, locks aren't a coordination mechanism — they're a within-shard concurrency-control detail. The sequencer ordered T7 before T8; shard B knows T7 must be applied first; T8's locks have no effect on T7. There is no "vote NO due to conflict".

"What if shard B finds the balance would go negative?" That's a deterministic failure of the transaction's body. Every replica observes the same failed check. The transaction is recorded as "executed but did not modify state because the check failed" — uniformly across every replica. No voting needed; the failure is a deterministic computation, not a coordination signal.

"What if the coordinator crashes mid-protocol?" There is no coordinator in Calvin. The sequencer layer is replicated by Paxos, so the batch log itself is durable and ordered without a single point of failure. A storage replica that crashes just resumes from its last checkpoint and replays the batch log forward.

"What about isolation — concurrent transactions stepping on each other?" Inside a shard, the storage layer applies transactions strictly in batch order, so they're serialised by construction. Across shards, the global batch order is the serial order: every replica will agree that T7 came before T8 because that's what the batch says.

The cost is paid once per epoch instead of once per transaction. The Paxos round to commit a batch is the same Paxos round that 2PC's consensus-replicated coordinator would run, but it's amortised across hundreds of transactions in the batch instead of being run for each one. That is where the 500K txns/sec number in the original paper comes from.

The catch — read/write sets must be known up front

The sleight of hand is that the sequencer can only put a transaction into a batch if it knows which shards the transaction will touch. For transfers between two known accounts, this is trivial: RW = {A, B}. For "find the user with email X and credit their account ₹100", you don't know up front which shard the user is on.

Calvin's answer is OLLP — Optimistic Lock Location Prediction. Before submitting the transaction for sequencing, the client (or a helper layer) does a reconnaissance read: it executes the read parts of the transaction without any locks or guarantees, predicts the read/write set from the result, and submits the transaction with that predicted set. The sequencer schedules the transaction. When the storage layer executes it, the first thing each shard does is verify that the read still gives the same result; if not, the transaction is treated as a "miss", aborted deterministically (every replica agrees), and the client is told to re-submit. On the second try, the prediction is usually right because it was just freshly read.

The OLLP discipline turns a fundamental "we don't know what to lock" problem into a probabilistic re-try loop. Most production workloads — banking, e-commerce, gaming — turn out to have predictable read/write sets, so OLLP misses are rare. But if your workload is dominated by transactions whose access patterns depend heavily on what they read, Calvin is a poor fit and you want something like Spanner.

Why this restriction exists: the sequencer needs to know which shards a transaction touches in order to deliver the batch to the right places. If shard C doesn't see T17 in its incoming batch, shard C can't apply T17 — and if T17 turns out to need a key on shard C, the system is broken. The read/write set tells the sequencer where to deliver. OLLP is the bridge for transactions whose set is data-dependent.

Real Python — sequencer and per-shard worker

Here is a minimal sketch. A Sequencer collects incoming transactions for an epoch, then publishes them via a stub Paxos. Each Worker represents one shard's replica: it pulls the batch sequence and executes the operations that touch its shard.

# sequencer.py
import time, threading, json, queue, uuid

class PaxosLog:
    """Stand-in for a real Paxos-replicated log. In production this is
    Multi-Paxos or Raft; here it's a thread-safe append-only list shared
    by the sequencer and all workers."""
    def __init__(self):
        self._entries = []
        self._cond = threading.Condition()

    def append(self, batch):
        with self._cond:
            self._entries.append(batch)
            self._cond.notify_all()

    def read_from(self, index):
        with self._cond:
            while len(self._entries) <= index:
                self._cond.wait()
            return self._entries[index]

    def length(self):
        with self._cond:
            return len(self._entries)

class Sequencer:
    def __init__(self, log, epoch_ms=20):
        self.log = log
        self.epoch_ms = epoch_ms
        self.buffer = []
        self._lock = threading.Lock()
        self._stop = False

    def submit(self, txn):
        # txn = {"id": ..., "rw_set": {"A": [...], "B": [...]}, "ops": [...]}
        with self._lock:
            self.buffer.append(txn)

    def run(self):
        next_seq = 0
        while not self._stop:
            time.sleep(self.epoch_ms / 1000.0)
            with self._lock:
                batch_txns, self.buffer = self.buffer, []
            if not batch_txns:
                continue
            # Order within the batch is deterministic — by submission order
            # plus a tie-break on txn id. In real Calvin, multiple sequencer
            # nodes interleave their buffers via the Paxos round itself.
            batch_txns.sort(key=lambda t: t["id"])
            batch = {"seq": next_seq, "txns": batch_txns}
            self.log.append(batch)         # <-- Paxos append (atomic, durable)
            next_seq += 1

The worker:

# worker.py
import json, os

class Worker:
    def __init__(self, shard_name, log, state, checkpoint_path):
        self.shard = shard_name
        self.log = log
        self.state = state               # {key: value} for keys this shard owns
        self.checkpoint_path = checkpoint_path
        self.next_index = self._load_checkpoint()

    def _load_checkpoint(self):
        if not os.path.exists(self.checkpoint_path):
            return 0
        with open(self.checkpoint_path) as f:
            cp = json.load(f)
            self.state = cp["state"]
            return cp["next_index"]

    def _save_checkpoint(self):
        with open(self.checkpoint_path, "w") as f:
            json.dump({"next_index": self.next_index, "state": self.state}, f)
            f.flush()
            os.fsync(f.fileno())

    def apply_one(self, txn):
        # Only execute ops that touch keys in this shard
        ops_for_us = [op for op in txn["ops"] if op["key"] in self.state]
        if not ops_for_us:
            return
        # Deterministic check: balance-non-negative for transfers
        for op in ops_for_us:
            if op["type"] == "add":
                projected = self.state[op["key"]] + op["delta"]
                if projected < 0:
                    return  # deterministic "no-op due to failed check"
        for op in ops_for_us:
            if op["type"] == "add":
                self.state[op["key"]] += op["delta"]

    def run(self):
        while True:
            batch = self.log.read_from(self.next_index)
            for txn in batch["txns"]:
                self.apply_one(txn)
            self.next_index += 1
            if self.next_index % 10 == 0:        # checkpoint every 10 batches
                self._save_checkpoint()

A few things to note about this code.

Where the determinism lives. apply_one reads only self.state and txn. There is no clock, no random, no external service. Two workers handed the same batch sequence and starting from the same checkpoint produce bit-identical state.

Where the durability lives. The Paxos log is the source of truth for ordering and for what happened. The worker checkpoints periodically (every 10 batches in this sketch) so that on crash it can resume by reloading the checkpoint and replaying the log forward from next_index. There is no PREPARE record because there is nothing to prepare — the batch's presence in the Paxos log is the commit.

What is not shown. Multi-shard transactions need workers to exchange read values (for transactions that read on shard A, write on shard B). The Calvin paper specifies a one-way send: each worker computes its locally-owned reads and sends them to the workers that need them as writes. The receiving worker waits, then applies. No ACKs, no votes — the sender knows the recipient will receive and use the value because the batch contract guarantees execution.

Worked example — 100 transactions across three shards, with a crash

100 client transactions, 3 shards, 50ms of sequencing — and what happens when shard B crashes

The setup. Shards A, B, C each own one third of the keyspace. We'll seed A with accounts A1, A2, A3 (each ₹1000), B with B1, B2, B3 (each ₹1000), C with C1, C2, C3 (each ₹1000). Two replicas per shard. One sequencer with a 20ms epoch.

Over 50ms, 100 client transactions arrive. They're a mix:

60 single-shard transfers (e.g. A1 → A2 ₹50)
30 two-shard transfers (e.g. A1 → B1 ₹100)
10 three-shard transfers (e.g. A1 → B1 → C1 ₹50 round-robin)

Each transaction has a known read/write set (we're not using OLLP for this example). Submission order is roughly chronological but with some jitter from network latency.

log = PaxosLog()
seq = Sequencer(log, epoch_ms=20)
threading.Thread(target=seq.run, daemon=True).start()

# Spin up replicas
shard_A_state = {"A1": 1000, "A2": 1000, "A3": 1000}
shard_B_state = {"B1": 1000, "B2": 1000, "B3": 1000}
shard_C_state = {"C1": 1000, "C2": 1000, "C3": 1000}

A1 = Worker("A", log, dict(shard_A_state), "ckpt_A1.json")
A2 = Worker("A", log, dict(shard_A_state), "ckpt_A2.json")
B1 = Worker("B", log, dict(shard_B_state), "ckpt_B1.json")
B2 = Worker("B", log, dict(shard_B_state), "ckpt_B2.json")
C1 = Worker("C", log, dict(shard_C_state), "ckpt_C1.json")
C2 = Worker("C", log, dict(shard_C_state), "ckpt_C2.json")

for w in [A1, A2, B1, B2, C1, C2]:
    threading.Thread(target=w.run, daemon=True).start()

# 100 client submissions over 50ms
for i in range(100):
    txn = generate_random_txn(i)        # builds rw_set + ops
    seq.submit(txn)
    time.sleep(0.0005)                  # 0.5ms between submissions

What happens, traced over 50ms of wall time:

t = 0 to 20ms. Roughly 40 client transactions arrive. The sequencer buffers them. At t=20ms the epoch ends. Sequencer sorts buffered transactions deterministically (by id), wraps them as Batch 0 = {seq: 0, txns: [T1, T2, ..., T40]}, and appends to the Paxos log.

t = 20ms (just after). All six workers are blocked in log.read_from(0). The append wakes them. Each worker reads Batch 0, iterates the 40 transactions in order. For each, it applies only the ops touching its own shard. Worker A1 and A2 both compute the same final state for shard A; B1 and B2 the same for B; C1 and C2 the same for C. No messages between workers (in this single-shard subset of the example).

t = 40ms. Another epoch closes with the next 40 transactions. Same dance. Batch 1 is appended to the log. Workers read it.

t = 50ms. The last 20 transactions arrive late and end up in Batch 2.

t = 60ms. All three batches have been applied by all six workers. Replicas of the same shard are bit-identical. The system has executed 100 distributed transactions with zero PREPARE messages, zero votes, zero in-doubt locks.

Now the failure case. Suppose worker B1 crashes at t=42ms — partway through processing Batch 1, after applying half the txns it cares about but before the others. What happens?

B1's checkpoint file on disk was last written at the end of Batch 0 (at t≈21ms when it had finished batch 0; checkpoint cadence in this sketch is every 10 batches, so realistically the last checkpoint is from Batch 0 or earlier).
B1 restarts. _load_checkpoint reads the file: next_index = 1, state is the post-Batch 0 state.
B1's run loop calls log.read_from(1). The Paxos log still has Batch 1 (and now Batch 2). B1 reads Batch 1, applies it deterministically — same code, same input, same starting state — and arrives at the same post-Batch-1 state that B2 already reached. B1 then reads Batch 2, applies it, and is fully caught up.
During B1's downtime, nothing was blocked. B2 served reads for shard B; the sequencer kept appending; A1, A2, C1, C2 kept executing. B1 just had to catch up on its own time.

Compare this to 2PC. In 2PC, if a participant crashes after voting YES, every other participant in every transaction it was prepared for is stuck holding locks waiting for the verdict. In Calvin, B1's crash affected nothing outside B1, because B1's role was purely to execute a known input — its absence delayed nothing else.

Verifying the determinism. After all batches are applied, you can dump the state of A1 and A2 and check A1.state == A2.state and B1.state == B2.state and C1.state == C2.state. They will be byte-identical, because they computed the same function on the same input. This is the property the entire system was designed around.

Where Calvin sits in the family tree

Calvin and Spanner are the two production-relevant answers to "make distributed transactions work at scale", and they answered the question in opposite directions.

Spanner kept 2PC and made it survivable: replicate the coordinator with Paxos so it can never block forever, use TrueTime to assign globally monotonic commit timestamps so external consistency holds, pay the wide-area latency for every cross-region transaction. The strength is that transactions can be expressive — read what you read as you read it, no upfront declaration. The cost is latency: every cross-region commit pays one or more round-trips to TrueTime quorums and to the coordinator's Paxos group.

Calvin deleted 2PC by making execution deterministic at the cost of upfront read/write set declaration. The strength is throughput and zero in-doubt state. The cost is expressivity: data-dependent transactions need OLLP, which is a probabilistic optimisation, not a clean abstraction.

FaunaDB, founded by Daniel Abadi (the lead author of the Calvin paper), is the production system built on Calvin's ideas. FaunaDB ships a query language (FQL) whose semantics make read/write sets statically determinable for most queries, so the OLLP fallback is rare. FaunaDB also extends Calvin with multi-region replication and the ability to do cross-region reads at a snapshot timestamp without a round-trip — they get linearisable reads from the same epoch ordering that drives writes.

Other systems borrowing from Calvin. TimescaleDB's distributed mode, several research prototypes, and at least one major exchange's matching engine use the deterministic-ordering pattern. The pattern shows up wherever the workload is throughput-dominated and the read/write set is knowable.

The two-line takeaway: Spanner pays per-transaction coordination latency to keep the programming model unconstrained. Calvin pays a one-time programming-model constraint to delete per-transaction coordination latency. Both are correct designs; they sit at different points on the throughput/expressivity tradeoff curve.

Common confusions

"Calvin doesn't use Paxos." It absolutely does — the sequencer layer is Paxos-replicated. The difference from Spanner is that Calvin runs Paxos once per batch of hundreds of transactions rather than once per transaction. The throughput win is amortisation, not an absence of consensus.

"Determinism just means same code." Determinism means same code and same inputs and no side effects from non-deterministic sources (clocks, randomness, concurrency, hash iteration order, network). Building a deterministic execution layer is a real engineering discipline; you can't just write normal code and assume it.

"OLLP is a workaround." It's a designed feature. The Calvin authors knew the read/write-set restriction was the major application-facing cost and they engineered OLLP as the bridge. In practice, on most workloads, OLLP miss rates are very low.

"Calvin has no locks." Each shard still uses internal locks for serialising the within-shard application of transactions. The locks are short-lived (the duration of one transaction's local execution) and never held across network round-trips. There is no global lock manager.

"You can replace 2PC with Calvin in any system." Only if you can afford to declare read/write sets up front and accept epoch-batching latency on the write path. Workloads that need < 5ms commit latency or have highly data-dependent access patterns are not Calvin's sweet spot.

Going deeper

Why epoch batching is not a latency bug

Calvin's commit latency floor is the epoch length — typically 10-50ms. To a system designer used to 2PC's "as fast as the network allows" model, this looks like a regression. It's not, when you frame it correctly.

In 2PC, the per-transaction wire latency dominates the floor: each cross-shard transaction pays at least one network round-trip on PREPARE and one on COMMIT, plus two fsyncs. On a wide-area deployment that is comfortably 30-100ms regardless of how clever you are. Calvin's 10-50ms epoch is roughly the same magnitude, and inside that epoch the system ingests thousands of transactions — so the per-transaction marginal latency at high load is much lower than 2PC's per-transaction round-trip.

OLLP in detail — the reconnaissance step

A Calvin transaction with data-dependent reads runs in two phases. Phase 1 (reconnaissance): the client executes the read parts of the transaction against the current state, with no locks and no transactional guarantees. From the read results, it constructs a predicted read/write set. Phase 2 (submission): the client submits the transaction to the sequencer with the predicted set. The sequencer schedules it. At execution time, each shard re-runs the reads. If the values match the prediction, execution proceeds. If not, the transaction is recorded as a deterministic abort (every replica agrees it failed prediction validation), and the client is told to retry. The retry usually succeeds because the prediction was just refreshed.

The clever bit: the validation is itself deterministic. Every replica observes the same "prediction matched" or "prediction mismatched" outcome, so the abort is consistent across replicas. The client doesn't see an "in-doubt" state — it sees either "committed" or "aborted, please retry".

Cross-shard reads at snapshot timestamps

Calvin's batch sequence numbers double as snapshot timestamps. A read that wants a consistent snapshot across shards just specifies "read at batch N". Every shard replica returns the state it had after applying batch N. No coordination is needed because the snapshot is defined by the sequence number, which every replica agrees on.

This is how FaunaDB delivers cross-region linearisable reads without paying the wide-area round-trip on each one. The reader picks a recent batch number and reads from the local replica; the local replica's state at that batch is identical to every other replica's state at that batch.

Why the storage layer doesn't need a WAL in the 2PC sense

2PC needed a participant WAL because the participant had to durably promise to commit before knowing the outcome. Calvin's storage layer has no such promise to make — the batch log is the WAL. Every state-changing operation is implicit in the batch log, and on crash recovery a worker rebuilds its state by replaying the batch log from its last checkpoint. The local store still uses an internal WAL for crash-consistent state updates within a single batch's application, but it carries no inter-node coordination semantics.

Where this leads next

Calvin closes one branch of the distributed-transactions story: the branch that asks "what if we engineered around 2PC entirely?" Chapter 114 picks up the saga pattern — long-running business transactions that span minutes, hours, or days, where neither 2PC nor Calvin's batch-execution model fits and you instead build orchestrated workflows of compensating actions. Chapter 115 then walks CockroachDB's architecture in detail, where you'll see another concrete production answer that combines Spanner's 2PC-over-consensus with some pragmatic shortcuts.

The arc of Build 14 has been: 2PC is the textbook answer, it blocks, the industry split into "fix 2PC with consensus" (Percolator, Spanner, CockroachDB) and "delete 2PC with determinism" (Calvin, FaunaDB). Both branches are correct. Which one fits your workload depends on whether you can afford the upfront read/write set in exchange for the throughput, or whether you'd rather pay the per-transaction coordination cost in exchange for an unconstrained programming model.

References

Thomson, Diamond, Weng, Ren, Shao, and Abadi, Calvin: Fast Distributed Transactions for Partitioned Database Systems, SIGMOD 2012 — the original Calvin paper. Sections 3 (architecture), 4 (deterministic locking), and 5 (OLLP) cover the design points this chapter walks; section 6 reports the 500K txns/sec throughput on a TPC-C-like benchmark.
Daniel Abadi, It's Time to Move on from Two Phase Commit, DBMS Musings blog, 2018 — the polemical case for Calvin's design philosophy by its lead author, with a clear comparison to Spanner's 2PC-over-Paxos approach and an argument for why deterministic execution scales better.
Fauna, Inc., FaunaDB Technical White Paper and Consistency Without Clocks: The FaunaDB Distributed Transaction Protocol — FaunaDB's productionisation of Calvin, including how the upfront read/write set is recovered from FQL queries and how multi-region replication is layered onto the sequencer model.
Harding, Van Aken, Pavlo, and Stonebraker, An Evaluation of Distributed Concurrency Control, VLDB 2017 — a head-to-head benchmark comparison of 2PC, optimistic concurrency control, MVCC, and Calvin-style deterministic execution on the same hardware, showing exactly where each design wins and loses.
Yale Database Group, Calvin Project Overview — the project page with links to the original implementation, follow-up papers (BOHM, PVW), and the dissertations from Thomson and Ren that extend the deterministic-execution model.
Ren, Thomson, and Abadi, An Evaluation of the Advantages and Disadvantages of Deterministic Database Systems, VLDB 2014 — a follow-up paper from the Calvin team that catalogues the design trade-offs of deterministic execution against traditional non-deterministic systems on identical workloads. Required reading if you're considering Calvin for a real deployment.