Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Linearizability

It is 19:42 at KapitalKite — a fictional discount stockbroker — on a Wednesday when the index is moving 1.4% in the last hour, and Saanvi the on-call engineer is staring at a complaint from a power-trader who claims his account showed ₹4,12,000 of buying power, then he placed a ₹3,90,000 order, then a refresh showed ₹4,12,000 again — and the order also went through. The trader says "your system is broken" and he is right, but the bug is not where the team thinks it is. The buying-power service is replicated across three nodes, the writes are synchronous to a quorum of two, and the reads also go to a quorum of two — by the textbook this should be safe. What broke was a single-replica read path that someone added six months ago for "performance", and what it broke was linearizability — the contract that makes a replicated object behave like a single, atomic object. This article is the precise definition of that contract, why it is so hard to keep, and how to test whether you actually have it.

Linearizability says every operation on a replicated object appears to take effect at one instant somewhere between when it was called and when it returned, and that single instant is consistent with real-time wall-clock order across all clients. It is the strongest contract you can place on a single object — strictly stronger than serialisable transactions over one key, stronger than sequential consistency, and the only model where a replicated register behaves like a non-replicated one. The cost is that every write and every read must consult enough replicas to intersect every other operation's quorum, which is at minimum one round-trip to a majority. Most production systems weaken to causal or session-scoped guarantees because they cannot afford that round-trip on every operation; the ones that do not weaken (etcd, ZooKeeper, Spanner) pay it deliberately and bound their topology to keep the cost survivable.

What the definition actually says — Herlihy & Wing 1990

Linearizability has a precise definition from Herlihy and Wing's 1990 paper Linearizability: A Correctness Condition for Concurrent Objects, and the precision matters because almost every real bug in this area comes from someone reasoning with a vague version of it.

The definition has three parts. First, every operation has an invocation event (the moment a client calls the method) and a response event (the moment the client receives the result). The interval [invocation, response] is the operation's uncertainty window — the system is allowed to do whatever it wants inside that window, but the externally observable behaviour must obey two rules. Second, there must exist a linearization point — a single instant inside [invocation, response] — at which the operation appears to take effect atomically and instantaneously. Third, if the linearization points of two operations are ordered as op_A → op_B, then op_A's effect must be visible to op_B and to every operation linearized after op_B. There is one final clause that distinguishes linearizability from sequential consistency: if op_A returned before op_B was invoked in real time (response_A < invocation_B on a hypothetical universal clock), then op_A must be linearized before op_B.

That last clause is what makes linearizability so strong. It is what forbids a system from "playing back" operations in any order it likes — the order must respect real-time. Sequential consistency drops this clause, and that is where most of the weaker models start.

Why the linearization point must lie inside the uncertainty window: the linearization point is a fiction the implementer constructs to justify the observed behaviour, not a thing the client can directly observe. If the linearization point lay outside [invocation, response], the client would have to "see the future" or "remember the past": at the moment of invocation the operation has not happened yet (so it cannot have been linearized earlier), and at the moment of response the client has the result (so it must have been linearized by now). The window [invocation, response] is exactly the set of timestamps at which the linearization point could lie without contradicting any client's observation.

Linearization points inside operation uncertainty windowsA horizontal time axis with three concurrent operations. Operation A from client 1 is a write of x equals 5, with its uncertainty window from time 1 to time 4. Operation B from client 2 is a read of x, with uncertainty window from time 2 to time 6. Operation C from client 3 is a read of x, with uncertainty window from time 5 to time 7. Linearization points are placed at time 3 inside operation A, time 4 inside operation B which returns the new value 5, and time 6 inside operation C which also returns 5. The linearization order A then B then C is consistent with real-time because operation A starts before operation B and operation A's response time 4 happens before operation C's invocation time 5. The diagram is illustrative. Linearization points must lie inside each operation's uncertainty window t=0 t=2 t=4 t=6 t=8 t=10 real time → client 1 write(x, 5) inv resp LA at t=3 client 2 read(x) → 5 inv resp LB at t=4 client 3 read(x) → 5 inv resp LC at t=6 Order LA → LB → LC respects real-time: A's response (t=4) precedes C's invocation (t=5), so A < C is mandatory.
Three concurrent operations, each with an uncertainty window. The system gets to choose a linearization point inside each window — but once chosen, the order must respect real-time: if operation A returned before operation C was invoked, A must be linearized before C, no matter what the implementation did internally.

Why a quorum on every operation is the floor — and what less buys you

Linearizability is not free, and the unavoidable cost is per-operation cross-replica coordination. The formal lower bound is from Attiya, Bar-Noy, and Dolev's 1995 paper Sharing Memory Robustly in Message-Passing Systems: in an asynchronous system with f crash failures among N nodes, any linearizable read or write must contact at least (N + f) / 2 + 1 replicas — a majority quorum. The intuition is the intersection argument: if a write contacted set W of replicas and a subsequent read contacted set R, the read can only see the write if W ∩ R ≠ ∅. With |W| = |R| = ⌈(N+1)/2⌉, every write quorum overlaps every read quorum in at least one replica — and that replica carries the latest value.

Why majority quorum is necessary and not just sufficient: suppose you tried to read from R = ⌈N/2⌉ - 1 replicas (one less than majority). Then the writer's W = ⌈N/2⌉ + 1 replicas leave N - |W| = ⌈N/2⌉ - 1 replicas that did not see the write — and your R could land entirely on those. The read returns the stale value, and there is now a real-time order violation: the write returned (so it must be linearized) but a subsequent read does not see it. The only way to close this gap is |W| + |R| > N, which forces both to be majority-sized in the symmetric case.

This is the floor. You cannot go below it without giving something up. The three things you can give up — each named, each used in production — are:

  • Single-replica reads with leader stickiness. Pin every read for a key to the leader replica that committed its writes. Reads do not pay quorum because the leader has the latest committed write. You give up read availability when the leader is partitioned away — your reads either block or silently fail over to a replica that may have stale data. Used by ZooKeeper's sync() model and by Raft when reads go through the leader.
  • Lease-based reads. The leader holds a time-bounded lease on the read key; while the lease is valid the leader can serve reads from local state without contacting peers. The lease must be shorter than the failure-detection timeout, and clock skew across nodes must be bounded. Used by Spanner's TrueTime and by etcd's lease-based linearizable reads.
  • Read-index protocol. Before serving a read, the leader contacts a heartbeat-quorum of followers to confirm it is still leader, then serves the read from local state. Saves the full Raft-log append for reads, but still pays one round-trip for the heartbeat. Used by etcd and TiKV.

Each of these still satisfies linearizability — they just shift where the round-trip happens. None gets you below one round-trip to a quorum's worth of nodes per operation that crosses the read/write boundary.

Why write and read quorums must intersectA diagram of five replicas labelled R1 through R5 arranged in a horizontal row. A write quorum highlighted in accent colour covers replicas R1, R2, and R3. A read quorum also highlighted covers R3, R4, and R5. The intersection at R3 is marked in a darker shade with a label saying "intersection — carries latest value". An arrow shows the write reaching R3 first. A note below the diagram explains that with N equals 5, both quorums of size three guarantee at least one replica in common, which is what makes the read see the write. Why W + R > N forces quorum intersection (here N=5, W=R=3) R1 in W R2 in W R3 in W ∩ R R4 in R R5 in R write quorum W = {R1, R2, R3} read quorum R = {R3, R4, R5} R3 = W ∩ R — carries latest committed value |W| + |R| = 6 > N = 5 ⇒ |W ∩ R| ≥ 1 always Cut R to 2 and you can pick {R4, R5} — the intersection vanishes, staleness leaks in. Illustrative — one of many valid quorum partitions.
The intersection argument: any write quorum and any read quorum must share at least one replica, which carries the latest committed value. Linearizability is exactly the property this intersection guarantees, lifted to operations on a register.

Testing for linearizability — the Wing & Gong / Jepsen approach

You cannot prove a system is linearizable by inspection — every implementation has subtle ordering bugs that only show up under load. The standard tool is to record a history (a log of every operation's invocation and response with timestamps) and then check whether a valid linearization exists for that history. The check is NP-complete in the worst case (Gibbons & Korach 1997), but in practice the histories produced by a few clients running for a few seconds are tractable.

The simulation below builds a tiny linearizability checker. It models a register replicated across three nodes with a 50 ms async-replication delay, then runs concurrent reads and writes against it under two read paths — quorum-read (linearizable) and single-replica-read (not). The checker exhaustively searches for a sequential ordering of the operations that respects real-time and is consistent with each operation's observed result. If no such ordering exists, the history is non-linearizable and the checker reports the violating operations.

# linearizability_check.py — minimal Wing-Gong style checker for a register
from itertools import permutations
from dataclasses import dataclass

@dataclass
class Op:
    client: str
    kind: str          # 'write' or 'read'
    arg: int | None    # value written, or None for read
    result: int | None # value returned, or None for write
    inv: float         # invocation real-time
    resp: float        # response real-time

    def __repr__(self):
        if self.kind == 'write':
            return f"{self.client}: write({self.arg}) @ [{self.inv:.0f},{self.resp:.0f}]"
        return f"{self.client}: read()={self.result} @ [{self.inv:.0f},{self.resp:.0f}]"

def is_linearizable(history, initial=0):
    """Search for a linearization respecting real-time order and register semantics."""
    n = len(history)
    for order in permutations(range(n)):
        # check real-time: if i precedes j in real-time, i must come before j in `order`
        rt_ok = True
        for a in range(n):
            for b in range(n):
                if history[a].resp < history[b].inv and order.index(a) > order.index(b):
                    rt_ok = False
                    break
            if not rt_ok: break
        if not rt_ok: continue
        # replay the linearization through register semantics
        state = initial
        valid = True
        for idx in order:
            op = history[idx]
            if op.kind == 'write':
                state = op.arg
            else:  # read
                if op.result != state:
                    valid = False
                    break
        if valid:
            return True, order
    return False, None

# Histories from the same workload, different read paths
quorum_history = [
    Op('c1', 'write', 5, None, 0,  20),   # write commits at quorum t=10
    Op('c2', 'read',  None, 5,    25, 35), # quorum read sees 5
    Op('c3', 'read',  None, 5,    40, 50),
]
stale_history = [
    Op('c1', 'write', 5, None, 0,  20),
    Op('c2', 'read',  None, 0,    25, 35), # single-replica hit a stale node
    Op('c3', 'read',  None, 5,    40, 50),
]

for label, h in [('quorum', quorum_history), ('single-replica', stale_history)]:
    ok, order = is_linearizable(h)
    print(f"{label}: linearizable={ok}")
    if ok:
        print("  valid order:", [h[i] for i in order])
    else:
        print("  no linearization respects both real-time and register semantics")

Sample output:

quorum: linearizable=True
  valid order: [c1: write(5) @ [0,20], c2: read()=5 @ [25,35], c3: read()=5 @ [40,50]]
single-replica: linearizable=False
  no linearization respects both real-time and register semantics

Why the checker enumerates permutations rather than just sorting by timestamp: real-time order is only a partial order on operations — operations whose intervals overlap (inv_a < resp_b AND inv_b < resp_a) are concurrent and the checker may pick any relative ordering between them. A sort by midpoint or by invocation time would arbitrarily linearize concurrent operations and miss valid histories where a different ordering of those concurrent operations is the one consistent with register semantics. Permutation enumeration is the price of correctness; smarter checkers (Knossos, Porcupine) prune the search via memoised state transitions but the underlying algorithm is still "search the partial order's linear extensions".

The quorum history linearizes because the only valid ordering is write(5) → read()=5 → read()=5, which respects real-time (the write returned at t=20 before the first read was invoked at t=25) and respects register semantics (every read after the write returns 5). The single-replica history does not linearize because c2's read returned 0 after c1's write of 5 had already returned — the only ordering respecting real-time would put write(5) → read()=0, but a register that just had 5 written to it cannot return 0. The checker is exhaustive over permutationsO(n!) in the operation count — which is fine for n ≤ 8 but explodes beyond that. The Jepsen library's Knossos checker uses memoised search to push the limit to roughly n = 25 per partition, and Porcupine (Anish Athalye's faster implementation) reaches n = 200+ for register histories.

This is why testing tools like Jepsen run real distributed systems under partitions and clock skew, capture histories with thousands of operations split across short overlapping windows, and feed each window to the checker independently. A bug only has to appear in one window for the whole run to be flagged.

A war story: KapitalKite's "buying-power double-spend" bug

Saanvi's investigation at KapitalKite — the fictional discount stockbroker from the opening — eventually narrowed the cause to a code path nobody had touched in ten months. The buying-power service ran on three nodes (Mumbai-A, Mumbai-B, Hyderabad-C) using a Raft cluster with the leader in Mumbai-A. All writes went through Raft and were quorum-replicated before returning. Reads, by default, also went through the leader using the read-index protocol. But one engineer had added a "fast path" for the post-trade buying-power refresh: after submitting an order, the order-service polled the buying-power service every 200 ms until the new value appeared, and the polling client had been pointed at the closest replica, not the leader, "to reduce leader load on hot-trading days".

What happened to the trader at 19:42 was this: he placed a ₹3,90,000 order, the order service called the leader and the leader's commit took 18 ms (normal for Mumbai's Raft cluster). The order service then started polling for buying-power refresh, hitting Hyderabad-C — but Hyderabad-C's Raft log had not yet caught up. For 220 ms, Hyderabad-C still showed the old buying power of ₹4,12,000. The order service displayed that to the trader. The trader hit the Place Order button again. The order service, seeing ₹4,12,000 of "buying power" on its local cache (populated from the stale read), submitted a second order without re-checking the leader. The trader had now placed two orders against funds that only existed once.

The bug was a textbook linearizability violation: a read returned a value (₹4,12,000) that was committed before a write the order-submission flow had already observed completing. There was no possible linearization order that respected real-time AND the register semantics of the buying-power object. Saanvi's fix had three parts: (1) every read in any flow that gates a subsequent write must use the read-index protocol — no fast-path reads off non-leader replicas; (2) the order service's local cache was deleted entirely (caches across a linearizability boundary are almost always wrong); (3) Jepsen-style nightly chaos tests were added that run a million-operation workload across induced 30%-loss partitions and feed every history window to Porcupine. The first night they ran, the test caught two more single-replica read paths nobody had remembered.

Vikrant's postmortem said: "We had nine months of green production metrics on a system that was not linearizable. The bug was always there; nobody had been unlucky enough to trade fast enough to hit the 220 ms window. 'It works' is not the same as 'it is correct'. Test the property, not the symptoms."

Common confusions

  • "Linearizability is the same as serializability." Serializability is a property of multi-object transactions: there must exist some serial ordering of the whole transactions such that the observed result is equivalent. Linearizability is a single-object property: each operation has an instantaneous linearization point, and the order respects real-time. Serializability does not require real-time order; linearizability does not say anything about transactions over multiple objects. Strict serializability is the conjunction — both single-object real-time order AND multi-object transactional atomicity. Spanner targets strict serializability; etcd targets linearizability.
  • "Sequential consistency is just linearizability with weaker ordering." Both require a single global order on operations. Linearizability requires that order to respect real-time across clients; sequential consistency requires it only to respect real-time within a single client. So a sequentially-consistent system can have client A finish a write at t=1, client B start a read at t=2, and B reads the old value — as long as some valid global order exists that explains it. Linearizability rejects this; sequential consistency permits it.
  • "Quorum reads automatically give linearizability." They do for a register if the write-quorum and read-quorum intersect, AND if the implementation handles the case where the read sees a value that was written but not yet committed (the read-repair problem). Cassandra's QUORUM reads are not linearizable by default because of this: a read can observe a value mid-write and a subsequent read can observe the prior value if read-repair has not propagated. To get linearizability you need LWT (Cassandra's Paxos-backed light-weight transactions), not QUORUM.
  • "Linearizability is only for storage systems." It is a property of any concurrent object — a counter, a queue, a lock, a log. Distributed locks (Chubby, etcd's locks) advertise linearizability so that "I hold the lock at time t" has a global, real-time meaning. Logs (Kafka with acks=all and a single leader per partition) advertise per-partition linearizability for the same reason — readers see writes in commit order across all readers.
  • "You can compose linearizable objects and stay linearizable." This is the one thing linearizability does well that sequential consistency does not: linearizability is locally checkable, meaning a system composed of linearizable objects is itself linearizable for operations that touch one object at a time. The moment you have a transaction spanning two linearizable objects, you need an additional protocol (2PC, Paxos commit) to keep the composition strict-serializable. This composability is why Herlihy & Wing argued so hard for linearizability over sequential consistency in 1990.

Going deeper

Herlihy & Wing's original paper — read sections 3 and 4

The 1990 paper Linearizability: A Correctness Condition for Concurrent Objects (Herlihy & Wing, ACM TOPLAS) is one of the rare pieces of distributed-systems theory that is actively pleasant to read. Section 3 contains the precise definition with the labelled-sequence formalism. Section 4 proves locality — the property that "any composition of linearizable objects is linearizable" — which is the main reason linearizability won over the alternatives. The paper also has the proof that linearizability is non-blocking: a process trying to perform an operation cannot be blocked waiting for some other process to finish, which is what distinguishes it from strict serializability for transactional multi-object systems.

How etcd actually serves linearizable reads — read-index protocol

The etcd Raft implementation uses Diego Ongaro's read-index trick from his 2014 thesis. When the leader receives a read request, it (1) records the current commit index as read_index, (2) sends a heartbeat to a quorum of followers and waits for acks, (3) waits until its applied index reaches read_index, then (4) serves the read from local state. The heartbeat-quorum step is what guarantees the leader is still leader (a partitioned old leader would not collect the heartbeat-acks); the apply-wait is what guarantees the read sees every commit that completed before the read started. Cost: one round-trip to a follower-quorum, no Raft-log entry — roughly half the cost of a write while still linearizable.

Spanner's TrueTime — linearizability with bounded clock uncertainty

Google's Spanner pushes linearizability across continents using TrueTime, an API that returns (earliest, latest) rather than a single timestamp, with a hardware-bounded uncertainty interval ε of typically 1–7 ms. To commit, Spanner picks a timestamp s and waits out the uncertainty — it sleeps until now.earliest > s before returning the commit. This wait — typically 5–15 ms — is what guarantees that any subsequent transaction with a higher TrueTime stamp is genuinely later in real-time than this commit. The catch is that TrueTime depends on GPS receivers and atomic clocks at every datacentre to keep ε small; without that hardware, Spanner's commit-wait would balloon to seconds. (Corbett et al., OSDI 2012.)

The CAP theorem says you cannot have linearizability under partition

Gilbert and Lynch's 2002 formalization of CAP proves: in an asynchronous network with possible message loss, no system can simultaneously guarantee linearizability AND availability AND partition tolerance. The instant a partition isolates a minority of nodes, that minority cannot serve writes (they have no quorum) — so they either return errors (CP system) or accept writes that violate linearizability when the partition heals (AP system). There is no third option. Production systems either choose CP (etcd, ZooKeeper, Spanner — refuse the minority's writes) or AP (Cassandra, DynamoDB default — accept and reconcile later). PACELC adds the question of what they do outside a partition, but inside one the trade is binary.

Why "linearizable enough" is a real engineering choice

Practical systems often violate strict linearizability in narrow, well-understood ways because the staleness window is far smaller than any human or downstream system can observe. DynamoDB's default reads are eventually consistent with a typical 100 ms staleness window; the ConsistentRead=true flag pays an extra ~10 ms for a linearizable read by routing through the partition leader. The choice is per-call. The principle is: do not assume your storage layer is linearizable across the whole API surface — assume per-operation, and look at the implementation, not the marketing page. The KapitalKite story is what happens when the assumption goes wrong.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
python3 linearizability_check.py

The exhaustive checker is O(n!) in operation count — fine for n ≤ 8. To explore real workloads, install Anish Athalye's porcupine (go get github.com/anishathalye/porcupine) and replay JSON-encoded histories. Jepsen (jepsen.io) is the production-grade tool — Kyle Kingsbury's reports are the best published case studies of real systems failing the test.

Where this leads next

This chapter formalised the strongest single-object consistency model. The next chapters in Part 12 walk down the lattice introduced in the wall chapter: sequential consistency drops real-time but keeps a global order; causal consistency drops the global order but keeps happens-before; session guarantees (read-your-writes, monotonic-read, monotonic-write, writes-follow-reads) keep only per-client invariants; eventual consistency keeps only convergence.

The chapter after the lattice is the CAP theorem in Gilbert-Lynch form, then PACELC in Abadi's form. CAP says you cannot have linearizability + availability under partition; PACELC says even outside a partition, you trade latency for linearizability on every operation. Both are tools for picking which consistency model the application can afford, given the real-time order requirements and the latency budget at hand.

Beyond Part 12, CRDTs (Part 13) are how eventual consistency is made safe by construction. Distributed transactions (Part 14) are how multi-object linearizability — strict serializability — is implemented when the application cannot work around it. Linearizability is the contract everything else weakens from; without a precise definition of it, you cannot say what you have given up.

References

  • Herlihy, M., Wing, J. — "Linearizability: A Correctness Condition for Concurrent Objects" (ACM TOPLAS 1990). The defining paper.
  • Attiya, H., Bar-Noy, A., Dolev, D. — "Sharing Memory Robustly in Message-Passing Systems" (J. ACM 1995). The lower-bound proof for read/write quorum sizes.
  • Gilbert, S., Lynch, N. — "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" (SIGACT 2002). CAP formalised.
  • Ongaro, D. — "Consensus: Bridging Theory and Practice" (Stanford PhD thesis, 2014). The read-index protocol used by etcd.
  • Corbett, J. C. et al. — "Spanner: Google's Globally-Distributed Database" (OSDI 2012). TrueTime and the commit-wait.
  • Gibbons, P. B., Korach, E. — "Testing Shared Memories" (SIAM J. Comput. 1997). NP-completeness of linearizability checking.
  • Athalye, A. — "Porcupine: A Linearizability Checker for Testing Distributed Systems" (anishathalye.com, 2017). Practical fast checker.
  • Kingsbury, K. — Jepsen reports (jepsen.io). The body of work that has uncovered linearizability bugs in most production databases.
  • Wall: consistency at scale needs new models — the previous chapter, framing why this contract is so often weakened.
  • Raft in detail — the consensus protocol that backs most production linearizable systems.