Sloppy Quorums and Hinted Handoff

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A strict quorum says: to commit a write, W of the N replicas in this key's preference list must acknowledge. That rule has a failure mode. If enough preference-list replicas are unreachable — partitioned, crashed, rebooting — that you cannot gather W acks from them, the write fails. For Riverone's shopping cart, and for every Dynamo-style system that inherited its goal, "write fails because some replicas are down" is exactly the outcome the architecture was built to avoid.

Sloppy quorum is the relaxation. If the preference list cannot yield W acks, the coordinator writes to any healthy nodes in the cluster — even ones outside the preference list — and counts those acks toward W. Each off-list ack carries a hint: a small record saying "this write was really meant for node X; I am holding it temporarily on X's behalf." The hint lives durably on the substitute node. When gossip tells the substitute that X is reachable again, it replays the write to X and deletes the hint. That is hinted handoff.

The system buys continuous write availability at a price. Between the write at time T and the hint delivery at T+N, a read that queries only the preference list may miss the write — the fresh copies are on off-list nodes the reader does not consult. Read repair and anti-entropy (the next two chapters) cover this gap. The inconsistency window is bounded by hint-delivery latency — typically seconds, sometimes minutes during long partitions, pathologically hours if a node stays down. For shopping carts, sessions, and telemetry, that is an acceptable tax. For bank ledgers, it is not, and you should not be using sloppy quorum.

The technique is Dynamo's answer to the hardest operational question in distributed storage: what should a write do when the right replicas are unavailable? Strict says no. Sloppy says yes, with a promise to make it right.

Picture a Dynamo cluster spread across three racks in an Indian datacentre. Key telemetry:PSLV-C58:10023 hashes to preference list [A1, A2, B1]. N=3, W=2, R=2.

At 03:14 a labourer puts a backhoe through rack A's primary power feed. UPS carries rack A for 180 seconds, then A1, A2, A3 go dark. A new telemetry sample arrives; the coordinator tries to fan out to [A1, A2, B1]. A1 down. A2 down. B1 acks. One ack, target W=2 — strict quorum fails.

A strict system drops the write. A sloppy system picks a healthy node — say C1 — and writes the sample there with a hint attached: "this value belongs to A2; give it to A2 when A2 returns." C1 stores the write durably. Ack count is now 2 (B1 + C1); the write succeeds.

Forty minutes later rack A's power comes back. C1's periodic delivery loop notices A2 is alive, replays the telemetry to A2, and deletes the hint. The cluster is consistent again. The 40-minute partition cost zero writes.

That is the whole idea. The rest is engineering detail.

Strict quorum in Dynamo — the partition pain point

You saw strict quorum in Tunable Consistency R + W > N. Given a key, the placement layer produces a deterministic preference list of N replicas. A write is accepted by the coordinator only if W of those specific N replicas ack. A read contacts R of those specific N replicas and merges. The R + W > N inequality guarantees overlap between any read quorum and any recent write quorum, and that overlap is what delivers strong consistency.

The word specific is the problem. Strict quorum does not care how many healthy nodes you have cluster-wide; it cares only about the N nodes responsible for this key. Suppose N=3, W=2 and your cluster has 100 nodes. A key's preference list picks three of them. If two of those three are down — say both happen to be in the same rack and the rack lost power — the write fails. The other 97 healthy nodes are useless to you; they are not on this key's preference list.

Why strict quorum is brittle to localised failure: the preference list is chosen by consistent hashing, which places a key on replicas that happen to be next on the ring. Replicas are supposed to be spread across failure domains — racks, AZs — but the spreading is probabilistic. With 3 racks and N=3 it is common to have exactly one replica per rack. One rack down then kills 1/N of the write availability across the cluster. Two racks down (a bad day) kills all writes for every key whose preference list sits in those two racks.

The traditional response is to lower W. With N=3, W=1, one ack is enough; you survive any two-replica failure as long as one is up. But you give up R + W > N consistency — R=2, W=1, N=3 does not satisfy R + W > N. You are now eventually consistent, and critically, you can still lose data: if the single replica that acked crashes before gossiping, the write evaporates. Lowering W trades availability against durability, not against partition tolerance.

The real goal is something different. You want writes to survive preference-list failure without lowering W and without losing durability. The ack count should stay at 2 for a W=2 write, but the 2 acks do not need to be from the specific 2 out of 3 preference-list replicas. They can be from any 2 healthy nodes, provided the system promises to eventually make the preference list correct.

That is sloppy quorum.

The sloppy-quorum rule

The formal statement: a sloppy-quorum write succeeds when W acks have been received from any healthy nodes in the cluster, with the coordinator preferring preference-list replicas first and falling back to substitutes when preference-list replicas are unreachable.

Break that into steps:

Compute the preference list P = [p1, p2, ..., pN] for the key.
Send the write to every live node in P. Count acks.
If acks reach W, return success (strict-quorum-equivalent).
Otherwise, walk the rest of the cluster clockwise on the ring, skipping nodes already in P, until you find a live node s.
Write to s with a hint recording the preference-list target that s is substituting for.
If acks now reach W, return success with a flag indicating sloppy-quorum.
If you exhaust the cluster without reaching W, fail the write.

Critically, the substitute is chosen deterministically — clockwise from the preference list on the ring — so multiple coordinators writing the same key during the same partition will tend to pick the same substitute, consolidating hints rather than scattering them.

Cassandra and Riak make this behaviour configurable. Cassandra's CONSISTENCY QUORUM with hinted_handoff_enabled: true is sloppy by default. Riak exposes the choice per request: pw (primary-writes) is the strict-quorum count, w is the total-including-substitutes count. w=2, pw=0 is fully sloppy; w=2, pw=2 is strict.

The hinted-handoff mechanism

A hint is not a write. It is metadata: "node s is temporarily holding value v for key k, which belongs on preference-list slot p." A hint has the following structure:

@dataclass
class Hint:
    target_node: str       # the preference-list replica this hint is for
    key: str
    value: bytes
    vector_clock: dict
    created_at: float      # monotonic timestamp
    ttl: float             # discard after this

Hints are stored durably on the substitute node, typically in a dedicated hint log (Cassandra uses a directory of hints-<host>-<timestamp>.db files; Riak used a LevelDB-backed hints bucket). The substitute also keeps the hinted value in its regular keyspace — a read at the substitute will still return it — but the hint log is what tracks "this value needs to be handed off."

The delivery loop on the substitute node runs periodically. In pseudo-code:

def deliver_hints(self):
    while self.running:
        for hint in self.hint_store.iter():
            if hint.ttl_expired():
                self.hint_store.delete(hint)
                continue
            target = self.cluster.node(hint.target_node)
            if not target.alive:
                continue
            try:
                target.store(hint.key, hint.value, hint.vector_clock)
                self.hint_store.delete(hint)
            except NetworkError:
                pass  # retry next round
        time.sleep(self.hint_interval)

The loop walks every pending hint, checks if the target is reachable, and delivers. The delivery write uses the same vector clock that was attached to the original sloppy write, so the target's merge logic treats it as a normal concurrent update and reconciles correctly. Once the target acks, the hint is deleted.

A subtle point: the hint-holding substitute does not need to delete its local copy of the value after delivery. The cluster now has the value on {B1, C1, A2} — one more than N. This is usually fine; over time, either the substitute ages out the value (if the key is not in its preference list for the new ring state), or anti-entropy notices the surplus and rebalances. Cassandra keeps the substitute's copy for some windows and relies on anti-entropy; Riak historically cleaned up more aggressively.

Sloppy-quorum timeline. At time T, the client's write is coordinated while A2 is down. B1 accepts the write directly; D accepts the write as a substitute with a hint naming A2 as the rightful destination. W=2 acks (B1 + D) satisfy the quorum, so the client sees success. At T+40 min A2 rejoins; D's hint-delivery loop replays the value to A2 and removes the hint. The cluster is now fully consistent — three replicas holding the value as the preference list intends.

The availability guarantee

State the property that sloppy quorum delivers:

As long as any W nodes in the cluster are reachable from the coordinator, writes succeed.

Compare to strict quorum:

Writes succeed only if W out of the N preference-list replicas for this particular key are reachable from the coordinator.

The sloppy guarantee is far stronger for clusters with many more nodes than N. In a 100-node cluster with N=3, W=2, strict quorum fails any write whose key's preference list has 2 down replicas — this is a common case during AZ outages. Sloppy quorum fails only when 99 of 100 nodes are down, which is a catastrophic scenario (your cluster is mostly dead anyway).

The price is a bounded inconsistency window. Between the write at T and the hint-delivery at T+N, a client who reads from the preference list — missing the substitute — may not see the write. Concretely: the preference list is [A2, B1, C], the write went to [D, B1, C] with a hint on D for A2. A reader querying [A2, B1] with R=2 gets the value from B1 (fresh) and whatever A2 had before (stale). Merge picks the fresh one and the reader is fine. But a reader querying [A2] with R=1 gets the stale value.

So sloppy quorum does not break R + W > N consistency during normal operation. It breaks it during partitions, bounded by hint-delivery latency. Read repair closes some of the gap opportunistically; anti-entropy closes the rest on a slower schedule.

Why the inconsistency is bounded, not unbounded: the hint is durable on the substitute. The only ways for the hint to be lost are (1) the substitute dies permanently without delivering — which is catastrophic replica failure that affects any durability scheme — or (2) the hint TTL expires, at which point anti-entropy is responsible for propagating the write. Neither is "silently forever"; both are detectable and have recovery paths. Contrast with write-once-forget systems where the window is infinite.

Python implementation — the sloppy-quorum write path

The skeleton of a coordinator's put method handling both the strict and sloppy paths. Under 40 lines, focused on the control flow:

# sloppy.py
from dataclasses import dataclass

@dataclass
class Ack:
    node_id: str
    hint_for: str = None  # None = strict ack; otherwise names target

class Coordinator:
    def __init__(self, ring, N=3, W=2):
        self.ring = ring
        self.N, self.W = N, W

    def put(self, key, value, vclock):
        pref = self.ring.preference_list(key, self.N)
        acks = []
        # Phase 1: try preference list strictly.
        for node in pref:
            if node.alive and node.try_write(key, value, vclock):
                acks.append(Ack(node.id))
                if len(acks) >= self.W:
                    return ("ok", acks)
        # Phase 2: need substitutes. Walk the ring clockwise,
        # skipping preference-list members.
        missing_targets = [p.id for p in pref
                           if not any(a.node_id == p.id for a in acks)]
        for sub in self.ring.clockwise_from(pref[-1]):
            if sub in pref or not sub.alive:
                continue
            target = missing_targets.pop(0)
            if sub.try_write_with_hint(key, value, vclock, target):
                acks.append(Ack(sub.id, hint_for=target))
                if len(acks) >= self.W:
                    return ("ok-sloppy", acks)
            if not missing_targets:
                break
        raise QuorumFailed(f"only {len(acks)}/{self.W} acks")

What this does: the coordinator first tries every live preference-list replica. If the ack count is short, it walks the ring clockwise from the last preference-list position, finding live substitutes and asking each to accept the write with a hint naming one of the down preference-list members. The return value distinguishes strict success from sloppy success — a useful signal if the caller wants to log or trigger additional repair.

What this elides: actual network code, retries on transient errors, the try_write_with_hint API on the substitute side (which persists the hint to disk), the gossip integration (how node.alive is determined), and concurrent put calls fighting over the same substitute. Production code handles all of them; the skeleton shows the control flow.

The hint delivery loop

On the substitute node, a background thread walks the hint store and delivers. The loop:

# substitute_node.py
class SubstituteNode:
    def hint_delivery_loop(self):
        while self.running:
            for hint in self.hint_store.list():
                if time.time() - hint.created_at > hint.ttl:
                    self.hint_store.delete(hint)
                    continue
                target = self.cluster.get_node(hint.target_node)
                if not target.alive:
                    continue
                try:
                    target.receive_hinted_write(
                        hint.key, hint.value, hint.vector_clock)
                    self.hint_store.delete(hint)
                except (NetworkError, Timeout):
                    pass
            time.sleep(self.interval)

Typical interval settings. Cassandra defaults to every 10 seconds between delivery rounds, with throttling to prevent a returning node from being overwhelmed by weeks of backlogged hints all at once (hinted_handoff_throttle_in_kb, default 1024 KB/sec per node). Riak used a similar design with a configurable riak_core.hinted_handoff timer.

A failed delivery — target still unreachable, or network error — leaves the hint in place for the next round. There is no retry limit; the only termination conditions are successful delivery or TTL expiry.

Hint store overflow — when the partition is long

The failure mode that matters in production is a replica that stays down for hours or days. Think rack power failure with a next-day replacement, or a disk that failed Friday night and waits for Monday morning. During the outage, every write whose preference list included that replica lands a hint on some substitute.

Hints accumulate fast. Each hint is a full value plus metadata. With 1 KB average values and 10,000 writes/sec on affected keys, that is 10 MB/sec, 600 MB/minute, 36 GB/hour of hint accumulation.

Three mitigations:

Max hint window. Cassandra's max_hint_window_in_ms (default 3 hours): stop generating hints for a target down longer than this. Beyond the window, anti-entropy (Merkle-tree-based, ch.81) takes over as the repair mechanism. New writes still land on substitutes, but unhinted — the substitutes are "real" replicas for those keys until anti-entropy rebalances.

Per-target storage caps. Some implementations cap the total size of hints buffered for any single target. Once the cap is hit, older hints are dropped or new hints are refused.

Throttled delivery on return. When the target returns, the substitute must not flood it. Cassandra throttles delivery to 1 MB/sec by default so the target serves live traffic while catching up. A 100 GB backlog at 1 MB/sec is 28 hours — usually fine because read-repair on live traffic restores consistency in parallel.

The hard case is permanent node loss. If A2 dies and never comes back, its hints are undeliverable. Operators must detect this and run a decommission-then-repair cycle — remove A2 from the ring, re-run anti-entropy across the new preference-list members, and delete the stale hints.

The temporary inconsistency window

Between the write at T and the hint delivery at T+N:

The write lives on {substitute, other preference-list replicas that acked} — call this set F (fresh).
The preference list minus F is stale (or empty if this is a new key).
A reader that queries only preference-list replicas misses F's substitute-held copy; whether the read is stale depends on whether the quorum overlaps with F's preference-list members.

Concretely, preference list [A2, B1, C], write landed on {D, B1, C} with a hint on D for A2. Readers:

R=3 from {A2, B1, C}: A2 stale, B1 fresh, C fresh. Merge picks fresh. Correct.
R=2 from {A2, B1}: A2 stale, B1 fresh. Merge picks fresh. Correct.
R=2 from {A2, C}: A2 stale, C fresh. Merge picks fresh. Correct.
R=1 from {A2}: A2 stale. Merge returns stale. Inconsistent read.

The R=1 case is the only one that fails during sloppy quorum, and it fails only if the reader happens to land on a preference-list replica that was down during the write. With typical R=2 settings, the read is self-correcting because at least one of the two responses is fresh (assuming at most one preference-list failure per key).

Read repair (ch.80) closes the remaining gap. When the coordinator sees that one of the queried replicas returned a stale value, it writes the merged result back to the stale replica asynchronously. Every read that touches the stale replica heals it. Anti-entropy (ch.81) catches the cold keys that nobody reads.

Common confusions

"Sloppy quorum is the same as eventual consistency." Close but not identical. Eventual consistency is the property that all replicas eventually converge given no further updates. Sloppy quorum is stronger: it is strong-when-up, eventual-during-partition. When no preference-list replicas are down, sloppy-quorum writes behave identically to strict-quorum writes — the R + W > N guarantee holds. Only during preference-list failure does the system degrade to eventual consistency, and only for keys whose preference lists are affected. Other keys on the same cluster keep their strong-consistency guarantees.
"Hints are in-memory caches." They must not be. A hint-holder that crashes before delivering permanently loses the write. Hints are written to durable storage with the same integrity as primary data — Cassandra uses the commit log plus hints-*.db files; Riak used LevelDB.
"All reads are sloppy too." No. Dynamo and its descendants typically make only writes sloppy. Reads still go to the preference list — if enough preference-list replicas are down to prevent meeting R, the read fails. The asymmetry is intentional: writes are the availability-critical path (you do not want to lose a shopping-cart update), reads are more tolerant of transient failure (the user can refresh). A sloppy read would also need to know where to find substitutes, which is harder to do deterministically. Cassandra has some extensions (read repair across all replicas, speculative reads) that blur the line, but the core design is sloppy-writes, strict-reads.
"Sloppy quorum is equivalent to Paxos during partitions." No. Paxos and Raft are total-order protocols: every replica sees the same sequence of commits. Sloppy quorum is partition-tolerant and AP — during a partition it keeps accepting writes on both sides, creating concurrent versions that must be merged later. Paxos blocks writes during partitions that break the majority. Different points on the CAP spectrum, different workloads.

Rack-loss Diwali traffic spike

Sanjay runs Snapdeal's cart service on a Cassandra cluster in Mumbai. Config: N=3, W=2, R=2, three racks in a single AZ, hinted_handoff_enabled: true, max_hint_window_in_ms: 3h.

It is 20:30 on Dhanteras. Peak cart traffic. Rack A goes dark — a UPS cascade failure after a transformer trip. Nine nodes out: A1 through A9.

20:30:00 — Last normal traffic. Gossip shows all 27 nodes healthy.
20:30:08 — Rack A nodes stop responding. After 8 seconds (phi_convict_threshold), peers mark A1-A9 as suspect.
20:30:11 — Gossip converges on "A1-A9 DOWN." Coordinators start routing around them. Every write whose preference list includes any A-rack node goes sloppy — about 1/3 of cart writes.
20:30:11 onward — Writes continue to return success. Latency rises (the median ack is now substitute-based) but no writes fail. Hints accumulate on B-rack and C-rack substitutes: at 2000 writes/sec with 1/3 affected, ~670 hints/sec across 18 substitutes, sustained for 3 hours — roughly 4 GB of hints per substitute. Within config limits.
23:25:00 — Rack A power restored. A1-A9 boot, rejoin gossip.
23:25:30 — Hint delivery begins, throttled to 1 MB/sec per target.
00:10:00 — Hint stores drained. Cluster fully consistent. Anti-entropy (nightly at 04:00) sweeps cold keys as a belt-and-suspenders check.

User impact. Zero cart writes failed during the 175-minute outage. R=2 reads returned correct values because each query overlapped at least one healthy replica. Rare R=1 reads against A-rack slots returned stale data; cart reads are typically QUORUM in practice.

Under strict quorum. Roughly 1/3 of cart writes — those with an A-rack replica — would have failed. At 2000 writes/sec, that is 670/sec failures, or 7 million failed writes over the outage. A customer-facing catastrophe on Dhanteras.

Why this design — the shopping cart again

Riverone's 2007 paper stated the requirement bluntly: the cart service must accept writes every minute of every year. Strict quorum cannot deliver that property against realistic failure assumptions. Rack outages, AZ outages, and network partitions happen on the order of tens of minutes per quarter in any large cloud deployment; strict quorum would refuse writes during those windows.

Sloppy quorum transforms rack outages from "refuse writes" to "accept writes, reconcile later." For a cart, the reconciliation is trivial — set union of items. For telemetry, append to a timestamped log and deduplicate. For session state, last-write-wins with reasonable semantics. For any workload where the conflict function is well-defined and cheap, sloppy quorum is a near-pure win.

The bounded inconsistency window — seconds to minutes under normal conditions — is the acceptable tax. A shopping-cart customer who sees a stale cart for 30 seconds after a rack failure is a much better outcome than a customer who cannot add to their cart at all.

Where this might not fit

Workloads that do not tolerate temporary inconsistency should not use sloppy quorum:

Banking ledgers. A sloppy-quorum system can show a stale balance, letting a user attempt an overdraft on money the system "committed" a minute ago. Banks need leader-based strict quorum or consensus protocols (Raft/Paxos).
Inventory with hard limits. "Last one in stock" fails under sloppy quorum — two buyers on different partition sides both succeed. Use a reservation service with strict consistency.
Distributed locks. Sloppy quorum can let two coordinators both believe they hold a lock during a partition. Use Chubby, ZooKeeper, or etcd.
Time-critical signals. Alerting pipelines, ad bidding, safety-critical control loops — anywhere "seconds of stale data" is unacceptable.

Rule of thumb: if you can write down a cheap merge function for your data, sloppy quorum is probably fine. If you cannot, you need strict semantics and should look elsewhere.

Going deeper

Cassandra's hinted handoff in depth

Cassandra's implementation evolved substantially over versions. The modern design (Cassandra 3.0+) uses dedicated hints-*.db files in $data/hints/, with a commit log for durability. Key config knobs:

hinted_handoff_enabled — master switch, defaults true.
max_hint_window_in_ms — 3 hours default. Stop generating hints for targets down longer than this.
hinted_handoff_throttle_in_kb — 1024 KB/sec per-target rate limit on delivery.
max_hints_delivery_threads — 2 default, concurrent delivery streams.
hints_flush_period_in_ms — 10 seconds, how often to flush in-memory hints to disk.

The commit log integration ensures hints survive crash. Cassandra's observability — nodetool listendpointspendinghints and similar commands — lets operators see outstanding hint volumes per target and triage long outages.

Riak and DynamoDB

Riak separated the hint-transfer path from normal writes. A riak_core_handoff_manager process orchestrated whole-partition migrations, used for both hinted handoff and rebalancing. Each vnode handed off its pending-writes list as a single stream, which handled large backlogs efficiently.

Riverone's managed DynamoDB service uses an internal mechanism that is hinted-handoff-shaped but tuned by the service team — customers do not configure hint windows or throttle rates. The availability SLA (99.999% for global tables) is achieved in part through aggressive sloppy-quorum substitution with rapid hint delivery.

Where this leads next

Chapter 80 covers read repair — the opportunistic repair path that fires on every read and catches divergence the moment a client touches a stale replica. Chapter 81 covers anti-entropy and Merkle trees — the systematic background repair that sweeps cold keys and catches hints that were lost. Sloppy quorum, hinted handoff, read repair, and anti-entropy form the four-layer consistency-repair stack that makes Dynamo-style leaderless replication production-worthy.

References

DeCandia et al., Dynamo: Riverone's Highly Available Key-value Store, SOSP 2007 — the original paper. Section 4.6 covers hinted handoff directly, with the shopping-cart motivation and the availability argument.
Apache Cassandra Project, Hinted Handoff — Cassandra Documentation — the modern reference for Cassandra's implementation, including all configuration knobs, operational monitoring, and the commit-log-backed durability model introduced in Cassandra 3.0.
Basho Technologies, Riak Handoff Documentation Archive — Riak's hinted-handoff and partition-migration machinery, now archived. The most detailed description of a non-Cassandra Dynamo-style handoff design, including the distinct riak_core_handoff_manager process model.
Kleppmann, Designing Data-Intensive Applications, Chapter 5 — Replication, O'Reilly 2017 — section 5.4 covers sloppy quorums and hinted handoff with worked numerical examples and careful discussion of the inconsistency window.
Amazon Web Services, Amazon DynamoDB Developer Guide — How It Works — the managed service's documentation, which describes the customer-facing availability SLA and the internal replication machinery at a high level.
Kingsbury, Jepsen: Cassandra (2013, with later updates) — systematic investigation of Cassandra's sloppy-quorum behaviour under partition, with reproducible tests measuring the actual inconsistency window under varied failure scenarios.