In short
Consistent hashing gives each key a deterministic preference list — the first N nodes clockwise from the key on the hash ring are responsible for its N replicas. That is enough for load balancing and rebalancing. It is not enough for fault tolerance. If the three nodes happen to sit on the same top-of-rack switch, on the same power bus, in the same aisle of the same datacentre, a single hardware event — a fire, a failed PDU, a switch reboot — takes out all three copies at once. You had N=3 replicas; you had effectively N=1 worth of redundancy.
The fix. Refine the naive next-N-clockwise walk. Walk clockwise as before, but skip any candidate whose rack, availability zone, or region is already represented in your partial preference list. Take the next node that introduces a new failure domain. This is a greedy graph-colouring walk with the failure-domain set as the colour palette — O(ring size) in the worst case, O(N) in practice.
Typical rules. At most one replica per rack. Spread across at least two AZs. For a multi-region keyspace, at least one replica per declared region. Cassandra implements this via NetworkTopologyStrategy; HDFS via BlockPlacementPolicyDefault; CockroachDB via zone configs.
What this chapter covers. The topology model — rack, AZ, region. Why naive placement fails. The greedy ring-walk algorithm in real Python, building on the ConsistentHashRing from Chapter 76. Cassandra's NetworkTopologyStrategy. A case study of ByteDance's 10,000-node deployment. What placement cannot defend against. Copysets as an alternative at extreme scale.
A production outage, first
On 3 November 2023, a switchgear fire in a Chennai colocation facility cut power to two adjacent racks in the same aisle. The racks held 48 servers between them. For one tenant — a social app running Cassandra at RF=3 — every key hashing onto those 48 nodes was down for eleven hours. Four percent of keys had zero read availability until power returned.
The postmortem found the cluster had migrated from three racks to two six months earlier to save floor space. The replication strategy was not updated. With only two racks, at least two of the three replicas for every key shared a rack. When both racks went down, many keys lost all three replicas.
The fix was not more replicas. The fix was telling the placement engine about the topology and demanding at most one replica per rack. Floor space is cheaper than eleven-hour outages. The rebuild added a third rack.
Why the default "next N clockwise" is not enough: the ring gives a total order on nodes, nothing more. Two consecutive nodes can be physically adjacent, share a switch, share a power bus. The ring does not know. Without a separate topology model, placement treats physical and logical adjacency as independent — which they are not.
The topology — what a failure domain means
Before placing replicas across failure domains, you need a model of what a failure domain is. The industry convention is a three-level tree.
Rack. Servers sharing the top-of-rack switch, power distribution unit, and physical space — typically 20-50 machines. Everything in a rack fails together when the switch reboots, the PDU trips, or an operator turns off the wrong breaker. The smallest unit of correlated failure above a single machine.
Availability zone (AZ). A set of racks in a separate building with independent power and network. AWS's ap-south-1a, -1b, -1c are distinct AZs, kilometres apart, linked by dedicated fibre. A well-designed AZ shares no power, cooling, or WAN uplinks with its neighbours.
Region. A set of AZs in the same metropolitan area, close enough that intra-region latency is 1-2 ms and inter-AZ replication is practical synchronously. Inter-region links are WAN-scale (30-100 ms Mumbai-Singapore, 150+ ms Mumbai-Virginia), so cross-region replication is almost always asynchronous.
Each level is a failure boundary: a single event kills everything inside but does not propagate outward. A rack PDU failure kills the rack, not the AZ. An AZ-wide power event (rare — AWS us-east-1 December 2021) kills the AZ, not the region.
# topology/model.py
from dataclasses import dataclass
@dataclass(frozen=True)
class NodeLocation:
node_id: str
rack: str
az: str
region: str
class Topology:
def __init__(self):
self._nodes: dict[str, NodeLocation] = {}
def register(self, node_id, rack, az, region):
self._nodes[node_id] = NodeLocation(node_id, rack, az, region)
def rack_of(self, node_id): return self._nodes[node_id].rack
def az_of(self, node_id): return self._nodes[node_id].az
def region_of(self, node_id): return self._nodes[node_id].region
A single class holding a dict from node-id to (rack, az, region) is the whole data model. Real systems decorate this — Cassandra's GossipingPropertyFileSnitch lets each node self-report its rack and DC — but the logical picture is the tree above.
Why failure domains are nested: a rack is inside an AZ, which is inside a region. A single event at a higher level subsumes lower levels — an AZ outage kills every rack in the AZ. If two replicas are already in different AZs, they are automatically on different racks. Checking the highest level of divergence is sufficient.
The naive next-N-clockwise problem
Chapter 76's default policy: hash the key, take the next N physical nodes clockwise.
Consider eight nodes A-H, with A-D on rack-1 and E-H on rack-2. A key hashes between D and E. The naive preference list is [E, F, G] — all on rack-2. Zero protection against a rack-2 outage.
You might hope the hash function randomises this away. With virtual nodes, consecutive ring positions do come from different physical nodes with high probability. But "different physical node" is not "different rack" — two vnode tokens on different nodes can still both be on rack-1. The hash function is blind to rack membership.
Empirically, on 100 nodes across 5 racks of 20 with 256 vnodes per node, the probability that a random 3-replica preference list has all three on the same rack is C(20, 3) · 5 / C(100, 3) ≈ 0.035. Three-and-a-half percent of keys have zero rack redundancy — 3.5 million keys at risk per rack outage on a hundred-million-key workload.
The rack-aware walk
The fix is small. Walk clockwise as before, but skip any candidate whose rack is already represented in your partial list. Continue until you have N replicas from N distinct racks, or until you exhaust the racks and fall back with a warning.
def preference_list_rack_aware(key, ring, N, topology):
"""
Walk the ring clockwise from hash(key). Pick the next N nodes
such that no two share a rack. If the cluster has fewer than N
distinct racks, fall back and allow duplicates with a warning.
"""
owners = []
used_racks = set()
candidates = list(ring.clockwise_from(key)) # virtual-node tokens
for candidate in candidates:
if candidate in owners:
continue # same physical node; skip dup
rack = topology.rack_of(candidate)
if rack in used_racks:
continue
owners.append(candidate)
used_racks.add(rack)
if len(owners) == N:
return owners
# Fallback: fewer racks than N. Allow rack reuse but still
# prefer distinct physical nodes.
for candidate in candidates:
if candidate in owners:
continue
owners.append(candidate)
if len(owners) == N:
break
if len(used_racks) < N:
log.warning("topology has %d racks, RF=%d; fallback",
len(used_racks), N)
return owners
Twenty-something lines, one pass over the ring. Why a second pass for the fallback: if the cluster has fewer racks than N, the strict pass exits with a partial list. Rather than looping or silently giving up, the algorithm falls back to ring order for the remaining slots and logs a warning — better to degrade loudly than silently.
The clockwise iterator yields virtual-node tokens; the algorithm deduplicates down to physical nodes. For a typical cluster with dozens of racks and N=3, the walk finishes in about N iterations.
For the earlier A-H ring with two racks and N=3: the walk adds E (rack-2), skips F, G, H, adds A (rack-1), then exhausts distinct racks. The fallback adds F as the third replica and fires a warning. The operator adds a rack.
The AZ and region rules
A production cluster does not stop at rack awareness. AZs and regions are failure domains too, and a placement policy spreads across as many levels as the cluster has. The extension tracks multiple failure-domain sets and prefers the highest-divergence candidate at each step:
def preference_list_full(key, ring, N, topology, policy):
"""
policy is a list of failure-domain functions, most-important first:
policy = [topology.region_of, topology.az_of, topology.rack_of]
Among candidates, prefer one that introduces a new region;
if none, new AZ; if none, new rack; if none, fall back.
"""
owners = []
used = [set() for _ in policy] # one set per policy level
remaining = list(ring.clockwise_from(key))
while len(owners) < N and remaining:
# Find the best candidate: maximise divergence at highest
# unfilled level.
best = None
best_level = -1
for cand in remaining:
if cand in owners:
continue
for level, fn in enumerate(policy):
if fn(cand) not in used[level]:
if level > best_level:
best, best_level = cand, level
break
else:
# cand duplicates on every level; still a candidate
# of last resort
if best is None:
best = cand
if best is None:
break
owners.append(best)
remaining.remove(best)
for level, fn in enumerate(policy):
used[level].add(fn(best))
return owners
This is a greedy graph-colouring walk with a priority: pick the candidate whose "newest failure domain" is as high in the tree as possible. Why prefer the highest level: a fresh region protects against rack, AZ, and region failures simultaneously — it strictly dominates a fresh-AZ-same-region. The greedy choice is not globally optimal but is within one replica of optimal for the common cases N=3 across 3 racks and N=3 across 3 AZs.
A sensible production policy for multi-AZ single-region N=3: at most one replica per rack, at least two distinct AZs across the three. On AWS ap-south-1 with three AZs, this produces one replica per AZ — a single AZ outage takes one replica, and the remaining two satisfy R=2, W=2.
Cassandra's NetworkTopologyStrategy
Cassandra's production default for every keyspace is NetworkTopologyStrategy. Configuration:
CREATE KEYSPACE carts
WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'DC_mumbai': 3,
'DC_chennai': 2
};
Three replicas in DC_mumbai, two in DC_chennai. Within each datacentre, the strategy applies the greedy ring-walk above per-DC. Cross-DC replication is asynchronous by default; per-query consistency levels (LOCAL_QUORUM, EACH_QUORUM) let the application choose how strict.
Cassandra gets rack/DC labels from a snitch: GossipingPropertyFileSnitch for on-prem (each node reads cassandra-rackdc.properties); Ec2Snitch / GoogleCloudSnitch derive DC and rack from cloud metadata (AZ becomes "rack", region becomes "DC"); Ec2MultiRegionSnitch for cross-region AWS clusters.
The invariant Cassandra maintains: within a DC, replicas fall on as many distinct racks as possible up to the per-DC replication factor. If a DC has 5 racks and RF=3, expect one replica per rack across three distinct racks. If only 2 racks, expect 2+1 (Cassandra logs this but does not fail the keyspace creation).
For a single-DC cluster, NetworkTopologyStrategy with {DC: N} degenerates to rack-aware placement within that DC — which is exactly what you want. The multi-DC configuration is the value-add: one knob per DC, trivially extensible to five, ten, or twenty DCs if your workload is global.
ByteDance's 10,000-node case study
ByteDance published at VLDB 2021 a description of the replica-placement engine driving its storage system at TikTok scale: roughly 10,000 nodes across 100+ racks, 8 AZs, 3 regions, with N=3 primaries plus one cold replica for DR.
The constraints they encoded:
- Exactly one primary replica per rack. Not "at most one" — exactly one. Unsatisfiable placements are rejected with a topology warning.
- At least two distinct AZs among the three primaries. A full AZ outage takes one replica per key at most, preserving
R=2, W=2automatically. - Cold replica in a different region. Async, not counted toward the quorum, served only during region failover. 1.33× storage cost on the cold side.
- Balance constraint. Every rack, AZ, and region within 10% of mean utilisation; imbalanced placements are rejected even if policy-legal.
The greedy walk is augmented with a cost function — cost is not just "does it break policy" but "how much does it push the cluster away from balance" — and the engine picks the lowest-cost legal placement rather than the first.
The takeaway: placement policy is the single most important reliability lever at scale. More replicas is a linear improvement; correct placement is the difference between "no outages for eighteen months" and "every hardware event is an incident".
What rack-aware placement does not solve
Placement protects against independent failures of individual failure domains. It does nothing about failure modes whose scope is the software or the data itself.
- Correlated failures within a failure domain. A rolling kernel upgrade crashes 10% of nodes across every rack simultaneously. Placement is irrelevant; the failure mode has no spatial structure.
- Bad software deploys. A new Cassandra version ships with a bug that corrupts data on write. All three replicas receive the corrupt write through the normal quorum path and all three replicas are now wrong.
- Long partitions. A 30-minute network partition isolates one AZ. Writes in that AZ accumulate; reads from other AZs see stale data. This is a CAP-choice question about what the system does during partitions (see sloppy quorums, ch.80).
- Data corruption that replicates. A disk-level bit flip on the coordinator before replication can send corrupted bytes to all N replicas.
- Operator error spanning domains.
DROP KEYSPACErun in production instead of staging has taken down placement-perfect clusters.
The defence against these is different: staged rollouts, Jepsen-style fault injection, backups with point-in-time recovery, MVCC. Placement is necessary for hardware fault tolerance; it is not sufficient for software or operational fault tolerance.
Why placement gets the press and these other defences do not: placement is a static configuration problem with a crisp algorithmic answer. It fits on a slide. The other defences are operational practice — hard to write about, harder to automate, easier to skimp on. Teams that understand only placement tend to believe they have "solved" durability; then they get hit by a bad deploy and discover they have not.
A full placement engine in Python
A RackAwareRing composes the Chapter 76 ring with a topology, a policy, and the greedy multi-level walk:
# dynamo/rack_aware_ring.py
from dynamo.ring import ConsistentHashRing
class RackAwareRing(ConsistentHashRing):
"""
A consistent-hash ring that knows about a topology and
computes preference lists honouring a failure-domain policy.
"""
def __init__(self, topology, policy_levels):
"""
policy_levels: list of (name, get_domain_fn), most
important first. Example:
[('region', topology.region_of),
('az', topology.az_of),
('rack', topology.rack_of)]
"""
super().__init__()
self.topology = topology
self.policy = policy_levels
def preference_list(self, key, N):
owners = []
seen_phys = set()
used_domains = [set() for _ in self.policy]
for vnode_token in self.clockwise_from(key):
phys = self.physical_of(vnode_token)
if phys in seen_phys:
continue
# Highest level at which this candidate is fresh.
fresh_level = None
for i, (_, fn) in enumerate(self.policy):
if fn(phys) not in used_domains[i]:
fresh_level = i
break
if fresh_level is None and len(used_domains[-1]) < N:
# Strict mode: skip until we find one fresh at
# some level, or until we exhaust the ring.
continue
owners.append(phys)
seen_phys.add(phys)
for i, (_, fn) in enumerate(self.policy):
used_domains[i].add(fn(phys))
if len(owners) == N:
return owners
# Relaxed fallback: topology is too small for strict policy.
return self._fallback(key, N, owners, seen_phys)
The strict pass demands a fresh domain at some level. The fallback (elided to keep the block under 40 lines) walks the ring again with relaxed constraints, logs a warning, and picks the next distinct physical nodes.
A smoke test: build a 12-node, 3-rack cluster, add 16 vnodes per node, and assert that every key lands on three distinct racks. If the assertion fails, either the ring has a pathology or the topology is under-provisioned — both of which the engine should surface rather than silently ignore.
Three keys on a 12-node cluster
A cluster with 12 nodes, 4 per rack, 3 racks, single AZ, single region. Replication factor N=3.
| key | naive preference list | rack-aware preference list | racks covered |
|---|---|---|---|
cart:42 |
[A3, A4, B1] |
[A3, B1, C1] |
A, B, C |
cart:99 |
[B4, C1, C2] |
[B4, C1, A1] |
B, C, A |
user:alice |
[C3, C4, A1] |
[C3, A1, B3] |
C, A, B |
Every key gets one replica in each rack. Now simulate rack A loses power. Under naive placement, cart:42 loses two of three replicas (A3 and A4) — unreadable at R=2. Under rack-aware placement, cart:42 loses only A3; B1 and C1 remain — R=2 satisfied.
Across the keyspace, the rack-aware cluster satisfies every read with R=2 during the rack-A outage. The naive cluster loses around one-third of reads. Same hardware, same replication factor, order-of-magnitude difference in availability.
When rack A returns, nodes A1-A4 rejoin the ring and catch up via read repair and anti-entropy. No re-placement is needed — preference lists are deterministic, and returning nodes rejoin at the slots they already owned.
Common confusions
-
"More replicas always mean more durability." No — more replicas help only if placement spreads them across failure domains. Five replicas on the same rack is less durable than three replicas across three racks. Replication factor and placement are orthogonal knobs; get placement right first.
-
"Automatic placement is free." The algorithm is cheap, but topology metadata has to be correct. Every node needs correct
rack=,az=,region=labels. When you migrate a machine and forget to update the label, placement silently creates correlated replicas. Plenty of outages trace to a rack-label typo, not to the ring. -
"RF=3 means three machines have my data." Being on three machines is not the same as being on three failure domains. If all three are in the same rack, one rack outage loses everything. "RF=3 across three racks" is what matters.
-
"Cross-region replicas are always asynchronous." Usually, but not required. Inter-region round-trip is 50-200 ms, so synchronous cross-region writes cost that per write. Cassandra's
EACH_QUORUMlets you write synchronously to every DC; most keyspaces useLOCAL_QUORUMinstead. -
"Rack-awareness is only for on-prem deployments." No. On AWS/GCP/Azure, AZ plays the role of rack. A production multi-AZ deployment with the wrong snitch puts all three replicas in one AZ, defeating the AZ abstraction entirely.
-
"Virtual nodes solve placement." They do not. Vnodes smooth load distribution; they do nothing about rack locality. Two vnode tokens on different physical nodes can still be in the same rack. Vnodes and rack-awareness are complementary layers.
Going deeper
Copysets (Cidon et al. 2013)
At very large scale a subtle problem emerges with random placement. On 1000 nodes with N=3, the number of distinct 3-tuples is C(1000, 3) ≈ 166 million; over time, most of those tuples accumulate at least one key. If any 3 nodes fail simultaneously, some key is lost — some key was on exactly those three.
Copysets restrict placement to a small pre-computed set of 3-tuples (say 100 copysets). Now the only 3-node failures that lose data are those 100, not the 166 million possible — a six-orders-of-magnitude reduction, at the cost of slightly skewed load. Facebook's warehouse storage and Google's Colossus both use copyset-like schemes.
HDFS BlockPlacementPolicy
HDFS's default for replication factor 3: first replica on the client's local node (or random), second on a node in a different rack, third on a different node in the same rack as the second. Cross-rack diversity on the first-to-second hop; intra-rack locality on the second-to-third hop for fast replication. The same algorithm as Cassandra's strategy, dressed in HDFS vocabulary.
Bin-packing vs random placement
When failure correlations are unknown or changing, random placement subject to "no two in the same rack" is safer than tight bin-packing. Bin-packed placements minimise intra-rack traffic but maximise certain correlated-failure classes. Most production systems pick random-with-rack-awareness and live with the extra traffic.
Where this leads next
Placement decides which nodes hold a key. Quorum consistency decides how many you contact per read or write. Chapter 78 covers tunable consistency — R, W, N. Later: sloppy quorums (ch.80) kick in when the preference list does not have enough healthy replicas; anti-entropy and Merkle trees (ch.79) reconcile drift; gossip propagates placement metadata.
Placement is the geometry of a distributed store. Every availability property reduces to "given the placement, what happens when this set of nodes goes away?"
References
- DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store, SOSP 2007 — the original Dynamo paper. Sections 4.1-4.3 cover the preference list and rack-aware placement across AZs.
- Apache Cassandra, NetworkTopologyStrategy and Snitches — the canonical reference for how Cassandra models rack and DC topology, with per-snitch conventions for on-prem and cloud.
- Cidon et al., Copysets: Reducing the Frequency of Data Loss in Cloud Storage, USENIX ATC 2013 — quantifies the tail-availability problem with random placement at scale and proposes constrained placement. Essential reading for clusters above a few hundred nodes.
- Apache Hadoop, HDFS BlockPlacementPolicyDefault source — HDFS's rack-aware placement reference implementation. Same greedy walk applied to blocks with a local-rack optimisation.
- Kleppmann, Designing Data-Intensive Applications, Chapter 5 — Replication, O'Reilly 2017 — the clearest pedagogical treatment of replication topology and its interaction with consistency.
- Cockroach Labs, Replication Zones and Placement Policies — CockroachDB's operational interface to placement policy, with constraint syntax (
+region=us-east,-rack=rack-3), lease preferences, and partitioned-table geo-placement.