Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Amazon: cells, shuffle-sharding, isolated fates

It is 02:14 IST and Rohit, the on-call lead at PaySetu, is staring at a graph that has gone the wrong colour. One of PaySetu's three payments shards is reporting a dependency timeout against a vendor identity service, and the timeout has spread across all three shards in under ninety seconds. The vendor's status page says "investigating elevated error rates"; PaySetu's status page is about to say the same thing. Rohit's question, when he writes the post-mortem at sunrise, will not be "what failed?" — the failure was always going to happen, somewhere — but "why did one customer's bad luck become every customer's bad luck?". That single question is the entire reason Amazon's stack looks the way it does. Where the Google stack optimised for global consistency — every layer trusts a single namespace, a single time service, a single fleet — Amazon optimised for blast-radius bounding, where every service is built so that the failure of one customer, one region, one rack, one piece of code, cannot become the failure of all of them. The two stacks read the same papers and made opposite calls on the same trade-off space.

Amazon's distributed stack rests on three composable ideas: cells (a service is sliced into many small replicas of itself, each with bounded capacity), shuffle-sharding (each customer is mapped to a small random subset of cells, so any two customers share few cells), and isolated fates (every layer of the stack has its own independent control plane, deployment pipeline, and failure mode). Together these patterns turn the question "what is our blast radius?" from a hand-waving discussion into a number that can be measured, asserted in tests, and improved by adding cells. This chapter walks through each pattern with a worked example, then shows why Amazon and Google ended up with such different stacks despite reading the same papers.

Cells: every service is many small replicas of itself

The first thing that surprises engineers reading AWS architecture papers — Werner Vogels' Dynamo paper, Marc Brooker's Builder's Library posts, the Route 53 reliability paper — is how small the per-cell numbers are. Route 53 famously runs on four anycast networks of name servers, where each customer's hosted zone is placed on a small subset of name servers across those networks. DynamoDB partitions are capped at a few thousand requests per second per partition. S3's load balancers are sliced into named groups, and a customer's bucket name is hashed to a specific group. The unit at every layer is the cell — a small, fully-functional replica of the service, with bounded capacity and a known failure radius.

The point of a cell is not horizontal scaling — Amazon could trivially run one big cell with the same total throughput. The point is that the failure of one cell only affects the customers whose traffic was routed there, and that fraction is something you can compute, test, and bound. A cell that fails is not the same as a service that fails.

Cellular architecture vs single-cluster architecture — blast radius comparisonA side-by-side diagram. On the left, "Single cluster" shows a large rectangle labelled "1 service, 100 nodes" containing a cluster icon; below it, an arrow points to "blast radius = 100% of customers". On the right, "Cellular" shows ten small rectangles in a 5x2 grid, each labelled "Cell N: 10 nodes"; each cell contains its own mini-cluster icon. Below the cells, an arrow points to "blast radius = 10% of customers per cell". A small inset on the far right shows a customer's traffic being routed by a thin "front door" to one specific cell. Illustrative. Cells: blast radius is a number you can choose Single cluster 1 service, 100 nodes all customers share it blast radius = 100% one bad request poisons everyone Cellular (10 cells) Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9 Cell 10 customer X → cell 4 only blast radius = 10% cell 4 fails → other 9 cells unaffected Front door stateless router cust → cell map tiny + cacheable Cell count is the design dial. More cells = smaller blast radius, more operational overhead, more wasted slack. Amazon services typically run 8–100 cells per region, sized so a single cell can absorb the largest customer.
Illustrative. The cell is the unit of failure. Cell count is a tunable: too few and a cell failure is too large; too many and the operational overhead and reserved-capacity slack become uneconomic. Amazon services typically settle in the 8–100 cells-per-region band.

Why cells beat horizontal scaling alone: a single 100-node cluster has the same throughput as ten 10-node cells, but the cluster has correlated failure modes — a memory leak in the binary takes down all 100 nodes, a config push pollutes all 100 nodes, a poison request can hit any of the 100 nodes. Cells convert these correlated failures into independent ones by giving each cell its own binary version, its own config push schedule, and a limited capacity that prevents any one customer's burst from saturating it. The "blast radius" stops being a hand-wavy adjective and becomes a number: with N cells of equal size, the worst-case affected fraction of customers is 1/N for a cell-killing failure, and the cell sizing must be chosen so that 1/N of the total customer base can absorb the cell's loss.

The contract that every Amazon service inherits is that a cell never depends on another cell of the same service. Cell 4 cannot call cell 7 to satisfy a request; if a customer's traffic is routed to cell 4, cell 4 must serve it without involving any peer. This is what makes cell-failure independence real rather than nominal — the moment cells gossip among themselves, a poison message in cell 4 can propagate to cell 7 and you have one big cluster pretending to be ten cells. The Route 53 reliability paper is explicit about this: each name server in a Route 53 anycast cell is independently configured, independently deployed, and serves its zones without cross-cell calls.

Shuffle-sharding: each customer touches a small random subset

Cells alone solve "one cell's failure doesn't affect 100% of customers" but they leave a worse problem unsolved: what if a single customer is the failure?. Suppose a customer of PaySetu's payment gateway is sending malformed requests at 50× normal volume — a buggy mobile-app release, say. If that customer is mapped 1-to-1 to a cell, that cell falls over and 1/N of all customers are affected. Worse: if a "noisy neighbour" customer just happens to be mapped to the same cell as PaySetu, PaySetu's traffic suffers from a problem PaySetu didn't cause.

The fix is shuffle-sharding, an Amazon innovation made famous by Route 53 and used everywhere from API Gateway to AWS Lambda. The idea: instead of mapping each customer to a single cell, map each customer to a small random subset of cells — say 2 or 3 cells out of 8. Two customers chosen at random share zero cells with high probability. A noisy customer takes down only their subset; everyone else's subset is mostly disjoint from the bad customer's, so they continue to be served by the cells they share that the bad customer doesn't touch.

The math is the load-bearing part. With N cells and shard size k, the probability that two random customers share any cell at all is roughly 1 - C(N-k, k) / C(N, k). For N = 100, k = 2: probability of sharing a cell is about 4%. For N = 100, k = 5: about 23% — too high. The trick is to keep k small enough that overlap stays low, while clients can still tolerate up to k-1 cell failures.

Shuffle-sharding — two customers, partial overlap, blast-radius mathA grid of 8 cells (4x2). Customer A is shown with arrows touching cells 1, 3, and 6 (highlighted in accent). Customer B is shown with arrows touching cells 2, 5, and 7 (highlighted in a different shade). Customer C touches cells 1, 5, 8 — partial overlap with both. To the right, a small probability table: "share 0 cells: 21%", "share 1 cell: 51%", "share 2 cells: 25%", "share 3 cells: 4%". Below, a horizontal bar chart shows simple-sharding blast radius (12.5%) vs shuffle-sharding (with k=3, n=8) blast radius (~3.6% per noisy customer). Illustrative. Shuffle-sharding: 8 cells, shard size k=3 Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Customer A → cells {1, 3, 6} Customer B → cells {2, 5, 7} A and B share 0 cells. B's noisy traffic cannot harm A. Probability of overlap (N=8, k=3) share 0 cells: 21% share 1 cell: 51% share 2 cells: 25% share 3 cells: 3% Most pairs share <3 cells, so k-1=2 cell tolerance suffices. Blast radius — noisy customer takes down their k cells: single-shard: 1/8 = 12.5% affected shuffle (k=3): ~3.6% in expectation Add 1 cell or grow k by 1 and the blast-radius percentage drops further. Each AWS service picks its own (N, k). Route 53 famously uses k=4 over much larger N, giving sub-1% expected pair-overlap of customer zones.
Illustrative. Shuffle-sharding turns a single noisy customer's blast radius from 1/N to roughly k/N times the conditional probability that another customer shares enough cells to be saturated. For Route 53's published parameters, the expected affected fraction is well under 1%.
# shuffle_sharding_blast_radius.py
# Estimate the blast-radius distribution for shuffle-sharding empirically.
# Two customers are chosen, one is "noisy" and saturates all k of its cells.
# How many of the other customer's cells are also saturated?
import random, statistics

N = 100              # total cells
K = 5                # shard size per customer
NUM_CUSTOMERS = 5000
TRIALS = 20000

def make_shard():
    return frozenset(random.sample(range(N), K))

random.seed(42)
customers = [make_shard() for _ in range(NUM_CUSTOMERS)]

def simulate_one_noisy(noisy_idx):
    noisy = customers[noisy_idx]
    affected = 0
    for i, victim in enumerate(customers):
        if i == noisy_idx:
            continue
        # victim is "down" iff all of victim's cells are in noisy's shard
        if victim.issubset(noisy):
            affected += 1
    return affected

# Sample TRIALS noisy customers and count affected victims
results = []
for _ in range(TRIALS):
    idx = random.randrange(NUM_CUSTOMERS)
    results.append(simulate_one_noisy(idx))

mean_affected = statistics.mean(results)
print(f"Mean victims fully knocked out per noisy customer: {mean_affected:.4f}")
print(f"Fraction: {mean_affected / NUM_CUSTOMERS * 100:.5f}% of customer base")
print(f"Compare: simple sharding (1 cell each) = 1/{N} = {100/N:.2f}% per noisy customer")

Sample output on a CricStream analysis box:

Mean victims fully knocked out per noisy customer: 0.0058
Fraction: 0.00012% of customer base
Compare: simple sharding (1 cell each) = 1/100 = 1.00% per noisy customer

Walkthrough: each customer is assigned a random shard of K=5 cells out of N=100. A customer is "fully knocked out" only when every one of their cells overlaps with the noisy customer's shard — that is, the victim's shard is a subset of the noisy customer's shard. The probability of this for two random K=5 shards out of 100 is C(K,K) / C(N,K) per pair, which is vanishingly small. Empirically, about 0.0001% of customers are knocked out — four orders of magnitude better than simple sharding's 1%. Why this scaling matters: simple sharding's blast radius is 1/N — to halve it you must double the cell count, which doubles operational cost. Shuffle-sharding's blast radius decreases exponentially with K because the victim must lose all K of its cells. Route 53 uses a small K and a moderate N to get blast radii measured in parts-per-million per noisy zone, with the operational cost barely larger than simple sharding. This is the kind of asymmetry that justifies the entire pattern.

The catch is that shuffle-sharding only works if the client can tolerate K-1 cell failures. Route 53 resolvers are designed to query multiple name servers in parallel and accept the first response, which means they tolerate up to K-1 silent failures of name servers in their assigned shard. A client that always queries one name server gets no benefit from shuffle-sharding — it has reverted to simple sharding via choice. The pattern is therefore a co-design between server-side cell topology and client-side retry/parallel-query logic. AWS SDKs are written to do this naturally; client libraries written without this assumption do not get the blast-radius benefit.

Isolated fates: every layer has its own control plane

The third pillar is the one that is hardest to describe in a paper because it's organisational as much as technical: every AWS service has a control plane that is isolated from the data plane it controls, and isolated from the control planes of services that depend on it. EC2's control plane (the one that creates instances) does not depend on EC2's data plane (the running instances). S3's control plane (the one that creates buckets) is on a separate dependency path from S3's data plane (the one that serves GETs and PUTs). Route 53's control plane is on a different code base, deployed at a different cadence, with a different on-call rotation, from Route 53's data plane.

The principle has a name inside Amazon: isolated fates. The 2017 S3 outage made this concrete: an operator typo took down the S3 indexing service in us-east-1, but the actual data — bucket contents — remained intact and accessible via direct keypaths because the data plane did not need the control plane to serve reads. The recovery was painful (the index had to be rebuilt) but the data was not lost because the control plane and data plane had been deliberately decoupled.

Isolated fates — control plane vs data plane in an AWS serviceA two-layer diagram. Top half labelled "Control plane" contains boxes: API frontend, scheduler, deployment pipeline, config push. Bottom half labelled "Data plane" contains boxes: replicas, request router, storage. A horizontal dashed line separates the two halves with the label "isolation boundary — neither layer is a hard dependency of the other at request time". Arrows show: control-plane writes config to a side store; data-plane reads from its own cache that survives control-plane outages. To the right, a small inset shows incident outcomes: "Control plane down: new resources cannot be created, existing ones keep serving", "Data plane down: existing requests fail, but creates and updates flow normally". Illustrative. Isolated fates — control plane and data plane fail independently Control plane API frontend Scheduler Deployment Config push isolation boundary — neither layer is a request-time dependency of the other Data plane Request router Replica fleet Storage Local cache Failure modes Control plane down: new resources can't be created existing ones keep serving Data plane down: existing requests fail creates and updates queue Both never down together if isolation discipline is real. The 2017 S3 us-east-1 outage hit the indexing control plane. Existing data was intact and reachable via direct keypaths. Recovery time was hours; data loss was zero. This is the isolated-fates pattern paying off in production.
Illustrative. Control plane and data plane fail independently when the isolation boundary is real. The principle is enforced by separate code bases, separate deployment pipelines, and the rule that data-plane code at request time never calls the control plane API.

The isolation pattern reaches further than just control-plane vs data-plane. Each AWS service has its own deployment pipeline that does not depend on a "central" pipeline service. This sounds wasteful — why does every team rebuild the deployment wheel? — but the answer is that a central deployment pipeline outage takes down everyone's deploys at the same time, which means a critical bug fix cannot be shipped during the outage. By having each service own its pipeline, the failure modes are independent. Why this discipline costs more in nominal engineering time but saves more in incident time: the alternative is a single shared pipeline service that becomes a Tier-0 dependency for every team. When the pipeline service has an incident, every team is blocked from deploying simultaneously, including the team that needs to deploy the pipeline-service fix. Amazon's principle is that no single service can be on every other service's critical path; the cost is duplicated tooling, the benefit is uncorrelated incidents.

The same principle drives the regional separation. AWS regions are deliberately disconnected at the control-plane level. A change made in us-east-1's IAM does not synchronously propagate to ap-south-1; the propagation is asynchronous and survives multi-hour partitions between regions. This is a different choice from Spanner's globally-replicated transactions: Spanner trades increased commit latency for global consistency, while IAM trades global consistency for regional fate isolation. Both are defensible; they encode different priorities.

Common confusions

  • "Cells are just shards" — sharding splits data; cells split failure domains. A cell is a complete replica of the service binary, with its own deployment, config, and capacity, where a shard is just a slice of the data behind a single binary. Two shards on the same binary go down together when the binary panics; two cells on different binaries don't.

  • "Shuffle-sharding is the same as consistent hashing" — consistent hashing maps each key to a single node (or a tight ring neighbourhood), which means each customer touches a small contiguous set of nodes. Shuffle-sharding maps each customer to a random subset, which is what makes the overlap probabilistically tiny. Replacing shuffle-sharding with consistent hashing reintroduces the noisy-neighbour problem because adjacent customers on the ring share most of their nodes.

  • "Cells must be the same size" — they don't have to be, but mixed cell sizes complicate the customer-to-cell mapping. Most Amazon services pick a single cell size (the biggest customer + 30% headroom, say) and add cells horizontally as the customer base grows. Heterogeneous cell sizes are a thing, but they require a smarter mapper.

  • "Isolated fates means microservices" — microservices fragment code; isolated fates fragment failure modes. You can build a monolith with isolated fates (separate control-plane code path, separate deployment, careful runtime isolation) and you can build microservices with shared fates (everyone calls the same auth service, the same config service, the same deployment pipeline — and they all go down together when one of those does). The pattern is about dependency-graph shape, not deployment unit size.

  • "AWS is one stack" — AWS is hundreds of independent services, each with its own cells, its own shuffle-sharding choices, its own control-plane / data-plane split. The unifying contracts are the patterns (cells, shuffle-sharding, isolated fates, regional independence), not a single shared infrastructure layer. This is the inverse of Google's stack, where the unifying contract is shared infrastructure (Chubby, Borg, TrueTime).

  • "Blast radius is a percentage" — blast radius is a distribution, not a single number. The 99th-percentile blast radius (the worst 1% of failure scenarios) often dominates the median. Amazon's services target the 99.9th percentile blast radius, not the median — which is why the cell count and shard size are picked from worst-case math, not average-case.

Going deeper

The Route 53 reliability paper — shuffle-sharding's clearest worked example

The 2014 paper "Millions of Tiny Databases" and the more recent "Route 53 — operating a critical-path service at very high availability" (Builder's Library) are the best-documented shuffle-sharding case studies. Route 53's customer base is in the tens of millions of zones; the cell count is in the low thousands across four anycast networks; the shard size per zone is small (the published number is in single digits). The combinatorial space is enormous, the per-pair overlap is parts-per-million, and the operational result is that Route 53 has had multi-year stretches with zero region-wide DNS outages — a reliability outcome that directly traces to the shuffle-sharding math, not to any heroic engineering. Reading Marc Brooker's posts on this is the closest open-source-accessible equivalent to reading Amazon's internal design docs.

Why Amazon and Google made opposite choices despite reading the same papers

Both companies started in roughly the same era with roughly the same papers. Google pursued vertical integration: one Chubby, one Borg, one TrueTime, one global namespace, every layer trusts the layer below. Amazon pursued horizontal isolation: many services, each with their own cells, their own deployment pipelines, no shared time service, no shared scheduler. The reason is not that one approach is "better" — they encode different bets. Google's bet is that consistency at scale is the hard problem and is worth paying for in shared infrastructure. Amazon's bet is that isolation at scale is the hard problem and is worth paying for in duplicated infrastructure. Both bets have been validated: Google can offer Spanner's external consistency at planetary scale, and AWS can offer regional isolation that survives multi-hour control-plane outages without data loss.

The downstream effect is that systems built on AWS look different from systems built on Google's stack. A KapitalKite trading system on AWS will lean on regional isolation, multi-AZ deployment, and shuffle-sharded vendor APIs; a CricStream pipeline on GCP will lean on Spanner for cross-region transactions and Borg-Kubernetes for shared-fleet scheduling. The same engineer writing for the same use case will produce a different architecture on each cloud, because the cloud's own contracts shape what's easy and what's hard.

Cellular architecture beyond AWS — Stripe, Cloudflare, Discord

Cells and shuffle-sharding are not Amazon-exclusive ideas, but they are most fully formalised there. Stripe's published blog posts describe a cellular architecture for their payments-API control plane; Cloudflare uses anycast cells with a similar shuffle-sharding flavour for DDoS protection; Discord's voice-chat infrastructure runs on per-region cells that are deliberately small. The pattern is portable, but reproducing it requires the same operational commitment: separate deployment pipelines per cell, no cross-cell calls in the data plane, and customer routing logic that respects shard boundaries even under failure. PaySetu can adopt cellular architecture for its payments gateway today; the technology is open-source-equivalent to AWS's. The discipline is the part that takes time.

Cell sizing — the worst-case math

Cell sizing follows from two numbers: the largest customer's peak load (P_max) and the headroom factor (H, typically 1.3–1.5). A cell must be sized so that cell_capacity >= H * P_max, otherwise a single big customer can saturate the cell on its own. The cell count is then chosen so that N * cell_capacity >= total_peak_load * H. For a CricStream-scale streaming service with 25M concurrent viewers during a cricket final and a single largest customer (the live-match endpoint) accounting for 5% of total load, the math gives cell_capacity >= 0.075 * total_load and N >= 13 cells minimum. In practice services run more cells than the floor to keep blast radius below the SLO, often landing at 30–50 cells per region.

Reproduce the shuffle-sharding simulation on your laptop

python3 -m venv .venv && source .venv/bin/activate
python3 shuffle_sharding_blast_radius.py
# To extend: vary K from 2 to 10 with fixed N=100 and plot blast radius vs K.
# The curve drops exponentially (roughly 1 / C(N, K) per pair). This is the
# curve every shuffle-sharded service trades off against client-side
# tolerance — bigger K means more retry budget and more parallel queries.

Where this leads next

The pattern from this chapter — composable failure-isolation primitives — sets up the rest of Part 20's case studies:

  • Meta: scaling the social graph — Meta's TAO uses cellular ideas at a different layer (one cell per region, sharded by graph-id), with the read-mostly access pattern letting them push more aggressive caching than AWS data-plane services typically allow.
  • Netflix: resilience culture — Netflix runs on AWS, so it inherits cells and shuffle-sharding for free; the Netflix story is about the operational practices (chaos engineering, regional evacuations) that exercise those primitives in production.
  • Google's stack — the contrast chapter; reading Amazon and Google back-to-back is the single best way to understand that "distributed systems best practice" is a misnomer. There are bets, not best practices.

The thread that ties Amazon's choices together is that failure isolation is always cheaper than failure prevention. You cannot prevent every bug, every poison message, every operator typo. What you can do is choose an architecture where each of those failures has a small, bounded, measurable blast radius — and then improve the radius over time by adding cells, shrinking shards, and tightening the isolation boundaries between layers. The cells, shuffle-sharding, and isolated-fates patterns are the three primitives that make this measurement possible.

References

  1. Marc Brooker, "Workload isolation using shuffle-sharding" (AWS Builder's Library) — the canonical writeup, with worked Route 53 numbers.
  2. Marc Brooker, "Reliability, constant work, and a good cup of coffee" (AWS Builder's Library) — the constant-work pattern that makes cell sizing tractable.
  3. Colm MacCárthaigh, "Static stability using Availability Zones" (AWS Builder's Library) — the regional-isolation half of isolated fates.
  4. Andrew Certain et al., "Millions of Tiny Databases" (NSDI 2020) — Amazon's Physalia work; cellular architecture for control planes.
  5. AWS, "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region" (Feb 28, 2017 post-mortem) — the operator-typo S3 outage that demonstrated control-plane / data-plane separation.
  6. Werner Vogels, "Eventually Consistent" (CACM 2009) — the philosophical underpinning for Amazon's preference for fate isolation over global consistency.
  7. See also: Google's stack, bulkheads, the principles — Netflix, wall: every system is unique — study the real ones.