In short

Sharding is the act of splitting a single logical dataset across many physical servers so no one machine holds everything. Every sharded database has to answer one question before it does anything else: given a key, which shard owns it? Three strategies dominate, and essentially every production system is some mixture of them.

Range sharding divides the keyspace into contiguous intervals. Shard 0 holds keys from "a" to "f", shard 1 from "g" to "m", and so on. Range queries ("give me everyone between 'a' and 'c'") land on a single shard. The failure mode is skew: if your keys are timestamps and yesterday's data is hotter than last year's, one shard burns while the others nap. MongoDB's default chunking, HBase, CockroachDB, TiKV, and Spanner all use ranges.

Hash sharding applies a hash function — MD5, xxHash, Murmur3 — to each key, and the hash determines the shard. The distribution is uniform by construction. Point lookups are fast; range scans become scatter-gather across every shard. The variant with consistent hashing (ch.76) minimises data movement on node join and leave. Cassandra, DynamoDB, ScyllaDB, and Redis Cluster all use hash partitioning.

Directory sharding abandons formulas entirely. A central service stores an explicit key → shard map. Every routing decision consults the directory. The advantage is surgical flexibility — you can move one hot tenant to its own shard without reshuffling anything else. The cost is one extra round-trip per query (usually cached) and a registry that must not go down. Vitess's VSchema and Citus's metadata tables are the canonical examples.

Most real systems combine strategies: hash for the long tail, directory for the celebrity tenants, ranges for time-ordered archives. This chapter compares the three, shows where each wins, and ends with the decision tree you apply when you finally outgrow one box.

You have 100 GB of data on one server. The server runs at 80% CPU, the disks read at capacity, every new user adds more bytes. No query rewrite, no index, no cache tier buys more than a quarter. The fix is not another optimisation — it is another server. Eventually thirty of them.

That pivot is sharding. Until now the database has been one thing in one place. From this chapter onward, it is one logical thing spread across N physical places, and every read and write starts with a routing decision: which of the N places holds this key? Three families of answers exist, and the rest of Build 12 is about living with the trade-offs they force.

Range sharding — keys split into contiguous chunks

The most intuitive scheme is to carve the keyspace into intervals and assign each interval to a shard. If your keys are usernames, shard 0 gets "a" through "f", shard 1 gets "g" through "m", shard 2 gets "n" through "s", shard 3 gets "t" through "z". Every key belongs to exactly one interval. Every shard knows its own interval and no one else's. A router — which can be a library in the client, a proxy, or a coordinator node — keeps a small table of boundaries and does a binary search into it to pick the shard for a given key.

The table is tiny — thirty shards means a few hundred bytes of metadata. Lookup is O(log N) even for thousands of shards. You cache the table aggressively; it changes only when shards split or merge.

The big win is range scans. "Give me every user from 'apoorv' to 'bhavesh'" becomes "hit shard 0, scan the interval, return". One shard, one network round-trip, one disk seek sequence. If your workload is dominated by range queries — time-series analysis, paginated listings, prefix searches — range sharding is almost magical. HBase built its reputation on this: rows sorted within regions, scans costing proportionally to their size rather than to the total table.

Why range queries are cheap on range shards: the keys you want live physically next to each other on the same shard. The database reads one continuous run of rows with sequential disk I/O, an order of magnitude faster than random I/O. A hash-sharded system would hit every shard — N times the work for the same scan.

The big loss is hot partitions. If your keys are timestamps and every new event is written with now(), every new write lands on whichever shard owns the "current" range. Shard 3 handles 100% of incoming traffic while shards 0, 1, 2 sit idle. When shard 3 fills, you split it, and the new shard 3b inherits the hot traffic. You never get ahead of the skew because the keys themselves are monotonic.

The same pattern appears with auto-increment IDs, monotonic UUIDs, and any key whose ordering correlates with write time. The fix: pick a different shard key (chapter 92), prefix keys with a random bucket (so 2026-04-24T10:00:00 becomes bucket7|2026-04-24T10:00:00), or abandon range sharding for the hot table.

A second complication is unequal data volumes. Shard users by initial letter and shard 0 ("a-f") overflows while shard 4 ("v-z") has 40% free disk. Dynamic splitting helps: when a range grows past a threshold, the database splits it in half. CockroachDB does this at the range level (512 MB default, automatic). MongoDB's balancer moves chunks to equalise. The operator rarely picks boundaries by hand; the database computes them from observed data.

Range-sharded systems in production: HBase, MongoDB in ranged mode, CockroachDB and TiKV (512 MB ranges replicated via Raft), Spanner (directory placement of ranges across Paxos groups), and Bigtable (the original row-range system).

Hash sharding — uniform random assignment

The second strategy is the one you probably already reach for. Apply a hash function to the key, take the result modulo the number of shards, and that is your shard.

def hash_shard(key, num_shards):
    return hash(key) % num_shards

Two lines. The math is simple and the guarantees are strong: any decent hash function (MD5, xxHash, Murmur3) produces nearly uniform output. Every shard gets roughly 1/N of the keys regardless of how skewed the key distribution is. The "aaa" user and the "zzz" user are equally likely to end up on any shard.

The chokepoint is resharding cost. Go from 10 shards to 11 and the formula changes from h % 10 to h % 11; they give the same answer for only about 1/11 of keys. Roughly 91% of your data moves. For 100 TB that is catastrophic. Consistent hashing (chapter 76) solves this: place keys and shards on a 64-bit ring, each key belongs to the next shard clockwise, and a single shard add or remove moves only 1/N of the data.

Hash sharding with consistent hashing is what Cassandra, DynamoDB, ScyllaDB, Riak, and Redis Cluster (slot variant) use. Virtual nodes are the production default; raw modulo-N is for examples.

The big loss is range queries. Ask "every user from 'a' to 'c'" and the hash scatters those keys across every shard. The query becomes a scatter-gather: fan out to all N shards, wait for all N, collect the union. A range scan that was one shard's sequential read under range sharding becomes N random-access queries.

Why range queries are bad on hash shards: the hash function deliberately destroys the key's natural ordering — that is its whole job. Two keys that were "close" in the original ordering are unrelated in hash space. You cannot do a range scan without touching every shard.

If your workload is point lookups — "get the cart for user 42", "get the profile with ID abc123" — this doesn't matter. Hash the key, hit one shard, done. DynamoDB's API shape (PutItem, GetItem, query-within-a-partition) is designed around this: you work on a single hash-selected partition at a time.

A second trade-off: hash sharding forbids correlated keys from living together. If you want "all items for this order" to be one shard's lookup, they must hash to the same shard. The usual trick is a composite shard key: hash on order_id but within that bucket sort by item_id. Cassandra's partition-key / clustering-key split does exactly this — partition key picks the shard, clustering key orders rows inside. Hash-sharded point lookups at the partition level, range-sharded scans within.

Directory sharding — explicit lookup

The third strategy gives up on formulas. Instead of computing the shard from the key, you look it up in a directory — a table that explicitly says which shard each key (or each group of keys) lives on.

The directory is usually coarse — tens of thousands of buckets or tenants, not billions of keys. A tenant scheme has one entry per tenant; a time-partitioned scheme has one entry per month. The directory holds a few MB of metadata; every router consults it (usually through a cache) before sending a query.

The structural advantage is arbitrary flexibility. Move one tenant to a new shard? Update the directory entry, migrate in background, flip the pointer. Give one big tenant its own dedicated shard while the small ones share? Put that tenant's entry on one shard, everyone else's on another. None of this is possible with pure range or hash sharding, where shard assignment is a function of the key.

The cost is operational complexity. The directory must be highly available — if it goes down, every query stalls. Typical deployments run the directory on a consensus-backed store (etcd, Zookeeper, the database's own metadata tables) with aggressive client-side caching.

Why directory sharding isn't just "a table somewhere": the directory is on the request path for every query. Naive implementation does a round-trip to the directory, then a second to the shard — doubling latency. Production systems cache aggressively; the common case is zero extra round-trips. But the directory itself cannot be a single point of failure, which means consensus, replication, and the infrastructure that entails.

Directory-sharded systems in production: Vitess uses VSchema — a YAML mapping of tables and keyspace-ids to shards, stored in a topology server (etcd or Zookeeper), consulted by vtgate routers. Citus stores shard placement in catalog tables on the coordinator node. Custom per-app routers — Slack's shard directory, Shopify's pod router, Uber's Schemaless — are all directory-based. Whenever sharding logic has business rules ("these tenants are co-located", "this table follows the user's home region"), a formula can't express it and a directory must.

The decision tree

Given a new sharded table, here is the question sequence:

  1. Do you need efficient range queries on the shard key? If yes, you cannot use hash sharding without a workaround. Go to range or directory.
  2. Is your workload evenly distributed across keys? If yes, hash is the simplest and cheapest. No hot shards, no rebalancing gymnastics.
  3. Do you have celebrity tenants or known hotspots? If yes, directory lets you isolate them. Hash would make them tolerable but not solve the underlying problem.
  4. Are transactional writes confined to a single partition? If yes, hash (on the partition key) guarantees the transaction lives on one shard.
  5. Is your system multi-tenant with wildly unequal tenant sizes? If yes, directory is almost always the right answer — hash distributes rows uniformly, but tenants are not uniform.
  6. Is the dataset time-ordered and append-dominant? If yes, range with a bucketed prefix (so writes spread across shards) is good; pure range is hot on the tail.

The real world rarely gives a clean answer. Most large systems end up with a hybrid — directory at the top level, hash inside each shard, range within each partition for clustering. That is fine. The question you ask at each level is "what does this layer need to optimise for?" and you pick the strategy that serves that layer.

Python implementation — each strategy

The three strategies written as they would appear in a routing library. Each one is short; the substance is in what they trade off, not in the code.

# Range sharding — boundaries is a sorted list of (start, end, shard_id)
def range_shard(key, boundaries):
    for start, end, shard_id in boundaries:
        if start <= key < end:
            return shard_id
    raise KeyError(f"no shard for key {key!r}")

# Hash sharding — plain modulo; use consistent hashing in production
import hashlib
def hash_shard(key, num_shards):
    h = int(hashlib.md5(key.encode()).hexdigest(), 16)
    return h % num_shards

# Directory sharding — directory is a lookup service
class DirectoryRouter:
    def __init__(self, directory_service, cache_ttl=60):
        self.directory = directory_service
        self.cache = {}
        self.cache_ttl = cache_ttl

    def get_shard(self, key):
        entry = self.cache.get(key)
        if entry and entry["expires_at"] > time.time():
            return entry["shard_id"]
        shard_id = self.directory.lookup(key)
        self.cache[key] = {
            "shard_id": shard_id,
            "expires_at": time.time() + self.cache_ttl,
        }
        return shard_id

Thirty-odd lines. Range does a linear scan of a small boundary table — you would use bisect in a real implementation, but with tens to hundreds of shards the constant matters less than readability. Hash applies MD5 and takes modulo; the % num_shards is where consistent hashing would replace the plain modulo in production code. Directory is a two-tier lookup: check the local cache, fall back to the remote directory service, cache the result.

What the sketch omits: ring maintenance, shard split protocols, directory update propagation, TTL invalidation after migrations, and the machinery to keep all of these consistent under node churn. Each is a chapter's worth of engineering on top of the core routing decision.

Splitting and merging shards

Shards outgrow their hosts. Rebalancing differs sharply across strategies.

Range: pick the overloaded shard, find a midpoint (by key count, byte count, or QPS), declare a new boundary. Keys above the midpoint go to a new shard; keys below stay. Only one shard's data moves. CockroachDB runs this in the background: any range past 512 MB splits, coordinated through the Raft log so readers and writers see a consistent view.

Hash: plain modulo-N is catastrophic (91% moves on N+1). Every production hash-sharded system uses consistent hashing with virtual nodes from day one. Adding a node puts 256 new virtual positions on the ring; each steals a small arc; total data moved is 1/N. Merging is symmetric.

Directory: move individual keys. Copy their data to the new shard, update the directory entry, invalidate caches. Only the named keys are affected. This is the strategy's defining superpower: if one tenant is causing pain, you relocate that tenant without touching anyone else.

Why split granularity matters: range splits at the boundary (one shard at a time), hash splits at the virtual-node level (many small arcs at once), directory splits at the key level. Directory is best for surgical interventions; hash for bulk rebalancing; range for predictable growth.

Hot-partition mitigation

Each strategy has a characteristic failure mode and fix.

Range gets hotspots when the shard key correlates with write time. Fix: choose a shard key that distributes more uniformly — a hash of the timestamp, a composite of (bucket, timestamp) where the bucket is random in 0..N, or a reverse-ordered timestamp. All of these break the natural "give me everything from yesterday" scan; you trade range-query locality for even distribution.

Hash does not get partition-level hotspots. What it can still suffer is per-key hotspots: one celebrity profile drawing 10% of all reads, overwhelming its owning shard. Sharding distributes keys across shards; it does not distribute requests to a single key. Fix: caching, replication, or explicit hot-key splitting at the application layer.

Directory handles hotspots surgically — identify hot keys, move them to their own shard, update the directory. Ten million individual entries defeats the "small metadata" advantage, so in practice you combine: hash most keys, use the directory only for hot exceptions. Slack's channel hosting uses exactly this: hash for channel IDs by default, directory for the few channels large enough to need dedicated shards.

Rebalancing strategies

Hash with consistent hashing moves 1/N of keys per node add or remove. Rebalance traffic is spread across all existing nodes thanks to virtual nodes, so no single source saturates. Predictable and linear.

Range is localised: a split moves half of one range; a merge combines two adjacent ranges. Other shards untouched. Data movement per operation is bounded by the split-or-merge size, usually 512 MB to a few GB.

Directory moves exactly what you tell it to. A single key migration is bytes to kilobytes. A tenant migration is whatever that tenant's data totals. You pay for every byte you move and never move a byte you didn't ask to.

Production combinations

Pure single-strategy deployments are rarer than the textbooks suggest. The real architectures are mixtures.

MongoDB uses range sharding by default but supports hash sharding per collection. Zones layer directory semantics on top: tag a zone "ap-south-1" and constrain ranges to shards in that zone. Net effect: directory-over-range.

Cassandra uses hash sharding (consistent hashing, virtual nodes) with the partition-key / clustering-key split — hash routing at the partition level, range ordering within each partition. Net effect: range-within-hash. Point lookups on partitions are hash-fast; scans inside a partition are range-fast; cross-partition scans are expensive scatter-gathers.

Vitess uses directory sharding via VSchema at the top level — the router consults the topology for which MySQL shard owns a keyspace-id. Inside each shard, MySQL uses B-tree indexes. Net effect: range-within-directory.

SaaS multi-tenant systems often use directory for tenants and hash within each tenant for user-level data. You get tenant isolation (move a big tenant without touching small ones) and intra-tenant performance (point lookups on user rows are one hash away).

No production system gets away with exactly one strategy. The question is always "which combination serves the access patterns?" — the answer comes from measuring real workloads, not from theoretical preferences.

Flipkart's product catalog

You are architecting the product catalog for Flipkart. 500 million products, 30 shards, three access patterns: point lookup by product_id, list by category, and free-text search (handled elsewhere). Which sharding strategy fits?

Option 1 — hash by product_id. Uniform distribution across 30 shards. Point lookup: hash the ID, hit the shard, return. The failure: "list all mobile phones" requires scatter-gather across every shard, because mobile-phone product IDs scatter uniformly. For catalog browsing — most of the traffic — every request becomes a 30-way fan-out.

Option 2 — range by category. Electronics on shard 0, books on shard 1, and so on. "List mobile phones" is one shard's range scan. The failure: categories are wildly imbalanced. Electronics has 100M products; books 10M; specialty categories 50K. Shard 0 overflows while shard 29 is 0.5% full.

Option 3 — directory by category. One directory entry per category. Popular categories like Electronics get dedicated shards; tail categories share. Moving a category between shards is a directory update plus a background migration. "List mobile phones" still hits one shard.

What the team actually builds. Directory at the category layer. Hash on product_id within each shard for uniform storage-node distribution. Clustering by (category, subcategory, popularity_score) to make category listings a range scan. Three strategies stacked: directory → hash → range, each optimising for a different question the system has to answer.

Three sharding strategies side by sideThree horizontal rows showing RANGE, HASH, and DIRECTORY sharding. Each row shows six keys being assigned to three shards via a different mechanism. RANGE uses sorted intervals, HASH uses a hash function mod 3, DIRECTORY uses a lookup table.RANGEa-f → S0, g-m → S1, n-z → S2S0 a-fS1 g-mS2 n-zrange scans: cheap (1 shard)hotspots: likely on skewed keysexample: HBase, CockroachDBHASHshard = hash(key) % 3S0S1S2range scans: expensive (fan-out)hotspots: none (uniform)example: Cassandra, DynamoDBDIRECTORYrouter looks up key → sharddirectory: {k1→S0, k2→S2, k3→S1, ...}S0S1S2range scans: depends on layouthotspots: fixable surgicallyexample: Vitess, Citus
The three strategies place keys on shards by different mechanisms. RANGE uses sorted intervals (fast range scans, vulnerable to hot keys). HASH uses a hash-and-modulo (uniform distribution, bad range scans). DIRECTORY uses an explicit lookup table (flexible, needs a highly available registry). Real systems combine all three at different layers.

Common confusions

Going deeper

MongoDB zones

MongoDB shards use range chunks (64–128 MB each). Zones tag shards and constrain chunks to tagged shards — a zone tagged "region:ap-south-1" pulls chunks to Mumbai; "tier:archive" routes cold data to slower shards. Zones are a directory layer on top of ranges.

CockroachDB's range-split algorithm

CockroachDB auto-splits ranges past 512 MB, and load-based splitting splits smaller ranges that see disproportionate QPS. The database detects skew and splits hot ranges into pieces routed to different nodes — no operator intervention.

Vitess VSchema and VReplication

Vitess splits metadata into two layers. VSchema describes the logical scheme — which tables are sharded, which columns are shard keys, which vindex functions map shard keys to keyspace-ids. A vindex can be hash, lookup table, consistent hash, or custom. VReplication streams rows between shards while the system serves traffic, letting Vitess reshard live MySQL from 4 to 8 to 16 shards with no downtime.

Where this leads next

Picking a strategy is step one. Chapter 92 covers shard keys and hot partitions — how to pick the column you shard on, how to detect skew before it becomes an outage. Chapter 93 covers routing — how clients, proxies, and coordinators actually find shards, with a focus on caching and correctness under resharding. Chapter 94 turns to cross-shard queries: scatter-gather execution, and when to denormalise instead.

Together, Build 12 covers the full life cycle: picking a partitioning scheme, picking a shard key, routing requests, executing cross-shard operations, splitting and merging shards. Consistent hashing (ch.76) gave you the machinery; this build gives you the strategies.

References

  1. MongoDB Inc., Sharding Documentation — the canonical reference for MongoDB's sharding model, covering ranged and hashed sharding, chunk balancing, zones, and the config-server/mongos architecture. The section on zone sharding shows exactly how directory semantics layer on range chunks.
  2. Apache Cassandra Project, Data Partitioning and Consistency — describes Cassandra's hash-based partitioning with consistent hashing and virtual nodes, and the partition-key / clustering-key split that combines hash and range sharding at different levels.
  3. Cockroach Labs, Ranges and Replication in CockroachDB — CockroachDB's distribution layer uses 512 MB range units split automatically on size and load. The documentation explains both the mechanics of splitting and the heuristics that drive it.
  4. PlanetScale / CNCF Vitess, Vitess Architecture and VSchema — Vitess's VSchema is the canonical example of directory sharding in open source: a YAML-described mapping of tables to shards, consulted by vtgate routers, backed by a topology server. The VReplication docs show how migrations work under this model.
  5. Kleppmann, Designing Data-Intensive Applications, Chapter 6 — Partitioning, O'Reilly 2017 — the clearest pedagogical treatment of range, hash, and directory partitioning, with worked examples of hot partitions, secondary indexes, and rebalancing strategies. The chapter this article is most directly in conversation with.
  6. Microsoft / Citus Data, Citus Distributed Tables — Citus (the PostgreSQL extension) uses directory sharding via metadata tables on the coordinator node. The documentation covers distribution columns, co-location, and reference tables, and the shard-rebalancer details the online migration protocol.