Shard Keys and Hot Partitions

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The shard key is the column — or combination of columns — whose value decides which physical shard owns a row. Picking it is the single highest-leverage schema decision on a sharded system, because after the first few terabytes you cannot change it without an online migration that costs weeks of engineering. The three properties of a good shard key are high cardinality (billions of distinct values), even distribution (no single value owns most of the traffic), and query alignment (your real queries filter on the key you shard by).

Bad shard keys are the opposite: monotonic timestamps make every new write land on "today's shard"; country codes route most rows to India and leave half the cluster idle; boolean flags split your cluster into two partitions. The symptom is a hot partition — one shard at 100 percent CPU while the others coast at 15, tail-latency spikes on the routes that touch that shard, and the write queue on the hot node backing up until the whole system stalls.

The fix catalogue is small. Composite keys pair the natural key with a bounding dimension ((user_id, day)) so no partition grows without limit. Salting prefixes a round-robin bucket id so one write target becomes a hundred. Directory sharding carves out heavyweight tenants. Read-side caching absorbs traffic for a genuinely hot key — a celebrity profile, a viral product page — that no key design can distribute.

Detection is per-node CPU and IOPS, per-shard throughput counters, and slow-query logs that always point at the same partition.

It is 3 AM and your pager shows one node at 100 percent CPU; the other 49 are at 15. The slow-query log: every entry on the same shard. Someone ran CREATE TABLE events (event_date DATE, ...) PARTITION BY event_date six months ago. Every insert lands on today's partition because the shard key is the calendar day.

You cannot fix this before the morning spike in four hours. You buy time — a read replica, a bigger instance — and plan the real fix: a new shard key, a month of migration work.

The previous chapter told you how rows get placed on shards. This chapter is about which column you feed into that placement, and the failure mode that eats distributed databases: the hot partition.

What makes a shard key "good"

Four properties, in order of how often they are ignored.

High cardinality. The shard key needs at least ten times as many distinct values as you have shards, and in practice you want thousands to millions of times more. Thirty shards and a shard key with a hundred values means each shard is assigned three or four value-groups; one unusually active group turns into a hot partition.

Why cardinality matters more than "uniqueness": the shard key governs where rows go, not which rows there are. A user_id with 100 million values shards well; a user_tier with five values ({free, basic, pro, enterprise, admin}) shards catastrophically even though every row has a tier.

Uniform distribution. Cardinality is necessary but not sufficient. A column can have a billion distinct values and still be lopsided. A parent_user_id on a messaging table where one celebrity user has 10 million messages and everyone else has 50 is high-cardinality and still hot. You want a distribution that looks roughly flat across the values that actually receive writes — no long tail, no single value that owns more than its fair share.

Query alignment. A shard key only helps you if your queries filter on it. If you shard by user_id but your hottest query is "all orders in the last hour", every query fans out to every shard; you paid for sharding and got scatter-gather. Cassandra's data model forces this explicit: you design the partition key from the queries backward, not from the entity forward.

Append safety. New values should not cluster at the extremes of the key's range. A timestamp is unsafe because every new row has now() for its value. A UUIDv4 is safe because each new UUID is random. A Snowflake-style id is unsafe because the high bits are time-ordered.

What makes a shard key "bad"

The four anti-patterns mirror the four good properties.

Monotonically increasing keys. Timestamps, auto-increment ids, Snowflake ids — anything whose next value is predictable. Every write lands on whichever shard owns the top of the keyspace. On range-sharded systems this is catastrophic; on hash-sharded systems the hash destroys the order and the problem disappears for writes — but range scans by time become impossible.

Low cardinality. Booleans, status enums, country codes, feature flags. A shard key with k distinct values cannot spread work across more than k shards, no matter how cleverly you hash. Two partitions for a 50-node cluster. You do not compose your way out of this.

Imbalanced distribution. Celebrity user ids. One tenant with 1000x the data of the median. One SKU (the new iPhone) that outsells the next thousand combined. The shard key has billions of values but the distribution is Pareto. Hash sharding spreads keys across shards; it does not spread requests to one key, because every request for that key must go to the shard that owns it.

Unaligned with access pattern. You shard by order_id, your application queries by customer_id. Every query is a fan-out. The most preventable of the four and also the most common, because schemas tend to outlive the query patterns they were designed for.

Detecting hot partitions

A hot partition announces itself in four places.

Per-node CPU, IOPS, memory. Graph the fleet. In a healthy cluster all nodes look identical. When one curve peels away and stays there, it is hardware or a hot partition; the slow-query log tells you which. Cassandra's nodetool tpstats shows pending tasks per node — a deeper queue means disproportionate load.

Per-shard throughput counters. DynamoDB CloudWatch exposes consumed capacity per partition; MongoDB's mongotop plus sh.status() shows hot chunks; Citus's citus_stat_statements breaks down by shard. The pattern is always: one shard an order of magnitude above the median.

Slow-query logs. Aggregate by shard id, sort descending, and the hot partition surfaces immediately. A shard that is both high-QPS and high-latency is saturating.

Client-side latency by shard key value. Wrap your client in a histogram that buckets by key and emits p99 per bucket. Keys whose p99 is 10x the median are hitting a hot partition. This catches per-key hotspots that shard-level metrics miss.

The earliest warning is the per-node CPU curve peeling off. By the time slow-query alerts fire, you have an incident.

Fix 1 — Composite shard keys

The most common fix for a skewed shard key is to pair it with a second column that bounds partition size.

Naive: a messaging app uses PK = (user_id). Each user's messages live on one partition. A heavy user with 10 million messages gives that shard a 10 GB partition; compactions slow down, range scans take seconds, the node thrashes.

Composite: PK = ((user_id, day), message_id). Each (user_id, day) pair is a separate partition. The size is bounded by how many messages one user sends in one day — even a heavy user sends maybe 100,000, a few MB. Compactions are predictable; scans are cheap.

CREATE TABLE messages (
  user_id     UUID,
  day         DATE,
  message_id  TIMEUUID,
  body        TEXT,
  PRIMARY KEY ((user_id, day), message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Why this works: the old partition key was (user_id), so the partition for one user grew unboundedly over time. The new partition key is (user_id, day), so each partition is bounded by one day of one user's activity — a natural cap that no user can exceed. You pay by giving up single-shot "all messages for this user" queries; those now have to enumerate days.

The trade-off is real. "The last 30 messages for user X" used to be a single partition scan; now it enumerates up to 30 partitions (one per day, walking backward until you have enough). You accept the cost because the alternative — unbounded partitions — is a ticking bomb.

The composite pattern applies anywhere your natural shard key can grow without limit: per-user timelines, per-device telemetry, per-tenant audit logs. Pair with a time bucket (day, hour, month) chosen to keep the expected partition under a few hundred MB.

Fix 2 — Key-prefix salting

When the shard key is inherently time-ordered — server logs, click streams, IoT telemetry — no pairing fixes it. Every new row has now() as its timestamp.

The fix is to salt the key with a random or round-robin bucket prefix. Instead of PK = (timestamp), write PK = ((bucket_id, timestamp)), where bucket_id cycles 0..99. A hundred parallel write streams instead of one.

import time

NUM_BUCKETS = 100

def write_event(db, event):
    bucket = int(time.time() * 1_000_000) % NUM_BUCKETS
    db.insert(
        table="events",
        partition_key=(bucket, event.timestamp),
        clustering_key=event.event_id,
        payload=event.body,
    )

Each insert picks a fresh bucket; consecutive writes land on different shards. Write throughput is bounded by 100 shards' capacity, not one.

The cost is at read time. "All events in the last hour" used to be a single range scan; now it queries 100 buckets in parallel and merges. Fine for analytics; it hurts latency-sensitive reads.

Why 100 and not 10 or 1000: bucket count is a direct trade between write spread and read fan-out. Rule of thumb: as many buckets as you have write-capable nodes, rounded up.

A variant uses a deterministic bucket from another column (user_id mod 100) so reads filtering on that column target one bucket. Pick the variant that matches your access pattern.

Fix 3 — Directory sharding for imbalanced tenants

If the problem is one hot tenant — a SaaS customer with 1000x the data of the median — no symmetric scheme fixes it. Hash sharding routes that tenant's data to one shard, which now runs 1000x the load. Composite keys bound partition size but not tenant size.

The answer is directory sharding (chapter 91) at the tenant level. Instead of hashing tenant_id, you maintain an explicit tenant_id → shard_id map. Small tenants share shards; heavy tenants get dedicated shards, sometimes several. A shard-assignment service allocates them.

The trade-off is operational: a highly available directory (etcd, Zookeeper, or replicated Postgres), a rebalancer that moves tenants as they grow, and monitoring to promote tenants from shared to dedicated before they become hot. Slack, ShopForge, and Glydex's Schemaless all run a directory layer because their tenant sizes vary by five orders of magnitude. Rule of thumb: when your top-percentile tenant is more than 100x your median, directory sharding pays for itself.

Fix 4 — Hot-key handling via read-side caching

Sometimes the shard key is fine and the distribution is fine and you still have a hot partition. What's hot is one particular key: the celebrity profile, the trending product page, the Diwali sale landing route. One row getting a million requests per second.

Sharding cannot help here. Every request for that key has to go to the shard that owns it. The fix is caching — absorb most of the traffic before it reaches the database.

A read-through cache (Redis or Memcached) sits in front of the hot key. The first reader fetches from the database and populates the cache; the next 100,000 readers hit the cache. Writes go through and invalidate. With a 95 percent hit rate you reduce database load on that key by 20x without touching the shard key.

def get_user(user_id, cache, db):
    key = f"user:{user_id}"
    cached = cache.get(key)
    if cached is not None:
        return cached
    row = db.fetch("SELECT * FROM users WHERE id = %s", user_id)
    cache.set(key, row, ttl=60)
    return row

Why caching is sometimes the only answer: hash sharding distributes different keys across shards, but every request for the same key must reach the same shard. A single key can draw as much traffic as you want, and no resharding fixes that. The only lever is to serve that key from somewhere other than the database.

For write-heavy hot keys — an inventory counter, a viral post's like count — caching alone isn't enough; you rethink the data model. Split the counter across N shards (like_count_0..like_count_99) and sum on read. Or use a CRDT. Or accept eventual consistency and batch.

Python demonstration of detection

A small wrapper around a database client that tracks per-key latency and flags the outliers. This is the kind of thing you instrument in production clients before you have a hot partition, not after.

import time
from collections import defaultdict

class HotKeyDetector:
    def __init__(self, threshold_multiplier=10.0, min_samples=20):
        self.times = defaultdict(list)
        self.threshold = threshold_multiplier
        self.min_samples = min_samples

    def record(self, key, latency_ms):
        self.times[key].append(latency_ms)

    def detect(self):
        eligible = {
            k: sum(t) / len(t)
            for k, t in self.times.items()
            if len(t) >= self.min_samples
        }
        if len(eligible) < 3:
            return []
        sorted_avgs = sorted(eligible.values())
        median = sorted_avgs[len(sorted_avgs) // 2]
        return [
            (k, avg) for k, avg in eligible.items()
            if avg > self.threshold * median
        ]

def timed_get(detector, db, key):
    start = time.monotonic()
    try:
        return db.get(key)
    finally:
        detector.record(key, (time.monotonic() - start) * 1000)

Run this in a production sampler — one in every 1000 requests — and emit detect() to your monitoring system every minute. Combine with shard-level metrics: the detector tells you which key is hot, the shard metrics tell you which shard. Together they distinguish "one hot key on a healthy shard" (add a cache) from "one hot shard with many warm keys" (reshard).

The time-series hot-partition problem

The most common cause of hot partitions in production is a time-ordered shard key. Knowing the pattern by name saves months of debugging.

Symptom: inserts all land on one shard. Older shards sit idle holding data nobody reads. When the recent shard fills it splits, and the new shard inherits the heat. You never get ahead because the keys are monotonic by definition.

Cause: a shard key whose high bits contain now(). Raw timestamps. Snowflake ids. ULID. Sorted UUIDs (UUIDv7).

The fix pattern is always two ingredients: a high-cardinality entity column and a time bucket. Compose into a partition key: PK = ((device_id, day), timestamp). Each (device, day) is bounded by one day of one device's telemetry. Across millions of devices, writes spread across millions of partitions; the hot set is millions × 1 day, not one × current-minute.

Size the time bucket so one bucket of one entity is in the sweet spot — large enough that read enumeration is cheap, small enough that no partition overflows. For 1 Hz IoT telemetry, a day (~10 MB). For financial tick data, an hour. For audit logs, a month. The time dimension stops partitions growing forever; the entity dimension stops writes concentrating at any moment. Both properties matter.

The Black Friday outage

You are on-call for an e-commerce platform. Black Friday Indian edition. The orders table normally handles 1000 writes per second; by 8 AM it is 10,000 — within plan. At 8:15 the alert fires: one shard at 100 percent CPU, p99 order creation climbing through 200 ms, 500 ms, 1.5 seconds. Payment-callback timeouts begin, meaning duplicate charge attempts inside the hour.

The orders table has PRIMARY KEY (order_id). The team uses Snowflake-style ids — 41 bits of millisecond timestamp in the high bits, time-ordered. The cluster uses range sharding because the dashboard query is "all orders in the last 24 hours". Every new order's id sits at the top of the keyspace, owned by the most recent shard. That shard saturates.

You cannot change the schema at 8:15 AM. You cap the write rate, route overflow to a queue, and scale up the hot node. It buys four hours.

The real fix ships next quarter. New key: ((region_id, hour), order_id). region_id spreads writes across India's five delivery regions; hour rotates the hot partition every 60 minutes; order_id orders within. Hundreds of partitions carry the load. The dashboard query fans out to 5 × 24 = 120 partitions in parallel — within 2x the old wall-clock latency. Tail latency on order creation drops from 1.5 seconds back under 50 ms.

Before: Snowflake-style order ids are time-ordered, so every new order lands on whichever shard owns the current time range. One shard at 100 percent CPU, eleven idle. After: compositing region and hour into the partition key spreads new orders across all shards simultaneously. Write load is balanced; the dashboard query becomes a bounded 120-way fan-out whose shards answer in parallel.

When to accept some hot-partition risk

Not every hot-partition risk needs fixing. Changing a shard key on a live system is weeks of engineering; sometimes the arithmetic says "buy a bigger node".

Brief and predictable peaks. If the hot window is one hour per day and your biggest instance handles it with headroom, vertical scaling plus queue-based smoothing is cheaper than schema change.

Systems nearing end of life. A legacy service two quarters from replacement does not get a new shard-key migration. Survive, document, do the work in the replacement.

Workloads where resharding exceeds the incident cost. Resharding 50 TB takes weeks of double-writes. If the hot partition merely doubles p99 during peaks and nobody complains, write a postmortem that explicitly accepts the risk.

Never for OLTP critical paths. Payment authorisation, order creation, login. Hot partitions there are bugs, not features.

Common confusions

"Hash sharding eliminates hot partitions." Hash eliminates key-range hotspots by destroying the ordering. It does nothing about key-value hotspots: a celebrity user_id hashed to one shard still drives it hot. It does nothing about low-cardinality keys: hashing a boolean still gives two partitions. The hash is only as good as the key you feed it.
"Hot partitions are a rare edge case." They are the most common operational issue on sharded databases — more common than node failures, network partitions, or replication lag.
"Just scale up the hot node." Vertical scaling buys time, not a fix. The next peak is 10 percent higher; eventually you hit the largest instance your cloud offers with the same problem plus months of procrastination.
"Read replicas fix hot partitions." They fix read hotspots. Every replica applies every write, so a write-hot shard has write-hot replicas. Write hotspots require the shard key to change.
"The shard key is just the primary key." The primary key identifies a row; the shard key places it. Often different. In Cassandra the partition key is the leftmost component of the primary key. In Citus the distribution column is declared separately. Conflate them and you pick a cardinal primary key that shards terribly.
"If it hashes uniformly, it shards uniformly." Uniform hashing gives uniform shard assignment — every shard gets 1/N of distinct values. That is different from uniform load, which depends on how often each value is accessed. Hot keys defeat the most uniform hash.

Going deeper

DynamoDB's adaptive capacity handles hot partitions automatically. When a partition exceeds its provisioned throughput, DynamoDB borrows capacity from cooler partitions on the same table; if the hotspot persists, it physically splits the partition. From the application's perspective nothing changes — the partition map has evolved underneath. Hot partitions become invisible auto-scaling events, at the cost of a closed-source scheduler you cannot reason about the way you can with Cassandra's explicit model.

Vitess's VSchema vindexes give you a toolbox of shard-key mapping functions. A hash vindex does consistent hashing. A lookup vindex does directory sharding — itself a MySQL table mapping logical key to keyspace-id. You can mix vindexes per table — one sharded by user_id via hash, another by tenant_id via lookup — all behind the same vtgate proxy.

MongoDB's zones are directory sharding layered on ranged or hashed chunks. Tag shards "region:ap-south-1" or "tier:premium", then declare zone ranges. The balancer enforces the mapping. Used for regulatory locality, tenant isolation, and tiered storage.

Consistent hashing with virtual nodes (chapter 76) is a partial answer. Virtual nodes divide the keyspace into many small arcs so no arc contains a disproportionate share of keys. They cannot spread repeated requests for the same key — that still needs caching or key splitting on top.

Where this leads next

Chapter 93 covers routing — dispatching the query from a shard key. Client-aware, proxy, and coordinator routing have different latency and failure-mode trade-offs.

Chapter 94 covers resharding without downtime — changing your shard count or shard key on a live system. Double-writes, backfills, cutover, rollback. The work after you realise the key was wrong.

Chapter 95 covers global secondary indexes — querying a sharded table by a non-shard-key column, with its own sharded storage and consistency trade-offs.

Chapters 91 through 95 are the sharded-database life cycle. The shard key decision is the one that, if you get it right, makes the rest of the work easy.

References

Amazon Web Services, DynamoDB Adaptive Capacity and Partition Management — official description of how DynamoDB detects hot partitions, borrows capacity, and physically splits them. Essential for understanding managed hot-partition handling.
MongoDB Inc., Zones (Tag-Aware Sharding) — directory-style tenant carve-outs on top of range-sharded chunks. Covers zone declaration, balancer behaviour, and regional-compliance patterns.
PlanetScale / CNCF Vitess, VSchema and Vindexes Reference — canonical docs on Vitess's vindex system, the most flexible open-source implementation of composite shard-key mapping.
Apache Cassandra Project, nodetool tpstats and Per-Node Metrics — per-node thread-pool statistics, the first place to look for a hot partition in a Cassandra cluster.
Kleppmann, Designing Data-Intensive Applications, Chapter 6 — Partitioning, O'Reilly 2017 — the clearest textbook treatment of partition skew, hotspots, and mitigation. The salting pattern in particular is canonical here.
Chirpline Engineering, Announcing Snowflake — the original Snowflake id design. Worth reading for the discussion of why time-ordered ids were chosen and the knock-on shard-key consequences this chapter analyses.