Partitions and parallelism

Last chapter you wrote a single-partition log. It works, it durably stores millions of records, and on one NVMe disk it tops out somewhere around 50 MB/s of sustained writes. Razorpay's payments.tx_events topic does roughly 1 GB/s at peak. One partition cannot hold that traffic — not because the format is wrong, but because one disk is the wall. Partitions are how you punch through the wall while keeping the only ordering guarantee anyone actually needs.

A topic is N independent partitions. The partition for a record is hash(key) mod N — same key always lands in the same partition, so per-key order is preserved; cross-key order is given up. N partitions can be written by N producers, served by N consumers in a group, and stored on N disks across brokers, so throughput and storage scale linearly. The trade-off you cannot avoid: increasing N later silently changes the partition function for every key, which is why partition count is the most consequential one-line config in any streaming system.

Why one partition is not enough

A single partition is a single append-only file on a single disk. Three independent caps stack on top of each other. The disk: a good NVMe sustains ~500 MB/s sequential write, EBS gp3 ~125 MB/s, and that ceiling is shared with every other write on the same volume. The consumer: per-partition order means one consumer in a group reads it; you cannot parallelise reads of a single partition without losing order. The recovery: if that one broker dies, the partition is offline until the replica is promoted, which is seconds you don't have at 100k events/sec.

The fix is the only fix that has ever worked for distributed log systems: partition the topic. A topic with 64 partitions is 64 independent logs, each with its own segment files, its own offset sequence, its own consumer assignment. Producers send each record to exactly one partition; consumers in a group share the partitions among themselves. The total throughput of the topic is roughly N times the per-partition throughput, capped only by network and broker count.

The cost of this fix is one specific giveaway: global order across the whole topic is gone. If key=A lands on partition 3 and key=B on partition 7, a consumer reading both partitions sees them in some interleaving, not in the order they were originally produced. What survives is per-partition order, which — by the partition function — gives per-key order. Almost every streaming application turns out to need only the second guarantee. A fraud rule for merchant_42 cares that that merchant's transactions are in order; it does not care whether merchant_42's transaction at 11:02:03 is processed before or after merchant_99's transaction at 11:02:03. This single trade-off is what makes the Kafka model work.

A topic is many partitions across brokersA topic named payments.tx_events with 6 partitions distributed across 3 brokers. Producers hash the record key to choose a partition; consumers in a group are each assigned a subset of partitions. Each partition is one append-only log; the topic is the union of all six. topic payments.tx_events: 6 partitions, 3 brokers, 2 consumers producer P1 producer P2 hash(key) mod 6 broker A partition 0 (leader) partition 3 (leader) disk 1: ~500 MB/s broker B partition 1 (leader) partition 4 (leader) disk 2: ~500 MB/s broker C partition 2 partition 5 disk 3: ~500 MB/s consumer C1 (group fraud-rules-v2): partitions 0, 2, 4 consumer C2 (group fraud-rules-v2): partitions 1, 3, 5
Two producers fan out to six partitions across three brokers; the partition is chosen by hashing the record key. Two consumers in one group split the six partitions evenly, three each. Total topic throughput is the sum of per-partition throughput; per-key order is preserved because the same key always hashes to the same partition.

The whole architecture is in that figure. The producer side has a partition selector. The broker side has N independent logs (each one is the storage from chapter 48, replicated to F+1 machines — that part is chapter 51). The consumer side has a group coordinator that hands each consumer a slice of partitions. Lose any one of those three pieces and the system stops scaling.

The partition function

The partition function takes a record's key and returns an integer in [0, N). The contract is non-negotiable: same key, same partition, every time. Without that, per-key order is gone — Razorpay's merchant=42 records would land on partition 3 sometimes and partition 7 sometimes, and a downstream consumer reconstructing each merchant's transaction stream would see records out of order.

The function Kafka uses by default is murmur2(key) mod num_partitions. Murmur2 is a 32-bit non-cryptographic hash chosen for speed and good distribution; the mod makes it fit inside the partition count. Why not key.hashCode() mod N (Java's default)? Because hashCode is uneven for short keys — many merchant IDs that differ only in the last digit collide on the same partition, creating hot partitions. Murmur2 spreads them.

If the key is null, Kafka used to round-robin (one record per partition in turn). Modern Kafka uses sticky partitioning: the producer picks one partition, batches records into it for linger.ms worth of time, then picks another. This keeps batches large, which keeps throughput high, at the cost of slight imbalance over short windows.

# partitioner.py — the contract is "same key → same partition"
import hashlib, struct

def murmur2(data: bytes, seed: int = 0x9747b28c) -> int:
    """Kafka's default partitioner uses 32-bit Murmur2."""
    length = len(data)
    m = 0x5bd1e995
    r = 24
    h = seed ^ length
    i = 0
    while length >= 4:
        k = struct.unpack_from(">I", data, i)[0]
        k = (k * m) & 0xFFFFFFFF
        k ^= (k >> r)
        k = (k * m) & 0xFFFFFFFF
        h = (h * m) & 0xFFFFFFFF
        h ^= k
        i += 4
        length -= 4
    if length == 3:
        h ^= data[i + 2] << 16
    if length >= 2:
        h ^= data[i + 1] << 8
    if length >= 1:
        h ^= data[i]
        h = (h * m) & 0xFFFFFFFF
    h ^= (h >> 13)
    h = (h * m) & 0xFFFFFFFF
    h ^= (h >> 15)
    return h & 0x7FFFFFFF                    # positive int

def partition_for(key: bytes, num_partitions: int) -> int:
    """Same key → same partition, every time, on every producer."""
    return murmur2(key) % num_partitions

# --- Demo: 200,000 merchant IDs onto 64 partitions ---
if __name__ == "__main__":
    counts = [0] * 64
    for i in range(200_000):
        merchant_id = f"merchant_{i:06d}".encode()
        counts[partition_for(merchant_id, 64)] += 1
    avg = 200_000 // 64
    drift = max(counts) - min(counts)
    print(f"avg per partition: {avg}")
    print(f"max: {max(counts)}, min: {min(counts)}, drift: {drift} ({100*drift/avg:.1f}%)")
    print(f"merchant_000042 → partition {partition_for(b'merchant_000042', 64)}")
    print(f"merchant_000042 → partition {partition_for(b'merchant_000042', 64)}  (again)")
    print(f"merchant_000042 → partition {partition_for(b'merchant_000042', 32)}  (N=32, different)")
# Sample run:
avg per partition: 3125
max: 3243, min: 2998, drift: 245 (7.8%)
merchant_000042 → partition 17
merchant_000042 → partition 17  (again)
merchant_000042 → partition 9   (N=32, different)

The lines that matter:

Hot partitions and key skew

The partition function distributes keys uniformly. It does not distribute traffic uniformly, because traffic is uneven. On Razorpay, the top 100 merchants out of 50 million handle around 40% of payment volume. On Zerodha, Reliance and HDFC trades dwarf the long tail. On Swiggy during dinner, three Bengaluru pin codes generate more order events than the next two hundred combined. Hash-by-merchant-id sends every record for one big merchant to one partition, and that partition becomes the bottleneck for the whole topic.

You see it in metrics: 63 partitions sit at 20% CPU and one sits at 95%, with consumer lag building only on that one. Three honest fixes exist; you pick based on whether you can give up per-key order.

Fix 1 — composite keys. Hash on (merchant_id, hour) instead of merchant_id. The same merchant's records spread across 24 partitions per day instead of 1. Per-merchant order is lost; per-merchant-per-hour order is preserved. Acceptable for fraud rules that batch hourly. Not acceptable for a real-time fraud detector that needs the full per-merchant sequence.

Fix 2 — explicit partition for hot keys. Write a custom partitioner that recognises the top-K merchant IDs and round-robins them across a dedicated subset of partitions, while everything else uses Murmur2. Per-key order is lost for the hot keys, so the consumer must tolerate out-of-order records for those — usually fine if the downstream is a ledger that's order-independent (idempotent upserts, chapter 27).

Fix 3 — split the topic. Move the top-K merchants onto a separate topic payments.tx_events.vip with its own partition count. Per-key order is preserved everywhere; the consumer reads two topics. This is what Razorpay does in practice — VIP merchants get isolated infrastructure that can be scaled independently of the long tail. The operational cost is one extra topic per tier; the win is that a noisy VIP merchant cannot starve the long tail's consumers, and the long tail's consumer crash cannot hold up VIP processing.

Whichever fix you pick, instrument the answer. Kafka exposes per-partition write rate and consumer lag through JMX; a three-line PromQL alert (max(kafka_partition_bytes_in_per_second) / avg(kafka_partition_bytes_in_per_second) > 5) catches a hot partition before consumers start lagging. You only fix what you can see, and per-partition metrics are the entire visibility surface for skew problems.

The thing you should never do: bump partition count without thinking. It does not fix hot partitions — the hot key still hashes to one partition, just a different one. It also breaks ordering for every other key, as the next section explains.

What changing the partition count costs

hash(key) mod N is the partition function. Change N from 32 to 64 and every key gets remapped: merchant_42 was on partition hash mod 32 = 9, becomes partition hash mod 64 = 41. For a duration after the change, records for the same key are spread across the old partition (where historical data lives) and the new partition (where new writes go). A consumer reading per-key state — say, "current balance for merchant_42" — sees half the history on partition 9 and the other half on partition 41 and computes wrong totals.

There is no clean way to repartition a Kafka topic in place. The options:

  1. Forward-only: add partitions, accept that ordering for new records may differ from historical, and that consumers reading historical data must drain old partitions before relying on the new layout. Works only if the consumer is stateless or the change is announced and clients are restarted at a known offset.
  2. Mirror through a new topic: create payments.tx_events.v2 with the new partition count, dual-write from producers, drain consumers from v1, switch reads to v2, retire v1. Multi-week migration; Razorpay has run several of these and each one is a project, not a config change.
  3. Dedicated tools — Confluent's Cluster Linking, MirrorMaker 2, or Kafka Streams' repartition() can do this with care. None of them is "free".

This is why partition count is the single most consequential one-line config in a Kafka deployment. Pick it for peak expected throughput in the next 2–3 years, not for current load. The Confluent rule of thumb: estimate peak MB/s, divide by 50 MB/s/partition (a conservative per-partition write ceiling), and round up to the next power of two for clean math when you do eventually have to repartition. A topic doing 1 GB/s at peak: 1000 / 50 = 20, round up to 32 partitions, then double once more to 64 to leave headroom. Reset the count after 18 months of production data is far more painful than overprovisioning by 2× on day one.

Changing N remaps every keyTwo columns showing the same five keys hashing to partitions when N=4 and when N=8. Almost every key lands on a different partition after the change, demonstrating that growing the partition count silently breaks per-key ordering for all keys, not just the hot one. Same 5 keys, N=4 vs N=8: nearly every key relocates N = 4 merchant_42 → murmur2 mod 4 = 1 merchant_99 → murmur2 mod 4 = 3 merchant_007 → murmur2 mod 4 = 0 merchant_201 → murmur2 mod 4 = 2 merchant_555 → murmur2 mod 4 = 1 N = 8 (after repartition) merchant_42 → murmur2 mod 8 = 5 (was 1) merchant_99 → murmur2 mod 8 = 7 (was 3) merchant_007 → murmur2 mod 8 = 0 (same) merchant_201 → murmur2 mod 8 = 6 (was 2) merchant_555 → murmur2 mod 8 = 1 (was 1, lucky) ~7 of every 8 keys end up on a different partition. History on the old partition; new writes on the new one.
The keys that look the same to the producer land on different partitions after the count changes. A consumer relying on per-key history sees a split brain until the historical data is drained.

Consumer groups split partitions among consumers

The producer side decides where records go; the consumer side decides who reads what. A consumer group is a name shared by N consumer processes (group_id="fraud-rules-v2" in the previous chapter). Kafka's group coordinator assigns each partition to exactly one consumer in the group. If you have 6 partitions and 2 consumers, each consumer reads 3 partitions. Add a third consumer and rebalance — now each consumer reads 2. Add a tenth consumer and 4 of them sit idle, because 6 partitions cannot be split into 10 ways without breaking per-partition order.

This is the parallelism ceiling. A topic with N partitions can be processed by at most N consumers in a group. If you need 100 consumers in parallel, the topic needs at least 100 partitions — chosen on day one, because growing it later breaks ordering. This is why "how many partitions?" is asked at design time and why over-provisioning is the safer mistake. Why one consumer per partition and not many: per-partition order is the contract, and order requires a single reader; two readers on the same partition would race on offset commits and either skip records or process them twice. The only way to parallelise reads of one partition is to give up order — at which point you might as well use a queue (SQS, RabbitMQ), not a log.

Two common rebalance strategies. Range assignment sorts consumers and partitions and gives each consumer a contiguous range — simple, but unbalanced when partition counts are uneven across topics. Sticky assignment tries to keep each consumer's previous partitions during a rebalance, only moving the minimum needed. Sticky is the modern default; cooperative-sticky goes further by keeping consumers reading their unchanged partitions during the rebalance, eliminating the seconds-long "stop-the-world" pause that older Kafka clients had. For a Swiggy fraud detector that has 1.2 GB of in-memory state per consumer, the difference between a 3-second and a 30-second rebalance is the difference between "no one notices" and "an alert page goes off".

A practical sanity check before you ship: pick three keys you care about (your busiest merchant, your noisiest pin code, a random one) and run partition_for(key, N) to confirm they don't all collapse onto the same partition. If two of three land together, your hash function is unlucky and you should check whether the SDK is using Murmur2 or hashCode.

Common confusions

Going deeper

What sticky cooperative rebalance actually does

Pre-2.4 Kafka had "stop-the-world" rebalance: every consumer in the group revokes all its partitions, the coordinator computes a new assignment, every consumer picks up its new partitions. During the rebalance — often 5–30 seconds, depending on state-restoration cost — the topic is unprocessed. A Dream11 fraud detector during an India–Pakistan match cannot afford a 30-second pause; the queue would back up and miss the window.

Cooperative sticky (KIP-429, Kafka 2.4) splits this into two phases. Phase 1: the coordinator computes the new assignment but only revokes partitions that changed owner. Consumers whose partition list is unchanged keep processing through the rebalance. Phase 2: revoked partitions get reassigned to their new owners; only those few consumers pause briefly. For a 1.2 GB-state-per-consumer fraud detector, a single-partition reassignment restores in ~2 seconds; full-group reassignment took 60. The Confluent client and kafka-python ≥ 2.0.2 both support cooperative-sticky; switching to it is a one-line change with a one-restart cost.

Murmur3, FNV, xxHash — when does the hash matter

Kafka's protocol hardcodes Murmur2 for the default partitioner. If you use a non-default partitioner (custom Java class, or kafka-python's DefaultPartitioner), the hash function might differ — kafka-python ≤ 2.0.1 used Java's hashCode(), which produces a different partition assignment than the JVM Kafka client for the same key. This bites teams running mixed-language producers: the Python service writes to partition 9, the Java service writes the same key to partition 17, and per-key order is silently broken. The fix is to pin both clients to Murmur2 explicitly. Always check the partitioner your client uses; never assume hash compatibility across SDKs.

Co-partitioning: when two topics must share a partitioner

Stream-stream joins (chapter 67) require co-partitioning: if you want to join payments.tx_events against fraud.scores on merchant_id, both topics must use the same partition count and the same hash function on the same key. Otherwise merchant_42 is on partition 9 in one topic and partition 41 in the other, and a consumer reading both cannot see them on the same machine without a re-shuffle. Kafka Streams' repartition() and Flink's keyBy() both perform this re-shuffle automatically when their planner detects mismatched partitioning, but it is a network-wide hop you'd rather avoid. Choose partition counts at design time so joinable topics share a count.

Why "round-robin a hot key" is harder than it looks

Suppose Kiran on the platform team decides to round-robin records for the top 10 merchants across all 64 partitions to fix a hot partition. The throughput problem evaporates — but four new ones replace it. First, every consumer in the group now needs the merchant's running state (running balance, fraud counters), because any partition can carry that merchant's records — state that used to live on one consumer now duplicates 64 times. Second, ordering is gone for those merchants, so a fraud rule that depended on "transaction N happened before transaction N+1" now races. Third, a downstream sink writing to a per-merchant Postgres row sees concurrent writes from 64 consumers; you need row-level locking or optimistic concurrency to keep totals consistent. Fourth, when one of those 10 merchants stops trading, you cannot drain "their" partition — their records are everywhere. The lesson: round-robining a key is fine for genuinely keyless data (logs, metrics) and dangerous for anything that has per-key state.

The 2026 partition-budget conversation at Razorpay

Razorpay's payments cluster runs ~12,000 partitions across 18 brokers in ap-south-1. The on-call rotation tracks two metrics weekly: under-replicated partitions (0 is the goal; >50 means a broker is dying or network-saturated) and partition-skew (max partitions per broker / min). When skew exceeds 1.3× the cluster gets reassigned with Cruise Control over a weekend window, moving ~8% of partitions in ~6 hours of throttled traffic. The cost of not doing this rebalance is roughly ₹40 lakh/year in over-provisioned broker hours — a busier broker holds back the whole cluster's autoscaling because it hits CPU first.

Where this leads next

A partition is the unit of parallelism, but it is also the unit of failure. If broker B holding partition 3 dies, partition 3 is offline until a replica is promoted. /wiki/replication-and-isr-how-kafka-stays-up covers the in-sync replica protocol that turns "one disk per partition" into "F+1 disks per partition with automatic failover", which is what makes a Kafka cluster survive a broker dying without losing the partition's contents.

Then /wiki/consumer-groups-and-offset-management digs into the group coordinator: how partitions are assigned, how offsets are committed, and how a 3-second rebalance can blow up a fraud detector's tail latency. /wiki/the-kafka-protocol-on-the-wire is the wire-format chapter that ties producer, broker, and consumer together.

By the end of Build 7 the picture is whole: partitions distribute writes, replication makes each partition durable, and consumer groups parallelise reads — three independent dials, each with its own failure mode. Every Kafka-shaped system you'll meet — Pulsar, Redpanda, Kinesis, Pub/Sub Lite, Azure Event Hubs — implements these three pieces with slightly different names. Once you have the partition picture, switching between them is a config exercise, not a rewrite.

References