Partitions and parallelism
Last chapter you wrote a single-partition log. It works, it durably stores millions of records, and on one NVMe disk it tops out somewhere around 50 MB/s of sustained writes. Razorpay's payments.tx_events topic does roughly 1 GB/s at peak. One partition cannot hold that traffic — not because the format is wrong, but because one disk is the wall. Partitions are how you punch through the wall while keeping the only ordering guarantee anyone actually needs.
A topic is N independent partitions. The partition for a record is hash(key) mod N — same key always lands in the same partition, so per-key order is preserved; cross-key order is given up. N partitions can be written by N producers, served by N consumers in a group, and stored on N disks across brokers, so throughput and storage scale linearly. The trade-off you cannot avoid: increasing N later silently changes the partition function for every key, which is why partition count is the most consequential one-line config in any streaming system.
Why one partition is not enough
A single partition is a single append-only file on a single disk. Three independent caps stack on top of each other. The disk: a good NVMe sustains ~500 MB/s sequential write, EBS gp3 ~125 MB/s, and that ceiling is shared with every other write on the same volume. The consumer: per-partition order means one consumer in a group reads it; you cannot parallelise reads of a single partition without losing order. The recovery: if that one broker dies, the partition is offline until the replica is promoted, which is seconds you don't have at 100k events/sec.
The fix is the only fix that has ever worked for distributed log systems: partition the topic. A topic with 64 partitions is 64 independent logs, each with its own segment files, its own offset sequence, its own consumer assignment. Producers send each record to exactly one partition; consumers in a group share the partitions among themselves. The total throughput of the topic is roughly N times the per-partition throughput, capped only by network and broker count.
The cost of this fix is one specific giveaway: global order across the whole topic is gone. If key=A lands on partition 3 and key=B on partition 7, a consumer reading both partitions sees them in some interleaving, not in the order they were originally produced. What survives is per-partition order, which — by the partition function — gives per-key order. Almost every streaming application turns out to need only the second guarantee. A fraud rule for merchant_42 cares that that merchant's transactions are in order; it does not care whether merchant_42's transaction at 11:02:03 is processed before or after merchant_99's transaction at 11:02:03. This single trade-off is what makes the Kafka model work.
The whole architecture is in that figure. The producer side has a partition selector. The broker side has N independent logs (each one is the storage from chapter 48, replicated to F+1 machines — that part is chapter 51). The consumer side has a group coordinator that hands each consumer a slice of partitions. Lose any one of those three pieces and the system stops scaling.
The partition function
The partition function takes a record's key and returns an integer in [0, N). The contract is non-negotiable: same key, same partition, every time. Without that, per-key order is gone — Razorpay's merchant=42 records would land on partition 3 sometimes and partition 7 sometimes, and a downstream consumer reconstructing each merchant's transaction stream would see records out of order.
The function Kafka uses by default is murmur2(key) mod num_partitions. Murmur2 is a 32-bit non-cryptographic hash chosen for speed and good distribution; the mod makes it fit inside the partition count. Why not key.hashCode() mod N (Java's default)? Because hashCode is uneven for short keys — many merchant IDs that differ only in the last digit collide on the same partition, creating hot partitions. Murmur2 spreads them.
If the key is null, Kafka used to round-robin (one record per partition in turn). Modern Kafka uses sticky partitioning: the producer picks one partition, batches records into it for linger.ms worth of time, then picks another. This keeps batches large, which keeps throughput high, at the cost of slight imbalance over short windows.
# partitioner.py — the contract is "same key → same partition"
import hashlib, struct
def murmur2(data: bytes, seed: int = 0x9747b28c) -> int:
"""Kafka's default partitioner uses 32-bit Murmur2."""
length = len(data)
m = 0x5bd1e995
r = 24
h = seed ^ length
i = 0
while length >= 4:
k = struct.unpack_from(">I", data, i)[0]
k = (k * m) & 0xFFFFFFFF
k ^= (k >> r)
k = (k * m) & 0xFFFFFFFF
h = (h * m) & 0xFFFFFFFF
h ^= k
i += 4
length -= 4
if length == 3:
h ^= data[i + 2] << 16
if length >= 2:
h ^= data[i + 1] << 8
if length >= 1:
h ^= data[i]
h = (h * m) & 0xFFFFFFFF
h ^= (h >> 13)
h = (h * m) & 0xFFFFFFFF
h ^= (h >> 15)
return h & 0x7FFFFFFF # positive int
def partition_for(key: bytes, num_partitions: int) -> int:
"""Same key → same partition, every time, on every producer."""
return murmur2(key) % num_partitions
# --- Demo: 200,000 merchant IDs onto 64 partitions ---
if __name__ == "__main__":
counts = [0] * 64
for i in range(200_000):
merchant_id = f"merchant_{i:06d}".encode()
counts[partition_for(merchant_id, 64)] += 1
avg = 200_000 // 64
drift = max(counts) - min(counts)
print(f"avg per partition: {avg}")
print(f"max: {max(counts)}, min: {min(counts)}, drift: {drift} ({100*drift/avg:.1f}%)")
print(f"merchant_000042 → partition {partition_for(b'merchant_000042', 64)}")
print(f"merchant_000042 → partition {partition_for(b'merchant_000042', 64)} (again)")
print(f"merchant_000042 → partition {partition_for(b'merchant_000042', 32)} (N=32, different)")
# Sample run:
avg per partition: 3125
max: 3243, min: 2998, drift: 245 (7.8%)
merchant_000042 → partition 17
merchant_000042 → partition 17 (again)
merchant_000042 → partition 9 (N=32, different)
The lines that matter:
return h & 0x7FFFFFFF— the result is forced positive before themod. Negative-modulo bugs are the single most common partitioner mistake; a Java producer that usesMath.abs(hash) % Nreturns a negative partition forInteger.MIN_VALUEand crashes on send. Masking with the high bit clear avoids the case entirely.partition_forreturnsmurmur2(key) % num_partitions— pure function of two inputs, no state. Why a hash and not a counter: a counter would need every producer to coordinate on a shared sequence number, which is exactly the distributed-coordination problem the topic was supposed to solve. Hashing is the only way to let independent producers agree on a partition without talking to each other.- The
(N=32, different)line — the same key lands on partition 17 when N=64 and partition 9 when N=32. Change the partition count and every key remaps; the per-key order guarantee is lost across the boundary. This is the trap that bites every team that grew their topic from 32 to 64 partitions without thinking. - Drift = 7.8% — Murmur2 over 200k random-ish keys spreads uniformly enough that the busiest partition is ~8% above average. Real workloads have much higher drift because keys are not uniform: 0.1% of merchants generate 30% of traffic. The next section deals with that.
Hot partitions and key skew
The partition function distributes keys uniformly. It does not distribute traffic uniformly, because traffic is uneven. On Razorpay, the top 100 merchants out of 50 million handle around 40% of payment volume. On Zerodha, Reliance and HDFC trades dwarf the long tail. On Swiggy during dinner, three Bengaluru pin codes generate more order events than the next two hundred combined. Hash-by-merchant-id sends every record for one big merchant to one partition, and that partition becomes the bottleneck for the whole topic.
You see it in metrics: 63 partitions sit at 20% CPU and one sits at 95%, with consumer lag building only on that one. Three honest fixes exist; you pick based on whether you can give up per-key order.
Fix 1 — composite keys. Hash on (merchant_id, hour) instead of merchant_id. The same merchant's records spread across 24 partitions per day instead of 1. Per-merchant order is lost; per-merchant-per-hour order is preserved. Acceptable for fraud rules that batch hourly. Not acceptable for a real-time fraud detector that needs the full per-merchant sequence.
Fix 2 — explicit partition for hot keys. Write a custom partitioner that recognises the top-K merchant IDs and round-robins them across a dedicated subset of partitions, while everything else uses Murmur2. Per-key order is lost for the hot keys, so the consumer must tolerate out-of-order records for those — usually fine if the downstream is a ledger that's order-independent (idempotent upserts, chapter 27).
Fix 3 — split the topic. Move the top-K merchants onto a separate topic payments.tx_events.vip with its own partition count. Per-key order is preserved everywhere; the consumer reads two topics. This is what Razorpay does in practice — VIP merchants get isolated infrastructure that can be scaled independently of the long tail. The operational cost is one extra topic per tier; the win is that a noisy VIP merchant cannot starve the long tail's consumers, and the long tail's consumer crash cannot hold up VIP processing.
Whichever fix you pick, instrument the answer. Kafka exposes per-partition write rate and consumer lag through JMX; a three-line PromQL alert (max(kafka_partition_bytes_in_per_second) / avg(kafka_partition_bytes_in_per_second) > 5) catches a hot partition before consumers start lagging. You only fix what you can see, and per-partition metrics are the entire visibility surface for skew problems.
The thing you should never do: bump partition count without thinking. It does not fix hot partitions — the hot key still hashes to one partition, just a different one. It also breaks ordering for every other key, as the next section explains.
What changing the partition count costs
hash(key) mod N is the partition function. Change N from 32 to 64 and every key gets remapped: merchant_42 was on partition hash mod 32 = 9, becomes partition hash mod 64 = 41. For a duration after the change, records for the same key are spread across the old partition (where historical data lives) and the new partition (where new writes go). A consumer reading per-key state — say, "current balance for merchant_42" — sees half the history on partition 9 and the other half on partition 41 and computes wrong totals.
There is no clean way to repartition a Kafka topic in place. The options:
- Forward-only: add partitions, accept that ordering for new records may differ from historical, and that consumers reading historical data must drain old partitions before relying on the new layout. Works only if the consumer is stateless or the change is announced and clients are restarted at a known offset.
- Mirror through a new topic: create
payments.tx_events.v2with the new partition count, dual-write from producers, drain consumers from v1, switch reads to v2, retire v1. Multi-week migration; Razorpay has run several of these and each one is a project, not a config change. - Dedicated tools — Confluent's Cluster Linking, MirrorMaker 2, or Kafka Streams'
repartition()can do this with care. None of them is "free".
This is why partition count is the single most consequential one-line config in a Kafka deployment. Pick it for peak expected throughput in the next 2–3 years, not for current load. The Confluent rule of thumb: estimate peak MB/s, divide by 50 MB/s/partition (a conservative per-partition write ceiling), and round up to the next power of two for clean math when you do eventually have to repartition. A topic doing 1 GB/s at peak: 1000 / 50 = 20, round up to 32 partitions, then double once more to 64 to leave headroom. Reset the count after 18 months of production data is far more painful than overprovisioning by 2× on day one.
Consumer groups split partitions among consumers
The producer side decides where records go; the consumer side decides who reads what. A consumer group is a name shared by N consumer processes (group_id="fraud-rules-v2" in the previous chapter). Kafka's group coordinator assigns each partition to exactly one consumer in the group. If you have 6 partitions and 2 consumers, each consumer reads 3 partitions. Add a third consumer and rebalance — now each consumer reads 2. Add a tenth consumer and 4 of them sit idle, because 6 partitions cannot be split into 10 ways without breaking per-partition order.
This is the parallelism ceiling. A topic with N partitions can be processed by at most N consumers in a group. If you need 100 consumers in parallel, the topic needs at least 100 partitions — chosen on day one, because growing it later breaks ordering. This is why "how many partitions?" is asked at design time and why over-provisioning is the safer mistake. Why one consumer per partition and not many: per-partition order is the contract, and order requires a single reader; two readers on the same partition would race on offset commits and either skip records or process them twice. The only way to parallelise reads of one partition is to give up order — at which point you might as well use a queue (SQS, RabbitMQ), not a log.
Two common rebalance strategies. Range assignment sorts consumers and partitions and gives each consumer a contiguous range — simple, but unbalanced when partition counts are uneven across topics. Sticky assignment tries to keep each consumer's previous partitions during a rebalance, only moving the minimum needed. Sticky is the modern default; cooperative-sticky goes further by keeping consumers reading their unchanged partitions during the rebalance, eliminating the seconds-long "stop-the-world" pause that older Kafka clients had. For a Swiggy fraud detector that has 1.2 GB of in-memory state per consumer, the difference between a 3-second and a 30-second rebalance is the difference between "no one notices" and "an alert page goes off".
A practical sanity check before you ship: pick three keys you care about (your busiest merchant, your noisiest pin code, a random one) and run partition_for(key, N) to confirm they don't all collapse onto the same partition. If two of three land together, your hash function is unlucky and you should check whether the SDK is using Murmur2 or hashCode.
Common confusions
- "More partitions always means more throughput." Up to a point. Each partition has a fixed cost on the broker — file handles, replication threads, controller metadata. Above ~4,000 partitions per broker, the controller's metadata-replication overhead starts dominating CPU. The 2024 KRaft mode (replacing ZooKeeper) raises this to ~200k partitions per cluster, but per-broker the wall is similar. Pick the smallest count that meets peak throughput, then leave 2× headroom.
- "Partition count and replication factor are the same thing." Different axes entirely. Partitions split a topic across brokers for parallelism. Replication factor copies each partition to F+1 brokers for durability. A topic with 64 partitions and replication factor 3 has 192 partition replicas spread across the cluster. One is the parallelism dial; the other is the durability dial.
- "Per-partition order is global order." No. Records inside one partition are strictly ordered by offset; records across partitions have no ordering relationship. If
key=Agoes to partition 3 andkey=Bgoes to partition 7, a consumer reading both sees them in some interleaving determined by the broker's local clock and the consumer's poll. The only guarantee is "for any one key, you see its records in the order they were produced" — which is exactly the guarantee most applications actually need. - "
nullkeys round-robin one record per partition." Old Kafka, yes. Modern Kafka does sticky partitioning: the producer picks one partition and batches into it forlinger.ms, then picks another. The throughput cost of true round-robin (tiny single-record batches) was unacceptable; sticky preserves batch size at the cost of slight short-window imbalance. Production traffic is rarely affected — over a minute, distribution is even. - "Adding a broker rebalances partitions automatically." It does not. New brokers serve new partitions and replicas; existing partitions stay where they were. To move existing partitions onto the new broker you run a partition reassignment (Kafka's
kafka-reassign-partitions.shor Cruise Control). Forgetting to do this is the most common reason a freshly-added broker sits idle while the rest of the cluster groans. - "Two consumers in the same group can read the same partition for redundancy." Forbidden by design. One partition, one consumer in the group — anything else would mean records get processed twice. If you want two independent processings of the same data (one for fraud, one for analytics), use two consumer groups. Each group gets its own offset cursor; both read every record.
Going deeper
What sticky cooperative rebalance actually does
Pre-2.4 Kafka had "stop-the-world" rebalance: every consumer in the group revokes all its partitions, the coordinator computes a new assignment, every consumer picks up its new partitions. During the rebalance — often 5–30 seconds, depending on state-restoration cost — the topic is unprocessed. A Dream11 fraud detector during an India–Pakistan match cannot afford a 30-second pause; the queue would back up and miss the window.
Cooperative sticky (KIP-429, Kafka 2.4) splits this into two phases. Phase 1: the coordinator computes the new assignment but only revokes partitions that changed owner. Consumers whose partition list is unchanged keep processing through the rebalance. Phase 2: revoked partitions get reassigned to their new owners; only those few consumers pause briefly. For a 1.2 GB-state-per-consumer fraud detector, a single-partition reassignment restores in ~2 seconds; full-group reassignment took 60. The Confluent client and kafka-python ≥ 2.0.2 both support cooperative-sticky; switching to it is a one-line change with a one-restart cost.
Murmur3, FNV, xxHash — when does the hash matter
Kafka's protocol hardcodes Murmur2 for the default partitioner. If you use a non-default partitioner (custom Java class, or kafka-python's DefaultPartitioner), the hash function might differ — kafka-python ≤ 2.0.1 used Java's hashCode(), which produces a different partition assignment than the JVM Kafka client for the same key. This bites teams running mixed-language producers: the Python service writes to partition 9, the Java service writes the same key to partition 17, and per-key order is silently broken. The fix is to pin both clients to Murmur2 explicitly. Always check the partitioner your client uses; never assume hash compatibility across SDKs.
Co-partitioning: when two topics must share a partitioner
Stream-stream joins (chapter 67) require co-partitioning: if you want to join payments.tx_events against fraud.scores on merchant_id, both topics must use the same partition count and the same hash function on the same key. Otherwise merchant_42 is on partition 9 in one topic and partition 41 in the other, and a consumer reading both cannot see them on the same machine without a re-shuffle. Kafka Streams' repartition() and Flink's keyBy() both perform this re-shuffle automatically when their planner detects mismatched partitioning, but it is a network-wide hop you'd rather avoid. Choose partition counts at design time so joinable topics share a count.
Why "round-robin a hot key" is harder than it looks
Suppose Kiran on the platform team decides to round-robin records for the top 10 merchants across all 64 partitions to fix a hot partition. The throughput problem evaporates — but four new ones replace it. First, every consumer in the group now needs the merchant's running state (running balance, fraud counters), because any partition can carry that merchant's records — state that used to live on one consumer now duplicates 64 times. Second, ordering is gone for those merchants, so a fraud rule that depended on "transaction N happened before transaction N+1" now races. Third, a downstream sink writing to a per-merchant Postgres row sees concurrent writes from 64 consumers; you need row-level locking or optimistic concurrency to keep totals consistent. Fourth, when one of those 10 merchants stops trading, you cannot drain "their" partition — their records are everywhere. The lesson: round-robining a key is fine for genuinely keyless data (logs, metrics) and dangerous for anything that has per-key state.
The 2026 partition-budget conversation at Razorpay
Razorpay's payments cluster runs ~12,000 partitions across 18 brokers in ap-south-1. The on-call rotation tracks two metrics weekly: under-replicated partitions (0 is the goal; >50 means a broker is dying or network-saturated) and partition-skew (max partitions per broker / min). When skew exceeds 1.3× the cluster gets reassigned with Cruise Control over a weekend window, moving ~8% of partitions in ~6 hours of throttled traffic. The cost of not doing this rebalance is roughly ₹40 lakh/year in over-provisioned broker hours — a busier broker holds back the whole cluster's autoscaling because it hits CPU first.
Where this leads next
A partition is the unit of parallelism, but it is also the unit of failure. If broker B holding partition 3 dies, partition 3 is offline until a replica is promoted. /wiki/replication-and-isr-how-kafka-stays-up covers the in-sync replica protocol that turns "one disk per partition" into "F+1 disks per partition with automatic failover", which is what makes a Kafka cluster survive a broker dying without losing the partition's contents.
Then /wiki/consumer-groups-and-offset-management digs into the group coordinator: how partitions are assigned, how offsets are committed, and how a 3-second rebalance can blow up a fraud detector's tail latency. /wiki/the-kafka-protocol-on-the-wire is the wire-format chapter that ties producer, broker, and consumer together.
By the end of Build 7 the picture is whole: partitions distribute writes, replication makes each partition durable, and consumer groups parallelise reads — three independent dials, each with its own failure mode. Every Kafka-shaped system you'll meet — Pulsar, Redpanda, Kinesis, Pub/Sub Lite, Azure Event Hubs — implements these three pieces with slightly different names. Once you have the partition picture, switching between them is a config exercise, not a rewrite.
References
- Apache Kafka — Producer configuration:
partitioner.class— the default partitioner contract, sticky-partitioning behaviour, and how to plug in a custom partitioner. - Confluent — Choosing the number of partitions for a Kafka topic — Jun Rao's foundational sizing guide; the per-partition throughput numbers and the partition-count formula are still the production rules of thumb.
- KIP-429: Kafka Consumer Incremental Rebalance Protocol — the cooperative-sticky design doc; explains why pre-2.4 stop-the-world rebalance was unacceptable for stateful consumers.
- Murmur2 hash function — original spec — Austin Appleby's reference; the choice of Murmur2 over Java
hashCode()is the single change that prevents hot partitions on uneven keys. - LinkedIn Engineering — Kafka at LinkedIn: 7 trillion messages/day — partition counts, broker sizing, and rebalance latencies at the original Kafka deployment; useful Indian-scale comparison.
- Cruise Control — partition rebalancer for Kafka — the LinkedIn-built tool every large Kafka deployment uses for partition reassignment.
- /wiki/a-single-partition-log-in-python — the previous chapter; this chapter takes one of those logs and replicates it across N machines.
- /wiki/consumer-groups-and-offset-management — how a group coordinator assigns partitions to consumers and tracks each consumer's progress.