Sharded MongoDB: Chunks, Balancers, Config Servers

In short

A single MongoDB replica set is a beautiful machine — three nodes, one primary, automatic failover, oplog tailing for change streams. It is also a single-machine database. The primary holds the entire dataset and serves every write; secondaries hold full copies. When your collection passes roughly 1–2 TB or your write rate climbs past 10k ops/sec on commodity hardware, you have two choices: buy a bigger box (vertical scaling, capped by what AWS will rent you) or shard horizontally. MongoDB's sharding architecture is built in.

A sharded MongoDB cluster has three kinds of components. mongos processes are stateless query routers — clients connect to mongos, mongos figures out which shards to talk to, and mongos returns the answer. There are usually several mongos behind a load balancer; lose one, lose nothing. Config servers are a 3-node replica set storing the shard catalogue — the mapping from shard-key ranges to shards, the schema for sharded collections, balancer state. If the config servers are unreachable, the cluster cannot route queries, so they are the single most important replica set in the deployment. Shards are the data nodes; each shard is itself a 3-node replica set, and holds a slice of the data.

Inside each shard, data is divided into chunks — contiguous ranges of shard-key values, default 64 MB since v6.0. As writes flow in, chunks grow, hit the size threshold, and split. As shards become unbalanced (chunk count drift past a threshold), the balancer — a background process that runs on the config server primary — picks a chunk on the heaviest shard and migrates it to the lightest shard. Migration is online: writes during the move are tracked and replayed at the destination before the source releases the chunk.

The single most consequential decision in sharded MongoDB is the shard key. Ranged sharding gives you efficient range queries but risks hot spots on monotonic keys (timestamps, ObjectIds). Hashed sharding gives you uniform distribution but turns range queries into scatter-gather across every shard. Zoned sharding lets you pin key ranges to specific shards for geo-locality — Indian users to a Mumbai shard, US users to an Iowa shard. Pick wrong and you live with hot shards, expensive queries, or until v5.0 a dump-and-reload to fix it; v5.0+ supports online resharding, but it is slow and expensive. The cluster scales to a documented maximum of 1024 shards; in practice almost nobody runs more than 12.

You finished the previous chapter on change streams. You now know how a single MongoDB replica set works end-to-end: a primary takes writes, the oplog records them, secondaries tail the oplog, and external consumers tail it too via change streams. That is enough to power a startup with millions of users on a single replica set running on three EC2 instances.

It is not enough for the collection that finally outgrows one machine. Some numbers from the field: a single replica set on a r6i.4xlarge (16 vCPU, 128 GB RAM, NVMe-backed gp3) on AWS will comfortably hold 1–2 TB of working set in cache and serve 10k–30k writes/sec at sub-10ms p99. Past that, the primary's CPU saturates, the working set spills out of RAM, and disk reads start showing up in the hot path. You cannot fix this by adding secondaries — secondaries do not take writes. You cannot fix it by adding RAM — r6i.32xlarge is the biggest on-demand instance and it gets you to maybe 4–5 TB. You can fix it by sharding.

This chapter is about the architecture of sharded MongoDB — the three component types, how chunks work, how the balancer keeps things even, how the shard key choice shapes every query you will ever run, and the practical pitfalls that bite real deployments. We have already covered the Multi-Raft pattern used by CockroachDB and TiKV; MongoDB's sharding is older (it predates Raft) and uses a different design — there is no per-chunk consensus group, just a per-shard replica set with the existing oplog-based primary election.

The three components

A sharded MongoDB deployment has three kinds of processes, and each kind exists for a specific reason. Get this picture into your head before anything else.

The three layers of a sharded MongoDB cluster. Application clients connect to mongos query routers (stateless, horizontally scaled behind an L4 load balancer). mongos consults the config server replica set on every cache miss to learn which shard owns which chunks of the shard-key range, then forwards the operation to the correct shard primary. Each shard is itself a full MongoDB replica set with its own primary, oplog, and elections — sharding does not replace replication, it composes with it.

mongos: stateless query routers

A mongos process is a stateless gateway. It listens on port 27017 (the same port as mongod, deliberately — drivers do not need to know whether they are talking to a single replica set or a sharded cluster). When a query arrives, mongos:

Parses the operation and identifies the target collection.
Looks up the collection's shard key in its routing cache (a local in-memory copy of the chunk-to-shard map).
If the operation contains the shard key, computes which chunk(s) it falls into and forwards to those shards.
If it does not, broadcasts to all shards (scatter-gather).
Aggregates responses and returns to the client.

Why stateless: every mongos is identical. You run several behind an AWS NLB or an HAProxy and shoot one in the head whenever you want — drivers reconnect to a survivor. The only state mongos holds is the routing cache, and that is just a cache of what the config servers know. A cold mongos warms up by querying the config server primary; a stale entry is detected when a shard returns StaleConfig and triggers a refresh. There is no failover dance, no leader election among mongos. Production deployments typically run one mongos per application server (co-located, talked to over localhost) or a small fleet behind a load balancer.

Config servers: the metadata replica set

The config servers are a regular 3-node replica set running mongod with --configsvr. They store the config database, which contains:

config.collections — every sharded collection and its shard key.
config.chunks — every chunk: shard-key range, current shard, version stamp.
config.shards — the list of shards and their connection strings.
config.locks — distributed locks for balancer operations.
config.changelog — audit trail of chunk splits and migrations.

Every metadata change (creating a sharded collection, splitting a chunk, migrating a chunk) is a write to the config server primary, replicated to its two secondaries via the standard oplog mechanism. The balancer also runs on the config server primary as a background goroutine.

Why this matters operationally: if all three config servers are down, the cluster cannot route any new query that requires a metadata refresh. mongos can serve out of its routing cache for cached collections, but anything that triggers a StaleConfig refresh — chunk splits in flight, new collections, a recently migrated chunk — will fail. You should monitor config server health as carefully as you monitor your shards. AWS DocumentDB and Atlas hide this from you; you should not hide it from yourself if you self-host.

Shards: where the data lives

Each shard is a 3-node replica set. From inside a shard, MongoDB looks exactly like an unsharded MongoDB — a primary with an oplog, two secondaries tailing it, automatic failover via the standard election protocol. The only difference is that the shard only holds chunks whose shard-key ranges have been assigned to it. A query for a key range it does not own returns no documents (and mongos would never have routed there anyway). A typical production cluster runs 3–12 shards; the documented hard limit is 1024.

Chunks and the balancer

A sharded collection is logically divided into chunks. A chunk is a contiguous range of shard-key values — e.g. customerId: ["a", "f") is one chunk, ["f", "k") is the next. The default chunk size since MongoDB 6.0 is 64 MB (it was 128 MB before 6.0; the change was made because smaller chunks make the balancer more responsive). Every document in a sharded collection lives in exactly one chunk, and that chunk lives on exactly one shard at any given moment.

The balancer compares chunk counts across shards. When the difference exceeds the migration threshold (currently 2 for collections under 80 chunks, scaling up for larger collections), it picks chunks from the heaviest shard and moves them to the lightest. A migration copies the chunk's documents to the destination, tracks any writes that happen during the copy in a delta log, applies the delta, blocks new writes for a short critical section, applies the final delta, updates the config servers, and releases the chunk on the source. Total wall-clock time for a 64 MB chunk migration on a healthy network is typically 5–30 seconds.

Splits

A chunk grows when documents are inserted into its key range. When the chunk crosses 64 MB (a soft threshold checked on insert), mongos requests a split. The split is a metadata-only operation — no data moves. The config server inserts a new boundary into the chunk map, and what was one chunk [a, c) becomes two: [a, b) and [b, c). The boundary key is chosen as the median of the chunk's documents.

Splits are cheap, fast, and happen continuously as the dataset grows. A 5 TB sharded collection with 64 MB chunks has roughly 80,000 chunks. The config servers handle this metadata happily — the chunk map for a healthy collection is a few tens of MB.

Balancer migrations

Splits create chunks but do not move them. After a split, both new chunks live on the same shard as the original. Over time, if writes are concentrated in one shard-key range, that shard accumulates more chunks than its peers. The balancer detects this and starts migrating.

The balancer runs on the config server primary. Every few seconds it:

Counts chunks per shard for each sharded collection.
If (maxChunks − minChunks) > threshold, schedules a migration from max to min.
Asks the source shard to start a _recvChunkStart command on the destination.
Source streams documents in the chunk to destination over MongoDB's wire protocol.
Source tracks any writes to the chunk during the copy in a delta log.
Once initial copy is done, source applies the deltas in batches.
When the delta queue is small enough, source blocks writes to the chunk for a brief critical section, applies the final deltas, and tells the config servers to update the chunk map.
Source deletes its copy of the chunk's documents.

Why migrations are expensive: a 64 MB chunk migration involves a full copy of those documents over the network plus oplog ops on both shards plus a config server metadata write, and during the critical section writes to that key range stall. On a busy cluster, the balancer is one of the largest sources of background I/O. MongoDB lets you set a balancer window — typically you allow it to run only during low-traffic hours (e.g. 02:00–06:00 IST). The trade-off is that during the day, hot shards stay hot. Pick your poison: balanced load or predictable latency.

The shard key: the only decision that matters

If you remember nothing else from this chapter, remember this: the shard key choice is irreversible in practice (online resharding exists since v5.0 but is slow and expensive), affects every query you will ever run, and is the single most consequential design decision in a sharded MongoDB deployment. Get it right at the start.

The shard key is a field (or compound of fields) in every document of the sharded collection. MongoDB hashes it (for hashed sharding) or sorts on it (for ranged sharding) to assign documents to chunks. There are three flavours.

Three shard-key strategies and what each one buys you. Hashed gives the most uniform write distribution but turns range queries into scatter-gather. Ranged gives efficient range scans but is dangerous on monotonically increasing keys — every write goes to the same chunk, which becomes a hot spot until it splits, and then the new highest chunk takes over. Zoned (formerly "tag-aware") sharding lets you pin key ranges to specific shards, which you use for geo-locality (low latency for nearby users) or regulatory compliance (Indian user data in a Mumbai region under the DPDP Act).

Ranged sharding

sh.shardCollection("orders.placed", {createdAt: 1}) creates a sharded collection where chunks are contiguous ranges of createdAt. A query like db.placed.find({createdAt: {$gte: ISODate("2026-04-01"), $lt: ISODate("2026-05-01")}}) hits only the chunks whose ranges overlap April 2026 — typically one or two shards out of N. That is the upside.

The downside is monotonic keys. createdAt always increases. Every new document goes into the highest-key chunk, which lives on one shard. That shard takes 100% of write traffic. The chunk splits when it hits 64 MB, but the new highest chunk is on the same shard and the problem repeats. This is the "hot tail" anti-pattern. The same issue bites _id (ObjectId is roughly monotonic by timestamp prefix) and any auto-increment counter.

The fix is to compound the shard key with a high-cardinality leading field: {tenantId: 1, createdAt: 1}. Now writes for different tenants land in different chunks even if they happen at the same instant, and range queries scoped to one tenant are still efficient.

Hashed sharding

sh.shardCollection("users.profiles", {userId: "hashed"}) hashes userId (MongoDB uses md5 truncated to 64 bits) and assigns chunks based on hash ranges. The hash distribution is essentially uniform, so writes spread evenly across all shards even if userId is monotonic.

The cost is range queries. find({userId: {$gte: "a", $lt: "b"}) cannot be served from one shard — the hashed key destroys the natural ordering, so this query becomes a scatter-gather across all shards. For point queries (find({userId: "spike@flipkart.com"})) this is fine; for analytical scans it is awful.

Zoned sharding

Zoned sharding lets you assign tag-prefixed key ranges to specific shards. The classic use case is geo-locality: tag your Mumbai shards with region:in and your Iowa shards with region:us, then declare that the key range {country: "IN", ...} belongs to the region:in zone. Writes from Indian users land on Mumbai shards close to them; reads do not cross the Indian Ocean. The DPDP Act 2023 makes this useful in another direction — you can prove to regulators that Indian user data physically lives in Indian data centres.

Targeted vs scatter-gather queries

Once your collection is sharded, every query is one of two shapes. Targeted queries include the shard key (or a prefix of it for compound keys); mongos routes them to one shard, the shard executes locally, the result comes back. Latency is the same as an unsharded query — single-digit milliseconds for an indexed lookup.

Scatter-gather queries omit the shard key. mongos has no way to know which shards hold matching documents, so it broadcasts to all of them, waits for every reply, and merges. Latency is bounded by the slowest shard. Cost is N times the work. Some operators ($sort, $group without the shard key as a leading group field, $lookup joining sharded collections) require gathering at mongos for a final merge step.

You can see which kind a query is using explain:

from pymongo import MongoClient
client = MongoClient("mongodb://mongos1.prod:27017,mongos2.prod:27017/")
orders = client.ecommerce.orders
# targeted: includes the shard key {customerId: "hashed"}
plan_t = orders.find({"customerId": "user_42938"}).explain()
print(plan_t["queryPlanner"]["winningPlan"]["shards"])  # one entry → one shard
# scatter-gather: no shard key in the filter
plan_s = orders.find({"status": "pending", "amount": {"$gt": 5000}}).explain()
print(plan_s["queryPlanner"]["winningPlan"]["shards"])  # one entry per shard

A healthy sharded MongoDB deployment serves ≥95% of its query traffic as targeted operations. If you find that scatter-gather is a large fraction of your workload, your shard key is wrong for your access pattern.

Worked example

A 5 TB Indian e-commerce orders collection

You run an e-commerce platform serving roughly 40 million Indian customers. Your orders collection has grown to 5 TB across 1.2 billion documents. A single replica set on r6i.4xlarge is choking — the working set no longer fits in 128 GB of RAM, and writes are landing at 18k/sec with the primary's CPU pegged. You decide to shard.

The chosen shard key {customer_id: "hashed"} distributes the 5 TB across four shards uniformly — each shard holds ~1.25 TB and absorbs ~5k writes/sec, well within a single replica set's capacity. Point queries by customer (the dominant OLTP workload — order history page, place new order) are targeted to one shard. Analytical queries that filter on geography or status fan out to all four; you accept the cost because they run nightly into a separate aggregation pipeline, not on every page load.

Step 1: enable sharding.

sh.enableSharding("ecommerce")
sh.shardCollection("ecommerce.orders", {customer_id: "hashed"})

MongoDB pre-splits the hashed range into 2 chunks per shard (so 8 chunks initially) and distributes them evenly so writes fan out from second one.

Step 2: choose 4 shards on r6i.4xlarge. Each shard is a 3-node replica set across three Mumbai AZs. Total node count: 4 shards × 3 nodes + 3 config servers + 2 mongos = 17 processes. Monthly cost on AWS, on-demand: about ₹14 lakh; with 1-year reserved instances, around ₹8 lakh.

Step 3: load. Use mongorestore from your existing backup or $out from an aggregation reading from the old replica set. With hashed sharding and a uniform customer_id, the balancer barely has to do anything — the initial load is already even.

Step 4: validate. Run db.orders.getShardDistribution() — you should see ~25% on each shard. Run explain on your top 10 query patterns to confirm targeted vs scatter-gather assumptions hold.

Hot user "spike@flipkart" placing 200 orders/sec during a flash sale? With hashed sharding, spike@flipkart's hash lands deterministically on one shard (say Shard 2). All 200 orders/sec go to Shard 2's primary. Is that a hot spot? Yes, but a bounded one — it's one customer's worth of writes, not the entire write traffic of the cluster. Shard 2 absorbs an extra 200 ops/sec on top of its baseline 5k ops/sec; it does not notice. The hashed scheme prevents the much worse failure mode where all writes pile on one shard.

Analytical query "all orders in Karnataka in March 2026"? No customer_id in the filter. mongos broadcasts to all 4 shards. Each shard scans its slice of March orders (using a secondary index on {state: 1, month: 1}), returns matching documents, mongos merges. Latency: ~200 ms. You accept this. If it became a hot path, you would consider a separate Mongo Atlas search node or duplicating the data into BigQuery — not changing the shard key.

Practical pitfalls

A handful of failure modes show up over and over again in production sharded MongoDB clusters. Knowing the names lets you recognise them in monitoring before they hurt.

Hot chunks under monotonic ranged sharding. Discussed above. The fix is hashed or compound shard keys. If you find this in production, online resharding (v5.0+) is the escape hatch — convertToCapped or reshardCollection with a new shard key. It is a multi-hour operation that copies all of your data and should be scheduled during a maintenance window.

Jumbo chunks. A chunk that grows past 64 MB but cannot be split because all of its documents share the same shard-key value (e.g. you sharded on country and have 500 GB of country: "IN" documents). Jumbo chunks cannot be migrated by the balancer — they exceed the migration size limit. The fix is to choose a higher-cardinality shard key from the start.

Balancer thrash during business hours. Migrations are expensive, and the default balancer window is "whenever it wants". On busy clusters you should restrict it: db.settings.update({_id: "balancer"}, {$set: {activeWindow: {start: "02:00", stop: "06:00"}}}, {upsert: true}). The cluster will be slightly imbalanced during the day; the latency improvement is worth it.

Orphan documents. A migration that completes successfully but whose final cleanup-on-source step crashes leaves orphaned copies of documents on the source shard. mongos filters them out for queries that include the shard key, but scatter-gather queries can return duplicates. The cleanupOrphaned command and the v4.4+ "range deletion" task handle this automatically; on older versions you would run it manually.

Config server outages. If your config servers are down, mongos cannot refresh its routing cache. Existing cached routes still work, but anything triggering a StaleConfig error fails. Treat the config server replica set as the most critical part of the deployment — three nodes across three AZs, with monitoring and on-call paging.

Going deeper

There are some sharp edges in MongoDB sharding that only become visible at production scale, and deserve their own treatment.

How online resharding works (v5.0+)

Before MongoDB 5.0, changing a shard key required mongodump, drop the collection, recreate with a new shard key, mongorestore. Hours of downtime; many shops postponed the fix indefinitely. Version 5.0 introduced reshardCollection, which keeps the cluster online during the change.

The mechanism is, essentially, a bulk copy with change-stream catch-up. MongoDB creates a temporary collection with the new shard key, copies all documents from the old collection in batches, and uses a change stream on the source to track writes that happen during the copy. Once the source's writes have been applied to the destination, MongoDB enters a brief critical section, applies the final delta, atomically swaps the collection name, and drops the old. The total wall-clock time depends on collection size and write rate; for a 1 TB collection at moderate write load, expect 4–8 hours and a brief (sub-second) write stall at the swap.

The cost is real: during the copy, you are doubling your write traffic (every write to source also lands at destination via the change-stream replay) and consuming an extra 1 TB of disk. Plan capacity accordingly.

Why MongoDB does not use Multi-Raft

MongoDB sharding predates Raft (sharding shipped in 2010, Raft was published in 2014) and uses replica sets — a primary-only-writes design with elections triggered by heartbeat timeouts. Each shard has its own replica set with its own primary; there is no per-chunk consensus group. This means MongoDB cannot do strongly-consistent cross-shard transactions cheaply — distributed transactions (added in v4.2) use a two-phase commit coordinator and are 2–10x slower than single-shard transactions. CockroachDB and TiKV chose Multi-Raft precisely so that every range has a Raft group and cross-range transactions can use clean two-phase locking on top of consensus.

Cluster size limits and what people actually run

The documented limit is 1024 shards per sharded cluster. Production deployments that approach this are rare; most heavy users run 3–12 shards. Yelp's well-known case study in 2018 ran 64 shards across hundreds of TB; at the time it was one of the largest. Most shops find that adding shards beyond ~12 hits diminishing returns — the per-chunk metadata overhead, balancer scheduling complexity, and operational burden start to outweigh the per-shard throughput gain. The real production pattern at scale is: more capacity per shard (bigger nodes, NVMe, more RAM), not more shards. If you find yourself wanting 50 shards, consider whether a different database (a Multi-Raft system, or a key-value store layered with an analytical engine) fits your workload better.

What managed services hide

Atlas and AWS DocumentDB hide the config servers from you — you do not choose their instance type, you do not see their CPU graphs, you do not page on their disk. On Atlas you pick a "M-series" tier per shard and Atlas runs the rest. This is fine 99% of the time. The 1% is when balancer behaviour matters (DocumentDB notably has a different sharding implementation that does not actually use chunk migrations; it pre-splits on collection creation) and you need to know what is happening underneath. Read the documentation for whichever managed service you are using before you assume MongoDB-on-EC2 mental models apply.

References

Sharding — MongoDB Manual — the canonical reference for cluster topology, balancer behaviour, and chunk management.
Choosing a Shard Key — MongoDB Manual — official guidance on hashed vs ranged vs zoned, with anti-patterns.
Sharded Cluster Components — MongoDB Architecture — deep dive into mongos, config servers, and shard replica sets.
Reshard a Collection — MongoDB Manual — the v5.0+ online resharding mechanism.
Scaling MongoDB at Yelp — production case study of running large sharded MongoDB clusters at scale.
Amazon DocumentDB Sharding — AWS's MongoDB-compatible service and how its sharding model differs from upstream.