Cross-Shard Queries — Scatter-Gather

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A sharded database routes a query to one shard when the WHERE clause names the shard key. Scatter-gather is the only thing it can do when the predicate names anything else. The coordinator (or the client, depending on architecture) broadcasts the query to all N shards in parallel — that is the scatter. It waits for all N to respond — that is the gather. It combines the partial results client-side and hands one answer back. The pattern is universal: Vitess does it for cross-keyspace selects, Citus does it for distributed SQL, Presto and Trino are scatter-gather engines from the ground up, Spark's shuffle is the same idea, MongoDB's mongos does it whenever a query lacks the shard key.

Three classes of query require scatter-gather: aggregates without the shard key (SELECT SUM(amount) FROM orders), filters on non-PK columns without a GSI (WHERE last_login > ...), and global ORDER BY LIMIT across the whole dataset. Anything without a shard-key predicate is a scatter.

The pros are obvious — it works for any query, no schema changes, no second copy of the data. The cons compound at scale. Latency becomes max-of-N rather than one round trip, so the slowest shard owns the response time and tail latency explodes as the cluster grows. Network traffic multiplies by N. A single slow shard stalls the whole query. Combining results gets non-trivial as soon as you leave SUM/COUNT for AVG, percentiles, or DISTINCT — some cannot be combined cleanly at all.

This chapter is what the previous one's GSI was protecting you from. Knowing the cost is the first step in budgeting for it.

It is Monday morning. The product team has shipped a new analytics dashboard and your pager is going off because the first widget — "orders across all users yesterday" — takes 12 seconds to load and times out half the time. The orders table is sharded by user_id. The dashboard query has no user_id predicate; it asks for a sum across the entire table. The router has nothing to route on. It does the only thing it can — sends the query to every shard in parallel and adds the answers up.

That is scatter-gather. With 50 shards, this single widget does 50 round trips per refresh, and the refresh fires on every open browser. The sum is fast on each shard; the whole thing is slow because one shard is GC pausing and the coordinator waits.

The previous chapter built a global secondary index to avoid scatter-gather on hot OLTP paths like login by email. This chapter is about the path you are forced down when a GSI does not exist, cannot exist, or would not pay for itself.

The algorithm

The mechanics are straightforward enough to fit in a paragraph; the cost model takes longer.

A query Q arrives at the coordinator (or at a client library that plays the same role — Vitess's vtgate, Citus's coordinator node, the application code in front of a sharded MongoDB cluster). The coordinator parses Q, sees no shard-key predicate, and decides the query must scatter. It opens N concurrent connections — one per shard — and sends Q to each. It waits for all N responses to come back, holding the partial results in memory. Once the slowest one arrives, it runs a combine step appropriate to Q's shape — sum the sums, merge-sort the partial sorts, union the row sets — and sends one final response to the caller.

Why parallelism matters here but does not save you: a single shard answering Q takes time T_i. Done sequentially, the total wall clock is sum(T_i). Done in parallel, it is max(T_i). Parallelism turns sum into max — a real win. The catch is that max(T_i) is bounded below by the slowest shard. If even one of N shards is slow — GC pause, bad disk, neighbour noise — the whole query waits for it. Parallelism limits latency to the slowest, not the average.

The coordinator usually applies a timeout to each shard so that one stuck shard cannot indefinitely freeze the query. When the timeout fires you face a choice: abort the whole query (the safe answer is "we do not know") or return a partial result with a flag (the available answer is "here is what came back from 47 of 50 shards"). Different systems pick different defaults; the right answer is workload-dependent.

Query types that require scatter-gather

Four shapes show up over and over.

Aggregates without the shard key. The dashboard widget. SELECT SUM(amount) FROM orders WHERE event_time > NOW() - INTERVAL '1 day'. The shard key is user_id; the filter is on event_time. Every shard owns some of the orders from yesterday, so every shard must contribute. If the system is clever, each shard returns its local sum (a single number) and the coordinator sums those — bandwidth-cheap, easy combine. If the system is dumb, each shard returns every matching row and the coordinator does the SUM itself — bandwidth-expensive, same logical answer.

Filters on non-PK columns without a GSI. SELECT * FROM users WHERE last_login > '2026-01-01'. The shard key is user_id; you have no GSI on last_login (rightly, because it changes too often to index — see the previous chapter). Every shard must scan its local rows for matches. The coordinator unions the row sets and returns them.

Global ORDER BY LIMIT. SELECT * FROM events ORDER BY timestamp DESC LIMIT 100. The top 100 globally could come from any shard; you do not know in advance. Each shard returns its own top 100 by timestamp (because the top 100 globally is some subset of the union of every shard's local top 100), and the coordinator merge-sorts those N×100 rows and takes the top 100. The combine step is non-trivial; the bandwidth scales with the LIMIT and the number of shards rather than with the table size.

Any query without a shard-key predicate. This is the catch-all. Compliance scans ("find all rows containing this PII string"). Ad-hoc developer queries. Reports the analytics team writes once. Backfills. Anything that wants to look at the whole table touches every shard.

A fifth shape — cross-shard joins — is a scatter-gather of one side followed by per-shard lookups against the other, and it is bad enough that most sharded systems either disallow it or push it to a separate analytics engine.

The latency cost

The headline property of scatter-gather is the one that makes it expensive at scale: tail latency dominates total latency.

Consider a single shard with response time distribution X — say, median 2 ms and p99 of 10 ms — a long tail caused by GC pauses, disk hiccups, and cache misses. A single-shard query has p99 = 10 ms. Easy.

A scatter-gather over N shards waits for max(X₁, ..., X_N). The distribution of the max of N independent samples is shifted dramatically toward the tail. If each shard's p99 is 10 ms — meaning each shard takes longer than 10 ms 1% of the time — then at N=10 shards, the probability that all ten finish within 10 ms is 0.99¹⁰ ≈ 0.904. So the probability that at least one takes longer than 10 ms is ~9.6%. At N=100 shards, the probability that all finish within 10 ms drops to 0.366. A "1% slow" event hits two-thirds of all queries.

Why this matters: tail amplification means scatter-gather latency is set by your slowest shard's worst behaviour, not by the typical shard's typical behaviour. Doubling the shard count does not halve the work per shard — work per shard might be roughly constant — but it does increase the chance that some shard hits its tail on any given query. As clusters grow from 10 to 100 to 1000 shards, p99 latency on scatter-gather queries grows even though no individual shard got slower.

The single shard's p99 of 10 ms cascades into a 100-shard scatter-gather p99 closer to 25-40 ms, and the 1000-shard scatter-gather p99 is 60-100 ms or more. None of the shards got slower; the math of taking the max did the damage.

The bandwidth cost

The second cost is data movement. Every shard returns something to the coordinator. What it returns depends on the query.

For a SUM with no per-row payload, each shard returns one number; total bandwidth is N numbers. Trivially small even for N=1000.

For a filter — SELECT * FROM users WHERE last_login > '2026-01-01' — each shard returns however many rows match. If a fifth of users have logged in since January, every shard returns 20% of its rows. The coordinator now has 20% of the entire table in memory before it sends a single byte to the caller. For a 100 GB table, that is 20 GB of bandwidth and 20 GB of coordinator memory pressure for one query.

The mitigation is the obvious one: filter and limit at each shard, not just at the coordinator. A LIMIT 100 should be pushed down to each shard so each returns at most 100 rows, not "all matches, the coordinator will pick 100". Aggregates should be computed locally so a SUM stays a single number across the wire, not a row stream the coordinator sums afterward. Most production query engines do this push-down automatically; ad-hoc queries against a SQL frontend may not.

A practical anti-pattern: a developer runs SELECT * FROM events WHERE region = 'IN' against a sharded table, meaning "show me the first few hundred", but forgets the LIMIT. Every shard returns every Indian event. The coordinator buffers 50 GB on its way to the developer's psql session. The terminal hangs, the network saturates, the on-call gets a ticket.

Python implementation

A minimal scatter-gather coordinator in Python. Async because every shard call is a network round trip; you want them in flight simultaneously, not one at a time.

import asyncio

async def scatter_gather(query, shards, timeout=2.0):
    """Run query on all shards in parallel; combine partial results."""
    async def call(shard):
        try:
            return await asyncio.wait_for(shard.execute(query), timeout)
        except asyncio.TimeoutError:
            return None  # straggler — caller decides what to do

    tasks = [asyncio.create_task(call(s)) for s in shards]
    partials = await asyncio.gather(*tasks)
    survivors = [p for p in partials if p is not None]
    if len(survivors) < len(shards):
        # log lost shards; some queries can answer from a subset
        pass
    return combine(query, survivors)

def combine(query, partials):
    op = query.aggregate
    if op == "sum":     return sum(p[0] for p in partials)
    if op == "count":   return sum(p[0] for p in partials)
    if op == "min":     return min(p[0] for p in partials)
    if op == "max":     return max(p[0] for p in partials)
    if op == "avg":
        total = sum(p["sum"] for p in partials)
        n = sum(p["count"] for p in partials)
        return total / n if n else None
    if op == "order_by_limit":
        merged = []
        for p in partials:
            merged.extend(p)
        merged.sort(key=lambda r: r[query.order_col], reverse=query.desc)
        return merged[:query.limit]
    raise NotImplementedError(op)

Why the AVG case looks different: you cannot average pre-computed averages and get the right answer — (avg(A) + avg(B)) / 2 is not avg(A ∪ B) unless the partitions are equal-sized. Each shard must return its local sum and local count separately, and the coordinator does the division on the totals. This is the simplest example of why "the combine step is non-trivial" once you leave SUM and COUNT.

Production code adds cross-shard cancellation, transient retries, request hedging (covered later), connection pooling, per-shard latency logging, and gather-step metrics. The 30 lines above are the algorithm; production is the cost-management harness around it.

Combining results — per aggregate function

The combine function is where scatter-gather goes from "send the same query everywhere" to "actually compute the right answer". Each aggregate has its own reduction rule.

SUM. Each shard returns its local sum; the coordinator sums those. Trivially associative. Linear in N for the gather step, constant per shard.

COUNT. Each shard returns its local count; the coordinator sums those. Same shape as SUM.

AVG. Each shard must return two numbers — its local sum and its local count. The coordinator divides total sum by total count. If you only return the per-shard average, you cannot reconstruct the global average without weights.

MAX, MIN. Each shard returns its local max (or min); the coordinator picks the global max (or min). Trivially associative.

ORDER BY LIMIT N. Each shard sorts locally and returns its top N — never less, because the global top N might be all on one shard. The coordinator merges N×N (with merge-sort, since each shard's slice is already sorted) and takes the top N. Bandwidth scales with N×LIMIT, not table size.

GROUP BY plus aggregate. Each shard groups locally and returns one row per group with the local aggregate. The coordinator merges rows with the same group key, combining their aggregates with the rule above (SUM of SUMs, sum-and-count for AVG, etc.). Cardinality of the result is bounded by total distinct group keys across the cluster, not by the table size — this is what makes group-by analytics tractable.

DISTINCT. Each shard returns its locally distinct values; the coordinator unions and dedupes. Bandwidth is bounded by the cardinality of the column, not the table size.

Percentile. Hard. You cannot combine two p99s and get a global p99. The classical exact answer requires returning every value (or a sorted run that lets you index into the merged sequence). The practical answer is approximate quantile sketches — t-digest, HDR histograms, KLL — which are mergeable at small fixed cost. Modern engines (Trino, ClickHouse, Spark) ship approximate-percentile functions that scatter-gather correctly; engines without them either return wrong answers or scan the whole dataset to the coordinator. Knowing this distinction is the difference between trusting and not trusting your dashboard's "p99 latency" widget.

When scatter-gather is unavoidable

Some queries genuinely have no other answer. Pretending they do is the engineering mistake.

Analytics on unindexed columns. A new product question — "how many users in city X last week" — comes up Tuesday and is needed Wednesday. Building a GSI on city takes days and adds permanent write load; one scatter-gather costs an afternoon. Scatter is right when the question is rare and exploratory.

Complex joins across sharded tables. Two tables sharded on different keys, joined on a third column, is by definition not single-shard. The shape is: scatter one side, build a per-shard map, scatter the other side carrying lookup probes, gather. Most engines push this to an offline batch system.

Compliance queries. "Find all data belonging to user X across every table that may hold personal information." Every table, every shard. No way to make this single-shard without pre-computing every PII relationship — itself a costly index.

Development and ad-hoc queries. Debugging incidents, exploring schemas, validating backfills. Not hot paths; scatter-gather is fine.

The pattern: read-rare, latency-tolerant, schema-flexible. When all three hold, scatter is the right shape.

When to avoid scatter-gather

Some queries should never scatter, even if they technically can.

Hot-path OLTP. Login, profile fetch, payment authorisation. A scatter-gather adds tail-latency exposure to a code path the user is waiting on. Build a GSI; pay write-side cost; keep reads single-shard.

High-frequency queries. A widget running hundreds of times per second multiplies the scatter cost by that frequency. 50 ms once a minute is fine; the same scatter at 200 QPS is 200 × N round trips per second.

Queries that return lots of data. SELECT * over a sharded table is a memory bomb on the coordinator. Rewrite with a tight LIMIT, build a GSI, or move it offline.

Anything covered by an existing GSI. Engines occasionally pick the wrong plan; review query plans on sharded systems because the plan space includes "scatter the whole cluster".

The first question to ask of any new sharded query is "does this filter on the shard key, on a GSI'd column, or neither?". The answer is roughly a measure of the query's expected cost.

Tail-latency mitigation

The math says scatter-gather p99 grows with cluster size; engineering pushes back with tricks that flatten the tail. Three are worth knowing.

Hedged requests. Send the query to two replicas of each shard simultaneously, take the first response, cancel the slower one. Cost: roughly 2× read traffic. Gain: one slow replica out of two is unlikely to also be slow on the second, so the per-shard tail collapses. Querion's Tail at Scale paper (Dean and Barroso, 2013) introduced the pattern; it is now standard. The variant deferred hedge waits a few milliseconds before sending the second request — same effect, lower steady-state cost.

Timeout plus partial answer. Set a per-shard timeout shorter than the budget. When it fires, return what you have and flag the query as partial. For dashboards, "SUM across 47 of 50 shards, 2 stragglers" is more useful than a 30-second hang.

Queue prioritisation. Run scatter-gather queries in a low-priority lane on each shard, separate from OLTP traffic. The scatter accepts longer per-shard latency in exchange for not stalling user requests. Postgres has statement timeouts; CockroachDB has admission control; ScyllaDB has workload prioritisation.

One day's order summary across 50 shards

The analytics team's dashboard runs a single query every refresh:

SELECT DATE(event_time), SUM(amount)
FROM orders
WHERE event_time > NOW() - INTERVAL '1 day'
GROUP BY DATE(event_time);

The orders table is sharded by user_id. The shard key has nothing to do with event_time, so the query is a scatter-gather across all 50 shards.

Each shard does the obvious local thing. It scans yesterday's index on event_time, filters its local orders, groups by date, and returns one row per date — at most two rows in this query (yesterday and today, since "the last 24 hours" can straddle midnight). 50 shards × 2 rows = 100 rows of network traffic for the gather step.

The coordinator unions the 100 rows and combines: rows with the same DATE(event_time) have their SUM(amount) values added. The result is two rows.

End-to-end latency: each shard's local query takes around 5 ms (tight time-range, indexed scan, local sum). The coordinator waits for the slowest of 50 shards to finish — typically 10-12 ms because of tail amplification. The combine step is a millisecond. Total: ~13 ms.

The same query without the time predicate — "lifetime SUM by date" — would scan every shard's full table. Each local query becomes seconds. The 50-shard max becomes tens of seconds. Same scatter pattern, completely different cost — because the per-shard work changed, not the cluster topology.

The query enters at the coordinator, fans out to 50 shards in parallel, every shard returns its local group-by-date sum, the coordinator merges rows with the same date and adds. The end-to-end time is the max of per-shard latencies plus the combine. One slow shard at 12 ms — perhaps in mid-GC or hitting a cold cache — sets the response time even though every other shard finished in 4-6 ms.

Common confusions

"Scatter-gather is expensive, avoid it." It is the only path for many legitimate queries — analytics, compliance, cross-shard reports. The right framing is budget it: know which queries scatter, count them, monitor their cost, and decide per query whether to leave them as scatter or pay for a GSI. Refusing to scatter at all forces a GSI-per-question schema that becomes its own write-amplification disaster.
"Scatter-gather is parallel so it is fast." Parallelism limits latency to max-of-N, not min-of-N or average-of-N. Tail-latency math means max grows much faster with N than the average shard's latency. A 100-shard scatter-gather is not "as fast as a single-shard query" — it is "as slow as the slowest shard, which is rarely the median".
"The coordinator can simplify any query." It cannot. Some queries are fundamentally not associative: percentiles, "the median row across the cluster", "the second-most-common value". They either require approximation (sketches), full-data shipping (every row to the coordinator), or multi-pass algorithms (a first scatter to compute statistics, a second informed by the first). Read the engine's documentation before assuming PERCENTILE_CONT(0.99) does what you expect.
"A GSI eliminates scatter-gather." It eliminates scatter-gather for the column it indexes. A GSI on email lets you log in by email without scattering; it does nothing for a query filtering on last_login or country. Each non-shard-key query that needs to be fast wants its own GSI, and at some point the GSI portfolio is more expensive than budgeted scatters.
"All N shards must respond before the coordinator can answer." That is the default but not the requirement. For aggregate queries that tolerate approximate results — "estimate of total orders, ignoring 2 stragglers" — the coordinator can return early once enough shards report. For exact answers, you wait for everyone or fail the query.
"Scatter-gather scales linearly with shard count." Per-shard work scales sub-linearly (work is split N ways), but coordinator-side work — connection management, response buffering, combine — scales linearly with N, and tail latency scales worse than linearly. The bottleneck shifts from shards to coordinator as N grows.

Going deeper

Presto and Trino — scatter-gather as architecture

Presto, Trino, and their Spark cousins are scatter-gather engines from the ground up. A query plan in Trino is a tree of operators where the leaves scan source tables across many connectors (Hive, Kafka, S3, Cassandra) and inner nodes are exchange operators that shuffle data between worker stages. The scatter-gather model is so deeply baked in that the coordinator-worker split becomes the unit of horizontal scaling. Data-locality awareness matters: if a worker can pin to the rack hosting the data, network bandwidth drops dramatically; cross-rack scatters at scale eat the network before they eat the CPU. The Trino docs on stage scheduling and the Presto paper (Sethi et al., 2019) are the readings here.

Federated queries across heterogeneous stores

Modern data platforms run scatter-gather across multiple kinds of system at once — Postgres for OLTP, Kafka for events, S3 Parquet for cold data, ClickHouse for analytics. The pattern is the same shape: a coordinator parses the query, decides per-source what to push down, runs the parts in parallel, and combines. Trino, Presto, and Apache Calcite implement this; the design challenge is push-down semantics — what filters and aggregates can each connector handle, and what falls back to the coordinator. The cost of falling back is the entire dataset crossing the network.

Adaptive aggregation — direct vs scatter

A mature router may choose per-query between routing direct (one shard) and scattering (all shards) based on the predicate. A WHERE id IN (1, 2, 3) with three shard-key values targets at most three shards, not all of them — a "multi-shard direct" pattern that is neither single-shard nor full scatter. The router computes the destination set from the predicate and broadcasts only to those shards. Vitess, Citus, and modern MongoDB routers all do this. The trick is making the predicate analyser smart enough to spot subset-targeting cases — IN, equality, BETWEEN over the shard-key column — without misclassifying a full scatter as a partial one.

Where this leads next

Chapter 97 is cross-shard transactions — what happens when a write must touch rows on more than one shard atomically. Scatter-gather is the read-side pattern; cross-shard transactions are the write-side analogue, and they are far harder to make work correctly than scatter is to make work at all.

Chapter 98 is the atomic-commit wall — the theoretical and practical limits on what a sharded system can do across shards without paying the latency cost of two-phase commit or Paxos. Scatter-gather hides behind the read-only assumption; the moment you need to mutate, the cost model changes shape.

Build 12 has been about the consequences of breaking up a single logical table across many physical shards. The shard key (chapter 92) decides who owns what, the GSI (chapter 95) gives you a second access path on a non-shard-key column, and scatter-gather (this chapter) is the catch-all for queries the other two cannot answer. Each is a tool whose cost you trade off against the rest. The systems that work at scale are the ones whose engineers know which tool a given query needs and refuse to use the wrong one out of laziness or hope.

References

PlanetScale / CNCF Vitess, VTGate Query Planning — Vitess's coordinator architecture, including how vtgate decomposes queries into single-shard, multi-shard direct, and scatter modes, and the routing rules that pick between them.
Citus Data, Distributed Query Planning in Citus — Postgres-on-shards: how Citus pushes filters and aggregates down to worker nodes, gathers partial results at the coordinator, and combines them with Postgres's own executor.
Sethi et al., Presto: SQL on Everything, ICDE 2019 — the canonical Presto/Trino paper, with the cleanest published explanation of multi-stage scatter-gather query execution and its trade-offs.
Dean and Barroso, The Tail at Scale, Communications of the ACM 2013 — the foundational paper on tail-latency amplification, hedged requests, and the techniques that make scatter-gather tolerable at large N.
Melnik et al., Dremel: Interactive Analysis of Web-Scale Datasets, VLDB 2010 — Querion's original scatter-gather analytics engine, the architectural ancestor of BigQuery, Trino, Drill, and the modern interactive-analytics category.
Kleppmann, Designing Data-Intensive Applications, Chapter 6 — Partitioning, O'Reilly 2017 — the textbook treatment of partitioning, secondary indexes, and the read-side query patterns including the latency consequences of scattering across N partitions.