In short
A hash join answers an equi-join R ⋈_{R.a = S.b} S in two phases. The build phase scans the smaller relation R, hashes each row on the join key, and inserts it into an in-memory hashmap keyed by that value. The probe phase scans the larger relation S, hashes each row the same way, looks it up in the map, and emits a merged tuple for every match. That is |R| inserts plus |S| lookups, each O(1) amortised — total O(|R| + |S|) in CPU and B_R + B_S page reads. For a million-by-million equi-join, hash join is five to six orders of magnitude faster than naive nested-loop. The trap is memory: the hashmap must hold every row of R at once, and a 5 GB R does not fit in a 100 MB buffer pool. External (grace) hash join fixes this by partitioning: hash both relations into k partition files using the same hash function, so matching keys always land in the same partition pair (R_i, S_i). Then run in-memory hash join on each pair independently. I/O is 3(B_R + B_S) — read both, write both partitioned copies, read both again. Hybrid hash join (Postgres, DuckDB, every MPP engine) keeps partition 0 in memory during the build pass and probes it without ever writing it to disk, saving roughly 1/k of the I/O. Hash join is equi-only — the hash function only respects equality — and the same partition-then-aggregate pattern is reused for GROUP BY, which the next chapter builds out.
The query that breaks nested-loop join
You have two tables. users has 10 million rows, 50 bytes each. orders has 50 million rows, 100 bytes each. Neither has a usable index on the join key. You run:
SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id;
At 100 rows per page, users is B_R = 10⁵ pages, orders is B_S = 5·10⁵ pages. Plug into nested-loop join's cost formula: even with block NLJ, you scan the inner once per block of the outer, and with a 1000-page buffer pool the number of outer blocks is ⌈10⁵ / 998⌉ ≈ 100. I/O is 10⁵ + 100 · 5·10⁵ = 5·10⁷ page reads. Thirty thousand seconds at 0.5 ms/page — about eight hours. The query does not finish in any useful sense.
Hash join does it in 10⁵ + 5·10⁵ = 6·10⁵ page reads — 300 seconds — if the build side fits in memory. When it does not, external hash join does it in 1.8·10⁶ reads — about fifteen minutes. Still two orders of magnitude faster than the NLJ nobody would ever run.
The iterator model already tells you what the executor contract is: open(), next(), close(). A hash join is a blocking operator on the build side — it has to consume all of R inside open() before it can return its first result. That is a structural fact about hashing; you cannot probe a hash table that is not built yet. Keep this in mind — it is why hash join cannot be placed under a LIMIT 1 as cheaply as an index nested-loop join can.
The in-memory hash join
Here is the algorithm, written against the iterator model. It follows the shape of NestedLoopJoinExec from the previous chapter but with one crucial difference: there is no rewind. The build side is consumed exactly once, in open(), and the probe side is streamed once through next().
# query/exec/hj.py
from dataclasses import dataclass, field
from typing import Iterator, Callable, Any
from collections import defaultdict
Row = tuple
@dataclass
class HashJoinExec:
build: Any # smaller side; consumed once in open()
probe: Any # larger side; streamed in next()
build_key: Callable[[Row], Any] # extract join key from a build row
probe_key: Callable[[Row], Any] # extract join key from a probe row
_table: dict = field(default_factory=lambda: defaultdict(list))
def __iter__(self) -> Iterator[Row]:
# --- BUILD PHASE: one full pass over the build side, into the hashmap
self.build.open()
try:
for r in self.build:
self._table[self.build_key(r)].append(r)
finally:
self.build.close()
# --- PROBE PHASE: one pass over the probe side, with O(1) lookups
self.probe.open()
try:
for s in self.probe:
for r in self._table.get(self.probe_key(s), ()):
yield r + s # every matching pair
finally:
self.probe.close()
Twenty-five lines. The two phases are the two try blocks. The hashmap is a defaultdict(list) because the build key is not necessarily unique — ten users with the same id? no; ten orders with the same user_id? yes. Each bucket is a list of rows, and the probe emits the Cartesian product of the bucket with each matching probe row.
Why the build side is consumed entirely before the probe begins: a hash table with half its keys missing will silently return wrong answers. If you streamed both sides simultaneously and a probe row arrives before its matching build row, you would miss the match forever. The whole correctness argument for hash join rests on the build being complete when the probe starts — which is also what makes hash join's open() blocking in the iterator-model sense from the previous chapter.
Why this only works for equi-joins: the hash function h(k) maps values to buckets so that h(k₁) = h(k₂) whenever k₁ = k₂. That is an equality statement, not an inequality one. For R.a < S.b, two matching rows r, s can have arbitrary key values — h(r.a) and h(s.b) can be in any two buckets — and there is no way to organise buckets so that a probe finds all its matches without scanning every bucket, at which point you have reinvented the nested-loop join.
Deriving the cost
Two numbers: CPU comparisons and page reads. Let B_R, B_S be the page counts of build and probe, and P the rows per page.
CPU. Build does |R| hash inserts, each O(1) average. Probe does |S| hash lookups, each O(1) average, returning an average of |R|/|keyspace| matches per probe row:
Why O(1) average and not O(1) worst-case: a hash bucket with k collisions degrades to O(k). If every build row hashes to the same bucket (adversarial hash function, or a single hot key), every probe lookup is O(|R|) and hash join devolves to nested-loop join. Real engines defend against this with good universal hash functions and by detecting skew at runtime (see the skew discussion below).
I/O. One pass over each relation:
Why this is optimal: you have to read every row of every relation at least once to produce a join result that depends on every row. B_R + B_S is that minimum. No algorithm beats it. In-memory hash join is I/O-optimal for equi-joins whenever the build side fits in memory.
Plug in |R| = 10⁷, |S| = 5·10⁷, B_R = 10⁵, B_S = 5·10⁵ at 0.5 ms per SSD page: I/O is 6·10⁵ reads, about 300 seconds. Comparisons are 6·10⁷ — 60 ms of CPU. You are entirely I/O-bound, which is the right regime.
When the build side does not fit in memory
The in-memory algorithm has a concealed assumption: the whole hashmap lives in RAM. Python's defaultdict pretends this is always true. Real databases cannot. The buffer pool is typically 100 MB to a few GB; the build side is routinely tens of GB. If you blindly call _table[key].append(r) for every row of a 10 GB R, the operating system will start paging the hashmap out to swap, and your hash lookup becomes a random-access 10-millisecond disk read. You have just silently turned an O(|R|+|S|) algorithm into something worse than nested-loop join.
The database cannot rely on the OS to page for it — page replacement policies optimised for files are terrible for hashmaps. So it manages memory itself. The fix is partitioning: before the main build, split the data into chunks that each fit in memory, and hash-join each chunk independently.
Grace hash join — two passes, same hash partition
Named after the GRACE database machine built at the University of Tokyo in 1984, where it first appeared. The trick is to use hashing twice: once to partition, once to join.
Phase 1 — partition. Choose a hash function h_part and a number of partitions k. Scan R and for each row, compute h_part(r.key) mod k and write the row to partition file R_i on disk. Do the same for S, using the same hash function, producing S_0, S_1, ..., S_{k-1}.
Phase 2 — join each partition. For each i ∈ [0, k), read R_i into memory, build a hashmap on it, then stream S_i through and probe. The in-memory hash join from the previous section runs untouched — you just hand it R_i and S_i as its inputs.
The magic is this: because h_part is the same function on both sides, any row with key v ends up in partition i = h_part(v) mod k on both sides. So matching rows always land in the same partition pair. You never miss a match even though you are joining one partition at a time.
Choosing k. You want each R_i to fit in memory. Let M be the available memory in pages. Then k ≈ ⌈B_R / M⌉, rounded up. Some slack is added because partitions won't be exactly equal in size — real engines typically pick k = ⌈B_R / (0.8 · M)⌉.
Here is the algorithm:
# query/exec/grace.py
from dataclasses import dataclass, field
from typing import Any, Callable, Iterator
import pickle, tempfile, os
from collections import defaultdict
@dataclass
class GraceHashJoinExec:
build: Any
probe: Any
build_key: Callable
probe_key: Callable
k: int = 16 # number of partitions
_dir: str = field(default_factory=tempfile.mkdtemp)
def _partition(self, child, key_fn, prefix):
files = [open(f"{self._dir}/{prefix}_{i}", "wb") for i in range(self.k)]
child.open()
try:
for row in child:
i = hash(key_fn(row)) % self.k # h_part: partition hash
pickle.dump(row, files[i])
finally:
child.close()
for f in files: f.close()
def __iter__(self) -> Iterator:
# --- Partition both sides to disk with the SAME hash function.
self._partition(self.build, self.build_key, "R")
self._partition(self.probe, self.probe_key, "S")
# --- For each partition pair, run in-memory hash join.
for i in range(self.k):
table = defaultdict(list)
with open(f"{self._dir}/R_{i}", "rb") as f:
while True:
try: r = pickle.load(f)
except EOFError: break
table[self.build_key(r)].append(r)
with open(f"{self._dir}/S_{i}", "rb") as f:
while True:
try: s = pickle.load(f)
except EOFError: break
for r in table.get(self.probe_key(s), ()):
yield r + s
os.remove(f"{self._dir}/R_{i}")
os.remove(f"{self._dir}/S_{i}")
Two phases, two files per partition. In production the pickle.dump is replaced by a page-packing writer (rows laid out in the same on-disk format as tables) and the open is a buffer-pool page read; the control flow is identical.
Deriving the I/O cost
Count page reads and writes separately because writes cost the same as reads on SSDs but matter for lifetime:
- Partition build for R: read
Ronce (B_R), write the partitioned copy (B_Rwrites). - Partition build for S: read
Sonce (B_S), write the partitioned copy (B_Swrites). - Join pass: read each
R_ionce (totalB_Rreads), read eachS_ionce (totalB_Sreads). No writes — results stream upward through the iterator tree.
Totalling reads + writes:
Why exactly three times: every page of input is touched three times — read once in the partition pass, written once as part of a partition file, read once during the join pass. There is no fourth pass because the join phase emits rows directly to the parent operator without spilling. If the result is itself too big to materialise (a join that feeds into a sort), the sort operator spills separately; that cost is not grace hash's problem.
Why grace hash crushes BNL on large-large joins: BNL's I/O is B_R + ⌈B_R/(B-2)⌉ · B_S. For B_R = 10⁵, B_S = 5·10⁵, B = 10³, that is 10⁵ + 100 · 5·10⁵ = 5·10⁷ reads. Grace is 3(10⁵ + 5·10⁵) = 1.8·10⁶. Grace wins by a factor of ~28 even with a comfortable buffer pool. Shrink the buffer pool and BNL degrades linearly while grace stays constant (it just uses more partitions). That is why hash join is the default for large equi-joins.
Hybrid hash join — keeping partition 0 in memory
One subtle inefficiency of grace hash: partition 0's build rows are written to disk during the partition pass, then read back immediately in the join pass. If you had kept partition 0 in memory during partitioning, you could skip that round trip. Hybrid hash join (Shapiro 1986) does exactly this: it uses however much memory is available to hold partition 0's hashmap, and only spills partitions 1..k-1 to disk.
During the partition pass for R:
- For each row, compute
i = h_part(r.key) mod k. - If
i == 0, insert into the in-memory hashmap directly. - Otherwise, write to partition file
R_ion disk.
During the partition pass for S:
- For each row, compute
i = h_part(s.key) mod k. - If
i == 0, probe the in-memory hashmap and emit matches immediately. - Otherwise, write to partition file
S_ion disk.
Then for i ∈ [1, k), run the normal grace phase.
Cost. Partition 0 is roughly 1/k of each relation. It is never written and never re-read. So the savings are 2(B_R/k + B_S/k):
For k = 10, hybrid saves 20% of the I/O. For k = 2 (plenty of memory, just slightly too little for pure in-memory), it saves 50%. As memory grows toward holding the whole build side, k → 1 and hybrid smoothly degrades into in-memory hash join with cost B_R + B_S. This is the whole point of hybrid — it is a continuous interpolation between in-memory hash (cheap, memory-hungry) and grace (expensive, memory-light), adapting to whatever memory the executor happens to have.
Every production engine that ships hash join ships hybrid. Postgres has shipped it since 9.5 (2016) with the "work_mem" knob controlling the in-memory partition; DuckDB, Snowflake, Redshift, BigQuery, Spark's ShuffledHashJoin — all hybrid, all adaptive.
External hashing for aggregation
The partitioning trick is not specific to joins. Any operation where rows with the same key need to end up together can use the same pattern — and the canonical example is aggregation. To compute SELECT user_id, SUM(total) FROM orders GROUP BY user_id when orders is too big to hold all the per-user running totals in memory, you hash-partition orders into k files by user_id, then aggregate each partition independently in memory, then concatenate. The I/O cost is 3·B_S — exactly grace hash join's formula with B_R = 0 because there is no build side. The next chapter builds this out for GROUP BY, DISTINCT, and window functions. You already know the algorithm; that chapter is about which aggregates work under this pattern and which need the sort-based alternative.
Cost, collisions, and skew
The cost formulas above all assume hashing distributes rows uniformly. When it does not, grace hash breaks in one of two ways, and you need defences.
Skewed build side. Suppose the join key is user_id and one user has ten million orders while the other users average fifty. The hash function will send all ten million of that user's orders to one partition — which no longer fits in memory. When you try to build the in-memory hashmap for that partition, it blows the memory budget.
The fix is recursive partitioning: detect that partition i is too big, and re-partition it with a different hash function h_part'. The second-level partition will split the skew across more buckets (unless every row has the exact same key value, in which case no hash function can help — see below). Recursive partitioning is what production engines like Postgres do when a partition overflows work_mem: see ExecHashIncreaseNumBatches in nodeHash.c for the actual mechanism.
Single hot key. If ten million rows have the identical user_id = 42, no hash function will split them. They all hash to the same value by definition. When this happens, the build side for that one key is itself bigger than memory, and you have to fall back to a nested-loop join for that partition alone — scan the hot build bucket against the hot probe bucket in blocks. Most engines call this a "skew-handling" or "fallback" path.
Bloom filters for selective probes. A related optimisation — useful when most probe rows don't have a match on the build side. Before running the full probe, build a Bloom filter on the build keys. During probing, test each probe row against the filter first; if it misses, skip the hashmap lookup entirely. For a 1% selective join (one probe row in 100 has a build-side match), the Bloom filter eliminates 99% of hashmap lookups. This is implemented as "runtime filters" in Impala, Presto, and Spark; see the Bloom filter wiki for the data structure itself. It is a CPU win, not an I/O win — but on CPU-bound hash joins over cached data, it can double throughput.
Grace hash join versus nested-loop on 10 M × 50 M rows.
Scenario: two tables to join on an equality predicate with no index.
R(users): 10 million rows × 100 bytes = 1 GB =B_R = 10⁴pages at 100 KB/page (using a large-ish page size for the arithmetic — in practice 8 KB pages scale the numbers up by ×12, but the ratios stay the same).S(orders): 50 million rows × 100 bytes = 5 GB =B_S = 5·10⁴pages.- Buffer pool memory
M: 100 MB =10³pages.
Number of partitions for grace hash. Each partition of R must fit in M:
Why 10 and not fewer: if k = 5, each partition of R is 2·10³ pages — twice memory. You cannot build the hashmap. If k = 20, partitions are half a megabyte, well within memory but you incur more partition file overhead. Ten is the tight fit.
Grace hash I/O.
At 0.5 ms/page SSD read, that is 90 seconds. Wall-clock ~2 minutes including partition-pass writes which cost roughly the same.
Nested-loop I/O for comparison.
Why NLJ is absurd here: |R| is the number of rows (10⁷), not pages. Each of the ten million outer rows triggers a full inner scan of 50,000 pages. Even with block NLJ (B = 1000 buffer pages, ⌈B_R/(B-2)⌉ = 11 blocks), the inner is still scanned 11 times — that is 11 · 5·10⁴ = 5.5·10⁵ pages, three times worse than grace hash, but only because BNL was also I/O-bound — the CPU comparisons for BNL are still 10⁷ · 5·10⁷ = 5·10¹⁴, ten million times more than grace hash's 6·10⁷.
Ratio.
Grace hash is about three million times faster than naive NLJ on this workload. This is the kind of number that makes you understand why every analytical engine ever built implements hash join.
Hybrid hash. With k = 10, hybrid saves 2(B_R + B_S)/k = 2 · 6·10⁴ / 10 = 1.2·10⁴ page ops — about 7% of total. Modest here because k is large; for joins that fit in memory except for a small overflow, hybrid is close to free while grace costs 3×.
Common confusions
-
"Hash join works for
<,>,LIKE, orBETWEEN." No — hash join is strictly equi-only. The hash function maps equal keys to equal buckets, which is useless for inequality. Forr.timestamp < s.deadline, the optimiser must fall back to nested-loop join or, in some special cases, a sort-merge variant. This is one of the three reasons (with small outers and indexed inners) that NLJ survives in every production planner — it is the only join algorithm that handles arbitrary predicates. -
"Grace hash writes to disk, so it must be slow." Writing sequentially to
kpartition files is cheap — SSDs sustain gigabytes per second of sequential writes, and on HDDs the partition writes are sequential (one file at a time) even though logically they arekstreams, because engines buffer per-partition pages and flush full pages sequentially. The cost compared to in-memory hash is2(B_R + B_S)extra page ops, but compared to the|R| · B_Sof naive NLJ, the partition writes are noise. Writing once beats re-reading the inner|R|times by orders of magnitude. -
"The build side is always the smaller table." By convention yes, because a smaller build side means a smaller hashmap and fewer partitions in the spilling case. But this is an optimiser choice, not an algorithmic requirement — either side can be the build side. The optimiser uses table statistics (
pg_statisticin Postgres,DBA_TABLESin Oracle) to estimate row counts for each input to the join, including after filters, and picks the smaller one. When statistics are wrong (stale, or the filter selectivity is hard to estimate), the optimiser picks wrong, memory fills, partitions spill recursively, and the query takes 10× longer. This is the #1 reason production DBAs runANALYZE. -
"Why not just use a
dictand let Python/the OS page it?" Swap-based paging has LRU eviction on page granularity, but a hash table has random access — a probe touches one bucket out of thousands, and that bucket is almost certainly not in resident memory. Every hashmap lookup becomes a 10 ms disk read. Explicit partitioning gives you sequential reads during the join phase (you readR_istraight through), which is 100× faster than random reads. Database engines always manage spill explicitly for this reason. -
"If both sides are already sorted, do I still want hash join?" Probably not — sort-merge join is
O(B_R + B_S)with no build phase, and it exploits the existing sort. Hash join beats sort-merge only when both sides are unsorted and neither can exploit an index scan. The next chapter on sort-merge builds this out. -
"Does hash join respect row order?" No. The probe outputs rows in the order the probe side arrives, interleaved with whichever build-side rows happened to hash into the matching bucket. If your query depends on output order, add an
ORDER BY— hash join is a streaming reshuffler.
Going deeper
Vectorised hash join in DuckDB and Velox
Classical hash join does one hashmap lookup per probe row — one virtual dispatch, one cache miss. DuckDB's vectorised execution batches 1024–2048 probe rows into a vector and processes the batch through the hashmap in one function call. Modern CPUs with AVX-512 have gather instructions that can fetch 16 hashmap buckets in parallel from memory; the DuckDB and Velox engines use these to run hash-join probes at billions of rows per second per core. The speedup over the row-at-a-time iterator model is 10–30× on CPU-bound analytical queries, which is one of the reasons DuckDB benchmarks famously well against much larger distributed systems for single-node analytics.
The idea generalises: any hashmap operation (aggregation, distinct, set-membership) benefits from vectorised probing. This is one of the two or three core techniques that distinguish 2020s analytical engines from the 1990s iterator-model originals.
Distributed hash join and the shuffle
In an MPP engine (Snowflake, BigQuery, Redshift, Spark), the relations are spread across many workers. Hash join generalises by shuffling: every worker partitions its local slice of R and S using the same h_part, then ships partition i to worker i. After the shuffle, worker i holds all of R_i and all of S_i — it runs a local hybrid hash join and produces its slice of the output. This is exactly the grace hash pattern, just with network replacing disk. Cost becomes 3(B_R + B_S) in I/O plus B_R + B_S in network bytes. Network is usually ~10× slower than local SSD, so the shuffle often dominates. Optimisers like Snowflake's aggressively try to eliminate shuffles — using broadcast joins when one side is small, co-located joins when both sides are already partitioned on the join key.
Where this leads next
-
Sort-merge join and external merge sort — hash join's main competitor. Sort both sides on the join key, then do a single linear merge. Slower than hash when both sides need to be sorted from scratch; faster when inputs are already sorted (e.g. from an indexed scan or an upstream
ORDER BY). Sort-merge also handles some inequality cases that hash cannot, which makes it the second-choice equi-join algorithm and the first choice for banded joins. -
Hash-based and sort-based aggregation — reuses the partitioning idea for
GROUP BY. Hash-aggregate partitions by the group key and aggregates each partition in memory. Sort-aggregate sorts by the group key and aggregates in one linear pass. The choice mirrors the hash-vs-merge join choice, and the memory management is identical. -
Join reordering — once you know hash join costs
O(B_R + B_S)and has a hefty memory requirement, the optimiser's job is to decide which pair of tables to hash-join first in a multi-way join. Bushy plans with hash joins at multiple levels are common for star-schema analytics; left-deep plans with one hash join and index-nested-loops below are common for OLTP. -
Skew detection and adaptive plans — production engines increasingly detect at runtime when a partition is too skewed and switch algorithms mid-query. Spark's Adaptive Query Execution is the most prominent example; it can convert a planned shuffled hash join into a broadcast join after seeing the actual row counts.
The jump from "nested-loop join for everything" to "hash join for equi, nested-loop for the rest" is the single biggest algorithmic win in query execution. Every other improvement — vectorisation, compilation, morsel scheduling — multiplies the constant factor. Picking the right algorithm for the right shape of input multiplies the complexity class, and that is why the chapter you just read is, practically, the most important one in Build 6.
References
- Shapiro, Join Processing in Database Systems with Large Main Memories, ACM TODS 11(3), 1986 — introduces hybrid hash join and gives the cost analysis reproduced in this chapter. The original source for the
3(B_R + B_S)formula and the adaptive interpolation into in-memory hash. - Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys 25(2), 1993 — the reference survey of physical join algorithms, including hash, grace, hybrid, and the recursive-partitioning skew-handling path.
- Kitsuregawa, Tanaka, Moto-oka, Application of Hash to Data Base Machine and Its Architecture, New Generation Computing 1(1), 1983 — the GRACE database machine paper that gave grace hash join its name. Describes the partition-then-join pattern for hardware parallelism.
- DeWitt, Katz, Olken, Shapiro, Stonebraker, Wood, Implementation Techniques for Main Memory Database Systems, SIGMOD 1984 — early engineering analysis of hashing for in-memory joins, including the build-phase cost derivation used here.
- PostgreSQL source tree, nodeHash.c and nodeHashjoin.c — the actual production hybrid hash join implementation, including the
ExecHashIncreaseNumBatchesrecursive-partitioning path that handles runtime skew. - Raasveldt and Mühleisen, DuckDB: an Embeddable Analytical Database, SIGMOD 2019 — description of DuckDB's vectorised hybrid hash join on modern hardware, with the SIMD gather-probe optimisation discussed in the Going Deeper section.