Sort-Merge Join and External Merge Sort

In short

A sort-merge join sorts both inputs on the join key, then merges them with two cursors advancing in lockstep — whenever the keys are equal, emit the cross product of the equal-key blocks; whenever they differ, advance the smaller. The merge is linear: O(|R| + |S|) comparisons and B_R + B_S page reads after sorting. The hard part is the sort, because real databases routinely sort inputs bigger than memory. External merge sort solves that: pass 0 reads the input in chunks of M pages, sorts each chunk in memory, and writes it back as a sorted run, producing ⌈N/M⌉ runs; each subsequent pass opens M−1 runs at once and merges them through a heap, consuming one buffer per input run and one for output. After ⌈log_{M−1}(N/M)⌉ merge passes you have a single sorted file. Total I/O is 2·N·(1 + ⌈log_{M−1}(N/M)⌉) — for realistic N and M this is two passes, i.e. 4·N. Sort-merge join's full cost is thus roughly 4(B_R + B_S) — worse than grace hash's 3(B_R + B_S) on pure equi-joins. You pick sort-merge anyway when one side is already sorted (clustered B-tree on the join column), when the query has a downstream ORDER BY or window function that matches the join key, when the predicate is a range (BETWEEN) rather than equality, or when the key distribution is skewed enough that grace hash's recursive partitioning degrades. The sorted output is itself a reusable artefact, which is exactly why query planners keep sort-merge in their toolkit even though hash is faster on paper.

You are running an analytics query against two large tables. The orders table has fifty million rows and a clustered B-tree on customer_id. The customers table is smaller, a few million rows, and the query asks for every customer joined with every order, returned sorted by customer_id:

SELECT c.name, o.order_id, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id
ORDER BY c.id;

The optimiser looks at this and notices three things. The orders side is already sorted on the join key, courtesy of the clustered index. The query demands ORDER BY c.id, so the output needs to come out sorted on the join key anyway. And because this is a plain equi-join on two large tables, both hash join and sort-merge join are candidates.

Hash join would build a hash table on customers, stream orders through, and probe. Fast. But the output comes out in hash-probe order, effectively random with respect to customer_id, so the planner has to stick a Sort operator on top to satisfy the ORDER BY. Sort-merge join, on the other hand, needs customers sorted on id — one external sort pass — and the orders side is already sorted. The merge is linear. The output comes out in customer_id order, so the ORDER BY is a no-op.

For this query, sort-merge wins. It is not faster on the join alone — it is faster on the whole plan, because the sorted output is a by-product the downstream operators need anyway. Sort-merge is rarely the cheapest join in isolation, but its output is valuable, and a good planner always prices the whole tree.

This chapter builds the three pieces: the merge join itself, external merge sort, and the two wired together as a physical operator — with the cost analysis laid out against grace hash.

The merge join — linear once both sides are sorted

Start with the easy half. Assume both inputs are already sorted on the join key. You have two cursors, one on each input. The algorithm is: compare the two current keys; if they are equal, emit the cross product of the equal-key blocks and advance both; if they differ, advance the smaller.

# query/exec/sort_merge_join.py
from dataclasses import dataclass
from typing import Iterator, Callable, Any

Row = tuple

@dataclass
class SortMergeJoinExec:
    outer: Any                    # child operator, sorted on outer.key
    inner: Any                    # child operator, sorted on inner.key
    outer_key: Callable[[Row], Any]
    inner_key: Callable[[Row], Any]

    def __iter__(self) -> Iterator[Row]:
        self.outer.open(); self.inner.open()
        try:
            r, s = next(self.outer, None), next(self.inner, None)
            while r is not None and s is not None:
                kr, ks = self.outer_key(r), self.inner_key(s)
                if kr < ks:
                    r = next(self.outer, None)
                elif kr > ks:
                    s = next(self.inner, None)
                else:
                    yield from self._emit_equal_block(r, s)
                    r, s = self._advance_past(kr)
        finally:
            self.outer.close(); self.inner.close()

That is the skeleton. The interesting case is the else branch, where the keys match. A naive implementation that just emits (r, s) and advances both cursors is wrong: if the outer has two rows with key k=7 and the inner has three rows with key k=7, you need to emit all six pairs, not two or three.

The correct behaviour is to collect every inner row with the current matching key into a small buffer, then for each outer row at that key emit the cross product with the buffer. Here is the equal-block handler:

    def _emit_equal_block(self, r, s):
        # Buffer all inner rows that share the current key.
        k = self.inner_key(s)
        inner_block = [s]
        while True:
            nxt = next(self.inner, None)
            if nxt is None or self.inner_key(nxt) != k:
                self._inner_peeked = nxt         # save for later
                break
            inner_block.append(nxt)
        # Now emit r × inner_block, then every subsequent outer row
        # whose key still equals k, crossed with the same buffer.
        while r is not None and self.outer_key(r) == k:
            for s_row in inner_block:
                yield r + s_row
            r = next(self.outer, None)
        self._outer_peeked = r

Why the inner has to be buffered and the outer does not: you need to rewind the inner across the equal-key block for every matching outer row, but you cannot rewind a general iterator cheaply. The fix is to materialise the inner's equal-key rows in a list — which is fine because the equal-key block is usually small (typically a few dozen rows even in large tables; if it isn't, you have a skew problem). The outer flows through once per equal block.

Why this is linear. Each row on each side is visited exactly once, across all equal-key blocks combined. The cross product inside a block is O(|block_R| × |block_S|) but that is the size of the output, not wasted work. If the join is one-to-many or many-to-one, the merge does O(|R| + |S|) comparisons plus O(output_size) emission — optimal.

I/O cost. Assuming both sides fit the streaming assumption (you only hold one row at a time plus an equal-block buffer), the merge reads each input page exactly once:

\boxed{\;\text{I/O}_{\text{merge}} = B_R + B_S\;}

This is the best any join algorithm can do, because any join must read both inputs at least once. The catch, of course, is that you only get this number if the inputs are already sorted. The rest of the chapter is about paying for that assumption.

External merge sort — sort something bigger than memory

You have a file of N pages to sort. You have M buffer pages available (M ≪ N). In-memory sorting needs all the data resident, which by assumption doesn't fit. External merge sort solves this by doing the sort in phases: run generation followed by one or more merge passes.

Pass 0 — run generation. Read the input M pages at a time. Sort each chunk in memory using any in-memory algorithm (quicksort, timsort — the standard library is fine). Write the sorted chunk back out as a sorted run. After pass 0 you have ⌈N/M⌉ sorted runs, each of length M pages (the last one maybe shorter).

# query/exec/external_sort.py
import heapq, os, tempfile, pickle
from typing import Iterator, Iterable, Callable

def run_generation(pages: Iterator[list], M: int, key: Callable) -> list[str]:
    """Pass 0: read M pages, sort in memory, write a sorted run file.
    Returns the list of run file paths."""
    run_paths = []
    buf: list = []
    for page in pages:
        buf.extend(page)
        if _pages_worth(buf) >= M:
            run_paths.append(_flush_sorted(buf, key))
            buf = []
    if buf:
        run_paths.append(_flush_sorted(buf, key))
    return run_paths

def _flush_sorted(rows, key):
    rows.sort(key=key)
    fd, path = tempfile.mkstemp(prefix="run-")
    with os.fdopen(fd, "wb") as f:
        pickle.dump(rows, f)
    return path

Twenty lines; most of it is file-handle bookkeeping. The sort itself is one call to Python's built-in. In a production engine you would swap that for a tuned external sort kernel (SQL Server's uses a loser tree; see the going-deeper section), but the structural idea is unchanged.

Pass 1 onwards — k-way merge. Now you have ⌈N/M⌉ sorted runs on disk. Merge them. The trick is that you can merge up to M−1 runs at once using a min-heap: allocate one input buffer page per run, one buffer for output, and repeatedly pop the smallest key across all input buffers, emit it to the output buffer, and advance the input page that produced it when its buffer drains.

def k_way_merge(run_paths: list[str], key: Callable, out_path: str) -> None:
    """Merge sorted runs into one sorted file using a min-heap."""
    iters = [_read_run(p) for p in run_paths]
    with open(out_path, "wb") as f:
        # heapq.merge is exactly the k-way merge we want: O(N log k) comparisons.
        for row in heapq.merge(*iters, key=key):
            pickle.dump(row, f)
    for p in run_paths:
        os.remove(p)

def _read_run(path):
    with open(path, "rb") as f:
        for row in pickle.load(f):
            yield row

Python's heapq.merge is the canonical k-way merge: it holds a heap of size k (one entry per input stream), emits the smallest, refills its slot from the stream that produced it. Total comparisons are O(N log k) for N rows across k runs. A C implementation in a real database looks the same, with a hand-rolled heap and page-at-a-time I/O instead of row-at-a-time.

Pass count. If each merge pass collapses M−1 runs into one, the number of runs divides by M−1 every pass. Starting from ⌈N/M⌉ runs, you need:

\text{passes} = \left\lceil \log_{M-1}\left(\frac{N}{M}\right) \right\rceil

When M is a few thousand and N is even tens of millions of pages, this is 1 — one merge pass suffices. It takes N pretty big before you need a second merge pass. For a third, N has to be astronomical.

I/O cost. Every pass reads N pages and writes N pages:

\boxed{\;\text{I/O}_{\text{sort}} = 2 \cdot N \cdot \left(1 + \left\lceil \log_{M-1}(N/M) \right\rceil \right)\;}

Why the 1 +: pass 0 (run generation) is one full read-and-write, and each merge pass is another full read-and-write. If you need one merge pass, total passes over the data is two, so total I/O is 4N. If you need two merge passes, total is 6N. The formula just packages this as "1 for run generation plus however many merge passes".

Why the M−1 and not M: during a merge pass, one of the buffer pages has to hold the output being written back to disk. The other M−1 hold one page each from M−1 input runs. You cannot merge M runs with only M buffers because there would be nowhere to stage the output. Every external-sort cost formula in every textbook has that M−1, and it is usually where the off-by-one errors creep in.

Worked numbers. You have a 10 GB relation to sort, 8 KB pages, so N = 10·10^9 / 8·10^3 = 1.25·10^6 pages (call it N = 10^6 for tidiness). Buffer pool is 100 MB = M ≈ 10^4 pages. Run generation produces 10^6 / 10^4 = 100 runs. One merge pass can handle up to M−1 ≈ 10^4 runs, so one pass suffices. Total passes over the data is 2. Total I/O is 4 · N = 4 · 10^6 page reads/writes, about 32 GB of disk traffic for a 10 GB sort. On a 1 GB/s SSD, that's roughly 32 seconds. Not bad for something that didn't fit in RAM.

Replacement-selection — building longer runs

Pass 0 as described produces runs of size exactly M. A cleverer pass-0 algorithm, replacement-selection (Knuth TAOCP Vol 3, §5.4.1), produces runs of expected length 2M on random data. Halving the number of runs halves the merge fan-in requirement and sometimes saves a whole merge pass.

The idea: keep a heap of size M filled from the input. Emit the heap's minimum to the current output run. Read one fresh input row; if it is greater than the row you just emitted, it can still extend the current run, so insert it into the heap normally. If it is less than the emitted row, it cannot be placed into the current run (would break sort order), so mark it with a "next generation" tag and insert it anyway — the tag ensures it only pops after the current run finishes.

def replacement_selection(rows, M, key):
    """Pass-0 variant producing runs of expected length 2M on random data."""
    import heapq
    heap, it = [], iter(rows)
    for _ in range(M):                               # prime the heap
        try: r = next(it); heapq.heappush(heap, (0, key(r), r))
        except StopIteration: break
    last_key, last_gen = None, 0
    while heap:
        g, k, r = heapq.heappop(heap)
        if g != last_gen:                            # run boundary
            yield ("RUN_BOUNDARY",)
            last_key, last_gen = None, g
        yield r; last_key = k
        try: nxt = next(it)
        except StopIteration: continue
        nxt_k = key(nxt)
        heapq.heappush(heap, (g if nxt_k >= last_key else g + 1, nxt_k, nxt))

The argument for 2M expected run length: on random input, half of fresh rows are larger than the emitted row and extend the current run; the other half seed the next. The heap steady state yields runs roughly twice M before first-generation entries drain out.

Replacement-selection has fallen somewhat out of fashion. Modern CPUs prefer straight-line code over heap operations, so Postgres uses plain quicksort runs, SQL Server uses tournament trees, and DuckDB uses radix-partitioned runs. The trade-off depends on key width and access pattern.

Sort-merge join end to end

Now wire the two pieces together. The physical operator pulls from two children, sorts each if necessary, then merges.

@dataclass
class SortMergeJoinOperator:
    outer: Any
    inner: Any
    outer_key: Callable
    inner_key: Callable
    buffer_pages: int = 1024
    outer_presorted: bool = False
    inner_presorted: bool = False

    def __iter__(self):
        outer_it = (self.outer if self.outer_presorted
                    else external_sort(self.outer, self.buffer_pages, self.outer_key))
        inner_it = (self.inner if self.inner_presorted
                    else external_sort(self.inner, self.buffer_pages, self.inner_key))
        yield from merge_join(outer_it, inner_it, self.outer_key, self.inner_key)

The external_sort helper is the run-generation-plus-merge-pass pipeline from above; merge_join is the cursor-advancement algorithm from earlier. The two flags outer_presorted and inner_presorted are the optimiser's hooks: if it knows a child already produces sorted output (a B-tree index scan, an upstream sort, an upstream merge-join with a compatible key), it sets the flag and the sort is skipped entirely.

Total cost. Let B_R, B_S be the page counts of the two inputs. Assume one merge pass suffices for both external sorts (the realistic case).

\boxed{\;\text{I/O}_{\text{SMJ}} = \underbrace{4 B_R}_{\text{sort }R} + \underbrace{4 B_S}_{\text{sort }S} + \underbrace{(B_R + B_S)}_{\text{merge}} = 5(B_R + B_S)\;}

Wait — that's 5, not 4. The 4 you see in some textbooks counts the merge phase as part of the second sort pass, by arranging the merge to consume the last merge-pass output directly instead of materialising a sorted file and then re-reading it. A production sort-merge join does exactly that: the final merge pass of the external sort is the input to the merge join, no intermediate materialisation. With that optimisation, the cost drops to 4(B_R + B_S).

Compare grace hash join, whose cost for the same two inputs is 3(B_R + B_S) — one pass to partition each side, one pass to read both partitions back. Hash wins by one pass' worth of I/O.

So on a pure cost-of-the-join-in-isolation basis, hash beats merge. The three places merge wins are the three cases in the next section — and they show up often enough that both algorithms stay in every optimiser's toolkit.

When sort-merge beats hash join

One side already sorted. This is the common case. A clustered B-tree on the join column means the scan already emits rows in key order. The sort cost collapses from 4·B to 0. A sort-merge join against this presorted side costs 4·B_R + B_S + B_S = 4·B_R + 2·B_S (if the other side still needs sorting) or B_R + B_S if both are presorted. Hash join's 3(B_R + B_S) cannot match the latter and does not benefit from presortedness.

Downstream ORDER BY or window function matches the join key. The query ... JOIN ... ON a.x = b.x ORDER BY a.x can use the sort-merge's sorted output for free. Hash join's output is in hash-probe order, requiring an explicit sort on top. Same story for OVER (PARTITION BY x ORDER BY x) window functions: the planner propagates "sortedness on x" up through the tree as an interesting order (Selinger's term), and picks the plan that satisfies the most interesting orders without extra work.

Range predicates. Hash functions partition by equality — the hash of 5 and the hash of 6 land in unrelated buckets, so a predicate like a.x BETWEEN b.low AND b.high cannot be evaluated by a hash join. Merge join can handle this with a sliding-window variant: sort both sides, and for each outer row with value v, walk the inner forward emitting all rows whose interval contains v (or backward over the no-longer-overlapping intervals). The complexity becomes O(|R| + |S| + \text{output}), optimal.

def range_merge_join(r_iter, s_iter, key_r, low_s, high_s):
    """Emit (r, s) where low_s(s) <= key_r(r) <= high_s(s).
    Both sorted ascending by key_r / high_s respectively."""
    active = []                      # currently-open intervals from s
    s = next(s_iter, None)
    for r in r_iter:
        kr = key_r(r)
        # Expire intervals that end before kr.
        active = [x for x in active if high_s(x) >= kr]
        # Open any new intervals whose low is <= kr.
        while s is not None and low_s(s) <= kr:
            if high_s(s) >= kr:
                active.append(s)
            s = next(s_iter, None)
        for x in active:
            yield r + x

Interval and temporal databases rely on this daily. It is one of the few algorithms that cleanly beats both hash and nested-loop for a useful workload.

Skewed key distributions. Grace hash join partitions by hash; if one key value covers twenty percent of the input, that partition is twenty percent of the data, and if it doesn't fit in memory you have to recursively partition it — a second hash pass on a sub-partition. Recursive partitioning is the standard fix, but each recursion is another read-and-write of the offending partition. Sort-merge degrades more gracefully under skew: a hot key just means a bigger equal-key block in the merge, and the block either fits or is spilled to disk as needed, with no recursion.

A full end-to-end example

Sorting a 10 GB relation with 100 MB of buffer

You are running a report that groups orders by customer and sorts by total spend. The orders table is 10 GB, pages are 8 KB, so N = 1.25 × 10^6 pages — call it N = 10^6 for clean arithmetic. The buffer pool allocated to this operator is 100 MB, which at 8 KB per page is M ≈ 1.28 × 10^4 pages — call it M = 10^4.

Pass 0 — run generation. Read pages in chunks of M = 10^4. The 10^6-page input yields 10^6 / 10^4 = 100 sorted runs. I/O: one full read of the input plus one full write of the sorted runs = 2N = 2 × 10^6 pages.

Pass 1 — merge. You have 100 runs. One merge pass can handle up to M − 1 ≈ 10^4 − 1 runs simultaneously — well over 100. So one merge pass suffices. I/O: one full read of the 100 runs plus one full write of the merged output = 2N = 2 × 10^6 pages.

Total sort I/O: 4N = 4 × 10^6 pages = 32 GB of disk traffic.

On a 1 GB/s NVMe SSD, raw I/O time is 32 seconds. Add CPU time for sorting (roughly N log N comparisons spread across pass 0, and N log k for the merge heap), and the whole operation completes in under a minute for a 10 GB sort on commodity hardware.

External merge sort in two passes. Pass 0 reads the input in chunks of `M` pages, sorts each in memory, and writes a sorted run. Pass 1 opens all runs simultaneously, using `M−1` buffers for inputs and 1 for output, and merges them through a min-heap. For realistic `N` and `M`, one merge pass suffices.

Common confusions

"Merge join only works if both inputs happen to be sorted." False. Merge join always assumes sorted inputs, but the plan inserts explicit Sort operators above each child when needed. The sort is part of the plan and the optimiser costs it in. The case where no Sort is inserted — because the child is already sorted — is when sort-merge shines, but it is not a prerequisite.
"Duplicates kill merge join." False. Duplicates are handled by the equal-block buffering: for each run of matching keys you buffer the inner's equal-key rows and emit the cross product against every outer row sharing that key. Performance degrades only when the equal-block is too large to fit in memory, in which case it spills to disk and becomes a mini nested-loop.
"External sort is always slower than external hash." False. On presorted inputs external sort degenerates to a single linear pass, which is cheaper than grace hash's two-pass partitioning. On already-sorted data the comparison is not even close: merge wins four-to-one.
"M is the buffer pool size in bytes." No. Throughout external-sort analyses, M is the number of pages the operator is allowed to use. If your buffer pool is 100 MB and pages are 8 KB, then M ≈ 12,800. This chapter uses M for pages consistently.
"Sort-merge join output is always sorted." Only on the join key. If your query joins on customer_id but orders by order_date, sort-merge gives you sorted-on-customer_id output, which doesn't help the ORDER BY. The planner's interesting-orders analysis has to match the join key against what downstream operators want.
"The number of runs always decreases by a factor of M per pass." By a factor of M − 1, not M. One buffer is always reserved for output. For large M this is a tiny correction; for small M it is the difference between terminating in three passes and three hundred.

Going deeper

Two subtler topics worth seeing if you want to go from textbook external sort to what a production engine actually does: loser trees (which make the merge cache-friendlier), and how column stores rewrite the sort algorithm for vectorised execution.

Tournament trees and loser trees

The k-way merge's inner loop needs, at each step, the smallest key across k input buffers. A binary heap gives O(log k) per operation. SQL Server's external sort uses a loser tree instead (Knuth TAOCP Vol 3, §5.4.1) — a tournament tree variant that reduces per-pop comparisons to roughly 1 + log k / 2 on structured inputs, with better cache behaviour because tree nodes sit contiguously and each update touches only log k of them on a predictable path.

The idea: build a binary tree whose leaves are the k input streams and whose internal nodes hold the loser of the comparison between their two subtrees. The root holds the overall winner and knows which leaf it came from. To pop the winner, refill that leaf and walk up the tree updating losers on the path — one comparison per level.

Graefe's Implementing Sorting in Database Systems (ACM Computing Surveys 2006) is the survey to read. He shows empirically that loser trees win by 10–30 percent on the merge phase at realistic fan-in values.

Column stores: radix sort on fixed-width keys

Row-store sorts compare variable-width tuples; a column store like DuckDB or Vertica sorts on a fixed-width key column laid out contiguously — a very different problem. The modern answer for integer or bounded-string keys is radix sort, which is O(N · w) where w is the key width in bytes — often faster than comparison sort because radix passes are cache-friendly, branch-free, and SIMD-able.

DuckDB combines radix partitioning in pass 0 (producing many small L2-sized partitions) with an in-partition quicksort. The external variant spills partitions when they collectively exceed memory. The sort is effectively a hash-partition-on-the-sort-key, bringing it closer to grace hash join than to classical external merge sort. Velox (Meta's vectorised library) takes the same approach. Once you commit to fixed-width columnar layouts, the best sort algorithm stops being merge sort.

Where this leads next

Sort-merge join is one corner of a triangle whose other corners are nested-loop and hash join. Each has one or two natural use cases; the optimiser's job is to pick per join node.

Aggregation: hash-based and sort-based — GROUP BY has the exact same hash-versus-sort trade-off. Hash-aggregate is faster on pure aggregation, but sort-aggregate reuses the sort for downstream ordered output, and the algorithm underneath is the same external merge sort you just built.
Hash join and external hashing — the other big join algorithm. Grace hash join's partitioning mirrors external sort's run generation; both are "divide into manageable chunks, process each in memory, glue results together". The 3B vs 4B cost comparison is the headline difference.
Rule-based rewrites: predicate pushdown and join pushdown — the optimiser layer above physical operator choice. Given three tables and a star of predicates, which pair joins first? The cost functions derived here (4(B_R + B_S) for sort-merge with sorts, B_R + B_S for presorted sort-merge) are inputs to that layer.
Interesting orders — Selinger's generalisation of "sortedness" to any property a physical operator preserves or produces. Merge join preserves sortedness on the join key; hash join does not. The planner tracks these bottom-up and uses them to prune comparable plans. Sort-merge is the canonical source of an interesting order.

You now have three production-grade join operators: nested-loop (all three variants), sort-merge (with external merge sort underneath), and — next chapter — hash join. Every query plan in an EXPLAIN output picks one of these three at each join node.

References

Knuth, The Art of Computer Programming, Vol 3: Sorting and Searching (2nd ed., 1998), §5.4 "External Sorting" — the canonical treatment of sorted runs, k-way merge, replacement-selection, and tournament trees. Every modern external-sort paper builds on this.
Graefe, Implementing Sorting in Database Systems, ACM Computing Surveys 38(3), 2006 — a sixty-page survey covering every variant of external sort used in production engines, including loser trees, forecasting, and adaptive merging.
Blasgen and Eswaran, Storage and Access in Relational Databases, IBM Systems Journal 1977 — the System R paper that first costed merge join versus nested-loop and demonstrated merge's value for presorted inputs.
Chaudhuri, An Overview of Query Optimization in Relational Systems, PODS 1998 — surveys how cost-based optimisers choose among NLJ, hash, and merge, including the role of interesting orders.
Ramakrishnan and Gehrke, Database Management Systems (3rd ed., 2003), Ch. 13–14 — textbook derivation of external merge sort's 2N(1 + ⌈log_{M−1}(N/M)⌉) formula and sort-merge join's full cost, with the same notation this chapter uses.
PostgreSQL source, nodeMergejoin.c and tuplesort.c — the production implementations of merge join and external sort. tuplesort.c is a long read but the cleanest end-to-end external sort code in any open-source engine.