Nested-Loop Join, Block Nested-Loop

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A nested-loop join is the most obvious join algorithm imaginable: for every row r in the outer relation R, scan every row s in the inner relation S, and emit (r, s) whenever the join predicate holds. Its cost in tuple comparisons is |R| × |S| — quadratic, and for a million-row join against a million-row table that is 10¹² comparisons, which is hopeless. But the real cost on disk is usually worse: the inner table has to be read from disk once for every outer row, so the I/O cost is B_R + |R| · B_S page reads, where B_R, B_S are the page counts. A hundred-page outer joined against a thousand-page inner is a hundred thousand inner reads. The block nested-loop patch fixes this by reading the outer in blocks of B - 2 pages (using almost all of the buffer pool), and for each outer block, scanning the inner once. This cuts the inner scans from |R| times down to ⌈B_R / (B-2)⌉ times — a three-order-of-magnitude improvement when B is a few thousand. And if the inner table has an index on the join column, you can skip the inner scan altogether: for each outer row, do a B-tree lookup on the inner's index, which is O(log |S|) instead of O(|S|). That is the index nested-loop join, and it is the fastest join known when the outer is small and the inner is large and indexed. Hash join and sort-merge join dominate for large joins, but NLJ (in all three variants) wins for small outers, highly-selective predicates, and left-deep index-nested-loop plans — which is why every production optimiser still picks it on half your queries.

You have a table of users and a table of orders, and you want to know every user's order history. In SQL you write SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id. The optimiser lowers this into a Join node in the algebra tree. The iterator model says that at execution time some physical operator — NestedLoopJoinExec, HashJoinExec, SortMergeJoinExec — needs to implement open(), next(), close() and pull rows from its two children.

This chapter is about the simplest of those three. It is the one you would invent yourself in five minutes if you had never seen a database textbook.

for u in users:
    for o in orders:
        if u.id == o.user_id:
            yield (u.name, o.total)

Four lines. For each row of the outer table, scan every row of the inner. That is the whole algorithm. It is so obvious that it feels like cheating to call it an algorithm at all.

And yet every real database ships it. Postgres, MySQL, SQLite, Oracle, SQL Server — all have a nested-loop join in their physical operator set, all pick it regularly in production plans, and all spend engineering effort tuning it. The reason is that in three specific situations it is strictly the right choice, and in one of those situations (a small outer driving into a large indexed inner) nothing else comes close.

So the goal of this chapter is not to convince you that NLJ is good. It is to show you exactly when it wins and when it loses, by deriving its cost formula, patching its worst-case behaviour with the block variant, and then upgrading to the index variant that most production plans actually use.

The naive nested-loop join

Here is the algorithm as a Python generator, written against the iterator model from the previous chapter.

# query/exec/nlj.py
from dataclasses import dataclass
from typing import Iterator, Callable, Any

Row = tuple

@dataclass
class NestedLoopJoinExec:
    outer: Any                   # child operator: has open/next/close
    inner: Any                   # child operator: has open/next/close
    predicate: Callable[[Row, Row], bool]   # returns True if rows match

    def __iter__(self) -> Iterator[Row]:
        self.outer.open()
        try:
            for r in self.outer:                 # outer loop: once
                self.inner.open()                # REWIND the inner every time
                try:
                    for s in self.inner:         # inner loop: once per outer row
                        if self.predicate(r, s):
                            yield r + s          # concatenate tuples
                finally:
                    self.inner.close()
        finally:
            self.outer.close()

Twenty lines, most of them open/close bookkeeping. The join itself is the two for loops.

Why you have to rewind the inner: iterators in the Volcano model are single-pass. Once next() returns None, the iterator is exhausted. To scan the inner again for the next outer row, you have to call close() and then open() again to reset it to the first row. For a Seq-scan leaf, open() means rewinding the file pointer to page 0; for a more complex subtree, it might mean re-executing a filter chain from scratch. This rewinding is what makes the algorithm quadratic in I/O, not just in comparisons.

Why the predicate is a callable, not a SQL expression: at this layer, the operator doesn't care what shape the predicate takes. The binder has already lowered it into something you can evaluate on a pair of rows. For u.id = o.user_id, the callable is lambda r, s: r[0] == s[1] (index positions of the join columns, resolved by the planner). The same operator works for any predicate, including inequality joins like u.signup_date < o.placed_date, which is one of NLJ's few unique strengths.

The cost — why it is quadratic

Two numbers matter for any join algorithm: tuple comparisons (CPU work) and page reads (disk I/O). For most databases on real hardware the page reads dominate, so let's count both.

Let |R| and |S| be the row counts and B_R, B_S the page counts of outer and inner. Assume P is the number of rows per page (so B_R = ⌈|R| / P⌉).

Tuple comparisons. The inner if runs once for every (r, s) pair:

\text{comparisons} = |R| \cdot |S|

A million-row outer and million-row inner = 10¹² comparisons. On a CPU doing a billion comparisons per second, that is about seventeen minutes of pure CPU work before you read a single byte from disk. Already uncomfortable.

Page reads. The outer is read once:

\text{outer reads} = B_R

The inner is read once per outer row, because every time you rewind you have to re-fetch every inner page from disk (buffer pool caching helps but only if the inner fits in memory — more on this in a moment):

\text{inner reads} = |R| \cdot B_S

Total:

\boxed{\;\text{I/O}_{\text{NLJ}} = B_R + |R| \cdot B_S\;}

Plug in: |R| = 10⁵ rows, B_S = 10⁴ pages. That is 10⁹ page reads. At 10ms per random-read SSD page, that is 10⁷ seconds. A hundred days. The query does not finish.

The buffer pool does save you if the inner table fits in memory — the first scan loads it into the cache, and subsequent scans are cache hits, turning |R| · B_S disk reads into B_S disk reads plus (|R|-1) · B_S memory reads. That is the only reason naive NLJ is not catastrophic in practice. But it only works if B_S ≤ buffer pool size.

Nested-loop join as a grid. Every cell of the `|R| × |S|` rectangle is evaluated; shaded cells are matches. The horizontal sweep happens once per outer row, and unless the inner fits in the buffer pool, each sweep is a full re-read from disk.

Block nested-loop — use the whole buffer pool

The naive version wastes the buffer pool. It only ever holds one outer row at a time, plus the inner page currently being scanned. If you have a buffer pool of B pages, you are using maybe three of them. The block nested-loop (BNL) trick is to grab the whole outer block at once, scan the inner once against that block, and match every outer row against every inner row during the single inner scan.

Here is the algorithm. B is the total buffer pages; we reserve one page for the current inner page and one for output, so the outer block is B − 2 pages.

# query/exec/bnl.py
from dataclasses import dataclass
from typing import Iterator, Callable, Any

@dataclass
class BlockNestedLoopJoinExec:
    outer: Any                    # child operator (must expose pages, not just rows)
    inner: Any                    # child operator
    predicate: Callable
    buffer_pages: int = 128       # B; we'll use B-2 for the outer block

    def __iter__(self) -> Iterator:
        self.outer.open()
        try:
            block_size = self.buffer_pages - 2
            while True:
                # 1. Load up to B-2 pages of outer rows into the outer block.
                block = []
                pages_loaded = 0
                for page in self.outer.next_pages(block_size):
                    block.extend(page.rows)
                    pages_loaded += 1
                if pages_loaded == 0:
                    break            # outer exhausted
                # 2. Scan the inner ONCE for this whole block.
                self.inner.open()
                try:
                    for s in self.inner:
                        for r in block:                # compare against every
                            if self.predicate(r, s):   # outer row in the block
                                yield r + s
                finally:
                    self.inner.close()
        finally:
            self.outer.close()

Thirty lines. The two-loop structure has become a three-loop structure — outer block loop, then inner scan, then inner nested loop over the block — but the inner loop over the block is in memory, so it is cheap. What matters is that the inner table is now read once per block instead of once per row.

Cost analysis. The outer is still read once: B_R pages. The inner is read once per outer block, and there are ⌈B_R / (B-2)⌉ blocks:

\boxed{\;\text{I/O}_{\text{BNL}} = B_R + \left\lceil \frac{B_R}{B - 2} \right\rceil \cdot B_S\;}

Why this is a massive win: the naive formula had |R| (rows) in the multiplier; the block formula has ⌈B_R / (B-2)⌉ (blocks of pages). If P = 100 rows per page and B = 1000 buffer pages, then a million-row outer is B_R = 10⁴ pages, which is ⌈10⁴ / 998⌉ = 11 blocks. The inner is scanned 11 times, not a million times. That is a factor of ~100,000 improvement in inner-side I/O.

If the entire outer fits in the buffer pool (B_R ≤ B − 2), you get exactly one inner scan: the join I/O is B_R + B_S. That is linear in the table sizes — the best any join algorithm can do, because you have to read both tables at least once. For a small outer joined against a large inner, BNL is optimal, which is exactly when the optimiser picks it.

Tuple comparisons are unchanged. BNL still does |R| · |S| predicate evaluations — the grid is the same size, you are just visiting it in a different order. Some engines further optimise the in-memory block lookup with an ad-hoc hash table over the block, turning the inner loop over block into an O(1) probe. That is sometimes called hashed block nested-loop, and it is the bridge to the full hash join algorithm in the next chapter.

Index nested-loop — skip the inner scan entirely

Now imagine the inner table has an index on the join column — say, orders.user_id has a B-tree on it. Then for a given outer row r, you do not need to scan the inner at all. You do a B-tree lookup on r.id and pull only the matching inner rows directly.

# query/exec/inl.py
@dataclass
class IndexNestedLoopJoinExec:
    outer: Any
    inner_index: Any              # B-tree or hash index on inner's join column
    inner_heap: Any               # the actual table, for fetching full rows
    outer_join_col: int           # position of the join column in outer row

    def __iter__(self) -> Iterator:
        self.outer.open()
        try:
            for r in self.outer:
                key = r[self.outer_join_col]
                # B-tree lookup returns tuple IDs matching the key
                for tid in self.inner_index.lookup(key):
                    s = self.inner_heap.fetch(tid)
                    yield r + s
        finally:
            self.outer.close()

The inner loop is now a B-tree lookup per outer row. A B-tree lookup is O(log_f |S|) page reads, where f is the fan-out (typically 100–500). For a million-row inner, that is three or four page reads per outer row. Then a few more to fetch the matching tuples from the heap.

Cost. Let m be the average number of inner matches per outer row, and assume the B-tree is h levels tall (usually 3–4):

\boxed{\;\text{I/O}_{\text{INLJ}} = B_R + |R| \cdot (h + m)\;}

Why this can be the cheapest join of all: if |R| is small (say 100 outer rows) and the inner has an index, you do 100 lookups of 4–6 page reads each, totalling maybe 600 reads. Compare with BNL, which still has to read the entire inner at least once — if the inner is a million pages, BNL is a million reads. INLJ turns a |S|-sized scan into a |R| · log |S|-sized set of lookups, which wins whenever |R| · log |S| ≪ |S|, i.e. whenever |R| ≪ |S| / log |S|.

This is why "small outer driving into large indexed inner" is the classic INLJ win pattern. If your query is SELECT * FROM recent_signups r JOIN orders o ON r.user_id = o.user_id WHERE r.signup_date > '2026-04-01' — and recent_signups has 100 rows, and orders.user_id is indexed — the optimiser will pick INLJ and the query finishes in milliseconds. If you drop the index, the same query falls back to BNL or hash join and takes seconds.

A realistic join on Indian-scale data.

You are building the backend for a JEE coaching platform. You have two tables:

students — 2 million rows, 50 pages. One row per enrolled student.
test_attempts — 80 million rows, 20,000 pages. One row per student per test attempt, indexed on student_id.

You run: SELECT s.name, COUNT(a.id) FROM students s JOIN test_attempts a ON s.id = a.student_id WHERE s.city = 'Kota' GROUP BY s.name.

The filter s.city = 'Kota' is very selective — maybe 50,000 rows. The optimiser's job is to pick a join algorithm.

Option A: Naive NLJ. Outer = filtered students (50,000 rows, 2 pages). Inner = test_attempts (20,000 pages). I/O = 2 + 50,000 × 20,000 = 10⁹ page reads. With the inner cached in a 128 MB (~16,000-page) buffer pool... it doesn't fit, so most reads miss. Dead in the water.

Option B: Block NLJ with B = 16,000. Outer pages = 2, so only one block. Inner scanned once. I/O = 2 + 1 × 20,000 = 20,002 page reads. At 0.5 ms/page SSD random reads, about 10 seconds.

Option C: Index NLJ using test_attempts(student_id). For each of 50,000 filtered outer rows, do a B-tree lookup (say 4 pages) + fetch ~40 matching attempt rows per student (say 5 heap pages due to clustering). I/O = 2 + 50,000 × 9 = 450,002 page reads. Slower than BNL in this case, because 50,000 outer rows is not that small.

Option D: Hash join (next chapter). Build a hash table on filtered students (fits in memory), stream test_attempts and probe. I/O = 2 + 20,000 = 20,002 reads. Same as BNL but no quadratic predicate evaluations.

For this query, BNL and hash join tie on I/O, and hash join wins on comparisons. The optimiser picks hash.

If the filter had returned only 100 Kota students instead of 50,000, INLJ would drop to 2 + 100 × 9 = 902 page reads — thirty times faster than BNL. Then the optimiser picks INLJ.

That is the game. Same query, same tables, three different plans depending on filter selectivity and available indexes. NLJ in its index variant is the winner whenever the outer is truly small.

Common confusions

"Why is the naive version still in the code path at all if BNL is always better?" BNL needs to know page boundaries on the outer side — it pulls pages, not rows. Not every child operator can expose that cheaply. A child that is itself the output of a hash join or a sort doesn't have neat page-aligned output to hand over. In those cases the engine falls back to the row-at-a-time naive version. In production, Postgres's nested-loop node and MySQL's BNL flag coexist; one is picked based on what the child can provide.
"I thought NLJ was always terrible. Why do modern planners still pick it?" Three reasons. (1) Inequality joins — R ⋈_{r.start < s.timestamp} S cannot use hash join (hash is equality-only) or merge join (merge requires sorted equality). NLJ handles any predicate. (2) Small outer × indexed inner — INLJ wins by an order of magnitude when |R| is small and the index exists, as the worked example shows. (3) Left-deep pipelines — in a series of joins where one side is a tiny filter result, NLJ avoids building a hash table for the small side, saving memory. Optimisers like Postgres's default to hash join but switch to INLJ when the selectivity estimate says the outer will be small.
"Is the 'block' in block nested-loop the same as a page?" No. A page is a fixed unit of disk I/O — 4 KB or 8 KB typically. A block, in BNL, is a chunk of the outer that fills most of the buffer pool — could be B − 2 pages, i.e. hundreds or thousands of disk pages. The naming is confusing because "block" sometimes also means "page" in operating-systems literature. In this chapter, "block" = the chunk of outer rows held in memory during one inner scan.
"If both tables are large and unindexed, is NLJ the right choice?" Almost never. Hash join is O(|R| + |S|) I/O and O(|R| + |S|) comparisons (amortised). Sort-merge is O((|R|+|S|) log) I/O but O(|R|+|S|) comparisons after the sort. Both crush BNL for large-large joins. BNL is only good when one side is small or when you need inequality or complex predicates that hash and merge can't handle.
"Does the order of outer vs inner matter?" Yes, enormously. The outer is the "driver" — its size determines how many inner scans (or index lookups) you do. The optimiser always tries to put the smaller (or more-filtered) side as the outer. In INLJ specifically, the outer must be the side without the usable index, and the inner must be the side with the index, because the algorithm assumes inner.lookup(key) is cheap.
"What does SELECT * FROM a NATURAL JOIN b NATURAL JOIN c execute as — one nested loop of three levels?" The optimiser turns a multi-way join into a tree of binary joins. Three-way join = two Join nodes in the algebra tree, one feeding into the other. Each internal Join picks its own algorithm. A left-deep plan with INLJ at every level is very common for star-schema queries — "drive from the fact table through indexed lookups into each dimension."

Going deeper

If you wanted to know what the four-line Python algorithm actually becomes in production, this section is for you. Block-size tuning, Oracle's prefetching, MySQL's batch-key-access, and where the comparisons go when the CPU cache gets involved are all worth a look.

Block-size tuning: why bigger isn't always better

The formula says "bigger blocks, fewer inner scans, always better." In practice there are three reasons to cap the block size below the full buffer pool.

First, concurrency: other queries running at the same time need buffer pages too. If one query's BNL grabs every page in the pool, every other query's scans miss the cache. Real databases limit the outer block to a fraction of B (Postgres work_mem is per-operation, typically tens of MB).

Second, startup latency: the outer block must be fully loaded before the inner scan starts. If you're streaming results to a user who only wants the first 10 rows (LIMIT 10), a huge block means you do all the inner I/O up front for rows you'll never emit. The optimiser accounts for this with a "startup vs total cost" pair of estimates, and a LIMIT query often drops BNL block size way down, sometimes back to the one-row naive form.

Third, CPU cache friendliness: when the inner loop sweeps through the block comparing against every outer row, each outer row needs to be in the CPU cache. If the block is 10 MB, that's larger than L2, and every cycle of the sweep blows the cache. Some engines reorder the inner-block nested loop to be cache-blocked at L1 or L2 granularity, giving a second-level blocking on top of BNL's page-level blocking.

Oracle's table prefetching for NLJ

Oracle's nested-loop join has a neat trick for the "small outer, indexed inner" case. When the outer rows are processed one at a time, each inner lookup is a random read into the inner's heap. Random reads on rotating disks are slow; even on SSDs they have worse latency than sequential. Oracle issues prefetches: as soon as outer row r_i is fetched, the index lookup for r_{i+1} is issued asynchronously, so by the time the inner loop gets there, the heap page is already in the buffer. This is described in Oracle's Cost-Based Optimizer reference. The net effect is that INLJ I/O is mostly bounded by sequential read throughput, not random-access latency.

MySQL's block nested-loop and BKA

MySQL used to have a pure naive NLJ for any non-indexed inner; in 5.6 it gained Block Nested Loop (BNL) via the block_nested_loop optimiser switch (on by default in 8.0). The block is join_buffer_size bytes (default 256 KB, often tuned to 16 MB). Then in 8.0 MySQL added Batch Key Access (BKA): when the inner side has an index but the heap pages are far apart, MySQL gathers a batch of index-lookup results, sorts them by row ID, and issues them as a single batch to the storage engine — converting random heap accesses into near-sequential ones.

BKA is BNL + INLJ + sorted-prefetch. It is the most sophisticated NLJ variant in any mainstream engine and wins on joins where the outer is a few thousand rows and the inner is indexed but not clustered on the join key. See the MySQL reference manual, 8.2.1.6 for the details.

Postgres's Nested Loop and `enable_nestloop`

Postgres's default planner behaviour is to consider nested-loop, hash, and merge joins for every binary join, and cost them using its row-count and page-count estimates. You can disable NLJ entirely with SET enable_nestloop = off;, which is a useful diagnostic when a query plan chooses NLJ and you suspect the estimates were wrong. The Postgres docs explicitly say: "Nested-loop join is especially useful for small outer relations because there are relatively few scans of the inner relation" — this is the |R| · log |S| ≪ |S| case from above.

Postgres's cost model also accounts for cache effects: if the inner is small enough to fit in effective_cache_size, the cost of rescanning is lowered, which nudges the planner toward NLJ even for moderate outer sizes. This is why toy cost analysis ("naive NLJ costs |R| · B_S") is always wrong in practice — the buffer pool changes everything.

Inequality joins: the one place NLJ is indispensable

Hash join cannot handle r.a < s.b. The hash function depends on equality; an inequality predicate does not partition the keyspace into buckets the same way. Sort-merge join can handle inequality in some cases (if you sort both sides), but the merge phase becomes a cross product of the overlapping ranges. For R ⋈_{r.start < s.end AND r.end > s.start} S — the interval-overlap join — there is no clever partitioning scheme, and the planner falls back to NLJ.

Modern spatial and temporal databases solve this with specialised algorithms (partition-based interval joins, R-tree joins), but for general-purpose SQL, inequality joins still mean NLJ.

NLJ in distributed query engines

In distributed engines (Spark, Presto, Snowflake) the calculus changes again. Shipping a huge inner across a network for every outer partition is catastrophic, so "nested-loop join" in the distributed sense means broadcasting the smaller side to every worker, then each worker runs a local BNL or hash join against its slice of the outer. Broadcast joins are the distributed cousin of INLJ: good when the broadcast side is small, bad otherwise. Spark's BroadcastNestedLoopJoin is exactly this — it ships the inner to every executor and runs a local BNL per partition, and it's the default choice for small inner + inequality predicate.

Where this leads next

You now have three NLJ variants and a clear sense of when each wins. The remaining two big joins in any production engine are hash join and sort-merge join, and they are the default for large-large joins.

Hash join and external hashing — the next chapter. Build a hash table on the smaller side (the "build" side), stream the larger side through and probe. O(|R| + |S|) when the hash table fits in memory; when it doesn't, you spill partitions to disk and recurse, which is the "external hashing" part. Always beats BNL for large-large joins on equality predicates.
Sort-merge join and external merge sort — the other workhorse. Sort both sides on the join key, then do a single linear merge pass. Slower than hash join when both sides fit in memory, but beats hash when the inputs are already sorted (from an indexed scan or an upstream ORDER BY) or when the result itself needs to be sorted.
Rule-based rewrites: join reordering — given three or more tables, which pair to join first? That is a separate optimisation layer that picks the order, independently of which algorithm runs at each step. The algorithm (NLJ vs hash vs merge) is chosen per node; the order is chosen for the whole tree.
Cost-based optimisation — the planner does not pick NLJ by guessing; it enumerates candidate plans and computes an estimated I/O + CPU cost for each. The formulas in this chapter (B_R + |R| · B_S for naive, B_R + ⌈B_R/(B-2)⌉ · B_S for BNL, B_R + |R|·(h+m) for INLJ) are literally the cost functions the planner uses. Get them wrong and the planner picks the wrong algorithm.

Nested-loop join is the first physical operator you implement because it is the simplest to explain. It is also the one that survives longest in the wild, because in its index variant it is optimal for a huge class of real queries. The next chapter shows what happens when neither side is small enough for BNL and neither side has a usable index — the case where hash join is the only reasonable choice.

References

Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys 1993 — the definitive survey of physical join algorithms, including naive, block, and index nested-loop with full cost analyses.
Selinger, Astrahan, Chamberlin, Lorie, Price, Access Path Selection in a Relational Database Management System, SIGMOD 1979 — the System R paper that first documented the cost-based choice among NLJ, merge, and early hash variants.
MySQL Reference Manual, 8.2.1.6 Nested-Loop Join Algorithms — documents naive NLJ, block NLJ, and Batch Key Access (BKA) with worked examples on real schemas.
PostgreSQL Documentation, Planner/Optimizer — Join Strategies — how Postgres costs and picks among NLJ, hash, and merge joins.
Ramakrishnan and Gehrke, Database Management Systems (3rd ed., 2003), Ch. 14 — textbook derivation of the |R| · |S|, BNL, and INLJ cost formulas used in this chapter.
Mishra and Eich, Join Processing in Relational Databases, ACM Computing Surveys 1992 — historical survey covering the full family of join algorithms, including the inequality-join case where NLJ is the only choice.