Iterator Model (Volcano) Execution

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

After parsing, binding, and optimisation you are holding a physical plan — a tree of concrete algorithms (SeqScan, Filter, Project, NestedLoopJoin). The tree does not run itself. Something has to walk it and produce rows. Graefe's Volcano (1994) answered with the iterator model: every operator implements three methods — open() prepares state, next() returns the next tuple or a sentinel meaning "done", close() releases resources. Operators store their children as fields. When a parent's next() is called, it calls next() on its children, does per-tuple work (filter, transform, join), and returns a tuple up to its caller. Execution is pull-based: the consumer drives the producer by demanding the next row. Every operator holds at most O(1) tuples, which makes execution streaming — a million-row query uses the same memory as a ten-row query. The model has one famous weakness: calling next() billions of times through virtual dispatch is slow, which vectorised engines (DuckDB) and compiled engines (HyPer) fix in different ways. This chapter builds the classic iterator model in Python — Scan, Filter, Project, NestedLoopJoin — and traces a real query through it tuple by tuple.

The optimiser has finished. Predicates have been pushed down, joins reordered, and each logical operator replaced with a physical algorithm. You are holding a tree like this:

Project(exprs=[u.name, o.total],
  child=NestedLoopJoin(predicate=(u.id == o.user_id),
    left=SeqScan('users'),
    right=Filter(predicate=(o.total > 1000), child=SeqScan('orders'))))

Nothing has executed. No pages read, no rows compared, no results produced. Something has to walk the tree and actually do it. That something is the executor, and the design of the executor is the subject of this chapter.

The question is: what interface should every operator expose so the executor can walk the tree generically, without special cases per operator type? Graefe's 1994 answer — in Volcano — An Extensible and Parallel Query Evaluation System — was three methods: open, next, close. Thirty years later Postgres, SQLite, MySQL, SQL Server, Oracle, and virtually every row-based database use some version of it. Volcano is the grandparent of every query executor you will read.

The three-method protocol

Every operator, no matter what it does internally, exposes the same three methods:

The Volcano three-method protocol. Every physical operator — Scan, Filter, Join, Aggregate — implements these three methods. Execution of a query is: open() the root, next() repeatedly until EOF, close() the root.

In Python, the protocol is an abstract base class with three methods:

# execute/operator.py
from abc import ABC, abstractmethod
from typing import Iterator

# A Tuple is just a dict from column name to value, for this teaching engine.
# Real systems use a packed byte layout for cache efficiency.
Tuple = dict[str, object]

# The EOF sentinel. When next() has nothing left, it returns this.
EOF = None

class Operator(ABC):
    @abstractmethod
    def open(self) -> None: ...

    @abstractmethod
    def next(self) -> Tuple | None: ...

    @abstractmethod
    def close(self) -> None: ...

Why three methods and not two (say, just next() with lazy setup)? Because some operators need a bulk prepare step that cannot happen inside next(). A hash join builds its hash table from the entire right input before returning the first result; an aggregate reads its entire input before emitting the first group. open() is where that bulk work lives. Conversely, close() is where you release the buffer-pool page pins and file handles; without a dedicated close, operators would leak those on partial consumption (LIMIT 10 stops calling next() after ten tuples — the tree still needs to be torn down).

Why return None rather than raise StopIteration like Python iterators do: exceptions for control flow are slow, and the executor will call next() hundreds of millions of times in a real query. A sentinel return value is one pointer comparison; an exception is a full unwind. Every production engine uses a sentinel (Postgres returns a null TupleTableSlot, SQLite returns SQLITE_DONE), not an exception.

Each algebra operator as an iterator

Each logical operator from the previous chapter has at least one physical iterator implementation. Some have many — a logical Join might become a NestedLoopJoin, a HashJoin, or a MergeJoin depending on the optimiser's cost model. For this chapter, build the simplest form of each and use them to run real queries.

Scan

The leaf of every tree. A Scan reads rows from a base table, one at a time. In a real system it goes through the buffer pool (built in Build 5) and reads page-at-a-time; for this teaching iterator, assume rows are already in a Python list and return them one by one.

# execute/scan.py
from .operator import Operator, Tuple

class SeqScan(Operator):
    """Sequential scan: read every row of a table in storage order."""

    def __init__(self, table_name: str, rows: list[Tuple]):
        self.table_name = table_name
        self._rows = rows                # in a real system, iterate pages
        self._cursor = 0                 # current position

    def open(self) -> None:
        self._cursor = 0                 # rewind (supports re-open for NLJoin)

    def next(self) -> Tuple | None:
        if self._cursor >= len(self._rows):
            return None                  # EOF
        row = self._rows[self._cursor]
        self._cursor += 1
        return row

    def close(self) -> None:
        self._cursor = 0                 # nothing else to release in this toy

Fifteen lines. open() sets the cursor. next() returns the next row or None. close() is trivial here but in a real system would unpin the current buffer-pool page. The _rows list is a stand-in for the real scan cursor; in chapter 47 of Build 5, this is where your HeapFile iterator plugs in.

Filter

Keep rows where a predicate holds, drop the rest. The filter holds a child operator and a predicate function.

# execute/filter.py
from .operator import Operator, Tuple

class Filter(Operator):
    """σ — keep rows where predicate(row) is true."""

    def __init__(self, child: Operator, predicate):
        self.child = child
        self.predicate = predicate       # a callable: Tuple -> bool

    def open(self) -> None:
        self.child.open()

    def next(self) -> Tuple | None:
        while True:
            row = self.child.next()
            if row is None:
                return None              # child exhausted → we are too
            if self.predicate(row):
                return row               # pass this one up
            # otherwise loop: fetch the next candidate from child

    def close(self) -> None:
        self.child.close()

Why the while loop inside next(): a filter has to keep asking its child for rows until it finds one that passes the predicate. If the child returns ten rows of which only the third matches, the filter calls child.next() three times inside a single next() call of its own. Only when the child returns EOF does the filter return EOF too. This is the "pull" part of "pull-based execution" — the filter pulls rows from its child until it has something to return.

Project

Keep only the named columns (or compute expressions). Like Filter, it holds a child and forwards open/close.

# execute/project.py
from .operator import Operator, Tuple

class Project(Operator):
    """π — compute each output expression from the input row."""

    def __init__(self, child: Operator, exprs: list, output_names: list[str]):
        self.child = child
        self.exprs = exprs               # each callable: Tuple -> value
        self.output_names = output_names

    def open(self) -> None:
        self.child.open()

    def next(self) -> Tuple | None:
        row = self.child.next()
        if row is None:
            return None
        return {name: e(row) for name, e in zip(self.output_names, self.exprs)}

    def close(self) -> None:
        self.child.close()

Unlike Filter, Project never loops: every input row produces exactly one output row. The loop would be wasted work.

NestedLoopJoin

The simplest join algorithm and the one that works no matter what. For every row of the outer (left) input, rewind the inner (right) input and scan it from the start, emitting pairs that satisfy the join predicate. The chapter on join algorithms will build better ones; this is the baseline every database has as a fallback.

# execute/nljoin.py
from .operator import Operator, Tuple

class NestedLoopJoin(Operator):
    """⋈ — for each outer row, scan inner and emit matching pairs."""

    def __init__(self, left: Operator, right: Operator, predicate):
        self.left = left
        self.right = right
        self.predicate = predicate       # Tuple x Tuple -> bool
        self._outer: Tuple | None = None # current outer row

    def open(self) -> None:
        self.left.open()
        self.right.open()
        self._outer = self.left.next()   # load the first outer row

    def next(self) -> Tuple | None:
        while self._outer is not None:
            inner = self.right.next()
            if inner is None:            # right exhausted for this outer row
                self._outer = self.left.next()
                if self._outer is None:
                    return None          # both exhausted
                self.right.open()        # rewind inner for the new outer
                continue
            if self.predicate(self._outer, inner):
                return {**self._outer, **inner}   # merged row
        return None

    def close(self) -> None:
        self.left.close()
        self.right.close()

Why the outer is loaded in open() rather than on the first next(): symmetry with inner cursor handling, and next()'s loop assumes self._outer already holds a candidate. An alternative design holds a flag _first_call and does the outer fetch in next(); both work. The Volcano paper uses the open() approach for clarity.

Why self.right.open() is called again inside the loop: each time the inner stream is exhausted for the current outer row, it must be rewound so the next outer row can pair with every inner row from the start. This is why open() on a SeqScan resets the cursor — so a NestedLoopJoin can legitimately re-open its inner child. Operators that cannot be cheaply re-opened (a hash join's build phase) need a different join algorithm, which is chapter 46's subject.

That is it. Four operators — SeqScan, Filter, Project, NestedLoopJoin — about fifteen lines each, and you can already run any SELECT with equi- or theta-joins. The brevity is not an accident; it is the whole point of the iterator model.

Pull-based execution — next() cascades upward

The execution model is called pull-based because the consumer drives the producer. The user's code (or the wire-protocol layer sending rows to the client) calls next() on the root of the tree. The root calls next() on its children, which call next() on their children, all the way down to the leaves. When the leaves return a row, it propagates up, transformed at each level by whatever the operator does.

Tuples flow up, next() calls flow down. The consumer at the top drives the producer at the bottom. Filter is the only operator that may call next() on its child several times before returning one tuple of its own — everyone else is one-in, one-out.

A query executes like this:

# execute/run.py

def run(root: Operator) -> list[Tuple]:
    root.open()
    results = []
    while True:
        row = root.next()
        if row is None:
            break
        results.append(row)
    root.close()
    return results

That is the entire executor. Eight lines of top-level code, and every operator's three methods. The whole complexity of running SQL lives inside the operator implementations — the driver is trivial.

Worked query, traced through

Set up two tiny tables and run a real query end-to-end. Watch what each operator does.

Running SELECT u.name, o.total FROM users u, orders o WHERE u.id = o.user_id AND o.total > 1000

Two tiny tables and the plan the optimiser produced (with the filter already pushed below the join):

users = [{'id':1,'name':'Dipti'}, {'id':2,'name':'Raghav'}, {'id':3,'name':'Priya'}]
orders = [{'order_id':101,'user_id':1,'total': 500},
          {'order_id':102,'user_id':1,'total':2500},
          {'order_id':103,'user_id':2,'total':1200},
          {'order_id':104,'user_id':3,'total': 900}]

plan = Project(
  child=NestedLoopJoin(
    left=SeqScan('users', users),
    right=Filter(lambda r: r['total'] > 1000, SeqScan('orders', orders)),
    predicate=lambda l, r: l['id'] == r['user_id']),
  exprs=[lambda r: r['name'], lambda r: r['total']],
  output_names=['name','total'])

print(run(plan))

Trace the first next() at the root:

Project.next() → NestedLoopJoin.next(). The NLJ already has _outer = {id:1,name:'Dipti'} from open(). Call Filter.next().
Filter.next() → SeqScan(orders).next() → {order_id:101, user_id:1, total:500}. Predicate false. Loop. SeqScan.next() → {order_id:102, user_id:1, total:2500}. Predicate true. Return.
Back in NLJ: outer id=1, inner user_id=1. Join predicate true. Return merged {id:1,name:'Dipti',order_id:102,user_id:1,total:2500}.
Project keeps name, total and returns {name:'Dipti', total:2500}.

Driver appends, calls next() again. Filter drops order 103's 1200? No — 1200 > 1000, so it passes. But the join predicate 1 == 2 fails; loop. Order 104 fails the filter. Filter returns None. NLJ advances outer to Raghav, rewinds inner via right.open(). Orders 101, 102 filtered/join-fail; order 103 passes both and returns {name:'Raghav', total:1200}. Raghav has no more qualifying orders. Advance outer to Priya. Her only order is 900, filtered out. No joins. Outer exhausts. Root returns None.

Final result:

[{'name': 'Dipti',  'total': 2500},
 {'name': 'Raghav', 'total': 1200}]

At no point did the engine hold more than one outer row, one inner candidate, and one output tuple. Even if the tables held a million rows each, that footprint does not grow. Iterator-model execution is streaming, and this is its single biggest practical virtue.

Pros and cons

The iterator model has stuck around for thirty years because its pros run deep. Its cons are real too, and they are what the last decade of query-engine research has been fixing.

Pros:

Simple. Three methods per operator, one dispatch type, one driver loop. A new operator is one file.
Compositional. Any operator can be the child of any other. A Limit(n) operator is ten lines and drops at the top of any tree.
Streaming. Each operator holds O(1) tuples (except blocking operators like sort and hash-build, whose memory can be bounded and spilled). Results can stream to the client before the query is fully computed.
Lazy. LIMIT 10 does exactly the work for 10 rows. The scan does not run to completion; it stops the moment the root does.
Generalises to parallelism. Volcano's original paper also introduced the exchange operator, which splits a stream across workers and merges their outputs using — of course — the same open/next/close interface.

Cons:

Virtual-call overhead. Every next() is a virtual dispatch. A billion rows through ten operators is ten billion indirect calls, each costing 5–20 cycles on modern CPUs.
Poor ILP. A while loop that consumes one row, does ten instructions, returns, gets re-called — that is a loop the CPU cannot pipeline across. Batch-at-a-time engines (DuckDB, MonetDB X100) solve this by running the inner loop inside each operator, over a whole batch.
Allocations. Every tuple passed between operators is, naively, a new object. Compiled engines (HyPer, Umbra) eliminate these by fusing operators into one function.

For short OLTP queries on small data, the pros dominate — the query is I/O bound, not next()-bound. This is why every OLTP database on earth uses the iterator model. The cons only bite on analytical queries over hundreds of millions of rows, and that is exactly where vectorised and compiled engines have taken over.

Common confusions

"Why is it called 'pull-based' when tuples flow up?" Because the caller pulls. The caller says "give me the next row", and the callee obliges. In the alternative (push-based) model the producer initiates: it calls consume(row) on its parent whenever a row is ready. Data flow direction is the same either way (leaves to root); what differs is who is in control. "Pull" vs "push" refers to control flow, not data flow.
"If Filter calls child.next() multiple times per outer call, won't the stack blow up?" No — each next() is a normal function call that returns before the outer one re-invokes its child. The stack depth equals the tree depth, not the number of rows. A ten-deep tree means ten stack frames, no matter how many billions of rows pass through.
"What if an operator needs to see all its input before producing any output — like ORDER BY or COUNT(*)?" Those are called blocking operators. They do their bulk work inside open() — reading the entire child, sorting or aggregating it — and then next() emits from the sorted buffer or the computed groups. The caller still sees a normal open/next/close sequence, but the first next() is expensive because it runs everything underneath to completion. Sort, group-by, and hash-join-build are the classic blocking operators.
"Scan's open() resets _cursor to 0 — is repeated open a bug or a feature?" Feature. NestedLoopJoin relies on it: every time it advances its outer row, it calls self.right.open() to rewind the inner stream. The contract is "open can be called more than once; each call resets state to the start". Not every operator can cheaply support re-open — a hash join's build phase is expensive — so the optimiser has to be careful about which children land under NLJ.
"Does the executor know about the catalog, types, columns?" In this teaching version, tuples are Python dicts keyed by column name. In a real system, each operator is given a schema at plan time, and tuples are packed byte buffers with fixed offsets per column. open() is where each operator finalises its knowledge of the schema its children will emit.

Going deeper

You have an executor that runs any query the optimiser produces. What follows is the architectural fork in the road: the iterator model's two modern successors — push-based with code generation, and vectorised — both of which arose because the iterator model does not keep up with modern CPUs on analytical workloads.

Push-based execution with code generation (HyPer, Umbra)

In the pull model, every operator implements next() and is called by its parent. In the push model, every operator implements consume(row) and calls its parent. The producer drives; the consumer reacts.

The trick is that push interacts beautifully with code generation: when a scan produces a row, the calls to its parent's consume, its grandparent's consume, and so on all the way up to the root, can be inlined into a single compiled function. No virtual dispatch because the whole tree is one straight-line function. This is the idea behind HyPer (Neumann 2011): emit LLVM IR at query time, compile to native code, run. On analytical workloads that gives 5–10× over an interpreted iterator model, because data stays in CPU registers across operator boundaries instead of being spilled to the stack on every next().

Vectorised execution — batch next() (MonetDB X100, DuckDB)

The vectorised approach, first in MonetDB X100 (Boncz et al. 2005) and now the bedrock of DuckDB, keeps the iterator model but changes the unit: next() returns a vector — a fixed-size batch (usually 1024 or 2048 values) laid out columnwise. Each next() does N rows of work, amortising the virtual call. The inner loop is a tight scan over N columnar values, which the compiler can auto-vectorise to SIMD — on an AVX-512 CPU, a predicate over 1024 floats takes ~128 cycles instead of ~5,000 scalar. Build 15 of this curriculum implements this on top of the same algebra.

Morsel-driven parallelism

A third evolution — morsel-driven execution (Leis et al. 2014) — replaces Volcano's exchange operators with a thread pool that grabs small morsels (~10K rows) and runs them through the plan. Excellent load balancing, no partitioning overhead. Umbra and DuckDB both use variants.

Production iterator models

Postgres uses a classic iterator model; ExecProcNode is a switch statement rather than a virtual call for better branch prediction.
SQLite is the purest Volcano: a bytecode VM where each opcode essentially runs one step of an iterator tree. See the SQLite Virtual Machine Opcodes documentation.
Apache Spark SQL runs "whole-stage code generation" that fuses operator chains into compiled Java bytecode — a JVM echo of HyPer.

The iterator model is not dead. It is the default that every new engine has to argue against, and even engines that have argued against it keep a fallback iterator path for operators that don't fit the vectorised or compiled mould.

Where this leads next

You now have a working executor. Four operators, three methods each, and the whole of SQL's physical execution laid out as a recursive next() cascade. The next chapters extend the model with serious join algorithms.

Nested-loop join and block nested-loop — your NestedLoopJoin reopens the inner once per outer row, which is O(|R| · |S|) I/O. The block variant reads outer in chunks for a big constant-factor win.
Hash join and external hashing — when both sides fit in memory, hashing reduces join cost to O(|R| + |S|); when they don't, external hashing partitions to disk and joins one partition at a time.
Merge join — if both inputs are sorted on the join key, joining is a single linear pass.
Aggregate and group-by operators — blocking operators that read all input in open() and emit groups from next().

Every operator you add plugs into this three-method protocol. The protocol is the stable interface; the operators behind it are where the ideas change. Volcano's elegance is exactly this: the surface is the same for everyone, from a one-day-old SeqScan to a fully tuned ParallelGraceHashJoin. The tree composes. The executor stays eight lines.

References

Graefe, Volcano — An Extensible and Parallel Query Evaluation System, IEEE TKDE 6(1), 1994 — the founding paper. Introduces the open/next/close protocol and the exchange operator for parallel execution. Still worth reading cover-to-cover.
Neumann, Efficiently Compiling Efficient Query Plans for Modern Hardware, VLDB 2011 — the HyPer paper, which introduced the push-model-plus-code-generation approach. If you ever wondered why modern engines emit LLVM IR, start here.
Boncz, Zukowski, Nes, MonetDB/X100: Hyper-Pipelining Query Execution, CIDR 2005 — the original vectorised-execution paper. The ideas here became DuckDB's entire execution layer twenty years later.
Leis, Boncz, Kemper, Neumann, Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age, SIGMOD 2014 — how modern parallel executors schedule work. The successor to Volcano's exchange operator.
PostgreSQL source tree, execProcnode.c — the dispatch routine ExecProcNode that calls each operator's next() equivalent. Reading this plus one or two operator files (e.g. nodeSeqscan.c, nodeNestloop.c) is the quickest tour of a real iterator model in production.
Kemper, Neumann, HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots, ICDE 2011 — the broader HyPer architecture paper, which sets the code-generation executor in its engine context. Pairs with reference 2.