Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
In short
The Volcano executor calls next() once per row per operator — at 100M rows that is seconds of pure virtual-call overhead before any real work, and the CPU's SIMD lanes sit idle because nothing is contiguous. Vectorised execution flips the unit: each operator processes a batch of 1024 columnar values at a time, amortising dispatch to picoseconds per row and unlocking AVX-512 inner loops that compare 8 doubles per cycle. The result is the 20–100× speedup every modern analytical engine is built on.
You spent the previous chapter learning that a column store reads only the bytes the query asked for — the layout wins by total_columns / referenced_columns, often 25× to 50× before any other trick. That is a storage win: it shrinks the bytes that arrive at the CPU.
But once those bytes have arrived in RAM, the executor still has to do work on them. And here the classical Volcano executor — the one every textbook teaches — turns out to be the wrong shape for analytical workloads. It was designed for the pipeline where SCAN feeds FILTER feeds AGG one tuple at a time, and it composes operators beautifully. It is also, when you measure it on a hundred-million-row table, slower than the disk it is reading from. The CPU becomes the bottleneck. Modern columnar engines fix this by processing batches.
This chapter derives why, builds a tiny vectorised executor in Python so you can feel the difference yourself, and ends with the worked example that the rest of Build 15 will lean on: a WHERE region = 'south' GROUP BY region SUM(price) query running 20× faster on the vectorised path than on the tuple path.
The Volcano model, recap
In the iterator model — named after the Volcano paper by Goetz Graefe in 1994 — every operator is a class with one method:
class Operator:
def next(self) -> Tuple | None:
"""Return the next tuple, or None when exhausted."""
...
A query like SELECT SUM(price) FROM sales WHERE region = 'south' becomes a tree of operators chained by next() calls:
Aggregate (SUM)
|
Filter (region == 'south')
|
Scan (sales)
Aggregate.next() calls Filter.next() in a loop until it returns None. Each call to Filter.next() calls Scan.next() and either returns the row or pulls another. Each call to Scan.next() reads one row from the heap.
This is beautiful software engineering — every operator implements the same one-method interface, the planner can wire them up in any order, you can unit-test each operator in isolation. And it is exactly right for OLTP, where a query touches twenty rows and the dispatch cost rounds to zero.
Where tuple-at-a-time falls apart
Run that same model on a hundred-million-row scan and four problems compound:
1. Function-call overhead. Each next() is a virtual call — the compiler doesn't know which subclass op.next() will dispatch to, so it emits an indirect jump through the vtable. On modern CPUs that's roughly 5 to 10 nanoseconds per call (one indirect branch plus the function prologue). For a three-deep operator tree, that is 15–30 ns per row. Multiply by 100M rows and you have spent 1.5 to 3 seconds on dispatch before doing any work. Why this matters now and not in 1994: in 1994 the actual per-row work — disk seek, decompression, predicate evaluation — was so much more expensive than the dispatch that nobody noticed. In 2026, with NVMe SSDs delivering 7 GB/s and predicates being one CPU instruction, dispatch is a measurable fraction of total time.
2. Branch misprediction. Inside Filter.next() there is an if predicate(tuple): return tuple else: continue. The CPU's branch predictor learns patterns — "this branch is usually taken" — but a WHERE region = 'south' predicate on randomly-ordered data is essentially a coin flip. Every misprediction stalls the pipeline by 15–20 cycles while the CPU throws away speculative work. On a typical analytical filter you lose 5–15% of total CPU time to mispredictions alone.
3. No SIMD. Modern x86 CPUs have AVX-512 registers that hold 8 doubles or 16 floats and can run one operation across all of them in a single cycle. ARM has NEON (4 doubles per register) and SVE. These are the silicon equivalent of going from a single carpenter to a row of eight working in lockstep. You can only use them on contiguous arrays of the same type. A tuple-at-a-time executor sees one value, then moves to the next operator; the SIMD units are completely idle. The CPU is running at maybe 10% of its arithmetic throughput.
4. Cache behaviour. Each tuple in Volcano is typically a struct or a dict — {order_id: 1234, region: "south", price: 99.50, ...} — with pointers to variable-length strings. Walking the tree pulls fragments from many different cache lines, evicting useful data. The CPU's L1 cache (32 KB) and L2 cache (256–1024 KB) end up thrashing, and you stall on memory loads instead of running instructions.
Add these together and a row-by-row Volcano executor on 100M rows runs at maybe 30–100 million rows per second on a modern core. A vectorised executor on the same hardware runs at 1–3 billion rows per second — a 20× to 100× speedup, and that is the gap Photon and DuckDB report on real TPC-H queries.
The vectorised fix
The insight, due to Boncz, Zukowski and Nes in 2005, is the obvious one once you've seen the problem: change the unit of work from one tuple to one batch. Each operator now takes a batch of (say) 1024 values in and returns a batch out.
class VectorOperator:
def next_batch(self) -> ColumnBatch | None:
"""Return the next batch of up to BATCH rows, or None when exhausted."""
...
The same query becomes:
VectorAggregate (SUM, batch=1024)
|
VectorFilter (region == 'south', batch=1024)
|
VectorScan (sales, batch=1024)
The change looks small — next() becomes next_batch() — but the consequences cascade through every layer:
- Dispatch amortised. One virtual call per 1024 rows means dispatch cost drops from ~15 ns/row to ~15 ps/row. Three orders of magnitude.
- Tight inner loops. Inside
VectorFilterthe body isfor i in range(n): out[i] = a[i] > 100over a primitivedouble[]. The C++ compiler recognises this pattern and emits AVX-512 instructions that compare 8 doubles per cycle. Why autovectorisation works here and not in Volcano: SIMD requires the compiler to prove that the loop body has no aliasing, no exceptions, no virtual calls — exactly the conditions a tight array loop satisfies and a tuple-at-a-timenext()does not. - Branch predictor wins. Inside the batch loop there are no data-dependent branches at all — the comparison
a[i] > 100becomes a SIMD compare that produces 0 or all-1s in each lane, which is then ANDed into a bitmap. No mispredictions because there are no branches. - Cache-friendly. The whole batch (1024 doubles = 8 KB) fits comfortably in L1 cache. The next operator reads it back from L1 with no memory traffic. Compare to Volcano, which can blow L1 chasing a single tuple's pointers.
This is what every modern analytical engine does. ClickHouse calls them "blocks" (typically 65 536 rows). Apache Arrow Compute calls them "RecordBatches" (typically 1024 to 64K rows). DuckDB uses 2048-row "DataChunks". Photon and Velox use similar batch sizes. The choice of 1024 is not arbitrary: it's large enough that dispatch cost vanishes and SIMD has room to ramp up, and small enough that the working set fits in L1 cache (1024 × 8 B = 8 KB, half of typical L1).
SIMD: the inner-loop multiplier
Inside the batch, the real win comes from single-instruction-multiple-data (SIMD) execution. A modern Intel core has 32 × 512-bit AVX-512 registers; a single instruction like vmulpd zmm0, zmm1, zmm2 multiplies eight 64-bit doubles in lockstep in one cycle.
The SIMD multiplier (typically 4× for floats on NEON, 8× for doubles on AVX-512, 16× for floats on AVX-512) stacks on top of the dispatch saving. The Volcano executor pays full dispatch cost AND runs scalar; the vectorised executor pays 1/1024 of the dispatch cost AND runs SIMD. The product is the 20–100× speedup that real engines see in practice. Why SIMD is invisible in tuple-at-a-time: SIMD instructions operate on registers that hold multiple values; loading 8 doubles into a register requires 8 contiguous values in memory, which a per-tuple loop cannot offer.
A vectorised filter, in code
The filter pattern in a vectorised engine has two flavours.
The first materialises a new array of surviving values:
def filter_materialise(values: list[float], threshold: float) -> list[float]:
return [v for v in values if v > threshold] # allocates a new list
That works but pays an allocation cost on every batch. The second flavour, used by every production engine, is the selection vector — return a small array of indices that survived the predicate. Downstream operators iterate over these indices into the original (unfiltered) column.
def filter_selection_vector(values, threshold):
# one tight loop, no allocation per surviving row
sel = []
for i, v in enumerate(values):
if v > threshold:
sel.append(i)
return sel # int[] of indices into `values`
The downstream Aggregate then does:
def sum_with_selection(values, sel):
total = 0.0
for i in sel: # iterate sel, gather from values
total += values[i]
return total
In a real C++ engine these loops compile to SIMD-friendly inner kernels: the > comparison emits vcmpgtpd producing an 8-bit mask per 8 lanes, and the gather is done via vpcompressq or by simply iterating over the dense surviving prefix. Why selection vectors win over materialisation: most analytical filters have selectivities of 1–30%; allocating a fresh contiguous output array on every batch costs more than the filter itself. A selection vector is small (int32[matching_count]) and lets downstream operators reuse the original column array, keeping it hot in cache.
A worked example
A 100M-row filter + group-by, end to end
You are the data engineer at a Bengaluru retail platform. The fact table:
CREATE TABLE sales (
order_id BIGINT,
region VARCHAR(16), -- 4 distinct values: north/south/east/west
category VARCHAR(32), -- 200 distinct values
price DECIMAL(10,2)
);
-- 100,000,000 rows, stored columnar (Parquet)
The product team's daily query:
SELECT region, SUM(price)
FROM sales
WHERE category = 'Electronics'
GROUP BY region;
Both engines below scan the same 100M-row column store — so disk I/O cost is identical. The only variable is the executor.
Tuple-at-a-time path (Volcano):
def volcano_query(scan_iter):
agg = {} # region -> running sum
for row in scan_iter: # 100M next() calls
if row["category"] == "Electronics":
agg[row["region"]] = agg.get(row["region"], 0.0) + row["price"]
return agg
Cost analysis on a 3 GHz core:
- 100M
next()calls × 15 ns dispatch = 1.5 s pure overhead - 100M dict lookups for
category+region+price= ~6 s Python interpreter time (or ~2 s in C++ with hash-map lookups) - 100M conditional branches with ~10% misprediction rate = ~0.3 s stall
- Total wall clock on this query in a Volcano C++ engine: ~10 s
Vectorised path (column batches of 1024):
import numpy as np
BATCH = 1024
ELECTRONICS_CODE = 17 # dictionary code for 'Electronics'
def vector_query(category_chunks, region_chunks, price_chunks):
# Per-region running sums (4 buckets — region cardinality is 4)
sums = np.zeros(4, dtype=np.float64)
for cat, reg, pri in zip(category_chunks, region_chunks, price_chunks):
# SIMD compare across 1024 lanes — one AVX-512 instruction per 8 lanes
mask = (cat == ELECTRONICS_CODE) # bool[1024]
# SIMD masked add into 4 region buckets
survived_reg = reg[mask] # int8[k], k ≈ 5% of 1024
survived_pri = pri[mask] # float64[k]
# 4-bucket histogram-sum — one tight loop per bucket
for r in range(4):
sums[r] += survived_pri[survived_reg == r].sum()
return sums
Cost analysis:
- 100M / 1024 = 97 656 batches
- 97 656 batches × 15 ns dispatch = 1.5 ms (1000× less than Volcano)
- The
cat == 17SIMD compare runs at ~8 lanes/cycle → 100M comparisons in ~12 ms on one core - The masked sum runs at similar throughput → another ~10 ms
- L1 cache stays hot because each 8 KB batch fits in 32 KB L1
- Total wall clock on this query in a vectorised C++ engine: ~0.5 s
Result: 10 s → 0.5 s. A 20× speedup, on the same data, on the same hardware, just by changing how the operators iterate.
You can prove the same effect in Python today by running the equivalent NumPy version of a WHERE x > 100 filter against a Python for loop — NumPy will be 50–100× faster, for exactly the reasons above (vectorised C inner loop versus interpreted per-element work). The ratio in a real query engine is smaller (10–30×) because the C++ Volcano baseline is already much faster than Python, but the underlying physics is identical.
What you give up
Vectorised execution wins almost everywhere on analytical workloads, but it has costs.
Latency-sensitive point queries get worse. A query like SELECT * FROM sales WHERE order_id = 12345 touches one row. A vectorised engine still allocates a 1024-row batch, runs the filter across it, and discards 1023 results. For OLTP point queries, Volcano's per-row pipelining is genuinely faster. Why this is a non-problem in practice: vectorised engines target OLAP, where queries scan millions of rows; OLTP engines stay on Volcano. Hybrid HTAP systems route point queries to a row-store path and analytical queries to the vectorised column path.
Code complexity. Each operator has to handle batch boundaries, partial last batches, selection vectors, null bitmaps, and dictionary-encoded inputs. The Volcano next() interface is one method; the vectorised interface in Velox or Apache Arrow Compute is dozens of kernels per data type. Engineering cost per operator goes up roughly 3–5×.
Memory pressure. Each batch needs its own scratch buffers — input, mask, selection vector, output. For a deep operator pipeline running at 1024 rows × N operators × multiple columns, you can easily allocate megabytes per query thread. Real engines pool these buffers to avoid heap thrashing.
Pipelining vs blocking. Some operators (sort, hash-join build) are blocking — they have to consume all input before producing output. Vectorisation doesn't help these much; the real win is on the streaming operators (scan, filter, project, hash-aggregate, hash-join probe) which dominate analytical query time.
Real systems
Almost every analytical engine built since 2015 is vectorised:
- MonetDB/X100 → Vectorwise → Actian Vector — the original. Boncz et al. introduced the model in 2005; the lineage continues commercially.
- ClickHouse — block-at-a-time execution with 65 536 rows per block, hand-written SIMD kernels for the hot operators. Open source, Polara/now-independent. The reason ClickHouse benchmarks at billions of rows per second per core.
- DuckDB — 2048-row DataChunks, autovectorised C++. Embedded analytical engine; the SQLite of OLAP.
- Apache Arrow Compute — vectorised compute kernels over Arrow's columnar in-memory format. Used by Polars, Spark (via Arrow integration), Pandas 2.x.
- Databricks Photon — C++ vectorised engine that replaces Spark's row-based JVM executor for SQL workloads. SIGMOD 2022.
- Velox — Meta's vectorised execution library, the substrate under Presto, Spark and PyTorch's data loader.
- Snowflake — vectorised columnar execution over micropartitions; the architecture every cloud warehouse copies.
- BigQuery (Capacitor / Dremel) — Querion's columnar engine with vectorised execution and tree-of-aggregators MPP.
Volcano-style row-at-a-time engines persist exactly where they were designed for: OLTP. Postgres, MySQL, Oracle, SQL Server, CockroachDB. The split is clean: transactions stay row-by-row, analytics goes vectorised.
Common confusions
-
"Vectorised execution is the same as SIMD." They are related but distinct. Vectorisation is the engine design — process a batch of 1024 rows per operator call instead of one. SIMD is a CPU feature — execute one arithmetic instruction across 8 lanes in parallel. Vectorisation makes SIMD possible (because the inner loop now runs over a contiguous array) but a vectorised engine still wins even on hardware without SIMD, because the per-row dispatch cost still drops 1000×. ClickHouse on a Raspberry Pi is still ~20× faster than Volcano on the same Pi, even though both engines fall back to scalar code.
-
"Bigger batches are always better." No — once a batch exceeds L1 cache, you start paying L2 latency on every read, and the gains flatten. The 1024–4096 sweet spot exists because
1024 × 8 B = 8 KBfits inside the 32 KB L1 with room for two operand arrays plus a result. ClickHouse's 65 536-row blocks work because ClickHouse fuses operators inside a block so the data stays in registers; a naive non-fused pipeline at 65 536 rows would actually be slower than 1024. -
"NumPy is already a vectorised engine, so I don't need ClickHouse." NumPy gives you vectorised compute kernels but no query planner, no operator pipeline, no spill-to-disk, no parallelism across cores, no hash-aggregate, no joins beyond
np.searchsorted. Polars and DuckDB are what you get when you wrap a NumPy-style kernel layer in an actual SQL engine. The mechanics inside the inner loop are the same; the engineering above it is what makes a database. -
"Vectorised engines don't use indexes." They do, just sparingly and at a different granularity. Instead of B-tree-per-row indexes (Postgres style), analytical engines use min/max statistics per row group ("zone maps") — for each block of 100 K rows, store the min and max of each column. A
WHERE date > '2026-01-01'query then skips entire blocks whosemax(date) < 2026-01-01without reading them. This pruning happens before vectorised execution starts; the two techniques compose. -
"Volcano is dead — every engine should be vectorised." Volcano is alive and correct for OLTP. A
SELECT * FROM users WHERE id = 12345query on Postgres reads exactly one B-tree leaf, returns one row, and rounds to microseconds. The same query on a vectorised engine still allocates a 1024-row batch buffer and pays cache-line touch overhead — Postgres is genuinely faster on point queries. Vectorisation pays back only when the query touches at least a few thousand rows. This is why hybrid systems (TiDB, SingleStore, Postgres + Citus columnar) keep both engines and route at plan time. -
"Selection vectors are always better than materialisation." Only when selectivity is low (say, <50%). At very high selectivity (
WHERE x > 0on positive-only data), a selection vector that lists 1023 of 1024 indices is wasteful and the gather costs more than just copying the values. Real engines (DuckDB, Velox) measure selectivity per batch and switch strategies — selection vector for sparse output, materialised dense array for dense output.
Compilation, fusion, and the next frontier
Vectorised execution is not the last word. Two extensions keep pushing the envelope:
Compiled queries (HyPer, Umbra, DataFusion)
Instead of having a generic VectorFilter operator that works for any predicate, compile each query into custom machine code at plan time. The Hyper/Umbra system from TU Munich uses LLVM to generate a single tight loop fused across SCAN + FILTER + AGG, with the predicate inlined. This eliminates all operator-boundary overhead, not just per-row dispatch. The downside is compile time — typically 10–100 ms — which only pays back on queries that scan more than a few million rows. Modern engines (DuckDB, DataFusion, Velox) often combine vectorisation with light JIT compilation: vectorised by default, compiled when the query is heavy enough.
Operator fusion and pipelining
Even within vectorisation, you can do better than "filter writes a new array then aggregate reads it." Fusion combines adjacent operators into one kernel that does the filter and the aggregation in the same pass over the input — keeping data in SIMD registers across operator boundaries. ClickHouse and Photon both fuse common patterns like WHERE x > c GROUP BY y. The performance gain is typically 1.5–2× on top of vectorisation alone, because fused kernels touch each value exactly once instead of twice.
Adaptive batch sizing
The 1024 batch size is a heuristic that balances dispatch amortisation against L1 cache footprint. Adaptive engines measure cache-miss rates at runtime and shrink batches when they overflow L1, or grow them when miss rates are low. Photon and Velox both ship adaptive sizing; the gain is 5–15% on cache-bound workloads.
Going beyond CPU SIMD
The next frontier is GPU vectorised execution — engines like HeavyDB, BlazingSQL, and Snowflake's experimental GPU lanes process batches of millions of rows on GPU cores, exploiting the same vectorisation principle at 1000× the parallel width. The bottleneck shifts from CPU dispatch to PCIe bandwidth between disk and GPU, which is a different engineering problem entirely.
Summary
Three things to carry forward:
- The unit of work matters as much as the layout. Column storage gets the right bytes off disk. Vectorised execution gets useful work out of those bytes once they arrive in CPU. Both are needed for the 50–100× analytical speedup; either alone gives only a fraction.
- Volcano was right for 1994 hardware and is wrong for 2026 hardware. The cost ratio between dispatch and useful work has flipped — what used to round to zero now dominates. Modern CPUs reward tight inner loops over contiguous arrays and punish virtual-call-per-row pipelines.
- The batch interface is the entire trick. Change
next() -> tupletonext_batch() -> column[1024]and almost every other optimisation falls out automatically: dispatch amortises, branches vanish, autovectorisation kicks in, cache footprint shrinks. Every modern OLAP engine — ClickHouse, DuckDB, Photon, Velox, Snowflake, BigQuery — is built around this one idea.
The next chapter takes the vectorised executor and pairs it with dictionary and run-length encoding so the data being vectorised is also compressed in memory — the third multiplicative win after columnar layout and batched execution.
References
- Boncz, Zukowski, Nes, "MonetDB/X100: Hyper-Pipelining Query Execution", CIDR 2005
- Apache Arrow Compute — Vectorised Kernel Documentation
- Behm et al., "Photon: A Fast Query Engine for Lakehouse Systems", SIGMOD 2022
- ClickHouse — Architectural Overview (Engineering Blog)
- Raasveldt, Mühleisen, "DuckDB: an Embeddable Analytical Database", CIDR 2020
- Dageville et al., "The Snowflake Elastic Data Warehouse", SIGMOD 2016