Vectorized Execution: 1000 Rows at a Time

In short

The classical query executor — the Volcano model that has shipped in Postgres, MySQL, Oracle and SQL Server for thirty years — implements every operator as a class with a next() method that returns one tuple at a time. SCAN's next() reads a row from disk, hands it to FILTER's next(), which checks the predicate and either passes it up or pulls another. The composition is gorgeous: any operator plugs into any other through the same one-tuple interface, and the planner can rearrange the tree freely. For OLTP, where a query touches dozens of rows, this is exactly right.

For OLAP, where a query touches a hundred million rows, this is a disaster. Each next() call is a virtual function dispatch — 5 to 10 nanoseconds of pure overhead. Multiply by 100M rows and four operators in the pipeline and you have spent two full seconds on dispatch alone, before adding a single price or comparing a single region string. Worse, every row goes through if predicate branches that the CPU's branch predictor cannot learn (data is essentially random), so the pipeline stalls 5–15% of the time. And worst of all, the CPU's vector units — AVX-512 on Intel, NEON on ARM, the silicon that can add 8 doubles in one cycle — sit completely idle because the executor is looking at one value at a time.

The fix, introduced by Boncz, Zukowski and Nes in MonetDB/X100 at CIDR 2005 and now the default in every modern analytical engine, is vectorised execution: each operator processes a batch of 1024 (or 4096) rows at a time as columnar arrays, not one tuple at a time. Filter takes a double[1024] of prices in, returns either a filtered double[] or a small selection vector of matching indices. Aggregate sums an array. The virtual call cost is paid once per 1024 rows — 5 ns / 1024 = 5 picoseconds per row — and the inner loops are tight enough that the C++ compiler autovectorises them into AVX-512 SIMD. The same SIMD instruction does eight > comparisons or eight additions in one cycle, giving an additional 8× on top of the dispatch saving. Combined with the columnar layout from the previous chapter, vectorised execution is what turns a 10-second analytical scan into a 0.5-second one — the 20× to 100× speedup that the DuckDB engine paper and Databricks Photon report on real TPC-H workloads.

You already know this intuition from data science: a four-line NumPy expression (arr > 100).sum() runs 50–100× faster than the equivalent Python for loop, for exactly the same reason — the heavy lifting moves into a tight C inner loop that processes the whole array in one call. Vectorised execution is that idea, generalised to a SQL engine.

You spent the previous chapter learning that a column store reads only the bytes the query asked for — the layout wins by total_columns / referenced_columns, often 25× to 50× before any other trick. That is a storage win: it shrinks the bytes that arrive at the CPU.

But once those bytes have arrived in RAM, the executor still has to do work on them. And here the classical Volcano executor — the one every textbook teaches — turns out to be the wrong shape for analytical workloads. It was designed for the pipeline where SCAN feeds FILTER feeds AGG one tuple at a time, and it composes operators beautifully. It is also, when you measure it on a hundred-million-row table, slower than the disk it is reading from. The CPU becomes the bottleneck. Modern columnar engines fix this by processing batches.

This chapter derives why, builds a tiny vectorised executor in Python so you can feel the difference yourself, and ends with the worked example that the rest of Build 15 will lean on: a WHERE region = 'south' GROUP BY region SUM(price) query running 20× faster on the vectorised path than on the tuple path.

The Volcano model, recap

In the iterator model — named after the Volcano paper by Goetz Graefe in 1994 — every operator is a class with one method:

class Operator:
    def next(self) -> Tuple | None:
        """Return the next tuple, or None when exhausted."""
        ...

A query like SELECT SUM(price) FROM sales WHERE region = 'south' becomes a tree of operators chained by next() calls:

        Aggregate (SUM)
            |
        Filter (region == 'south')
            |
        Scan (sales)

Aggregate.next() calls Filter.next() in a loop until it returns None. Each call to Filter.next() calls Scan.next() and either returns the row or pulls another. Each call to Scan.next() reads one row from the heap.

This is beautiful software engineering — every operator implements the same one-method interface, the planner can wire them up in any order, you can unit-test each operator in isolation. And it is exactly right for OLTP, where a query touches twenty rows and the dispatch cost rounds to zero.

Where tuple-at-a-time falls apart

Run that same model on a hundred-million-row scan and four problems compound:

1. Function-call overhead. Each next() is a virtual call — the compiler doesn't know which subclass op.next() will dispatch to, so it emits an indirect jump through the vtable. On modern CPUs that's roughly 5 to 10 nanoseconds per call (one indirect branch plus the function prologue). For a three-deep operator tree, that is 15–30 ns per row. Multiply by 100M rows and you have spent 1.5 to 3 seconds on dispatch before doing any work. Why this matters now and not in 1994: in 1994 the actual per-row work — disk seek, decompression, predicate evaluation — was so much more expensive than the dispatch that nobody noticed. In 2026, with NVMe SSDs delivering 7 GB/s and predicates being one CPU instruction, dispatch is a measurable fraction of total time.

2. Branch misprediction. Inside Filter.next() there is an if predicate(tuple): return tuple else: continue. The CPU's branch predictor learns patterns — "this branch is usually taken" — but a WHERE region = 'south' predicate on randomly-ordered data is essentially a coin flip. Every misprediction stalls the pipeline by 15–20 cycles while the CPU throws away speculative work. On a typical analytical filter you lose 5–15% of total CPU time to mispredictions alone.

3. No SIMD. Modern x86 CPUs have AVX-512 registers that hold 8 doubles or 16 floats and can run one operation across all of them in a single cycle. ARM has NEON (4 doubles per register) and SVE. These are the silicon equivalent of going from a single carpenter to a row of eight working in lockstep. You can only use them on contiguous arrays of the same type. A tuple-at-a-time executor sees one value, then moves to the next operator; the SIMD units are completely idle. The CPU is running at maybe 10% of its arithmetic throughput.

4. Cache behaviour. Each tuple in Volcano is typically a struct or a dict — {order_id: 1234, region: "south", price: 99.50, ...} — with pointers to variable-length strings. Walking the tree pulls fragments from many different cache lines, evicting useful data. The CPU's L1 cache (32 KB) and L2 cache (256–1024 KB) end up thrashing, and you stall on memory loads instead of running instructions.

Add these together and a row-by-row Volcano executor on 100M rows runs at maybe 30–100 million rows per second on a modern core. A vectorised executor on the same hardware runs at 1–3 billion rows per second — a 20× to 100× speedup, and that is the gap Photon and DuckDB report on real TPC-H queries.

The vectorised fix

The insight, due to Boncz, Zukowski and Nes in 2005, is the obvious one once you've seen the problem: change the unit of work from one tuple to one batch. Each operator now takes a batch of (say) 1024 values in and returns a batch out.

class VectorOperator:
    def next_batch(self) -> ColumnBatch | None:
        """Return the next batch of up to BATCH rows, or None when exhausted."""
        ...

The same query becomes:

        VectorAggregate (SUM, batch=1024)
            |
        VectorFilter (region == 'south', batch=1024)
            |
        VectorScan (sales, batch=1024)

The change looks small — next() becomes next_batch() — but the consequences cascade through every layer:

Dispatch amortised. One virtual call per 1024 rows means dispatch cost drops from ~15 ns/row to ~15 ps/row. Three orders of magnitude.
Tight inner loops. Inside VectorFilter the body is for i in range(n): out[i] = a[i] > 100 over a primitive double[]. The C++ compiler recognises this pattern and emits AVX-512 instructions that compare 8 doubles per cycle. Why autovectorisation works here and not in Volcano: SIMD requires the compiler to prove that the loop body has no aliasing, no exceptions, no virtual calls — exactly the conditions a tight array loop satisfies and a tuple-at-a-time next() does not.
Branch predictor wins. Inside the batch loop there are no data-dependent branches at all — the comparison a[i] > 100 becomes a SIMD compare that produces 0 or all-1s in each lane, which is then ANDed into a bitmap. No mispredictions because there are no branches.
Cache-friendly. The whole batch (1024 doubles = 8 KB) fits comfortably in L1 cache. The next operator reads it back from L1 with no memory traffic. Compare to Volcano, which can blow L1 chasing a single tuple's pointers.

This is what every modern analytical engine does. ClickHouse calls them "blocks" (typically 65 536 rows). Apache Arrow Compute calls them "RecordBatches" (typically 1024 to 64K rows). DuckDB uses 2048-row "DataChunks". Photon and Velox use similar batch sizes. The choice of 1024 is not arbitrary: it's large enough that dispatch cost vanishes and SIMD has room to ramp up, and small enough that the working set fits in L1 cache (1024 × 8 B = 8 KB, half of typical L1).

SIMD: the inner-loop multiplier

Inside the batch, the real win comes from single-instruction-multiple-data (SIMD) execution. A modern Intel core has 32 × 512-bit AVX-512 registers; a single instruction like vmulpd zmm0, zmm1, zmm2 multiplies eight 64-bit doubles in lockstep in one cycle.

The SIMD multiplier (typically 4× for floats on NEON, 8× for doubles on AVX-512, 16× for floats on AVX-512) stacks on top of the dispatch saving. The Volcano executor pays full dispatch cost AND runs scalar; the vectorised executor pays 1/1024 of the dispatch cost AND runs SIMD. The product is the 20–100× speedup that real engines see in practice. Why SIMD is invisible in tuple-at-a-time: SIMD instructions operate on registers that hold multiple values; loading 8 doubles into a register requires 8 contiguous values in memory, which a per-tuple loop cannot offer.

A vectorised filter, in code

The filter pattern in a vectorised engine has two flavours.

The first materialises a new array of surviving values:

def filter_materialise(values: list[float], threshold: float) -> list[float]:
    return [v for v in values if v > threshold]   # allocates a new list

That works but pays an allocation cost on every batch. The second flavour, used by every production engine, is the selection vector — return a small array of indices that survived the predicate. Downstream operators iterate over these indices into the original (unfiltered) column.

def filter_selection_vector(values, threshold):
    # one tight loop, no allocation per surviving row
    sel = []
    for i, v in enumerate(values):
        if v > threshold:
            sel.append(i)
    return sel    # int[] of indices into `values`

The downstream Aggregate then does:

def sum_with_selection(values, sel):
    total = 0.0
    for i in sel:                # iterate sel, gather from values
        total += values[i]
    return total

In a real C++ engine these loops compile to SIMD-friendly inner kernels: the > comparison emits vcmpgtpd producing an 8-bit mask per 8 lanes, and the gather is done via vpcompressq or by simply iterating over the dense surviving prefix. Why selection vectors win over materialisation: most analytical filters have selectivities of 1–30%; allocating a fresh contiguous output array on every batch costs more than the filter itself. A selection vector is small (int32[matching_count]) and lets downstream operators reuse the original column array, keeping it hot in cache.

A worked example

A 100M-row filter + group-by, end to end

You are the data engineer at a Bengaluru retail platform. The fact table:

CREATE TABLE sales (
  order_id     BIGINT,
  region       VARCHAR(16),     -- 4 distinct values: north/south/east/west
  category     VARCHAR(32),     -- 200 distinct values
  price        DECIMAL(10,2)
);
-- 100,000,000 rows, stored columnar (Parquet)

The product team's daily query:

SELECT region, SUM(price)
FROM sales
WHERE category = 'Electronics'
GROUP BY region;

Both engines below scan the same 100M-row column store — so disk I/O cost is identical. The only variable is the executor.

Tuple-at-a-time path (Volcano):

def volcano_query(scan_iter):
    agg = {}                             # region -> running sum
    for row in scan_iter:                # 100M next() calls
        if row["category"] == "Electronics":
            agg[row["region"]] = agg.get(row["region"], 0.0) + row["price"]
    return agg

Cost analysis on a 3 GHz core:

100M next() calls × 15 ns dispatch = 1.5 s pure overhead
100M dict lookups for category + region + price = ~6 s Python interpreter time (or ~2 s in C++ with hash-map lookups)
100M conditional branches with ~10% misprediction rate = ~0.3 s stall
Total wall clock on this query in a Volcano C++ engine: ~10 s

Vectorised path (column batches of 1024):

import numpy as np

BATCH = 1024
ELECTRONICS_CODE = 17        # dictionary code for 'Electronics'

def vector_query(category_chunks, region_chunks, price_chunks):
    # Per-region running sums (4 buckets — region cardinality is 4)
    sums = np.zeros(4, dtype=np.float64)
    for cat, reg, pri in zip(category_chunks, region_chunks, price_chunks):
        # SIMD compare across 1024 lanes — one AVX-512 instruction per 8 lanes
        mask = (cat == ELECTRONICS_CODE)              # bool[1024]
        # SIMD masked add into 4 region buckets
        survived_reg = reg[mask]                      # int8[k], k ≈ 5% of 1024
        survived_pri = pri[mask]                      # float64[k]
        # 4-bucket histogram-sum — one tight loop per bucket
        for r in range(4):
            sums[r] += survived_pri[survived_reg == r].sum()
    return sums

Cost analysis:

100M / 1024 = 97 656 batches
97 656 batches × 15 ns dispatch = 1.5 ms (1000× less than Volcano)
The cat == 17 SIMD compare runs at ~8 lanes/cycle → 100M comparisons in ~12 ms on one core
The masked sum runs at similar throughput → another ~10 ms
L1 cache stays hot because each 8 KB batch fits in 32 KB L1
Total wall clock on this query in a vectorised C++ engine: ~0.5 s

Result: 10 s → 0.5 s. A 20× speedup, on the same data, on the same hardware, just by changing how the operators iterate.

You can prove the same effect in Python today by running the equivalent NumPy version of a WHERE x > 100 filter against a Python for loop — NumPy will be 50–100× faster, for exactly the reasons above (vectorised C inner loop versus interpreted per-element work). The ratio in a real query engine is smaller (10–30×) because the C++ Volcano baseline is already much faster than Python, but the underlying physics is identical.

What you give up

Vectorised execution wins almost everywhere on analytical workloads, but it has costs.

Latency-sensitive point queries get worse. A query like SELECT * FROM sales WHERE order_id = 12345 touches one row. A vectorised engine still allocates a 1024-row batch, runs the filter across it, and discards 1023 results. For OLTP point queries, Volcano's per-row pipelining is genuinely faster. Why this is a non-problem in practice: vectorised engines target OLAP, where queries scan millions of rows; OLTP engines stay on Volcano. Hybrid HTAP systems route point queries to a row-store path and analytical queries to the vectorised column path.

Code complexity. Each operator has to handle batch boundaries, partial last batches, selection vectors, null bitmaps, and dictionary-encoded inputs. The Volcano next() interface is one method; the vectorised interface in Velox or Apache Arrow Compute is dozens of kernels per data type. Engineering cost per operator goes up roughly 3–5×.

Memory pressure. Each batch needs its own scratch buffers — input, mask, selection vector, output. For a deep operator pipeline running at 1024 rows × N operators × multiple columns, you can easily allocate megabytes per query thread. Real engines pool these buffers to avoid heap thrashing.

Pipelining vs blocking. Some operators (sort, hash-join build) are blocking — they have to consume all input before producing output. Vectorisation doesn't help these much; the real win is on the streaming operators (scan, filter, project, hash-aggregate, hash-join probe) which dominate analytical query time.

Real systems

Almost every analytical engine built since 2015 is vectorised:

MonetDB/X100 → Vectorwise → Actian Vector — the original. Boncz et al. introduced the model in 2005; the lineage continues commercially.
ClickHouse — block-at-a-time execution with 65 536 rows per block, hand-written SIMD kernels for the hot operators. Open source, Yandex/now-independent. The reason ClickHouse benchmarks at billions of rows per second per core.
DuckDB — 2048-row DataChunks, autovectorised C++. Embedded analytical engine; the SQLite of OLAP.
Apache Arrow Compute — vectorised compute kernels over Arrow's columnar in-memory format. Used by Polars, Spark (via Arrow integration), Pandas 2.x.
Databricks Photon — C++ vectorised engine that replaces Spark's row-based JVM executor for SQL workloads. SIGMOD 2022.
Velox — Meta's vectorised execution library, the substrate under Presto, Spark and PyTorch's data loader.
Snowflake — vectorised columnar execution over micropartitions; the architecture every cloud warehouse copies.
BigQuery (Capacitor / Dremel) — Google's columnar engine with vectorised execution and tree-of-aggregators MPP.

Volcano-style row-at-a-time engines persist exactly where they were designed for: OLTP. Postgres, MySQL, Oracle, SQL Server, CockroachDB. The split is clean: transactions stay row-by-row, analytics goes vectorised.

Compilation, fusion, and the next frontier

Vectorised execution is not the last word. Two extensions keep pushing the envelope:

Compiled queries (HyPer, Umbra, DataFusion)

Instead of having a generic VectorFilter operator that works for any predicate, compile each query into custom machine code at plan time. The Hyper/Umbra system from TU Munich uses LLVM to generate a single tight loop fused across SCAN + FILTER + AGG, with the predicate inlined. This eliminates all operator-boundary overhead, not just per-row dispatch. The downside is compile time — typically 10–100 ms — which only pays back on queries that scan more than a few million rows. Modern engines (DuckDB, DataFusion, Velox) often combine vectorisation with light JIT compilation: vectorised by default, compiled when the query is heavy enough.

Operator fusion and pipelining

Even within vectorisation, you can do better than "filter writes a new array then aggregate reads it." Fusion combines adjacent operators into one kernel that does the filter and the aggregation in the same pass over the input — keeping data in SIMD registers across operator boundaries. ClickHouse and Photon both fuse common patterns like WHERE x > c GROUP BY y. The performance gain is typically 1.5–2× on top of vectorisation alone, because fused kernels touch each value exactly once instead of twice.

Adaptive batch sizing

The 1024 batch size is a heuristic that balances dispatch amortisation against L1 cache footprint. Adaptive engines measure cache-miss rates at runtime and shrink batches when they overflow L1, or grow them when miss rates are low. Photon and Velox both ship adaptive sizing; the gain is 5–15% on cache-bound workloads.

Going beyond CPU SIMD

The next frontier is GPU vectorised execution — engines like HeavyDB, BlazingSQL, and Snowflake's experimental GPU lanes process batches of millions of rows on GPU cores, exploiting the same vectorisation principle at 1000× the parallel width. The bottleneck shifts from CPU dispatch to PCIe bandwidth between disk and GPU, which is a different engineering problem entirely.

Summary

Three things to carry forward:

The unit of work matters as much as the layout. Column storage gets the right bytes off disk. Vectorised execution gets useful work out of those bytes once they arrive in CPU. Both are needed for the 50–100× analytical speedup; either alone gives only a fraction.
Volcano was right for 1994 hardware and is wrong for 2026 hardware. The cost ratio between dispatch and useful work has flipped — what used to round to zero now dominates. Modern CPUs reward tight inner loops over contiguous arrays and punish virtual-call-per-row pipelines.
The batch interface is the entire trick. Change next() -> tuple to next_batch() -> column[1024] and almost every other optimisation falls out automatically: dispatch amortises, branches vanish, autovectorisation kicks in, cache footprint shrinks. Every modern OLAP engine — ClickHouse, DuckDB, Photon, Velox, Snowflake, BigQuery — is built around this one idea.

The next chapter takes the vectorised executor and pairs it with dictionary and run-length encoding so the data being vectorised is also compressed in memory — the third multiplicative win after columnar layout and batched execution.