Why Row Storage Loses on Scans

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A row-store packs every column of one row into the same page, so a query that wants a single column across N rows must still drag N whole rows off disk — for a 50-column table that is 98% of bandwidth thrown away. A column store flips the layout so each column lives in its own contiguous run; the same scan reads only the bytes it needs, compresses 5–20× better because each file is one type, and runs SIMD over the result for a typical 50×–200× wall-clock speedup. The price is that point updates and full-row reads now cost 50 seeks instead of one, which is why OLTP and OLAP keep two layouts side by side.

You have spent the previous 118 chapters building progressively stronger OLTP machinery: B-trees, write-ahead logs, MVCC, concurrency control, query planning, joins, distribution, consensus. Every one of those layers was tuned for a workload where queries touch a small number of rows by primary key — fetch user 12345, debit account 99887, insert order 4561. That is the world Postgres, MySQL, Oracle, SQL Server and CockroachDB live in, and they live in it well.

This chapter opens Build 15 — the columnar OLAP engine — by explaining why every one of those engines, no matter how well-tuned, collapses by two orders of magnitude the moment your workload changes shape from "fetch one row by key" to "scan a hundred million rows and aggregate one column."

The cause is not a bug in any of those engines. It is the fundamental physics of their on-disk layout. Rows are stored together. Analytics reads columns. The mismatch is not patchable with a better B-tree or a smarter optimiser — it requires a different storage layout. That is what the rest of Build 15 will construct: column chunks, dictionary and run-length encoding, vectorised execution, MPP shuffle, and the lakehouse table formats (Iceberg, Delta) that have eaten the modern data stack.

But first, we need to feel why row storage loses. Not as an abstraction — as actual bytes on actual disks moving at actual bandwidth.

The two workloads

Look at two SQL queries against the same orders table, which has 50 columns and 100 million rows on a BharatBazaar-scale e-commerce backend.

-- OLTP: "show me one customer's recent orders"
SELECT * FROM orders WHERE customer_id = 12345 ORDER BY ordered_at DESC LIMIT 20;

-- OLAP: "what is the average order value in the south region?"
SELECT AVG(price) FROM orders WHERE region = 'south';

The first query touches 20 rows, all 50 columns. The second touches 100 million rows, two columns (price and region). Both are valid workloads, both are common in any real business, and both are usually expected to be fast. But they have opposite access patterns:

OLTP wants all columns of a few rows — narrow and tall
OLAP wants a few columns of all rows — wide and short

A storage layout can be optimised for one or the other, but not both. Row storage — the layout that has dominated databases since System R in the 1970s — is optimised for the first. Why: a row-store keeps all of a row's columns adjacent in memory, so fetching the row is one sequential read of row_size bytes from one disk offset; that is the cheapest possible operation a storage engine can do.

The cost of that decision is what we are about to derive.

How disks and pages actually move data

Both spinning disks and SSDs transfer data in pages, not bytes. A page is the smallest unit the storage engine reads or writes — typically 4 KB or 8 KB on Postgres, 16 KB on InnoDB, 64 KB to 1 MB on cloud object stores. Why pages: the cost of one I/O operation is dominated by the seek and the per-request overhead, not the bytes; once you've paid to talk to the device you may as well grab a whole page. Smaller pages waste IOPS, larger pages waste bandwidth on partial reads.

Even main memory has the same property at smaller scale: the CPU pulls 64-byte cache lines from RAM, never single bytes. So the same "you must read a whole page even if you only want one field" rule applies at every level of the memory hierarchy.

This means the question "how much data does my query really read?" is not "how many bytes did my SQL ask for?" but "how many pages did the storage engine have to drag off disk to satisfy that ask?" If a single page contains useful and useless bytes, you pay for the whole page anyway. That ratio — useful bytes per page read — is the effective bandwidth of the layout, and it is what decides whether your scan finishes in 4 seconds or 4 minutes.

Row storage, drawn

Here is what a row-store layout looks like on disk for an orders table with 8 columns abbreviated as a through h. Each row's columns are packed tightly together; rows are then packed tightly into pages.

The key thing this picture shows is that the colour you want — say light blue, column c — is sprinkled across every page. There is no way to read just the blue boxes without reading the whole page they sit inside, because the storage device only knows how to deliver pages, not bytes-of-your-choosing.

So a query that asks for just column c reads every byte of every page — 100% of the data on disk — and then discards 7/8 of it in CPU. The CPU work to throw away the unwanted columns is real (parsing each row's variable-length encoding to skip past a, b, then read c, then skip d–h), but it is small compared to the disk cost. The disk cost is the killer.

Column storage, drawn

Now look at the same table laid out column-first. Each column gets its own contiguous run on disk; rows are reconstructed only when the query asks for them.

Three things change when you look at this picture:

Selective reads. A query that wants just column c reads only the blue strip — the storage engine seeks to c's file (or to the column-c chunk inside a Parquet page), reads that contiguous range, and is done. Pages now contain 100% useful bytes for this query. Why this is a 50× win: in a 50-column table where the row store had to read 50× as much data to get to one column, the column store reads exactly 1× — that ratio is the speedup, before any other optimisation.
Compression. A row-store page mixes types: an int customer-id, a string region, a float price, a date, a bool. Generic compressors (LZ4, Snappy) struggle on heterogeneous data. A column-store page holds one type, often one value repeated thousands of times (think region = 'south' for every order from Bengaluru) — perfect for run-length and dictionary encoding, which routinely give 5×–20× compression on real data.
SIMD. Modern CPUs have vector instructions (AVX-512, NEON) that perform the same operation on 8 or 16 values in one cycle — but only when those values are the same type, packed adjacent in memory. That is exactly what a column chunk gives you. A row-store cannot use SIMD at all because consecutive bytes belong to different columns of different types.

Together, these three give the 50×–200× speedup that is the whole reason data warehouses exist as a separate species from OLTP databases.

The actual numbers, drawn

Let's put numbers on it for a concrete query — SELECT AVG(price) WHERE region = 'south' on a 100-million-row sales table with 50 columns. Each row is 1 KB, so the table is 100 GB. The price column is 8 bytes, the region column is 8 bytes (after dictionary encoding), so the data the query actually needs is 100M × 16 B = 1.6 GB. Assume a 400 MB/s SSD.

A 62× speedup. From layout alone. Before you have applied a single byte of compression, before you have used a single SIMD instruction, before you have parallelised across cores or nodes. Just by storing the data column-by-column instead of row-by-row.

This is the core economic engine that pays for the entire data warehouse industry. Snowflake, BigQuery, Redshift, Databricks SQL, ClickHouse, DuckDB — every one of them is selling, in essence, this 60×-or-better speedup on analytical queries, paired with a query engine that knows how to exploit the layout. The first system to industrialise this was C-Store / Vertica in 2005; MonetDB had been pushing in-memory column-vector execution since the late 1990s; Apache Parquet is the open file format that brought the layout to Hadoop and the lakehouse stack.

A worked example

A 100M-row sales scan, end to end

You are a data engineer at a Bengaluru e-commerce startup. The fact table is sales:

CREATE TABLE sales (
  order_id        BIGINT,        -- 8 B
  customer_id     BIGINT,        -- 8 B
  product_id      BIGINT,        -- 8 B
  region          VARCHAR(16),   -- 8 B (dict-encoded average)
  city            VARCHAR(32),   -- 16 B
  ordered_at      TIMESTAMP,     -- 8 B
  delivered_at    TIMESTAMP,     -- 8 B
  price           DECIMAL(10,2), -- 8 B
  quantity        INT,           -- 4 B
  -- ...41 more columns, totalling 1024 B per row...
);

Row count: 100 million. Total table size: 100 GB. Storage: AWS gp3 EBS, 400 MB/s sustained throughput.

The marketing team runs a daily query:

SELECT AVG(price) FROM sales WHERE region = 'south';

Row-store path (Postgres):

The planner picks a sequential scan because region has no useful index (low cardinality — only 4 values). The executor opens the heap file, reads page after page, parses each 1 KB row tuple, decodes the region column to test the predicate, decodes the price column to add to the running sum.

Bytes read from disk: 100 GB (every page of the heap)
Wall-clock from I/O alone: 100 × 1024 MB / 400 MB/s = 256 s ≈ 4 min 16 s
Useful bytes: 100M × 16 B = 1.6 GB — the rest is bandwidth waste

Column-store path (DuckDB or ClickHouse over the same data):

The planner sees that only price and region are referenced. It opens the region column chunk and the price column chunk; the other 48 columns are not even touched. Within the region chunk, dictionary encoding has reduced 100 M strings to a 4-entry dictionary plus a 100M-entry array of 1-byte codes, giving a chunk of 100 MB. The price chunk, with delta encoding on the integer cents, compresses to about 400 MB.

Bytes read from disk: 500 MB (compressed region + price chunks)
Wall-clock from I/O alone: 500 MB / 400 MB/s = 1.25 s
The CPU then runs a vectorised filter (region_code == south_code) and a vectorised sum on the surviving prices, both at multi-GB/s throughput

Result: 256 s vs 1.25 s. A 200× speedup, on the same hardware, on the same data, just by changing the on-disk layout.

This is not a thought experiment. It is approximately the gap you will measure if you load the same TPC-H 100 GB dataset into Postgres and into DuckDB and run the same SELECT AVG(...).

A tiny in-memory engine, both ways

Let's build a minimal version of both layouts in Python so the cost calculation stops being abstract.

# storage/layouts.py
from dataclasses import dataclass
from typing import List, Dict, Any

@dataclass
class RowStore:
    """Each row is a dict {col_name: value}. All columns of a row live together."""
    rows: List[Dict[str, Any]]
    row_size_bytes: int  # average bytes per row (all columns)

    def scan_avg(self, target_col: str, filter_col: str, filter_val: Any) -> float:
        # Engine MUST traverse every row; each row touch reads `row_size_bytes`
        total, count, bytes_read = 0.0, 0, 0
        for r in self.rows:
            bytes_read += self.row_size_bytes        # whole row pulled off disk
            if r[filter_col] == filter_val:
                total += r[target_col]
                count += 1
        return total / count, bytes_read

@dataclass
class ColumnStore:
    """Each column is its own list. Columns are read independently."""
    columns: Dict[str, List[Any]]
    col_size_bytes: Dict[str, int]   # bytes per value, per column

    def scan_avg(self, target_col: str, filter_col: str, filter_val: Any) -> float:
        # Engine reads ONLY the two referenced columns
        n = len(self.columns[filter_col])
        bytes_read = n * (self.col_size_bytes[filter_col]
                          + self.col_size_bytes[target_col])
        total, count = 0.0, 0
        f_col = self.columns[filter_col]
        t_col = self.columns[target_col]
        for i in range(n):
            if f_col[i] == filter_val:
                total += t_col[i]
                count += 1
        return total / count, bytes_read

The two scan_avg methods compute the same answer. The difference is the bytes_read accounting — and that is what wall-clock time tracks. Why we count bytes, not rows: the bottleneck on an analytical scan is the bandwidth between disk (or network, for cloud object stores) and CPU. Row count is irrelevant; bytes-moved is everything.

Now we instrument the same query against both:

# storage/bench.py
import random, time
random.seed(42)

N = 1_000_000               # 1 million rows (smaller than the 100M example, faster to run)
COLS = 50
ROW_SIZE = 1024             # 1 KB per row
COL_SIZES = {"price": 8, "region": 8, **{f"col{i}": 1024 // 50 for i in range(48)}}

regions = ["north", "south", "east", "west"]
rows = [
    {"price": random.uniform(100, 9999), "region": random.choice(regions),
     **{f"col{i}": 0 for i in range(48)}}
    for _ in range(N)
]
columns = {"price":  [r["price"]  for r in rows],
           "region": [r["region"] for r in rows]}
for i in range(48):
    columns[f"col{i}"] = [0] * N

row_store = RowStore(rows, ROW_SIZE)
col_store = ColumnStore(columns, COL_SIZES)

t0 = time.perf_counter()
avg_r, bytes_r = row_store.scan_avg("price", "region", "south")
t1 = time.perf_counter()
avg_c, bytes_c = col_store.scan_avg("price", "region", "south")
t2 = time.perf_counter()

assert abs(avg_r - avg_c) < 1e-9            # same answer
print(f"row store : {bytes_r/1e6:8.1f} MB read   {t1-t0:6.3f} s")
print(f"column st : {bytes_c/1e6:8.1f} MB read   {t2-t1:6.3f} s")
print(f"speedup   : {bytes_r/bytes_c:.1f}×  (bytes), {(t1-t0)/(t2-t1):.1f}× (wall)")

Running this on a laptop prints something like:

row store :   1024.0 MB read    0.412 s
column st :     16.0 MB read    0.071 s
speedup   :    64.0×  (bytes),  5.8× (wall)

The byte-savings ratio is exactly 1024 / 16 = 64× — the row size divided by the two-column slice. Why the wall-clock speedup is smaller than 64×: the in-memory Python benchmark is bottlenecked on the per-row Python interpreter overhead (dict lookup, method dispatch) which is roughly the same in both layouts, masking the I/O win. On real disk and a vectorised C++ engine, the wall-clock and byte-read speedups converge.

What you give up

Column storage is not free. The same property that makes scans fast — values of one column packed together — makes two other operations slow.

Updates are scattered. To insert one new row into a row store, you append row_size bytes to one heap file, done. To insert one new row into a column store with 50 columns, you have to append one value to each of 50 separate column files (or 50 separate column chunks within a Parquet row group). That is 50× the seek pressure on the storage device and, if you're using object storage like S3 where each PUT is a separate API call, 50× the request cost. Why this is fatal for OLTP: a transactional system needs to commit thousands of single-row writes per second, often single-row updates that touch only one or two columns. Column storage turns that into thousands × 50 small writes per second, which neither a local SSD nor S3 is happy serving.

Random row reads are scattered. "Give me everything about user 12345" in a row store is one seek and one page read. In a column store, it is K seeks (where K is the number of columns), each landing in a different file. Even if those seeks are fast on SSD, the request count blows up. Column stores therefore explicitly do not target this access pattern — they assume you ask for batches of rows and many columns at once, or many rows and few columns, but rarely "one row, all columns."

Updates often become append-only with rewrites. Most column stores treat data as immutable and accumulate changes as new files; a background compaction process periodically rewrites old files merging in the deltas. This is the same idea you saw in LSM trees and SSTable compaction — the same trade-off (write amplification in exchange for read efficiency) appears in OLAP for the same reasons.

The honest summary: column storage is optimised for batch loads followed by analytical reads. Streaming inserts and point updates are second-class, and any column-store benchmark you see in marketing material assumes the workload is dominated by SELECT ... GROUP BY over millions of rows.

The hybrid: HTAP

Real businesses do both. The orders table that powers a transactional checkout flow is the same table that powers the executive dashboard. The temptation is to keep two copies: a row-store OLTP system (Postgres, MySQL) and a column-store OLAP warehouse (Snowflake, BigQuery), connected by a nightly ETL pipeline.

That works, but it has two problems: the warehouse is stale by hours, and the operational overhead of keeping two storage stacks in sync is large. HTAP (Hybrid Transactional / Analytical Processing) systems try to make one engine that serves both. Three architectures dominate:

TiDB stores the canonical data row-wise in TiKV (a Raft-backed key-value store) and replicates each row asynchronously into TiFlash, a column-store replica that serves analytical queries. Writes go to TiKV; analytical reads are routed to TiFlash. The replication is via Raft learners, so the analytical replica is usually within seconds of the transactional one.
SingleStore (formerly MemSQL) stores hot rows in a row-store and cold rows in a column-store inside the same engine, with a background process moving rows between the two as they age out of the write hot set.
Snowflake treats every table as columnar and uses micropartitions (50–500 MB column-store files in S3) plus aggressive metadata pruning to keep even small WHERE id = ? queries fast — punting on true OLTP but covering the long tail of operational queries that aren't truly point-read.

The HTAP design space is its own subject. The reason it exists at all is the gap this chapter has been describing: the two workloads have opposite layout requirements, and a single layout cannot serve both well.

Common confusions

"A column store is just a row store with the columns swapped — it is the same data." It is the same logical table, but the on-disk layout is different and that is the entire point. The row-store heap file interleaves bytes of every column for one row; the column-store file holds bytes of one column for every row. The query planner, the I/O path, the compression scheme, and the CPU execution model all change as a consequence. You cannot run a column-store query plan on a row-store file or vice versa without re-reading the data into the other layout.
"You can just add an index on price to make the row-store scan fast." A B-tree index on price lets you find rows where price = 5000 quickly, but does nothing for AVG(price) WHERE region = 'south'. To average a column you must read every value of it; an index does not eliminate that read, it only helps you locate specific values. Postgres covering indexes (INCLUDE (price, region)) come closer — they are a poor man's column store for one query — but you pay double storage, and you need a separate covering index per query shape.
"Parquet is a database." Parquet is a file format — column-store on disk, with footers describing column chunks, compression codecs and statistics. It cannot answer queries by itself. You need a query engine (DuckDB, Spark, Trino, ClickHouse, Athena) to open the file, read the relevant chunks, and execute the filter and aggregate. Confusing the format with the engine is like confusing a .mp3 file with Tunestack.
"Column stores are always faster." They are faster on wide-table, narrow-projection scans — the OLAP shape. On SELECT * FROM users WHERE id = 12345, a column store has to seek into 50 different column files to reconstruct the row, while a row store does one page read. That is why production stacks like BharatBazaar's keep row-store Postgres for the checkout path and only feed warehouses for analytics.
"Compression slows scans because the CPU has to decompress." Modern columnar codecs (LZ4, ZSTD, dictionary, RLE) decompress at multiple GB/s — far faster than a 400 MB/s SSD or an S3 GET can supply data. The disk is always the bottleneck, so smaller compressed bytes-on-disk almost always means faster wall-clock scans. The exception is in-memory analytical engines where data is already RAM-resident, in which case some columnar systems run uncompressed to skip the decompress step.
"HTAP means you only need one database." It means you only need one system, but inside that system there are still two storage layouts. TiDB has TiKV (row) plus TiFlash (column); SingleStore has rowstore plus columnstore tables; Snowflake has micropartitions plus hot caches. The internal complexity does not disappear — it just gets hidden behind a single SQL endpoint. Operationally you still pay for replicating data across both layouts, and the analytical replica is usually seconds-to-minutes behind the transactional one.

Where Build 15 goes from here

The 60× speedup we derived in this chapter is the floor. Real columnar engines stack three more orders of magnitude on top:

Encoding (chapter 120)

Storing one column with one type is the precondition for dictionary encoding (replace ["south", "south", "north", ...] with codes [1, 1, 2, ...] plus a tiny dictionary), run-length encoding ([1, 1, 1, 1, 2, 2] → [(1, 4), (2, 2)]), bit-packing (an integer column whose values fit in 12 bits stores 12 bits per value, not 32), delta encoding (sorted timestamps store the gaps, not the absolute values), and frame-of-reference. Real columns get 5×–20× compression from these tricks alone, often more on low-cardinality columns like region or product-category.

Vectorised execution (chapter 122)

Once a column lives as a contiguous array, the executor can process batches of 1024 or 4096 values at a time using vector operations rather than tuple-at-a-time iteration. The original Volcano iterator model costs one virtual function call per row — fine for OLTP, ruinous for OLAP at billion-row scale. Vectorised execution pays one virtual call per batch, and within a batch the inner loop is tight enough that the C++ compiler emits SIMD instructions automatically.

MPP and the lakehouse (chapters 124–128)

A single machine can scan tens of GB/s with a good column engine; a hundred-machine cluster scans terabytes per second. MPP (massively parallel processing) — the architecture pioneered by Teradata, productised by Vertica and Greenplum, then re-invented for cloud by Snowflake and BigQuery — partitions the column store across nodes and shuffles partial results between them. The newest twist is the lakehouse: store the column files in cheap object storage (S3, GCS), wrap them with an open table format (Apache Iceberg or Delta Lake) that adds ACID transactions and time travel, and let multiple compute engines (Spark, Trino, Snowflake, DuckDB) read the same files. This is the architecture every new data platform built since 2020 — Databricks, Tabular, Dremio, Onehouse — is converging on.

This chapter has just opened the door. The next nine chapters of Build 15 walk through each of those layers in turn, ending with you having built a small but real columnar engine that can scan a hundred-million-row Parquet table on a laptop in under a second.

Summary

You should now believe three things in your bones:

Disks and CPUs move data in pages, not bytes. Effective bandwidth = useful bytes per page read. That ratio is what decides scan speed, not row count.
Row storage wastes bandwidth on analytical queries by a factor of total_columns / referenced_columns. For a 50-column table referencing 2 columns, that is 25× of pure waste before any other consideration.
Column storage flips that ratio to ~1, then stacks compression and SIMC on top, giving a typical 50×–200× wall-clock speedup on real analytical workloads. The cost is paid in slower point updates and slower full-row reads, which is why it is OLAP-only by construction and why HTAP systems run two engines side by side.

The rest of Build 15 earns the rest of the speedup, layer by layer.