In short
Disk and main memory both transfer data in pages — typically 8 KB at the storage layer, 64 B at a CPU cache line. A row-store packs every column of one row into the same page, so to read a single column across N rows you have to drag N whole rows off disk and discard everything except the one field you wanted. For a 50-column table where the column you want is 8 bytes wide, 98% of the bytes you read are thrown away. A 100 GB row-store table where you only need one 8-byte column becomes a 100 GB scan to extract 800 MB of useful data — at 400 MB/s SSD bandwidth, that is four minutes of disk wall-clock for a query that only needed four seconds of useful work.
A column store flips the layout: each column lives in its own contiguous file (or its own segment within a file). Reading one column reads only that column's bytes — the same query now scans 800 MB and finishes in two seconds. On top of that, a single-column file holds homogeneous values of one type, which compresses 5–20× better than a heterogeneous row, and lets the CPU process the data with SIMD instructions that work on 8 or 16 values per cycle. The combined speedup on real analytical workloads is 50× to 200×, which is why every modern data warehouse — Snowflake, BigQuery, Redshift, ClickHouse, DuckDB — stores its base tables column by column.
The price you pay is that point updates and full-row reads become expensive: writing one row means writing into 50 separate column files, and reading "give me everything about user 12345" means seeking 50 places instead of one. That is the core OLTP/OLAP split, and it is why production stacks now run two engines side by side (HTAP systems like TiDB and SingleStore keep both layouts in sync), with row storage handling the transactional path and column storage handling the analytical one. This chapter derives the bandwidth math, builds a tiny in-memory engine in both layouts, and clocks the same AVG(price) query against each to see the 60× speedup with your own eyes.
You have spent the previous 118 chapters building progressively stronger OLTP machinery: B-trees, write-ahead logs, MVCC, concurrency control, query planning, joins, distribution, consensus. Every one of those layers was tuned for a workload where queries touch a small number of rows by primary key — fetch user 12345, debit account 99887, insert order 4561. That is the world Postgres, MySQL, Oracle, SQL Server and CockroachDB live in, and they live in it well.
This chapter opens Build 15 — the columnar OLAP engine — by explaining why every one of those engines, no matter how well-tuned, collapses by two orders of magnitude the moment your workload changes shape from "fetch one row by key" to "scan a hundred million rows and aggregate one column."
The cause is not a bug in any of those engines. It is the fundamental physics of their on-disk layout. Rows are stored together. Analytics reads columns. The mismatch is not patchable with a better B-tree or a smarter optimiser — it requires a different storage layout. That is what the rest of Build 15 will construct: column chunks, dictionary and run-length encoding, vectorised execution, MPP shuffle, and the lakehouse table formats (Iceberg, Delta) that have eaten the modern data stack.
But first, we need to feel why row storage loses. Not as an abstraction — as actual bytes on actual disks moving at actual bandwidth.
The two workloads
Look at two SQL queries against the same orders table, which has 50 columns and 100 million rows on a Flipkart-scale e-commerce backend.
-- OLTP: "show me one customer's recent orders"
SELECT * FROM orders WHERE customer_id = 12345 ORDER BY ordered_at DESC LIMIT 20;
-- OLAP: "what is the average order value in the south region?"
SELECT AVG(price) FROM orders WHERE region = 'south';
The first query touches 20 rows, all 50 columns. The second touches 100 million rows, two columns (price and region). Both are valid workloads, both are common in any real business, and both are usually expected to be fast. But they have opposite access patterns:
- OLTP wants all columns of a few rows — narrow and tall
- OLAP wants a few columns of all rows — wide and short
A storage layout can be optimised for one or the other, but not both. Row storage — the layout that has dominated databases since System R in the 1970s — is optimised for the first. Why: a row-store keeps all of a row's columns adjacent in memory, so fetching the row is one sequential read of row_size bytes from one disk offset; that is the cheapest possible operation a storage engine can do.
The cost of that decision is what we are about to derive.
How disks and pages actually move data
Both spinning disks and SSDs transfer data in pages, not bytes. A page is the smallest unit the storage engine reads or writes — typically 4 KB or 8 KB on Postgres, 16 KB on InnoDB, 64 KB to 1 MB on cloud object stores. Why pages: the cost of one I/O operation is dominated by the seek and the per-request overhead, not the bytes; once you've paid to talk to the device you may as well grab a whole page. Smaller pages waste IOPS, larger pages waste bandwidth on partial reads.
Even main memory has the same property at smaller scale: the CPU pulls 64-byte cache lines from RAM, never single bytes. So the same "you must read a whole page even if you only want one field" rule applies at every level of the memory hierarchy.
This means the question "how much data does my query really read?" is not "how many bytes did my SQL ask for?" but "how many pages did the storage engine have to drag off disk to satisfy that ask?" If a single page contains useful and useless bytes, you pay for the whole page anyway. That ratio — useful bytes per page read — is the effective bandwidth of the layout, and it is what decides whether your scan finishes in 4 seconds or 4 minutes.
Row storage, drawn
Here is what a row-store layout looks like on disk for an orders table with 8 columns abbreviated as a through h. Each row's columns are packed tightly together; rows are then packed tightly into pages.
The key thing this picture shows is that the colour you want — say light blue, column c — is sprinkled across every page. There is no way to read just the blue boxes without reading the whole page they sit inside, because the storage device only knows how to deliver pages, not bytes-of-your-choosing.
So a query that asks for just column c reads every byte of every page — 100% of the data on disk — and then discards 7/8 of it in CPU. The CPU work to throw away the unwanted columns is real (parsing each row's variable-length encoding to skip past a, b, then read c, then skip d–h), but it is small compared to the disk cost. The disk cost is the killer.
Column storage, drawn
Now look at the same table laid out column-first. Each column gets its own contiguous run on disk; rows are reconstructed only when the query asks for them.
Three things change when you look at this picture:
-
Selective reads. A query that wants just column
creads only the blue strip — the storage engine seeks toc's file (or to the column-cchunk inside a Parquet page), reads that contiguous range, and is done. Pages now contain 100% useful bytes for this query. Why this is a 50× win: in a 50-column table where the row store had to read 50× as much data to get to one column, the column store reads exactly 1× — that ratio is the speedup, before any other optimisation. -
Compression. A row-store page mixes types: an int customer-id, a string region, a float price, a date, a bool. Generic compressors (LZ4, Snappy) struggle on heterogeneous data. A column-store page holds one type, often one value repeated thousands of times (think
region = 'south'for every order from Bengaluru) — perfect for run-length and dictionary encoding, which routinely give 5×–20× compression on real data. -
SIMD. Modern CPUs have vector instructions (AVX-512, NEON) that perform the same operation on 8 or 16 values in one cycle — but only when those values are the same type, packed adjacent in memory. That is exactly what a column chunk gives you. A row-store cannot use SIMD at all because consecutive bytes belong to different columns of different types.
Together, these three give the 50×–200× speedup that is the whole reason data warehouses exist as a separate species from OLTP databases.
The actual numbers, drawn
Let's put numbers on it for a concrete query — SELECT AVG(price) WHERE region = 'south' on a 100-million-row sales table with 50 columns. Each row is 1 KB, so the table is 100 GB. The price column is 8 bytes, the region column is 8 bytes (after dictionary encoding), so the data the query actually needs is 100M × 16 B = 1.6 GB. Assume a 400 MB/s SSD.
A 62× speedup. From layout alone. Before you have applied a single byte of compression, before you have used a single SIMD instruction, before you have parallelised across cores or nodes. Just by storing the data column-by-column instead of row-by-row.
This is the core economic engine that pays for the entire data warehouse industry. Snowflake, BigQuery, Redshift, Databricks SQL, ClickHouse, DuckDB — every one of them is selling, in essence, this 60×-or-better speedup on analytical queries, paired with a query engine that knows how to exploit the layout. The first system to industrialise this was C-Store / Vertica in 2005; MonetDB had been pushing in-memory column-vector execution since the late 1990s; Apache Parquet is the open file format that brought the layout to Hadoop and the lakehouse stack.
A worked example
A 100M-row sales scan, end to end
You are a data engineer at a Bengaluru e-commerce startup. The fact table is sales:
CREATE TABLE sales (
order_id BIGINT, -- 8 B
customer_id BIGINT, -- 8 B
product_id BIGINT, -- 8 B
region VARCHAR(16), -- 8 B (dict-encoded average)
city VARCHAR(32), -- 16 B
ordered_at TIMESTAMP, -- 8 B
delivered_at TIMESTAMP, -- 8 B
price DECIMAL(10,2), -- 8 B
quantity INT, -- 4 B
-- ...41 more columns, totalling 1024 B per row...
);
Row count: 100 million. Total table size: 100 GB. Storage: AWS gp3 EBS, 400 MB/s sustained throughput.
The marketing team runs a daily query:
SELECT AVG(price) FROM sales WHERE region = 'south';
Row-store path (Postgres):
The planner picks a sequential scan because region has no useful index (low cardinality — only 4 values). The executor opens the heap file, reads page after page, parses each 1 KB row tuple, decodes the region column to test the predicate, decodes the price column to add to the running sum.
- Bytes read from disk: 100 GB (every page of the heap)
- Wall-clock from I/O alone:
100 × 1024 MB / 400 MB/s = 256 s ≈ 4 min 16 s - Useful bytes:
100M × 16 B = 1.6 GB— the rest is bandwidth waste
Column-store path (DuckDB or ClickHouse over the same data):
The planner sees that only price and region are referenced. It opens the region column chunk and the price column chunk; the other 48 columns are not even touched. Within the region chunk, dictionary encoding has reduced 100 M strings to a 4-entry dictionary plus a 100M-entry array of 1-byte codes, giving a chunk of 100 MB. The price chunk, with delta encoding on the integer cents, compresses to about 400 MB.
- Bytes read from disk: 500 MB (compressed
region+pricechunks) - Wall-clock from I/O alone:
500 MB / 400 MB/s = 1.25 s - The CPU then runs a vectorised filter (
region_code == south_code) and a vectorised sum on the surviving prices, both at multi-GB/s throughput
Result: 256 s vs 1.25 s. A 200× speedup, on the same hardware, on the same data, just by changing the on-disk layout.
This is not a thought experiment. It is approximately the gap you will measure if you load the same TPC-H 100 GB dataset into Postgres and into DuckDB and run the same SELECT AVG(...).
A tiny in-memory engine, both ways
Let's build a minimal version of both layouts in Python so the cost calculation stops being abstract.
# storage/layouts.py
from dataclasses import dataclass
from typing import List, Dict, Any
@dataclass
class RowStore:
"""Each row is a dict {col_name: value}. All columns of a row live together."""
rows: List[Dict[str, Any]]
row_size_bytes: int # average bytes per row (all columns)
def scan_avg(self, target_col: str, filter_col: str, filter_val: Any) -> float:
# Engine MUST traverse every row; each row touch reads `row_size_bytes`
total, count, bytes_read = 0.0, 0, 0
for r in self.rows:
bytes_read += self.row_size_bytes # whole row pulled off disk
if r[filter_col] == filter_val:
total += r[target_col]
count += 1
return total / count, bytes_read
@dataclass
class ColumnStore:
"""Each column is its own list. Columns are read independently."""
columns: Dict[str, List[Any]]
col_size_bytes: Dict[str, int] # bytes per value, per column
def scan_avg(self, target_col: str, filter_col: str, filter_val: Any) -> float:
# Engine reads ONLY the two referenced columns
n = len(self.columns[filter_col])
bytes_read = n * (self.col_size_bytes[filter_col]
+ self.col_size_bytes[target_col])
total, count = 0.0, 0
f_col = self.columns[filter_col]
t_col = self.columns[target_col]
for i in range(n):
if f_col[i] == filter_val:
total += t_col[i]
count += 1
return total / count, bytes_read
The two scan_avg methods compute the same answer. The difference is the bytes_read accounting — and that is what wall-clock time tracks. Why we count bytes, not rows: the bottleneck on an analytical scan is the bandwidth between disk (or network, for cloud object stores) and CPU. Row count is irrelevant; bytes-moved is everything.
Now we instrument the same query against both:
# storage/bench.py
import random, time
random.seed(42)
N = 1_000_000 # 1 million rows (smaller than the 100M example, faster to run)
COLS = 50
ROW_SIZE = 1024 # 1 KB per row
COL_SIZES = {"price": 8, "region": 8, **{f"col{i}": 1024 // 50 for i in range(48)}}
regions = ["north", "south", "east", "west"]
rows = [
{"price": random.uniform(100, 9999), "region": random.choice(regions),
**{f"col{i}": 0 for i in range(48)}}
for _ in range(N)
]
columns = {"price": [r["price"] for r in rows],
"region": [r["region"] for r in rows]}
for i in range(48):
columns[f"col{i}"] = [0] * N
row_store = RowStore(rows, ROW_SIZE)
col_store = ColumnStore(columns, COL_SIZES)
t0 = time.perf_counter()
avg_r, bytes_r = row_store.scan_avg("price", "region", "south")
t1 = time.perf_counter()
avg_c, bytes_c = col_store.scan_avg("price", "region", "south")
t2 = time.perf_counter()
assert abs(avg_r - avg_c) < 1e-9 # same answer
print(f"row store : {bytes_r/1e6:8.1f} MB read {t1-t0:6.3f} s")
print(f"column st : {bytes_c/1e6:8.1f} MB read {t2-t1:6.3f} s")
print(f"speedup : {bytes_r/bytes_c:.1f}× (bytes), {(t1-t0)/(t2-t1):.1f}× (wall)")
Running this on a laptop prints something like:
row store : 1024.0 MB read 0.412 s
column st : 16.0 MB read 0.071 s
speedup : 64.0× (bytes), 5.8× (wall)
The byte-savings ratio is exactly 1024 / 16 = 64× — the row size divided by the two-column slice. Why the wall-clock speedup is smaller than 64×: the in-memory Python benchmark is bottlenecked on the per-row Python interpreter overhead (dict lookup, method dispatch) which is roughly the same in both layouts, masking the I/O win. On real disk and a vectorised C++ engine, the wall-clock and byte-read speedups converge.
What you give up
Column storage is not free. The same property that makes scans fast — values of one column packed together — makes two other operations slow.
Updates are scattered. To insert one new row into a row store, you append row_size bytes to one heap file, done. To insert one new row into a column store with 50 columns, you have to append one value to each of 50 separate column files (or 50 separate column chunks within a Parquet row group). That is 50× the seek pressure on the storage device and, if you're using object storage like S3 where each PUT is a separate API call, 50× the request cost. Why this is fatal for OLTP: a transactional system needs to commit thousands of single-row writes per second, often single-row updates that touch only one or two columns. Column storage turns that into thousands × 50 small writes per second, which neither a local SSD nor S3 is happy serving.
Random row reads are scattered. "Give me everything about user 12345" in a row store is one seek and one page read. In a column store, it is K seeks (where K is the number of columns), each landing in a different file. Even if those seeks are fast on SSD, the request count blows up. Column stores therefore explicitly do not target this access pattern — they assume you ask for batches of rows and many columns at once, or many rows and few columns, but rarely "one row, all columns."
Updates often become append-only with rewrites. Most column stores treat data as immutable and accumulate changes as new files; a background compaction process periodically rewrites old files merging in the deltas. This is the same idea you saw in LSM trees and SSTable compaction — the same trade-off (write amplification in exchange for read efficiency) appears in OLAP for the same reasons.
The honest summary: column storage is optimised for batch loads followed by analytical reads. Streaming inserts and point updates are second-class, and any column-store benchmark you see in marketing material assumes the workload is dominated by SELECT ... GROUP BY over millions of rows.
The hybrid: HTAP
Real businesses do both. The orders table that powers a transactional checkout flow is the same table that powers the executive dashboard. The temptation is to keep two copies: a row-store OLTP system (Postgres, MySQL) and a column-store OLAP warehouse (Snowflake, BigQuery), connected by a nightly ETL pipeline.
That works, but it has two problems: the warehouse is stale by hours, and the operational overhead of keeping two storage stacks in sync is large. HTAP (Hybrid Transactional / Analytical Processing) systems try to make one engine that serves both. Three architectures dominate:
- TiDB stores the canonical data row-wise in TiKV (a Raft-backed key-value store) and replicates each row asynchronously into TiFlash, a column-store replica that serves analytical queries. Writes go to TiKV; analytical reads are routed to TiFlash. The replication is via Raft learners, so the analytical replica is usually within seconds of the transactional one.
- SingleStore (formerly MemSQL) stores hot rows in a row-store and cold rows in a column-store inside the same engine, with a background process moving rows between the two as they age out of the write hot set.
- Snowflake treats every table as columnar and uses micropartitions (50–500 MB column-store files in S3) plus aggressive metadata pruning to keep even small
WHERE id = ?queries fast — punting on true OLTP but covering the long tail of operational queries that aren't truly point-read.
The HTAP design space is its own subject. The reason it exists at all is the gap this chapter has been describing: the two workloads have opposite layout requirements, and a single layout cannot serve both well.
Where Build 15 goes from here
The 60× speedup we derived in this chapter is the floor. Real columnar engines stack three more orders of magnitude on top:
Encoding (chapter 120)
Storing one column with one type is the precondition for dictionary encoding (replace ["south", "south", "north", ...] with codes [1, 1, 2, ...] plus a tiny dictionary), run-length encoding ([1, 1, 1, 1, 2, 2] → [(1, 4), (2, 2)]), bit-packing (an integer column whose values fit in 12 bits stores 12 bits per value, not 32), delta encoding (sorted timestamps store the gaps, not the absolute values), and frame-of-reference. Real columns get 5×–20× compression from these tricks alone, often more on low-cardinality columns like region or product-category.
Vectorised execution (chapter 122)
Once a column lives as a contiguous array, the executor can process batches of 1024 or 4096 values at a time using vector operations rather than tuple-at-a-time iteration. The original Volcano iterator model costs one virtual function call per row — fine for OLTP, ruinous for OLAP at billion-row scale. Vectorised execution pays one virtual call per batch, and within a batch the inner loop is tight enough that the C++ compiler emits SIMD instructions automatically.
MPP and the lakehouse (chapters 124–128)
A single machine can scan tens of GB/s with a good column engine; a hundred-machine cluster scans terabytes per second. MPP (massively parallel processing) — the architecture pioneered by Teradata, productised by Vertica and Greenplum, then re-invented for cloud by Snowflake and BigQuery — partitions the column store across nodes and shuffles partial results between them. The newest twist is the lakehouse: store the column files in cheap object storage (S3, GCS), wrap them with an open table format (Apache Iceberg or Delta Lake) that adds ACID transactions and time travel, and let multiple compute engines (Spark, Trino, Snowflake, DuckDB) read the same files. This is the architecture every new data platform built since 2020 — Databricks, Tabular, Dremio, Onehouse — is converging on.
This chapter has just opened the door. The next nine chapters of Build 15 walk through each of those layers in turn, ending with you having built a small but real columnar engine that can scan a hundred-million-row Parquet table on a laptop in under a second.
Summary
You should now believe three things in your bones:
- Disks and CPUs move data in pages, not bytes. Effective bandwidth = useful bytes per page read. That ratio is what decides scan speed, not row count.
- Row storage wastes bandwidth on analytical queries by a factor of
total_columns / referenced_columns. For a 50-column table referencing 2 columns, that is 25× of pure waste before any other consideration. - Column storage flips that ratio to ~1, then stacks compression and SIMC on top, giving a typical 50×–200× wall-clock speedup on real analytical workloads. The cost is paid in slower point updates and slower full-row reads, which is why it is OLAP-only by construction and why HTAP systems run two engines side by side.
The rest of Build 15 earns the rest of the speedup, layer by layer.
References
- Stonebraker et al., "C-Store: A Column-oriented DBMS", VLDB 2005
- Boncz, Zukowski, Nes, "MonetDB/X100: Hyper-Pipelining Query Execution", CIDR 2005
- Apache Parquet — File Format Documentation
- Abadi, Boncz, Harizopoulos et al., "The Design and Implementation of Modern Column-Oriented Database Systems", Foundations and Trends in Databases, 2013
- Dageville et al., "The Snowflake Elastic Data Warehouse", SIGMOD 2016
- Armbrust et al., "Lakehouse: A New Generation of Open Platforms", CIDR 2021