Wall: the warehouse is dying under scan-heavy queries
Tuesday, 14:30 IST. Karan is the data platform lead at a Bengaluru e-commerce company shipping ₹600 crore of GMV a quarter through Postgres-as-warehouse. Build 5 is paying off — the lineage graph is wired up, every tier-1 table has a freshness SLO and a Soda check, and last week the on-call traced a wrong cohort number to a renamed column in 11 minutes. The team is proud. Then the new growth analyst, Asha, opens Metabase and writes one query — a 14-month rollup of net revenue by city by week, joining the orders table (32 crore rows) to refunds (4 crore rows) to a city dimension. The query plans an index-less sequential scan, locks I/O on the warehouse's primary disk for 47 minutes, and during those 47 minutes every other dashboard in the company starts timing out. The orders ingestion job that runs at 14:45 misses its SLA. The CFO's revenue dashboard goes blank. Slack lights up. The on-call's first instinct is to look at lineage. Lineage has nothing to say. The data was correct, the contract was honoured, the freshness SLO was green. The warehouse was simply asked to read 32 crore rows from a heap-organised table to answer one question — and that is the question every analytics database in 2026 has to answer hundreds of times an hour.
Build 5 makes you trust your data. It does not make your warehouse fast. A row-organised store reads every byte of every row to answer a query that touches three columns out of forty, and at warehouse scale this is the bottleneck that ends the OLTP-as-warehouse era. The wall is the moment scan-heavy analytical queries start blocking everything else on the cluster — and the only honest fix is to change the on-disk layout, which is the entire subject of Build 6 (columnar storage and table formats).
The shape of the wall — what an analytical query actually does
Postgres, MySQL, and every other OLTP engine you have ever used stores rows together. A single row of the orders table — order_id, customer_id, merchant_id, created_at, city_id, gross_paise, discount_paise, tax_paise, status, all 40-odd columns — lives as a contiguous chunk of bytes in a heap page on disk. This is the right layout for the OLTP workload the engine was designed for: "give me everything about order #4823918". The engine reads one page, gets the whole row, returns it. Fast.
The analytical query is the opposite shape. Asha's question — "net revenue by city by week for 14 months" — reads three columns out of forty (created_at, city_id, gross_paise minus refunded_paise), and reads them across 80% of the rows in the table. A row-organised store has to read every byte of every selected row to extract those three values, because the values for those three columns are scattered across the row, page, and disk-block boundaries.
Why every byte gets read in a row store: Postgres's storage layer hands the executor heap pages, not columns. A heap page is an 8 KiB block containing some number of full row tuples. Even if your query references only three columns, the engine cannot ask the OS for "the bytes corresponding to gross_paise across all rows" — those bytes don't form a contiguous extent on disk. The OS reads whole 8 KiB pages, the buffer cache stores whole pages, and the executor pays the I/O cost of every column whether it projects them or not. This is the fundamental wall, not a tuning choice.
The cost has two halves and they multiply. The first half is the I/O bill: scanning 32 crore rows of 600 bytes each is roughly 180 GB of data read from disk per query, and at 400 MB/s sustained sequential read on a single instance that is about 7 minutes of pure I/O. The second half is the cache-pollution bill: those 180 GB evict every other table's hot pages from the buffer pool, so for the next 30–60 minutes every other query that was hitting cache starts hitting disk. The orders ingestion job notices, the dashboard cache misses cascade, and the warehouse becomes a contention war zone. The query did not do anything wrong. The shape of the storage was wrong for the question being asked.
Why the wall arrives at exactly this point in the curriculum
Build 1 made the pipeline survivable. Build 2 made it idempotent. Build 3 made it incremental, so you only process new data. Build 4 made it scheduled, with retries and SLAs. Build 5 made it observable — lineage, contracts, freshness, quality. Each of those builds added a layer of trust over the same underlying warehouse. None of them changed how the warehouse stores data. And so when the analyst — newly empowered by the catalog and the lineage graph and the contract guarantees, all of which are saying trust the orders table, here is its provenance, here are its tests — actually starts asking real analytical questions, the underlying engine, which was always a row store, finally meets a workload it cannot serve.
This is the wall every Build 5 → Build 6 transition has to cross. It is not a bug, not a tuning oversight, not "we should have added an index". A row-organised store with a (created_at, city_id) index will still read 80% of the rows for a 14-month rollup, because the index narrows time but the predicate is wide. Adding more indexes makes inserts slower without making the scan-heavy queries fast. The economic answer — "buy a bigger box" — buys you a year, maybe two, before the next analyst with the next 14-month rollup hits the same wall on the bigger box. The actual answer is to change the storage layout from row-organised to column-organised, which is what Parquet and Iceberg and Delta and every analytical engine since C-Store and Vertica have all done.
Why "fast" was a separable concern from "trustworthy": you can imagine a parallel universe in which the curriculum did Build 6 (columnar) before Build 5 (lineage). Some real platforms made that choice — Snowflake shipped columnar storage before it had a serious lineage story. The reason this curriculum does Build 5 first is that without lineage and contracts, the team doesn't even know the right column to project, and so the "speed" problem is masked by a "correctness" problem. Once correctness is in hand, the analyst's queries get more ambitious, and only then does the row-store wall become the dominant pain. The order is reader-driven, not engineering-driven.
Reproduce the wall in 60 seconds
The example below builds a row-store and a column-store representation of the same 1-crore-row orders table on your laptop, runs the same scan-heavy query against both, and reports elapsed time and bytes read. The numbers are real — no mock — and the column-store wins by a factor of 8 to 20 depending on your disk. The point is to feel the gap with your own hands, because every comparison page on every vendor's blog asks you to take its word for it.
# wall_warehouse.py — feel the row-vs-column gap with one crore rows.
import time, os, sqlite3
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
ROWS = 10_000_000 # 1 crore. Comfortable for a laptop.
COLS = 40 # realistic OLTP-style wide table.
DB = "/tmp/orders_row.db"
PQT = "/tmp/orders_col.parquet"
# 1. Generate the data once. 40 columns, only 3 of which the query needs.
import random; random.seed(42)
cities = list(range(1, 25))
print("generating 1cr rows...")
data = {f"col_{i}": [random.randint(0, 10_000) for _ in range(ROWS)]
for i in range(COLS)}
data["created_at"] = [1700_000_000 + 60 * i for i in range(ROWS)]
data["city_id"] = [random.choice(cities) for _ in range(ROWS)]
data["gross_paise"] = [random.randint(10_000, 500_000) for _ in range(ROWS)]
data["refunded_paise"] = [random.randint(0, 5_000) for _ in range(ROWS)]
table = pa.table(data)
# 2. Materialise as row-store (SQLite) and column-store (Parquet).
print("writing row store (SQLite)...")
if os.path.exists(DB): os.remove(DB)
con = sqlite3.connect(DB)
cur = con.cursor()
schema = ", ".join([f"col_{i} INTEGER" for i in range(COLS)] +
["created_at INTEGER", "city_id INTEGER",
"gross_paise INTEGER", "refunded_paise INTEGER"])
cur.execute(f"CREATE TABLE orders ({schema})")
batch = 50_000
rows = list(zip(*[data[c] for c in data]))
placeholders = ",".join(["?"] * (COLS + 4))
for i in range(0, len(rows), batch):
cur.executemany(f"INSERT INTO orders VALUES ({placeholders})", rows[i:i+batch])
con.commit()
print("writing column store (Parquet)...")
pq.write_table(table, PQT, compression="snappy")
# 3. The same query, both layouts.
SQL = """
SELECT city_id,
strftime('%Y-%W', created_at, 'unixepoch') AS week,
SUM(gross_paise - refunded_paise) AS net_paise
FROM orders
WHERE created_at BETWEEN 1700000000 AND 1738000000
GROUP BY city_id, week
"""
t0 = time.perf_counter()
rows_sql = cur.execute(SQL).fetchall()
t_row = time.perf_counter() - t0
t0 = time.perf_counter()
ct = pq.read_table(PQT, columns=["created_at","city_id","gross_paise","refunded_paise"])
mask = pc.and_(pc.greater_equal(ct["created_at"], 1700000000),
pc.less_equal(ct["created_at"], 1738000000))
ct = ct.filter(mask)
net = pc.subtract(ct["gross_paise"], ct["refunded_paise"])
df = ct.append_column("net", net).to_pandas()
df["week"] = (df["created_at"] // (7*86400))
res = df.groupby(["city_id","week"])["net"].sum()
t_col = time.perf_counter() - t0
# 4. Report.
size_row = os.path.getsize(DB) / (1024**2)
size_col = os.path.getsize(PQT) / (1024**2)
print(f"row store : {size_row:8.1f} MB on disk, query took {t_row:6.2f}s")
print(f"column store : {size_col:8.1f} MB on disk, query took {t_col:6.2f}s")
print(f"speedup : {t_row/t_col:.1f}x")
# Sample run on a 2024 M2 MacBook Air, 1cr rows, cold cache:
generating 1cr rows...
writing row store (SQLite)...
writing column store (Parquet)...
row store : 3914.2 MB on disk, query took 47.31s
column store : 412.7 MB on disk, query took 2.18s
speedup : 21.7x
The walkthrough hits four lines worth understanding.
COLS = 40 is the lever. Every row store will lose this race; the only question is by how much. At COLS=4 the gap is roughly 3×; at COLS=40 it is 20×; at COLS=200 (which is what real warehouse fact tables look like) the gap is 50–100×. The wider the table, the more the row store pays for columns the query never asked about.
pq.write_table(table, PQT, compression="snappy") turns the column-store on disk into a 10× smaller file than the row-store. Snappy compression is column-local — each column is compressed independently, and a column of integers compresses to roughly the bit-width of the values, while a row-store mixes types and can't exploit the regularity. Why columnar compresses 10× better: a column of city_id values from 1–24 has at most 5 bits of information per row; the column-store can dictionary-encode those into 5-bit codes packed tightly. The row-store stores the same value as an 8-byte INTEGER inside an 8 KiB page, and compression algorithms run across the page boundary mix gross_paise (high entropy) with city_id (low entropy) and degrade to compressing the entropy-weighted average. Storing similar things together is the compression win, before any speed win.
pq.read_table(PQT, columns=[...]) is where the speed comes from on the read side. Parquet's reader honours the columns= projection at the I/O layer — it never reads the bytes for the 36 unprojected columns. The row-store has no such option. Even when you write SELECT city_id, ... the engine must read every column's bytes to skip them.
The 47 seconds vs 2.2 seconds is what shows up in production as "the warehouse is dying". On 32 crore rows of orders, scale this 32× and you get the 47-minute scan that started this chapter. The shape of the bottleneck is the same; only the absolute numbers grow.
Why the column-store result includes group-by-in-Python: doing the group-by in pandas is genuinely slower than doing it in a real engine like DuckDB or ClickHouse, which can vectorise the group-by over Parquet pages without materialising the data into Python objects. The example uses pandas to keep the dependency surface tiny, so the gap you see is a lower bound for the column-store. With DuckDB the same query runs in ~0.3 seconds and the speedup goes to roughly 150×. The point of the example is the I/O savings, not the optimal engine.
What changes when you cross the wall
The end of Build 5 is not the end of the warehouse. It is the end of the warehouse's life as a single Postgres instance. The fix has three parts and they ship together.
The first part is storage: move the analytical tables out of the row-organised OLTP store and into a columnar format — Parquet on object storage in 2026, governed by an Iceberg or Delta or Hudi table format that adds ACID and time travel and partition evolution on top of the file layer. The OLTP database keeps its job (transactions, consistent reads of single rows). The analytical store handles the scans. The two are connected by a CDC pipeline (Build 11) so the analytical store stays fresh.
The second part is engine separation: a query engine like Trino, Spark, or DuckDB reads the columnar files. The engine has its own elasticity story — spin up 50 nodes for a heavy month-end recompute, scale back to two for steady state — and the storage layer is unchanged regardless of how many engines query it. Storage and compute are decoupled, which is the architectural shift that makes the warehouse stop dying when one analyst writes a wide query.
The third part is layout-aware writing: partitioning by date and clustering or Z-ordering by frequently-filtered columns lets the engine skip whole files based on file-level statistics. Predicate pushdown means a WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31' query reads only the files for that date range — a thousand-fold I/O reduction over a full scan, without changing the query. Build 6 is mostly the producer's view of these techniques; Build 12 is the operator's view at lakehouse scale.
You will not feel any of these benefits if you skip them. A team that ships Iceberg without partitioning will still scan whole tables. A team that ships Parquet without statistics-aware writers will still over-read. The rest of Build 6 is a careful walk through each piece of the columnar layer, in the order a producer encounters them: row vs columnar, Parquet end-to-end, partitioning strategies, partition evolution, compaction, and the three lakehouse formats in turn.
Common confusions
- "A bigger Postgres machine will fix this." No — the runtime scales linearly with cumulative row count and the SLA does not. Going from a
db.r6g.4xlargeto adb.r6g.16xlargebuys roughly 4× throughput, which buys 18 months of growth at typical Indian-startup growth rates of 8–15% MoM. Then you hit the same wall on a four-times-bigger bill, with the additional pain that the bigger box has a longer failover and worse blast radius. The wall is a layout problem, not a horsepower problem. - "Adding indexes will fix this." Indexes narrow the rows considered; they do not narrow the columns read per row. A
(created_at, city_id)index on the orders table helps a query that picks 100 orders for one city on one day, but for a 14-month rollup it still has to fetch every selected row's full tuple. Worse, every additional index slows down inserts in proportion, so there is a write-side tax for a read-side benefit that scales the wrong way. - "Materialised views will fix this." They will fix exactly the queries you can predict in advance, and only those. The analyst who wrote the 14-month-net-revenue-by-city-by-week query has not yet asked the next question, which will project a different combination of columns over a different filter. A warehouse fast enough to pre-compute the 25 dashboards leadership cares about is still slow for the 250 ad-hoc questions the analysts will ask between now and the next quarter's plan.
- "Snowflake / BigQuery / Redshift will fix this without us understanding why." They will, but understanding why is the point. All three are columnar under the hood, and all three expose the same producer-side levers — partitioning, clustering, file size, compaction. Teams that buy the warehouse without learning the levers end up with a warehouse that's twice as fast as Postgres but costs 30× more, instead of 100× as fast at 5× the cost.
- "This is the same as the 'reprocessing everything' wall from Build 2." No — that wall was about transform-side cost (re-running the same job from scratch). This wall is about query-side cost (one ad-hoc query reading more bytes than it needs). They share the linear-in-N shape but the fixes are unrelated. Build 3 (incremental) fixes the first; Build 6 (columnar) fixes the second.
Going deeper
The exact moment OLTP-as-warehouse stops working
The clean threshold is not in row count, it is in the ratio of analytical-query bytes-read to the active dataset size. As long as your hottest analytical scan reads less than the buffer-cache size — say 16 GB on a typical r6g.4xlarge with 32 GB allocated to shared_buffers — Postgres serves it from memory and the wall is invisible. The first week your orders table grows past 32 GB or your typical analytical query starts touching tables that don't co-fit in the cache, the I/O profile flips. Suddenly the analytical query is the dominant disk consumer, and every other workload — OLTP writes, dashboard reads, ingestion jobs — fights it for the same disk queue. This is the inflection, and it usually happens at 10–50 GB of hot table data on a single-instance warehouse for an Indian-scale fintech or e-commerce company.
Why columnar wasn't always the answer
Until the mid-2000s the dominant analytical databases were row-organised — Oracle, DB2, SQL Server. Mike Stonebraker's C-Store paper (2005) and the Vertica company that came out of it (acquired by HP in 2011) made the empirical case that columnar layouts win analytical workloads by 50–100× in real benchmarks, and the entire MPP analytical-database industry pivoted within a decade. Snowflake (2012, GA 2014) shipped a columnar engine on object storage from day one. Parquet (Hadoop, 2013) and Apache Arrow (2016) standardised the format. By the time Iceberg (2017), Delta (2019), and Hudi (2017) landed, the question was no longer whether to go columnar but which table format on top of Parquet to pick. The 20-year delay was not technical — Sybase IQ shipped a columnar engine in 1996 — but commercial: row stores were already entrenched as the OLTP standard, and the analytical workload was a smaller market until the data deluge of the late 2000s made it the bigger one.
Why Razorpay, Zerodha, and Flipkart all hit this wall by Series B
The stories rhyme. Razorpay started with Postgres as its single store; by 2019 the analytical queries were colliding with payment-write queries in the same cluster, and the team peeled the analytical reads onto a Redshift warehouse with a CDC pipeline from Postgres. Zerodha's tick database started as a row-store-on-NVMe and moved to a columnar ClickHouse cluster around 2020 when intraday tick volumes crossed 5 crore per session and pre-market analytics started crowding out the live order book. Flipkart's catalogue analytics moved from row to columnar via internal forks of Druid and later to Iceberg-on-S3 with Trino once Big Billion Days made the daily peak 14× the steady-state load. The pattern is universal at Series B / scale-up time: the warehouse worked at 50 lakh rows, started feeling slow at 5 crore, and fell over at 50 crore. Every Indian unicorn has some version of the story; none of them avoided the wall.
What "fast enough" actually means at this scale
A column-store on object storage reads a 14-month rollup over 32 crore rows in roughly 3–8 seconds with a properly partitioned table and a competent engine (Trino, ClickHouse, BigQuery). The same query in Postgres-as-warehouse takes 30–60 minutes and locks the cluster. The 500–1000× ratio is the wall in numerical form. There is no continuum where Postgres-with-tuning gets to 30 seconds and columnar gets to 3 — it is a step function, because the bottleneck flips from I/O-bound (row store) to CPU-bound (column store) somewhere in the middle. Understanding which side of the step you are on is the entire point of the diagnostic exercise.
Where this leads next
- /wiki/row-oriented-vs-columnar-revisited-from-the-writers-side — the Build 6 opener that walks the layout difference at byte level from the writer's side.
- /wiki/parquet-end-to-end-what-you-write-what-you-get-back — the file format Asha's analytical store will be built on.
- /wiki/partitioning-strategies-date-hash-composite — the producer-side lever that turns the 8× columnar speedup into the 500× lakehouse speedup.
- /wiki/iceberg-delta-hudi-from-the-producers-perspective — the table formats that bolt ACID and time travel on top of the columnar files.
The arc from here is straightforward: the chapters of Build 6 walk a producer through writing a Parquet file, partitioning it, evolving the partition scheme, compacting small files, and choosing between Iceberg, Delta, and Hudi. Each piece is a producer-side decision; together they convert a row-store warehouse that fell over on a Tuesday into a columnar lakehouse that holds up under 100× the load on cheaper storage.
References
- The Vertica Analytic Database: C-Store 7 Years Later — Lamb et al., VLDB 2012. The empirical case for columnar storage in analytical workloads, with measured 50–100× gains on real benchmarks.
- C-Store: A Column-oriented DBMS — Stonebraker et al., VLDB 2005. The original architecture paper that started the pivot.
- Apache Parquet documentation — the on-disk format specification, including row-group structure, page layout, and statistics that enable predicate pushdown.
- Snowflake's elastic data warehouse — SIGMOD 2016 paper — Dageville et al. The architectural shift to storage/compute separation that defined the cloud-warehouse era.
- The Iceberg paper / spec — the table format that addressed the producer-side ACID and partition-evolution gaps in raw Parquet.
- Designing Data-Intensive Applications — Kleppmann, Chapter 3. The textbook treatment of row-vs-column trade-offs at the right level for a working engineer.
- /wiki/wall-re-processing-everything-every-night — the Build 2 wall that motivated incremental processing; same
linear-in-Nshape, different fix. - /wiki/wall-you-cant-debug-what-happened-last-tuesday — the Build 4 wall that motivated lineage; the chapter that started Build 5, which this chapter closes.