Row-oriented vs columnar, revisited from the writer's side
Aditi is a backend engineer at a Pune-based logistics startup. Yesterday she pushed INSERT INTO shipments VALUES (...) into Postgres and the write returned in 4 ms. This morning her data lead asked her to write the same shipment record into a Parquet file that the warehouse team will read tonight, and the writer she imported wants her to call parquet_writer.write_batch(table) with a thousand rows at a time, not one. Her instinct says one record, one write. The format says no — give me a batch, I will rotate the bytes ninety degrees and lay them out by column. This chapter is about why the format insists on that rotation, what the bytes look like before and after, and what the producer pays for the rotation in latency, memory, and code shape.
A row store writes one record by appending its fields back-to-back; a column store writes one batch by appending each field's values into its own contiguous run. The rotation is the entire difference, and it is paid for at write time — in batch buffering, encoding cost, and a flush boundary the producer must own. Once you can see the bytes change shape under your fingers, every other Build 6 decision (partitioning, compaction, statistics) becomes a layout question, not a tuning question.
Watching one record become bytes — twice
Take a single shipment record from Aditi's table. It has six fields you will see across most logistics-style schemas:
shipment_id = 9_482_711
customer_id = 41_337
city_id = 7 # Pune
status = "DELIVERED"
weight_grams = 1_240
created_at = 1_745_582_400 # 2026-04-25 12:00 IST as unix seconds
The row store writes this as one tuple. The column store writes it — eventually, after a thousand of its siblings show up — as one cell in each of six columns. The difference is visible at byte level.
Why the same record cannot be both layouts at once on disk: a record is logical; the layout is physical. Either one record's fields are contiguous (row), or all records' first field is contiguous followed by all records' second field (column). You cannot have both contiguous on a single linear address space without storing the data twice. Some systems (Apache Kudu, hybrid stores) keep two copies for this exact reason and pay double the storage to avoid the trade-off — but the ones that don't pay double have to pick a layout, and the pick is dictated by the dominant query shape.
The row layout's win is locality of one record. If your read pattern is "give me everything about shipment 9482711", the row store does one I/O — read the page, return the tuple — and that is unbeatable. Postgres, MySQL, SQLite, every OLTP engine you have ever used, optimises for this pattern because OLTP workloads ask for one record at a time.
The column layout's win is locality of one field. If your read pattern is "give me the average weight of every delivered shipment last quarter", the column store reads only weight_grams and status and created_at — three of six columns — and it reads each as one tight, type-uniform run that compresses 5–10× better than the same values would compress inside a row. The difference scales with the width of the table: a 6-column schema gives a 2× win, a 40-column OLTP-style schema gives a 14× win, a 200-column warehouse fact table gives a 50× win.
Why the writer must batch — the buffer that makes columnar possible
A row writer can take one record and append it: the field widths are known, the tuple is self-contained. A column writer cannot, because writing one record into a column store would mean opening six files, appending one value to each, and closing them — which is six syscalls per record, no compression possible (you have one value per column), and a fragmented mess on disk. The column writer must hold many records in memory before it can write any of them, because a column run is only useful at length.
This is the writer-side trade-off that Aditi felt this morning. Her instinct said INSERT one row. The format said write_batch(table) with at least 1000 rows. The format is right — and the size of "at least" is one of the most important knobs the producer turns.
Why this matters operationally: the writer's batch is your durability boundary. A row store's durability boundary is the WAL fsync after every commit — submillisecond. A column store's durability boundary is the row-group flush — seconds to minutes. If your producer dies between flushes, every record in the in-memory batch is gone, regardless of how many write calls you made. This is why production columnar pipelines (Spark, Flink, Kafka Connect's S3 sink) all wrap the writer in a checkpoint scheme that controls when flushes happen — the writer alone is unsafe, the writer-plus-checkpoint is the unit of correctness. We come back to this in Build 8 and Build 9.
Build the rotation in 80 lines of Python
The example below reproduces the writer-side rotation by hand. It accepts a stream of dict records, buffers them into a row-batch, then writes the same records two ways — row-oriented to a JSON-lines file (the simplest row store) and column-oriented to a Parquet file. The point is to watch the rotation happen in code so the abstraction is concrete.
# rotate_writer.py — see one stream become two layouts.
import json, os, time, struct
from collections import defaultdict
import pyarrow as pa
import pyarrow.parquet as pq
ROW_GROUP = 1000 # flush threshold — the writer's contract with the format.
SCHEMA = pa.schema([
("shipment_id", pa.int64()),
("customer_id", pa.int32()),
("city_id", pa.int8()),
("status", pa.dictionary(pa.int8(), pa.string())),
("weight_grams", pa.int32()),
("created_at", pa.int64()),
])
def stream_records(n): # synthesise n shipments.
import random; random.seed(0)
statuses = ["BOOKED","PICKED","IN_TRANSIT","DELIVERED","RTO"]
base_ts = 1_745_500_000
for i in range(n):
yield {
"shipment_id": 9_000_000 + i,
"customer_id": random.randint(1000, 200_000),
"city_id": random.randint(1, 24),
"status": random.choices(statuses, weights=[1,2,3,8,1])[0],
"weight_grams": random.randint(50, 25_000),
"created_at": base_ts + i,
}
# --- ROW WRITER: one record, one append. Tiny, no buffer. ---
def write_row(records, path):
t0 = time.perf_counter()
with open(path, "w") as f:
for r in records:
f.write(json.dumps(r) + "\n") # one line == one tuple.
return time.perf_counter() - t0
# --- COLUMN WRITER: buffer into six arrays, flush at ROW_GROUP. ---
def write_column(records, path):
t0 = time.perf_counter()
buf = defaultdict(list) # six parallel column buffers.
writer = pq.ParquetWriter(path, SCHEMA, compression="snappy")
for r in records:
for k, v in r.items():
buf[k].append(v)
if len(buf["shipment_id"]) >= ROW_GROUP:
writer.write_table(pa.Table.from_pydict(buf, schema=SCHEMA))
buf.clear()
if buf["shipment_id"]: # final partial batch.
writer.write_table(pa.Table.from_pydict(buf, schema=SCHEMA))
writer.close()
return time.perf_counter() - t0
if __name__ == "__main__":
N = 200_000
rows = list(stream_records(N))
t_row = write_row(rows, "/tmp/ship_row.jsonl")
t_col = write_column(rows, "/tmp/ship_col.parquet")
s_row = os.path.getsize("/tmp/ship_row.jsonl") / 1024
s_col = os.path.getsize("/tmp/ship_col.parquet") / 1024
print(f"row : {t_row:5.2f}s size {s_row:7.1f} KB "
f"({s_row*1024/N:.1f} bytes/record)")
print(f"column : {t_col:5.2f}s size {s_col:7.1f} KB "
f"({s_col*1024/N:.1f} bytes/record)")
print(f"size ratio (row / col): {s_row/s_col:.1f}x")
# Sample run on a 2024 M2 MacBook Air, 2 lakh records:
row : 0.71s size 18432.6 KB (94.4 bytes/record)
column : 0.49s size 1872.4 KB (9.6 bytes/record)
size ratio (row / col): 9.8x
Four lines of this code do the real work; the rest is plumbing.
for r in records: ... f.write(json.dumps(r) + "\n") is the row writer in full — one record, one append, no buffer, no flush boundary. Every record is durable when the OS writes the line. This is what Aditi's instinct wanted to do, and for one-record-at-a-time writes it is genuinely the simplest correct thing.
for k, v in r.items(): buf[k].append(v) is the rotation. Six columns, six lists, one record's six values added to six different lists in lockstep. The record goes in as a row; it lives in memory as six pieces of six columns. Why this rotation is the point: every column-store optimisation downstream — dictionary encoding the status column, delta encoding the timestamps, bit-packing the small-range city_ids — only works once values of the same type and similar magnitude are next to each other. That contiguity is what the rotation creates. There is no way to dictionary-encode status if its values are scattered between customer_id and weight_grams in the byte stream.
if len(buf["shipment_id"]) >= ROW_GROUP: writer.write_table(...) is the flush. This is the line that determines durability latency. With ROW_GROUP = 1000 and records arriving at 100/sec, every record waits up to 10 seconds to be on disk. Smaller row groups give you faster durability and worse compression; larger row groups give you slower durability and better compression. The producer owns this trade-off; the format does not pick for you.
writer.write_table(pa.Table.from_pydict(buf, schema=SCHEMA)) is where Parquet does its work. Inside write_table, each column is encoded with the scheme that fits its type (dictionary for status, plain for shipment_id, delta for created_at), the encoded bytes are compressed with Snappy, and the whole row group is appended to the file as a self-contained unit with min/max/null statistics in its footer. The 9.8× size win in the output comes from this step, not from the rotation alone — the rotation is necessary but not sufficient.
Why the row writer is faster on this small dataset and slower at scale: at 2 lakh records the row writer wins on wall-clock because it does no encoding, no compression, no schema validation. At 2 crore records the column writer wins because the encoding work is amortised over the row group and the output file is 10× smaller, which means 10× less to fsync, 10× less to upload to S3, 10× less to read back. The crossover is around 50k–100k records on a typical laptop. Below it, JSONL wins; above it, Parquet wins. Production data engineering lives well above the crossover.
Where the rotation cost shows up in production
The rotation has three costs the producer pays. Each one shows up as a real engineering decision in real Indian-scale pipelines.
Memory. A 1000-row buffer for a 40-column wide table holds roughly 1000 × 40 × 8 bytes = 320 KB if every column is a 64-bit integer; closer to 2 MB once strings and lists are included. A 1-million-row buffer (typical for a Spark Parquet writer) holds 2 GB. Streaming pipelines at PhonePe or Razorpay, where one writer process serves hundreds of partitions, have to size their JVM heap for partitions × row_group_bytes — and if they get it wrong they OOM mid-flush, lose the buffer, and replay from the last checkpoint. The buffer size is not a vanity knob.
Latency. Records sit in the buffer until the flush. For a batch pipeline reading from Postgres once an hour, this is invisible — the whole batch is one row group, written once. For a streaming pipeline reading from Kafka, this is visible as the gap between "Kafka offset committed" and "data queryable from the warehouse". A team that wants 30-second freshness on a streaming columnar sink has to flush every 30 seconds, which means small row groups, which means worse compression. There is no free freshness.
Schema rigidity. The row writer accepts any dict; the column writer must know the schema upfront because each column's encoder is type-specific. If a producer sees a new field — delivery_otp shows up in record number 547 — the row writer happily writes it; the column writer either reshapes the buffer (expensive, and you must rewrite the file footer) or rejects the record. This is the seed of the entire schema-evolution problem in Build 6 and Build 11.
These costs are why the columnar writer is rarely written by hand in production. It lives inside Spark, Flink, Kafka Connect, dbt, or a managed service — software whose job is to own the buffer, the flush, the schema, and the checkpoint, so the producer's code can stay record-shaped while the storage stays column-shaped.
Common confusions
- "Columnar is just row data with the columns swapped on read." No — the bytes on disk are physically different. Columnar files store each field as its own contiguous run with its own encoding scheme. You cannot read a Parquet file with a row-store reader; you cannot read a row-store file with a column-store reader. The format is the layout.
- "A row store is always slower than a column store." False on point lookups by primary key — the row store does one I/O, the column store does six (one per column), and the column store loses by 3–6×. Row stores are not "old"; they are right for OLTP workloads. Column stores are right for analytical workloads. Both are still the right answer, in their own zones.
- "Compression is what makes columnar small." Compression amplifies the win, but the rotation alone gives you 2–4× already. A column of
int8city_ids stored uncompressed already takes 1 byte per row; the same column in a row store takes 8 bytes per row (PostgresINTEGERis 4 bytes plus tuple overhead). Dictionary and bit-packing schemes then push the column-store down to 5–6 bits per row. The rotation is the foundation; compression is the multiplier. - "You can write Parquet one record at a time." The API will let you call
write_tablewith a single-row table, and the file will be valid — but every "row group" will contain one row, the encoders never see enough values to amortise their overhead, and the file will be larger than the equivalent JSONL. The format technically tolerates it; the format absolutely does not reward it. Always batch. - "Hybrid stores like Apache Kudu give you both for free." They don't — they pay 2× storage to keep both layouts, and they pay write amplification on every update. Hybrid stores are useful when your workload is genuinely 50/50 OLTP and analytical; they are wasteful when it is 95/5 either way, which is most of the time.
Going deeper
Why the row group, not the file, is the unit of atomicity
Parquet writers can append many row groups to a single file before closing it. Each row group has its own column chunks, its own encoding, its own statistics — and importantly, each row group is independently readable. A reader that wants the third row group can seek directly to its offset (recorded in the file's footer index) and decode just it, ignoring the other 99. This is what makes a 1 GB Parquet file with a thousand row groups behave like a thousand small files for query purposes, without the small-file metadata explosion. The trade-off — bigger row groups compress better but waste read effort on selective queries — is the central tuning question of Build 6, and it is the same trade-off Hive partitioning solves at a coarser scale.
Dictionary encoding and the ladder of cardinality
A column of statuses with five distinct values (BOOKED, PICKED, IN_TRANSIT, DELIVERED, RTO) gets dictionary-encoded into a 3-bit code per row plus a 5-entry dictionary at the start of the column chunk. A column of customer_ids with 50 lakh distinct values cannot be dictionary-encoded — the dictionary would be larger than the data — and so falls back to plain encoding. The crossover is around cardinality × log(cardinality) < row_count, which is why Parquet writers track per-chunk cardinality and switch encodings adaptively. For an Indian e-commerce schema, country_code, payment_method, status, gender, and state_code all dictionary-encode beautifully; customer_id, email, and transaction_id do not. Knowing which is which lets you predict your file sizes within 20%.
Why CDC pipelines emit row events but warehouses store columns
Debezium reads the Postgres write-ahead log and emits one Kafka message per row change — BEFORE/AFTER images of one tuple. The warehouse it eventually feeds (Iceberg, Delta) stores the same data column-wise. The shape transformation happens in the sink — Kafka Connect's Iceberg sink, Flink's IcebergSink, or a hand-rolled equivalent — which buffers row events into row groups and flushes them as columns. This is the gap Build 11 (CDC) closes; for now, recognise that the format mismatch between OLTP and analytical storage is the entire reason that pipeline exists.
How Zerodha's tick warehouse made the rotation work at 5 crore ticks/day
Zerodha's tick database started as PostgreSQL with daily-partitioned tables. By 2020 the analytical workload — pre-market backtests, post-trade reports, regulatory filings — was creating sequential scans that blocked the live order-acknowledgement pipeline running on the same cluster. The migration moved tick data to ClickHouse (a columnar engine) with the writer holding a 64-row block before any column work. ClickHouse's MergeTree engine writes column files per partition per part, and small parts get merged into bigger parts in a background compaction loop — same buffer-and-flush pattern as Parquet, but continuous rather than file-bounded. After the migration, analytical queries that took 20 minutes on Postgres ran in 3 seconds on ClickHouse, and the live order pipeline stopped sharing I/O with anyone. The rotation was the architectural unlock; the cluster topology, replication, and compaction were operational details on top of it.
Where this leads next
The rest of Build 6 walks the producer's path through the columnar layer one piece at a time. The next chapter, /wiki/parquet-end-to-end-what-you-write-what-you-get-back, opens up the Parquet file you just wrote and decodes the footer, the row group, the column chunk, and the encoded pages — the format's full byte-level grammar. After that, /wiki/partitioning-strategies-date-hash-composite covers the next-level-up layout decision (which directory does this row group live in?), and /wiki/small-file-problem-and-the-compaction-job covers what to do when a streaming writer leaves you with millions of tiny row groups instead of a few thousand healthy ones.
The one-line summary of the build is: every Parquet/Iceberg/Delta decision the producer makes is a layout decision, and every layout decision can be traced back to the rotation in this chapter. Once you internalise that, the rest of Build 6 is reading the documentation with the right lens.
References
- Apache Parquet specification — the on-disk format definition: row groups, column chunks, pages, the footer index, and the encoding catalogue.
- Dremel: Interactive Analysis of Web-Scale Datasets — Melnik et al., VLDB 2010. The paper that introduced the column-storage encoding scheme Parquet adopted, including repetition and definition levels for nested data.
- Apache Arrow columnar format — the in-memory columnar representation the Parquet reader/writer uses. Reading this clarifies how the rotation lives in process memory before it lands on disk.
- Designing Data-Intensive Applications, Chapter 3 — Kleppmann. The textbook treatment of row-vs-column storage at the right depth for a working engineer.
- ClickHouse: a fast open-source OLAP database — Yandex/ClickHouse Inc. The architecture overview makes the buffer-and-merge pattern explicit, which is exactly the shape this chapter described.
- The Snowflake Elastic Data Warehouse — SIGMOD 2016 — Dageville et al. The cloud-warehouse paper that locked storage/compute separation in as the modern architecture.
- /wiki/wall-the-warehouse-is-dying-under-scan-heavy-queries — the Build 5 wall this chapter is the first answer to.
- /wiki/parquet-end-to-end-what-you-write-what-you-get-back — the next chapter, opening the file you just wrote.