Row-oriented vs columnar, revisited from the writer's side

Aditi is a backend engineer at a Pune-based logistics startup. Yesterday she pushed INSERT INTO shipments VALUES (...) into Postgres and the write returned in 4 ms. This morning her data lead asked her to write the same shipment record into a Parquet file that the warehouse team will read tonight, and the writer she imported wants her to call parquet_writer.write_batch(table) with a thousand rows at a time, not one. Her instinct says one record, one write. The format says no — give me a batch, I will rotate the bytes ninety degrees and lay them out by column. This chapter is about why the format insists on that rotation, what the bytes look like before and after, and what the producer pays for the rotation in latency, memory, and code shape.

A row store writes one record by appending its fields back-to-back; a column store writes one batch by appending each field's values into its own contiguous run. The rotation is the entire difference, and it is paid for at write time — in batch buffering, encoding cost, and a flush boundary the producer must own. Once you can see the bytes change shape under your fingers, every other Build 6 decision (partitioning, compaction, statistics) becomes a layout question, not a tuning question.

Watching one record become bytes — twice

Take a single shipment record from Aditi's table. It has six fields you will see across most logistics-style schemas:

shipment_id    = 9_482_711
customer_id    = 41_337
city_id        = 7         # Pune
status         = "DELIVERED"
weight_grams   = 1_240
created_at     = 1_745_582_400   # 2026-04-25 12:00 IST as unix seconds

The row store writes this as one tuple. The column store writes it — eventually, after a thousand of its siblings show up — as one cell in each of six columns. The difference is visible at byte level.

One record, two on-disk layoutsTop half shows a row-organised heap page: the six fields of one shipment record are laid out side by side, then the next record continues. Bottom half shows a columnar layout: the same six fields from many records are gathered into six column chunks, each contiguous on disk. The same six fields, two layouts Row layout — one record's fields are contiguous; records stack vertically. 9482711 41337 7 DELIVERED 1240 1745582400 record #2 — same six fields again — same widths, same order record #3 ... Field labels (one record): ship_id cust_id city status weight_g created_at Columnar layout — each field becomes its own contiguous run; records align by index. shipment_id 9482711 9482712 9482713 ... int64 run customer_id 41337 82014 5512 ... int32 run city_id 7 7 14 ... 5-bit dict status DELIVERED IN_TRANSIT DELIVERED ... RLE + dict weight_grams 1240 880 2310 ... delta-bp created_at +0 +47 +92 ... delta encoding
The row layout writes "all of record 1, then all of record 2"; the columnar layout writes "all shipment_ids, then all customer_ids, ...". The same six fields, the same logical data, but the column store now has a chance to encode each column with a scheme that fits the data type — dictionary for low-cardinality strings, delta for monotonic timestamps, bit-packing for small-range integers.

Why the same record cannot be both layouts at once on disk: a record is logical; the layout is physical. Either one record's fields are contiguous (row), or all records' first field is contiguous followed by all records' second field (column). You cannot have both contiguous on a single linear address space without storing the data twice. Some systems (Apache Kudu, hybrid stores) keep two copies for this exact reason and pay double the storage to avoid the trade-off — but the ones that don't pay double have to pick a layout, and the pick is dictated by the dominant query shape.

The row layout's win is locality of one record. If your read pattern is "give me everything about shipment 9482711", the row store does one I/O — read the page, return the tuple — and that is unbeatable. Postgres, MySQL, SQLite, every OLTP engine you have ever used, optimises for this pattern because OLTP workloads ask for one record at a time.

The column layout's win is locality of one field. If your read pattern is "give me the average weight of every delivered shipment last quarter", the column store reads only weight_grams and status and created_at — three of six columns — and it reads each as one tight, type-uniform run that compresses 5–10× better than the same values would compress inside a row. The difference scales with the width of the table: a 6-column schema gives a 2× win, a 40-column OLTP-style schema gives a 14× win, a 200-column warehouse fact table gives a 50× win.

Why the writer must batch — the buffer that makes columnar possible

A row writer can take one record and append it: the field widths are known, the tuple is self-contained. A column writer cannot, because writing one record into a column store would mean opening six files, appending one value to each, and closing them — which is six syscalls per record, no compression possible (you have one value per column), and a fragmented mess on disk. The column writer must hold many records in memory before it can write any of them, because a column run is only useful at length.

This is the writer-side trade-off that Aditi felt this morning. Her instinct said INSERT one row. The format said write_batch(table) with at least 1000 rows. The format is right — and the size of "at least" is one of the most important knobs the producer turns.

The writer-side buffer and flush cycleA sequence diagram showing records arriving at the column writer and being held in an in-memory column-batch buffer. When the buffer reaches the row-group threshold, all six column chunks are encoded, compressed, and flushed as one row group on disk. Then the buffer resets. Producer-side: arrive → buffer → encode → flush stream of records one at a time In-memory batch (column buffer) six parallel arrays growing together shipment_id 600/1000 rows customer_id 600/1000 rows city_id 600/1000 rows status 600/1000 rows weight_grams 600/1000 rows created_at 600/1000 rows flush at row_group_size = 1000 on disk after flush one Parquet row group shipment_id chunk (encoded + snappy) customer_id chunk city_id chunk (dict) status chunk (RLE + dict) weight_grams chunk (delta-bp) created_at chunk (delta) + statistics: min/max/null per chunk Latency cost: a record sits in the batch for up to (row_group_size / arrival_rate) seconds before it is durable on disk. For 1000-row groups at 100 rows/sec, that's 10 seconds. The producer must own this clock.
The columnar writer is a buffer-and-flush state machine. Records arrive one at a time but are appended to six parallel column buffers; only when the buffer hits the row-group threshold does the writer encode each column with its chosen scheme, compress the result, and flush all six chunks as one atomic row group. Until that flush, the records exist only in process memory — a crash loses them.

Why this matters operationally: the writer's batch is your durability boundary. A row store's durability boundary is the WAL fsync after every commit — submillisecond. A column store's durability boundary is the row-group flush — seconds to minutes. If your producer dies between flushes, every record in the in-memory batch is gone, regardless of how many write calls you made. This is why production columnar pipelines (Spark, Flink, Kafka Connect's S3 sink) all wrap the writer in a checkpoint scheme that controls when flushes happen — the writer alone is unsafe, the writer-plus-checkpoint is the unit of correctness. We come back to this in Build 8 and Build 9.

Build the rotation in 80 lines of Python

The example below reproduces the writer-side rotation by hand. It accepts a stream of dict records, buffers them into a row-batch, then writes the same records two ways — row-oriented to a JSON-lines file (the simplest row store) and column-oriented to a Parquet file. The point is to watch the rotation happen in code so the abstraction is concrete.

# rotate_writer.py — see one stream become two layouts.
import json, os, time, struct
from collections import defaultdict
import pyarrow as pa
import pyarrow.parquet as pq

ROW_GROUP = 1000   # flush threshold — the writer's contract with the format.
SCHEMA = pa.schema([
    ("shipment_id",  pa.int64()),
    ("customer_id",  pa.int32()),
    ("city_id",      pa.int8()),
    ("status",       pa.dictionary(pa.int8(), pa.string())),
    ("weight_grams", pa.int32()),
    ("created_at",   pa.int64()),
])

def stream_records(n):                     # synthesise n shipments.
    import random; random.seed(0)
    statuses = ["BOOKED","PICKED","IN_TRANSIT","DELIVERED","RTO"]
    base_ts  = 1_745_500_000
    for i in range(n):
        yield {
            "shipment_id":  9_000_000 + i,
            "customer_id":  random.randint(1000, 200_000),
            "city_id":      random.randint(1, 24),
            "status":       random.choices(statuses, weights=[1,2,3,8,1])[0],
            "weight_grams": random.randint(50, 25_000),
            "created_at":   base_ts + i,
        }

# --- ROW WRITER: one record, one append. Tiny, no buffer. ---
def write_row(records, path):
    t0 = time.perf_counter()
    with open(path, "w") as f:
        for r in records:
            f.write(json.dumps(r) + "\n")    # one line == one tuple.
    return time.perf_counter() - t0

# --- COLUMN WRITER: buffer into six arrays, flush at ROW_GROUP. ---
def write_column(records, path):
    t0 = time.perf_counter()
    buf = defaultdict(list)                  # six parallel column buffers.
    writer = pq.ParquetWriter(path, SCHEMA, compression="snappy")
    for r in records:
        for k, v in r.items():
            buf[k].append(v)
        if len(buf["shipment_id"]) >= ROW_GROUP:
            writer.write_table(pa.Table.from_pydict(buf, schema=SCHEMA))
            buf.clear()
    if buf["shipment_id"]:                   # final partial batch.
        writer.write_table(pa.Table.from_pydict(buf, schema=SCHEMA))
    writer.close()
    return time.perf_counter() - t0

if __name__ == "__main__":
    N = 200_000
    rows = list(stream_records(N))
    t_row = write_row(rows, "/tmp/ship_row.jsonl")
    t_col = write_column(rows, "/tmp/ship_col.parquet")
    s_row = os.path.getsize("/tmp/ship_row.jsonl")  / 1024
    s_col = os.path.getsize("/tmp/ship_col.parquet") / 1024
    print(f"row    : {t_row:5.2f}s   size {s_row:7.1f} KB   "
          f"({s_row*1024/N:.1f} bytes/record)")
    print(f"column : {t_col:5.2f}s   size {s_col:7.1f} KB   "
          f"({s_col*1024/N:.1f} bytes/record)")
    print(f"size ratio (row / col): {s_row/s_col:.1f}x")
# Sample run on a 2024 M2 MacBook Air, 2 lakh records:
row    :  0.71s   size  18432.6 KB   (94.4 bytes/record)
column :  0.49s   size   1872.4 KB   (9.6 bytes/record)
size ratio (row / col): 9.8x

Four lines of this code do the real work; the rest is plumbing.

for r in records: ... f.write(json.dumps(r) + "\n") is the row writer in full — one record, one append, no buffer, no flush boundary. Every record is durable when the OS writes the line. This is what Aditi's instinct wanted to do, and for one-record-at-a-time writes it is genuinely the simplest correct thing.

for k, v in r.items(): buf[k].append(v) is the rotation. Six columns, six lists, one record's six values added to six different lists in lockstep. The record goes in as a row; it lives in memory as six pieces of six columns. Why this rotation is the point: every column-store optimisation downstream — dictionary encoding the status column, delta encoding the timestamps, bit-packing the small-range city_ids — only works once values of the same type and similar magnitude are next to each other. That contiguity is what the rotation creates. There is no way to dictionary-encode status if its values are scattered between customer_id and weight_grams in the byte stream.

if len(buf["shipment_id"]) >= ROW_GROUP: writer.write_table(...) is the flush. This is the line that determines durability latency. With ROW_GROUP = 1000 and records arriving at 100/sec, every record waits up to 10 seconds to be on disk. Smaller row groups give you faster durability and worse compression; larger row groups give you slower durability and better compression. The producer owns this trade-off; the format does not pick for you.

writer.write_table(pa.Table.from_pydict(buf, schema=SCHEMA)) is where Parquet does its work. Inside write_table, each column is encoded with the scheme that fits its type (dictionary for status, plain for shipment_id, delta for created_at), the encoded bytes are compressed with Snappy, and the whole row group is appended to the file as a self-contained unit with min/max/null statistics in its footer. The 9.8× size win in the output comes from this step, not from the rotation alone — the rotation is necessary but not sufficient.

Why the row writer is faster on this small dataset and slower at scale: at 2 lakh records the row writer wins on wall-clock because it does no encoding, no compression, no schema validation. At 2 crore records the column writer wins because the encoding work is amortised over the row group and the output file is 10× smaller, which means 10× less to fsync, 10× less to upload to S3, 10× less to read back. The crossover is around 50k–100k records on a typical laptop. Below it, JSONL wins; above it, Parquet wins. Production data engineering lives well above the crossover.

Where the rotation cost shows up in production

The rotation has three costs the producer pays. Each one shows up as a real engineering decision in real Indian-scale pipelines.

Memory. A 1000-row buffer for a 40-column wide table holds roughly 1000 × 40 × 8 bytes = 320 KB if every column is a 64-bit integer; closer to 2 MB once strings and lists are included. A 1-million-row buffer (typical for a Spark Parquet writer) holds 2 GB. Streaming pipelines at PhonePe or Razorpay, where one writer process serves hundreds of partitions, have to size their JVM heap for partitions × row_group_bytes — and if they get it wrong they OOM mid-flush, lose the buffer, and replay from the last checkpoint. The buffer size is not a vanity knob.

Latency. Records sit in the buffer until the flush. For a batch pipeline reading from Postgres once an hour, this is invisible — the whole batch is one row group, written once. For a streaming pipeline reading from Kafka, this is visible as the gap between "Kafka offset committed" and "data queryable from the warehouse". A team that wants 30-second freshness on a streaming columnar sink has to flush every 30 seconds, which means small row groups, which means worse compression. There is no free freshness.

Schema rigidity. The row writer accepts any dict; the column writer must know the schema upfront because each column's encoder is type-specific. If a producer sees a new field — delivery_otp shows up in record number 547 — the row writer happily writes it; the column writer either reshapes the buffer (expensive, and you must rewrite the file footer) or rejects the record. This is the seed of the entire schema-evolution problem in Build 6 and Build 11.

These costs are why the columnar writer is rarely written by hand in production. It lives inside Spark, Flink, Kafka Connect, dbt, or a managed service — software whose job is to own the buffer, the flush, the schema, and the checkpoint, so the producer's code can stay record-shaped while the storage stays column-shaped.

Common confusions

Going deeper

Why the row group, not the file, is the unit of atomicity

Parquet writers can append many row groups to a single file before closing it. Each row group has its own column chunks, its own encoding, its own statistics — and importantly, each row group is independently readable. A reader that wants the third row group can seek directly to its offset (recorded in the file's footer index) and decode just it, ignoring the other 99. This is what makes a 1 GB Parquet file with a thousand row groups behave like a thousand small files for query purposes, without the small-file metadata explosion. The trade-off — bigger row groups compress better but waste read effort on selective queries — is the central tuning question of Build 6, and it is the same trade-off Hive partitioning solves at a coarser scale.

Dictionary encoding and the ladder of cardinality

A column of statuses with five distinct values (BOOKED, PICKED, IN_TRANSIT, DELIVERED, RTO) gets dictionary-encoded into a 3-bit code per row plus a 5-entry dictionary at the start of the column chunk. A column of customer_ids with 50 lakh distinct values cannot be dictionary-encoded — the dictionary would be larger than the data — and so falls back to plain encoding. The crossover is around cardinality × log(cardinality) < row_count, which is why Parquet writers track per-chunk cardinality and switch encodings adaptively. For an Indian e-commerce schema, country_code, payment_method, status, gender, and state_code all dictionary-encode beautifully; customer_id, email, and transaction_id do not. Knowing which is which lets you predict your file sizes within 20%.

Why CDC pipelines emit row events but warehouses store columns

Debezium reads the Postgres write-ahead log and emits one Kafka message per row change — BEFORE/AFTER images of one tuple. The warehouse it eventually feeds (Iceberg, Delta) stores the same data column-wise. The shape transformation happens in the sink — Kafka Connect's Iceberg sink, Flink's IcebergSink, or a hand-rolled equivalent — which buffers row events into row groups and flushes them as columns. This is the gap Build 11 (CDC) closes; for now, recognise that the format mismatch between OLTP and analytical storage is the entire reason that pipeline exists.

How Zerodha's tick warehouse made the rotation work at 5 crore ticks/day

Zerodha's tick database started as PostgreSQL with daily-partitioned tables. By 2020 the analytical workload — pre-market backtests, post-trade reports, regulatory filings — was creating sequential scans that blocked the live order-acknowledgement pipeline running on the same cluster. The migration moved tick data to ClickHouse (a columnar engine) with the writer holding a 64-row block before any column work. ClickHouse's MergeTree engine writes column files per partition per part, and small parts get merged into bigger parts in a background compaction loop — same buffer-and-flush pattern as Parquet, but continuous rather than file-bounded. After the migration, analytical queries that took 20 minutes on Postgres ran in 3 seconds on ClickHouse, and the live order pipeline stopped sharing I/O with anyone. The rotation was the architectural unlock; the cluster topology, replication, and compaction were operational details on top of it.

Where this leads next

The rest of Build 6 walks the producer's path through the columnar layer one piece at a time. The next chapter, /wiki/parquet-end-to-end-what-you-write-what-you-get-back, opens up the Parquet file you just wrote and decodes the footer, the row group, the column chunk, and the encoded pages — the format's full byte-level grammar. After that, /wiki/partitioning-strategies-date-hash-composite covers the next-level-up layout decision (which directory does this row group live in?), and /wiki/small-file-problem-and-the-compaction-job covers what to do when a streaming writer leaves you with millions of tiny row groups instead of a few thousand healthy ones.

The one-line summary of the build is: every Parquet/Iceberg/Delta decision the producer makes is a layout decision, and every layout decision can be traced back to the rotation in this chapter. Once you internalise that, the rest of Build 6 is reading the documentation with the right lens.

References