Object storage as a primary store (S3 as a database)

Build 12 opens by promoting an unlikely substrate to "primary table store": Amazon S3, a system originally pitched in 2006 as a place to dump files. Twenty years later, every serious analytical platform — Snowflake, Databricks, BigQuery's external tables, every open lakehouse — sits on object storage, and a 5 PB Iceberg warehouse at a Bengaluru fintech now has the same disk substrate as the team's photo backups. Understanding why this works, and where it secretly doesn't, is the first chapter you need before any of the lakehouse mechanics make sense.

Object storage gives you four things a database engine usually has to build itself: durable, cheap, parallelable, and eleven-nines retentive bytes. It refuses to give you four things a database engine relies on: in-place mutation, fast small writes, cross-object atomicity, and millisecond access. The lakehouse is the architecture that takes the four it gives and rebuilds the four it refuses on top, in the table-format layer.

Why object storage isn't a filesystem (and the difference matters)

A filesystem on a laptop lets you open a file, seek to byte 4,096, write 8 bytes, and close it. The bytes change in place. The cost of that 8-byte write is roughly the cost of a single disk sector update. POSIX semantics are designed around this: open, seek, write, fsync. Every database engine you've ever used (Postgres, MySQL, SQLite) was built assuming this primitive.

Object storage refuses to offer it. When you "update" an object in S3, you upload a complete new copy of the object, with a new ETag, and the old version either disappears or hangs around as a numbered version depending on bucket settings. There is no seek-and-write. The unit of mutation is the entire object.

Filesystem vs object storage write semanticsTwo side-by-side panels. Left panel shows a 4 KB filesystem block being modified in place — only 8 bytes change, the rest stays untouched. Right panel shows an S3 object being replaced wholesale: a 16 MB Parquet file is uploaded as a new object, the old one becomes a previous version. Mutation unit: 8 bytes vs 16 MB Filesystem (POSIX) 4 KB block: 8 bytes change in place cost ≈ 1 sector write, <1 ms primitives: seek + write + fsync in-place → Postgres heap, MySQL InnoDB, SQLite — every classical DB Object storage (S3) v1: payments_2026_04_25.parquet (16 MB) ETag: 7e3f... immutable once written v2: payments_2026_04_25.parquet (16 MB) ETag: a91c... complete re-upload, new object cost ≈ 16 MB PUT, ~200 ms primitives: PUT + GET + LIST replace whole object → Iceberg, Delta, Hudi build a DB on top
The same 8-byte logical update costs vastly different things. On a filesystem it is a single sector write. On S3 it is a 16 MB re-upload. Every lakehouse design choice flows from this asymmetry.

Why this asymmetry exists at all: object storage is a distributed key-value store across thousands of disks in three or more availability zones. Each object is replicated and erasure-coded across that fleet. Letting clients seek-and-write would require coordinating partial writes across replicas — a problem that distributed-systems people spent a decade trying to solve before AWS settled on "objects are immutable; if you want to change one, write a new one". The trade is: you give up in-place mutation, you get eleven-nines durability across geographic regions for free.

The other primitive object storage refuses is directories. S3 has buckets and keys; the slash in s3://payments-warehouse/year=2026/month=04/file.parquet is just a character in the key. There is no mkdir, no rename of a directory, no atomic-move-of-folder. A LIST with a prefix scans an index of keys and is paginated — at 100 million keys per prefix, listing alone takes seconds. This is the second major asymmetry the lakehouse has to design around.

The four guarantees S3 gives the lakehouse for free

Strip the marketing away and the durability story is concrete. S3 standard storage advertises 99.999999999% (eleven nines) annual object durability. In production terms: across the entire fleet, AWS reports losing under one object per 10 billion stored per year. A 5 PB warehouse at a Bengaluru fintech with ~5 million Parquet files would, statistically, lose less than half a file every five years to disk failure — and even that loss is mitigated by versioning and cross-region replication.

Guarantee What S3 actually offers What the lakehouse uses it for
Durability 11 nines annual object durability across 3+ AZs Persistent table storage; no separate backup tier needed
Read-after-write consistency Strong since Dec 2020 — a successful PUT is immediately visible to all GETs Manifest writes commit instantly; readers see them on the next list
Parallel reads Object can be GET-ranged in parallel by N clients with no coordination Trino/Spark workers shard a Parquet file by row group, fan-out reads
Cost ~0.023 / GB-month for S3 Standard, ~0.004 for Glacier IR A 5 PB warehouse on local SSDs would cost ~₹45 lakh/month; on S3 ~₹9 lakh/month

The pre-2020 caveat is worth understanding because it shaped a generation of lakehouse code: until December 2020, S3 only offered eventual consistency for overwrite PUTs and DELETEs. A worker that wrote manifest_v2.json and immediately listed the prefix might or might not see it for 10 seconds. Iceberg, Delta, and Hudi all carry the scar tissue from that era — atomic-rename-via-Hadoop-FileOutputCommitter, manifest pointers stored in DynamoDB or Hive Metastore, retry-with-exponential-backoff baked deep into the read path. Today's S3 is strongly consistent, but the table-format code still has those guards because Azure Blob, GCS, and MinIO have their own consistency stories, and the formats are designed to work on all of them.

Why "strongly consistent" doesn't mean "transactional": S3 guarantees that a successful single-object PUT is immediately visible. It does not guarantee that a write to object A and a write to object B happen atomically. There is no S3 primitive for "commit these two PUTs together or roll both back". A table is many objects (data files + manifest files + table-metadata file). Achieving table-level atomicity requires the lakehouse to design a commit protocol on top of single-object consistency — and that is exactly what manifest files solve in the next chapter.

The four things S3 refuses, and the cost of rebuilding them

Now flip the table. The four things S3 gives, listed cleanly. Here are the four it does not, and what each one costs the lakehouse to rebuild.

Refusal What you have to build Where in the lakehouse
In-place mutation of bytes within an object Either rewrite the whole object (copy-on-write) or write a delete file and merge at read time (merge-on-read) Iceberg's COW vs MOR, Hudi's CoW vs MoR, Delta's deletion vectors
Fast small writes (<1 KB or <100 ms) Buffer in memory, flush in batches; or write to a low-latency staging tier first Streaming sinks; Iceberg's streaming-write path with rolled-up commits
Cross-object atomicity / transactions Atomic commit of a single metadata pointer (the table's "current version") with optimistic-concurrency retry Manifest files + the next-chapter commit protocol
Millisecond GET latency Cache hot files locally on workers; lay out files so a query reads the minimum number of objects Trino's cache; data skipping via min/max stats; Z-ordering

Each refusal has a price. The lakehouse pays it consciously rather than secretly. To make this concrete, here's a short Python script that measures one of the prices: how much it costs to update a single row in a 16 MB Parquet file when in-place mutation isn't an option.

# rewrite_cost.py — measure the cost of "updating one row" on object storage.
# This is what the lakehouse pays in copy-on-write mode for every UPDATE/DELETE.

import time, hashlib, io
import pyarrow as pa
import pyarrow.parquet as pq

# Simulate a Razorpay payments shard: 200,000 rows, 6 columns,
# realistic Indian-fintech sizes. Roughly 16 MB on disk.
N = 200_000
CITIES = ["Bengaluru", "Mumbai", "Pune", "Delhi", "Chennai"]
data = {
    "txn_id":       [f"pay_{i:09d}" for i in range(N)],
    "merchant_id":  [f"mer_{i % 8000:06d}" for i in range(N)],
    "amount_paise": [(i * 137) % 1_000_000 for i in range(N)],
    "ts_unix_ms":   [1714000000000 + i for i in range(N)],
    "status":       ["captured" if i % 50 else "failed" for i in range(N)],
    "city":         [CITIES[i % 5] for i in range(N)],
}

table = pa.table(data)
buf   = io.BytesIO()
pq.write_table(table, buf, compression="snappy")
size_mb = len(buf.getvalue()) / (1024*1024)
print(f"Parquet size: {size_mb:.2f} MB")

# "Update one row": flip status of txn_id pay_000050000 from captured to refunded.
def update_one_row(buf_bytes):
    t0 = time.time()
    t  = pq.read_table(io.BytesIO(buf_bytes))
    cols = t.to_pydict()
    idx  = cols["txn_id"].index("pay_000050000")
    cols["status"][idx] = "refunded"
    new_buf = io.BytesIO()
    pq.write_table(pa.table(cols), new_buf, compression="snappy")
    return new_buf.getvalue(), time.time() - t0

new_bytes, secs = update_one_row(buf.getvalue())
print(f"Update one row (COW): rewrote {len(new_bytes)/(1024*1024):.2f} MB "
      f"in {secs*1000:.0f} ms")

# "Update one row" the merge-on-read way: write a tiny delete file pointing at row idx.
delete_file = {"file": "pay_2026_04_25.parquet", "deleted_pos": [50_000],
               "new_row": {"txn_id": "pay_000050000", "status": "refunded"}}
deletes_serialized = repr(delete_file).encode()
print(f"Update one row (MOR): wrote a {len(deletes_serialized)} byte delete file")

# Sample run on an M2 MacBook Air:
# Parquet size: 14.71 MB
# Update one row (COW): rewrote 14.69 MB in 412 ms
# Update one row (MOR): wrote a 124 byte delete file

Walk through the numbers, because this is the trade Build 12 keeps coming back to:

The script is not testing S3 — it's running locally — but the relative costs hold once you add network I/O, because the bottleneck on object storage is the data-volume-per-mutation, not the latency.

Why this trade is worth making at lakehouse scale

The case for moving the warehouse off local SSDs and onto S3 is not "S3 is cleaner". It's an arithmetic argument that becomes overwhelming above a few terabytes.

Take a Razorpay-scale analytical warehouse: 5 PB of payment, merchant, ledger, and fraud-signal data, growing at 200 GB per day, with 80 daily-active analysts running an average of 2 TB of scans per analyst per day.

Cost line Local-SSD warehouse S3-backed lakehouse
Storage (5 PB) i3.16xlarge fleet, ~₹45 lakh/month S3 Standard, ~₹9 lakh/month
Compute (always-on) Fixed cluster, ~₹38 lakh/month Trino on EC2 spot, scale-to-zero overnight, ~₹14 lakh/month
Backup Separate snapshot pipeline, ~₹6 lakh/month Free — S3 versioning + cross-region replication
DR / multi-region Manual replication, ₹20 lakh/month S3 Cross-Region Replication, ~₹5 lakh/month
Total monthly ~₹1.09 crore ~₹28 lakh

The order-of-magnitude saving is the headline number. The structural saving is more important: storage and compute scale independently. The warehouse can grow to 50 PB without buying a single extra compute box; the compute fleet can scale up 10× during month-end batch jobs without buying a single extra disk. Decoupling the two is what makes the lakehouse architecture interesting; the fact that S3 happens to be the cheapest substrate that allows the decoupling is a happy accident.

Storage and compute decoupling on the lakehouseDiagram contrasting a coupled warehouse (one tightly-bound storage+compute box) with a decoupled lakehouse (S3 storage layer at the bottom, three independently-scaling query engines on top: Trino, Spark, DuckDB). Coupled vs decoupled compute–storage Coupled (i3 fleet) compute scales with disks local SSDs scales with compute 5 PB → 200 boxes ~₹1.09 crore / month Decoupled (lakehouse) Trino analyst SQL Spark batch ETL DuckDB laptop analysts S3 — Iceberg tables 5 PB, scales independent of compute read-once, scan-many, no cluster pinning ~₹28 lakh / month, scale-to-zero idle
The lakehouse story: one storage substrate, many query engines. Each engine scales by its own workload; storage scales by retained data. Razorpay can run a 50-machine Trino cluster during month-end and 4 machines at 3 a.m., with the table footprint untouched.

Edge cases that bite at production scale

The marketing pitch ends here. Production has five more things you need to know before you commit a 5 PB warehouse to S3.

Common confusions

Going deeper

The Werner Vogels 2006 design choices that still echo

S3's original API — PUT, GET, DELETE, LIST, plus ACLs — was deliberately minimalist. Werner Vogels' 2006 announcement said the team "chose simplicity and durability over richness". Twenty years later, every "we should add a rename API" or "let's offer in-place edits" proposal has been rejected because the operational cost of those features across the global S3 fleet outweighs the per-customer benefit. The lakehouse architecture exists because S3 said no to those features — if S3 had been a posix-compliant networked filesystem, Iceberg would never have been designed.

What changed in December 2020

Before December 2020, S3's overwrite semantics were "eventually consistent" — a PUT of an existing key, followed by a GET, might return the old object for up to ~10 seconds. List-after-write had similar lag. Every Hadoop-on-S3 project carried workarounds: S3Guard (DynamoDB-backed metadata), the S3A committer's two-phase staging directory, custom retries. The flip to strong consistency made the workarounds obsolete but the code is still there because it was correct under the old semantics and remains correct under the new ones, just over-engineered. Iceberg v2 quietly relaxed several of these guards; Delta Lake 3.x did the same.

How Zerodha's market-data lakehouse uses S3 differently from a fintech's transaction lakehouse

Zerodha records every NSE/BSE tick — about 6 billion ticks on a normal day, 18 billion on a budget-day spike. A fintech writes maybe 100 million rows per day and updates them frequently. Zerodha writes 100× more rows but never updates them — market ticks are append-only by nature. Their S3 layout is therefore radically different: hourly partitions, large files (1 GB target), no MOR layer, no manifest churn beyond the daily partition rotation. The fintech needs Iceberg's full update/delete machinery; Zerodha needs barely any of it. Same substrate, different table-format choices, both correct.

The "S3 Express One Zone" question

In 2024 AWS launched S3 Express One Zone, a single-AZ tier with single-millisecond latency at ~10× the storage cost. It looks like the answer to "S3 is too slow for hot data". For lakehouse workloads, the answer is mostly "no, this isn't what you wanted" — Express trades durability (single-AZ) for latency, and the bottleneck for analytical queries is parallel bandwidth, not single-request latency. Where Express does help is feature-store online-serving and small-state streaming checkpoints — workloads that lakehouses outsource to other systems anyway. The 5 PB warehouse stays on S3 Standard.

Open Compute alternatives — why MinIO and Ceph haven't displaced S3 inside cloud-resident shops

MinIO, Ceph, and SeaweedFS all implement the S3 API on commodity hardware. For on-prem deployments (a defense contractor, a regulated bank that can't use AWS), they are the right choice. For cloud-resident shops, they don't displace S3 because the operational cost — running Ceph monitors, replacing failed disks, managing erasure-coding policies — eats the storage savings. The 11-nines durability claim is not just about disks; it's about AWS spending years tuning the replication and recovery policies. Reproducing that on commodity hardware costs an SRE team's headcount, which usually exceeds the AWS bill being avoided.

Where this leads next

Build 12 spends the next 13 chapters teaching you how to turn S3 from "key-value store of immutable bytes" into "transactional analytical database". The mechanism is the table format — a layer of metadata files on S3 that gives you commit, schema evolution, time travel, and concurrent writers. The next chapter steps directly into that layer:

The closing chapter of Build 12, /wiki/cdc-iceberg-the-real-world-pattern, connects this to Build 11: how do you actually land Razorpay's CDC stream into Iceberg-on-S3 with bounded latency and clean schema evolution?

References