Object storage as a primary store (S3 as a database)
Build 12 opens by promoting an unlikely substrate to "primary table store": Amazon S3, a system originally pitched in 2006 as a place to dump files. Twenty years later, every serious analytical platform — Snowflake, Databricks, BigQuery's external tables, every open lakehouse — sits on object storage, and a 5 PB Iceberg warehouse at a Bengaluru fintech now has the same disk substrate as the team's photo backups. Understanding why this works, and where it secretly doesn't, is the first chapter you need before any of the lakehouse mechanics make sense.
Object storage gives you four things a database engine usually has to build itself: durable, cheap, parallelable, and eleven-nines retentive bytes. It refuses to give you four things a database engine relies on: in-place mutation, fast small writes, cross-object atomicity, and millisecond access. The lakehouse is the architecture that takes the four it gives and rebuilds the four it refuses on top, in the table-format layer.
Why object storage isn't a filesystem (and the difference matters)
A filesystem on a laptop lets you open a file, seek to byte 4,096, write 8 bytes, and close it. The bytes change in place. The cost of that 8-byte write is roughly the cost of a single disk sector update. POSIX semantics are designed around this: open, seek, write, fsync. Every database engine you've ever used (Postgres, MySQL, SQLite) was built assuming this primitive.
Object storage refuses to offer it. When you "update" an object in S3, you upload a complete new copy of the object, with a new ETag, and the old version either disappears or hangs around as a numbered version depending on bucket settings. There is no seek-and-write. The unit of mutation is the entire object.
Why this asymmetry exists at all: object storage is a distributed key-value store across thousands of disks in three or more availability zones. Each object is replicated and erasure-coded across that fleet. Letting clients seek-and-write would require coordinating partial writes across replicas — a problem that distributed-systems people spent a decade trying to solve before AWS settled on "objects are immutable; if you want to change one, write a new one". The trade is: you give up in-place mutation, you get eleven-nines durability across geographic regions for free.
The other primitive object storage refuses is directories. S3 has buckets and keys; the slash in s3://payments-warehouse/year=2026/month=04/file.parquet is just a character in the key. There is no mkdir, no rename of a directory, no atomic-move-of-folder. A LIST with a prefix scans an index of keys and is paginated — at 100 million keys per prefix, listing alone takes seconds. This is the second major asymmetry the lakehouse has to design around.
The four guarantees S3 gives the lakehouse for free
Strip the marketing away and the durability story is concrete. S3 standard storage advertises 99.999999999% (eleven nines) annual object durability. In production terms: across the entire fleet, AWS reports losing under one object per 10 billion stored per year. A 5 PB warehouse at a Bengaluru fintech with ~5 million Parquet files would, statistically, lose less than half a file every five years to disk failure — and even that loss is mitigated by versioning and cross-region replication.
| Guarantee | What S3 actually offers | What the lakehouse uses it for |
|---|---|---|
| Durability | 11 nines annual object durability across 3+ AZs | Persistent table storage; no separate backup tier needed |
| Read-after-write consistency | Strong since Dec 2020 — a successful PUT is immediately visible to all GETs | Manifest writes commit instantly; readers see them on the next list |
| Parallel reads | Object can be GET-ranged in parallel by N clients with no coordination | Trino/Spark workers shard a Parquet file by row group, fan-out reads |
| Cost | ~0.023 / GB-month for S3 Standard, ~0.004 for Glacier IR | A 5 PB warehouse on local SSDs would cost ~₹45 lakh/month; on S3 ~₹9 lakh/month |
The pre-2020 caveat is worth understanding because it shaped a generation of lakehouse code: until December 2020, S3 only offered eventual consistency for overwrite PUTs and DELETEs. A worker that wrote manifest_v2.json and immediately listed the prefix might or might not see it for 10 seconds. Iceberg, Delta, and Hudi all carry the scar tissue from that era — atomic-rename-via-Hadoop-FileOutputCommitter, manifest pointers stored in DynamoDB or Hive Metastore, retry-with-exponential-backoff baked deep into the read path. Today's S3 is strongly consistent, but the table-format code still has those guards because Azure Blob, GCS, and MinIO have their own consistency stories, and the formats are designed to work on all of them.
Why "strongly consistent" doesn't mean "transactional": S3 guarantees that a successful single-object PUT is immediately visible. It does not guarantee that a write to object A and a write to object B happen atomically. There is no S3 primitive for "commit these two PUTs together or roll both back". A table is many objects (data files + manifest files + table-metadata file). Achieving table-level atomicity requires the lakehouse to design a commit protocol on top of single-object consistency — and that is exactly what manifest files solve in the next chapter.
The four things S3 refuses, and the cost of rebuilding them
Now flip the table. The four things S3 gives, listed cleanly. Here are the four it does not, and what each one costs the lakehouse to rebuild.
| Refusal | What you have to build | Where in the lakehouse |
|---|---|---|
| In-place mutation of bytes within an object | Either rewrite the whole object (copy-on-write) or write a delete file and merge at read time (merge-on-read) | Iceberg's COW vs MOR, Hudi's CoW vs MoR, Delta's deletion vectors |
| Fast small writes (<1 KB or <100 ms) | Buffer in memory, flush in batches; or write to a low-latency staging tier first | Streaming sinks; Iceberg's streaming-write path with rolled-up commits |
| Cross-object atomicity / transactions | Atomic commit of a single metadata pointer (the table's "current version") with optimistic-concurrency retry | Manifest files + the next-chapter commit protocol |
| Millisecond GET latency | Cache hot files locally on workers; lay out files so a query reads the minimum number of objects | Trino's cache; data skipping via min/max stats; Z-ordering |
Each refusal has a price. The lakehouse pays it consciously rather than secretly. To make this concrete, here's a short Python script that measures one of the prices: how much it costs to update a single row in a 16 MB Parquet file when in-place mutation isn't an option.
# rewrite_cost.py — measure the cost of "updating one row" on object storage.
# This is what the lakehouse pays in copy-on-write mode for every UPDATE/DELETE.
import time, hashlib, io
import pyarrow as pa
import pyarrow.parquet as pq
# Simulate a Razorpay payments shard: 200,000 rows, 6 columns,
# realistic Indian-fintech sizes. Roughly 16 MB on disk.
N = 200_000
CITIES = ["Bengaluru", "Mumbai", "Pune", "Delhi", "Chennai"]
data = {
"txn_id": [f"pay_{i:09d}" for i in range(N)],
"merchant_id": [f"mer_{i % 8000:06d}" for i in range(N)],
"amount_paise": [(i * 137) % 1_000_000 for i in range(N)],
"ts_unix_ms": [1714000000000 + i for i in range(N)],
"status": ["captured" if i % 50 else "failed" for i in range(N)],
"city": [CITIES[i % 5] for i in range(N)],
}
table = pa.table(data)
buf = io.BytesIO()
pq.write_table(table, buf, compression="snappy")
size_mb = len(buf.getvalue()) / (1024*1024)
print(f"Parquet size: {size_mb:.2f} MB")
# "Update one row": flip status of txn_id pay_000050000 from captured to refunded.
def update_one_row(buf_bytes):
t0 = time.time()
t = pq.read_table(io.BytesIO(buf_bytes))
cols = t.to_pydict()
idx = cols["txn_id"].index("pay_000050000")
cols["status"][idx] = "refunded"
new_buf = io.BytesIO()
pq.write_table(pa.table(cols), new_buf, compression="snappy")
return new_buf.getvalue(), time.time() - t0
new_bytes, secs = update_one_row(buf.getvalue())
print(f"Update one row (COW): rewrote {len(new_bytes)/(1024*1024):.2f} MB "
f"in {secs*1000:.0f} ms")
# "Update one row" the merge-on-read way: write a tiny delete file pointing at row idx.
delete_file = {"file": "pay_2026_04_25.parquet", "deleted_pos": [50_000],
"new_row": {"txn_id": "pay_000050000", "status": "refunded"}}
deletes_serialized = repr(delete_file).encode()
print(f"Update one row (MOR): wrote a {len(deletes_serialized)} byte delete file")
# Sample run on an M2 MacBook Air:
# Parquet size: 14.71 MB
# Update one row (COW): rewrote 14.69 MB in 412 ms
# Update one row (MOR): wrote a 124 byte delete file
Walk through the numbers, because this is the trade Build 12 keeps coming back to:
Parquet size: 14.71 MB— 200k rows of realistic payment data compress to ~15 MB with Snappy. Iceberg's default target file size is 128 MB; this is a smaller-than-default shard.Update one row (COW): rewrote 14.69 MB in 412 ms— copy-on-write, the Iceberg/Delta default. Reading the whole file, swapping one cell, writing the whole file back. Why this is the right default for analytical workloads: read-side performance is a single pass over a contiguous file, which is exactly what columnar scans were optimised for. The write cost is paid at commit time; readers pay nothing extra.Update one row (MOR): wrote a 124 byte delete file— merge-on-read. The data file is untouched; a tiny "row 50,000 of file X is now this" record is written. Writes are 100× cheaper. Reads now have to cross-reference every data file with its delete files at query time, a 5–25% read penalty. Why this trade flips for streaming CDC ingest: a CDC pipeline writes thousands of small per-row changes per minute, and rewriting 16 MB per row would generate enough write amplification to bankrupt the pipeline. MOR makes high-frequency updates affordable; COW does not.- The 412 ms vs sub-millisecond gap — on a Postgres heap, that one-row UPDATE would have been 40 µs. Object storage forces the lakehouse to choose between paying that 10,000× cost on every update (COW) or shifting it to query time (MOR). There is no third option that retains S3 as the substrate.
The script is not testing S3 — it's running locally — but the relative costs hold once you add network I/O, because the bottleneck on object storage is the data-volume-per-mutation, not the latency.
Why this trade is worth making at lakehouse scale
The case for moving the warehouse off local SSDs and onto S3 is not "S3 is cleaner". It's an arithmetic argument that becomes overwhelming above a few terabytes.
Take a Razorpay-scale analytical warehouse: 5 PB of payment, merchant, ledger, and fraud-signal data, growing at 200 GB per day, with 80 daily-active analysts running an average of 2 TB of scans per analyst per day.
| Cost line | Local-SSD warehouse | S3-backed lakehouse |
|---|---|---|
| Storage (5 PB) | i3.16xlarge fleet, ~₹45 lakh/month | S3 Standard, ~₹9 lakh/month |
| Compute (always-on) | Fixed cluster, ~₹38 lakh/month | Trino on EC2 spot, scale-to-zero overnight, ~₹14 lakh/month |
| Backup | Separate snapshot pipeline, ~₹6 lakh/month | Free — S3 versioning + cross-region replication |
| DR / multi-region | Manual replication, ₹20 lakh/month | S3 Cross-Region Replication, ~₹5 lakh/month |
| Total monthly | ~₹1.09 crore | ~₹28 lakh |
The order-of-magnitude saving is the headline number. The structural saving is more important: storage and compute scale independently. The warehouse can grow to 50 PB without buying a single extra compute box; the compute fleet can scale up 10× during month-end batch jobs without buying a single extra disk. Decoupling the two is what makes the lakehouse architecture interesting; the fact that S3 happens to be the cheapest substrate that allows the decoupling is a happy accident.
Edge cases that bite at production scale
The marketing pitch ends here. Production has five more things you need to know before you commit a 5 PB warehouse to S3.
- The small-files problem. S3 charges per-object overhead: every PUT is a billable operation (~5 per million), every GET is one (~0.40 per million), every LIST is one. A streaming pipeline that writes 200k tiny files per day pays more in PUT cost than in storage cost — and a query that scans 200k tiny files spends 90% of its time in object-open overhead. Compaction (Build 6, Build 12) exists to keep file counts bounded.
- List-prefix performance degrades sub-linearly. S3 partitions index by key prefix. A bucket with
s3://warehouse/payments/year=2026/month=04/day=25/file.parquethas different list latency than one withs3://warehouse/file_a8b3.parquet— the former benefits from prefix-sharding, the latter creates a hot index partition. The flat-bucket myth ("S3 doesn't have folders so layout doesn't matter") is wrong; layout matters for list throughput. - Object-overwrite versioning silently grows the bill. If versioning is on (and it should be — it's the cheapest "undo" you'll ever buy), every
PUTof an existing key creates a new version and retains the old one. A pipeline that rewrites the same manifest file 10,000 times in a day silently grows the bucket by 10,000 manifest copies. Lifecycle rules to expire non-current versions are mandatory. - Cross-region replication has minutes of lag. Cross-Region Replication is an asynchronous process; lag is typically 15 minutes but can spike to hours during regional outages. A multi-region read-replica strategy that assumes sub-second propagation will silently serve stale data. Iceberg's snapshot id and atomic-commit semantics give you a way to reason about this — readers see a consistent snapshot, just an older one.
- Egress, not storage, is the surprise on the AWS bill. S3 storage is cheap. Egress (data leaving the region) is ~$0.09/GB. A query engine in us-east-1 reading from a bucket in ap-south-1 costs more in egress than the storage itself. Co-locating compute with storage is not optional; it's load-bearing for the cost model.
Common confusions
- "S3 is just a backup target." Pre-2020 it was. Post-strong-consistency S3 (Dec 2020) plus a table format on top (Iceberg/Delta/Hudi) plus a query engine that respects the format is a complete analytical database stack. The "primary store" claim is what Build 12 spends 13 chapters justifying.
- "S3 is the same as HDFS." They have similar APIs and very different consistency, durability, and economic models. HDFS gives you a real filesystem with
rename, weaker durability (3-way replication, no cross-AZ guarantee), and dedicated machines you have to operate. S3 gives you stronger durability, norename, and operations you don't have to do. The fact that Hadoop tools speak both is a happy compatibility accident, not a sign they're equivalent. - "S3 is slower than a local SSD, so it's bad for analytics." Per-byte, yes — a single-threaded sequential read from a local SSD is faster than a single-threaded GET from S3. But analytical queries are massively parallel: Trino opens 200 GET requests concurrently, each fetching a different row group, and the aggregate throughput dwarfs what a single-machine SSD can deliver. The right mental model is "S3 is a parallel-bandwidth substrate, not a low-latency one".
- "You can run an OLTP database on S3 directly." No. OLTP needs sub-millisecond writes with strong cross-row atomicity; S3 gives you 50–200 ms PUTs with single-object atomicity only. There are systems that layer OLTP on top of S3 (Aurora's storage layer is loosely S3-like; Neon does it explicitly) but they all add an in-memory + WAL tier on top — they don't run "on S3" in the way the lakehouse does.
- "Strong consistency means transactions." Strong consistency means a single-object PUT is immediately readable. It does not mean two PUTs are atomic with respect to each other. The lakehouse builds transactions on top using a single-object commit pointer (the table-metadata file) and optimistic concurrency. This is the next chapter.
Going deeper
The Werner Vogels 2006 design choices that still echo
S3's original API — PUT, GET, DELETE, LIST, plus ACLs — was deliberately minimalist. Werner Vogels' 2006 announcement said the team "chose simplicity and durability over richness". Twenty years later, every "we should add a rename API" or "let's offer in-place edits" proposal has been rejected because the operational cost of those features across the global S3 fleet outweighs the per-customer benefit. The lakehouse architecture exists because S3 said no to those features — if S3 had been a posix-compliant networked filesystem, Iceberg would never have been designed.
What changed in December 2020
Before December 2020, S3's overwrite semantics were "eventually consistent" — a PUT of an existing key, followed by a GET, might return the old object for up to ~10 seconds. List-after-write had similar lag. Every Hadoop-on-S3 project carried workarounds: S3Guard (DynamoDB-backed metadata), the S3A committer's two-phase staging directory, custom retries. The flip to strong consistency made the workarounds obsolete but the code is still there because it was correct under the old semantics and remains correct under the new ones, just over-engineered. Iceberg v2 quietly relaxed several of these guards; Delta Lake 3.x did the same.
How Zerodha's market-data lakehouse uses S3 differently from a fintech's transaction lakehouse
Zerodha records every NSE/BSE tick — about 6 billion ticks on a normal day, 18 billion on a budget-day spike. A fintech writes maybe 100 million rows per day and updates them frequently. Zerodha writes 100× more rows but never updates them — market ticks are append-only by nature. Their S3 layout is therefore radically different: hourly partitions, large files (1 GB target), no MOR layer, no manifest churn beyond the daily partition rotation. The fintech needs Iceberg's full update/delete machinery; Zerodha needs barely any of it. Same substrate, different table-format choices, both correct.
The "S3 Express One Zone" question
In 2024 AWS launched S3 Express One Zone, a single-AZ tier with single-millisecond latency at ~10× the storage cost. It looks like the answer to "S3 is too slow for hot data". For lakehouse workloads, the answer is mostly "no, this isn't what you wanted" — Express trades durability (single-AZ) for latency, and the bottleneck for analytical queries is parallel bandwidth, not single-request latency. Where Express does help is feature-store online-serving and small-state streaming checkpoints — workloads that lakehouses outsource to other systems anyway. The 5 PB warehouse stays on S3 Standard.
Open Compute alternatives — why MinIO and Ceph haven't displaced S3 inside cloud-resident shops
MinIO, Ceph, and SeaweedFS all implement the S3 API on commodity hardware. For on-prem deployments (a defense contractor, a regulated bank that can't use AWS), they are the right choice. For cloud-resident shops, they don't displace S3 because the operational cost — running Ceph monitors, replacing failed disks, managing erasure-coding policies — eats the storage savings. The 11-nines durability claim is not just about disks; it's about AWS spending years tuning the replication and recovery policies. Reproducing that on commodity hardware costs an SRE team's headcount, which usually exceeds the AWS bill being avoided.
Where this leads next
Build 12 spends the next 13 chapters teaching you how to turn S3 from "key-value store of immutable bytes" into "transactional analytical database". The mechanism is the table format — a layer of metadata files on S3 that gives you commit, schema evolution, time travel, and concurrent writers. The next chapter steps directly into that layer:
- /wiki/manifest-files-and-the-commit-protocol — how a directory of immutable Parquet files becomes a transactional table.
- /wiki/concurrent-writers-optimistic-concurrency-serializability — how five writers commit to the same table without a coordinator.
- /wiki/copy-on-write-vs-merge-on-read-iceberg-vs-hudi — the central trade you saw the 412 ms vs 124 byte numbers for.
The closing chapter of Build 12, /wiki/cdc-iceberg-the-real-world-pattern, connects this to Build 11: how do you actually land Razorpay's CDC stream into Iceberg-on-S3 with bounded latency and clean schema evolution?
References
- Amazon S3 — "Amazon S3 now delivers strong read-after-write consistency" (Dec 2020) — the AWS announcement of the consistency-model flip that enabled the modern lakehouse.
- Werner Vogels — "Choosing consistency" (2007) — the original design rationale for S3's eventual-consistency-by-default era. Reads differently now that the default has flipped.
- Michael Armbrust et al. — "Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics" (CIDR 2021) — the Databricks paper that named the architecture this build is about.
- Apache Iceberg — Format Specification v2 — the binding document for what a manifest, snapshot, and commit actually are. Skim §3 (table format) and §4 (commit protocol) before the next chapter.
- Ryan Blue — "Iceberg: A Modern Table Format for Big Data" (Netflix engineering, 2018) — the original motivating talk; explains why Hive's directory-based table format broke at Netflix's scale and what Iceberg replaced.
- Trino documentation — "Object storage connectors" — the query-engine side of the picture; how Trino reads Parquet from S3 in parallel.
- /wiki/wall-consumers-want-the-data-shaped-for-their-use-case — the consumer-side wall this chapter starts breaking.
- /wiki/columnar-storage-parquet-and-orc — Build 6's columnar-format chapter; Iceberg builds on Parquet, not on raw bytes.