Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Object storage as a primary store (S3 as a database)

Build 12 opens by promoting an unlikely substrate to "primary table store": Amazon S3, a system originally pitched in 2006 as a place to dump files. Twenty years later, every serious analytical platform — Snowflake, Databricks, BigQuery's external tables, every open lakehouse — sits on object storage, and a 5 PB Iceberg warehouse at a Bengaluru fintech now has the same disk substrate as the team's photo backups. Understanding why this works, and where it secretly doesn't, is the first chapter you need before any of the lakehouse mechanics make sense.

Object storage gives you four things a database engine usually has to build itself: durable, cheap, parallelable, and eleven-nines retentive bytes. It refuses to give you four things a database engine relies on: in-place mutation, fast small writes, cross-object atomicity, and millisecond access. The lakehouse is the architecture that takes the four it gives and rebuilds the four it refuses on top, in the table-format layer.

Why object storage isn't a filesystem (and the difference matters)

A filesystem on a laptop lets you open a file, seek to byte 4,096, write 8 bytes, and close it. The bytes change in place. The cost of that 8-byte write is roughly the cost of a single disk sector update. POSIX semantics are designed around this: open, seek, write, fsync. Every database engine you've ever used (Postgres, MySQL, SQLite) was built assuming this primitive.

Object storage refuses to offer it. When you "update" an object in S3, you upload a complete new copy of the object, with a new ETag, and the old version either disappears or hangs around as a numbered version depending on bucket settings. There is no seek-and-write. The unit of mutation is the entire object.

Filesystem vs object storage write semanticsTwo side-by-side panels. Left panel shows a 4 KB filesystem block being modified in place — only 8 bytes change, the rest stays untouched. Right panel shows an S3 object being replaced wholesale: a 16 MB Parquet file is uploaded as a new object, the old one becomes a previous version. Mutation unit: 8 bytes vs 16 MB Filesystem (POSIX) 4 KB block: 8 bytes change in place cost ≈ 1 sector write, <1 ms primitives: seek + write + fsync in-place → Postgres heap, MySQL InnoDB, SQLite — every classical DB Object storage (S3) v1: payments_2026_04_25.parquet (16 MB) ETag: 7e3f... immutable once written v2: payments_2026_04_25.parquet (16 MB) ETag: a91c... complete re-upload, new object cost ≈ 16 MB PUT, ~200 ms primitives: PUT + GET + LIST replace whole object → Iceberg, Delta, Hudi build a DB on top
The same 8-byte logical update costs vastly different things. On a filesystem it is a single sector write. On S3 it is a 16 MB re-upload. Every lakehouse design choice flows from this asymmetry.

Why this asymmetry exists at all: object storage is a distributed key-value store across thousands of disks in three or more availability zones. Each object is replicated and erasure-coded across that fleet. Letting clients seek-and-write would require coordinating partial writes across replicas — a problem that distributed-systems people spent a decade trying to solve before AWS settled on "objects are immutable; if you want to change one, write a new one". The trade is: you give up in-place mutation, you get eleven-nines durability across geographic regions for free.

The other primitive object storage refuses is directories. S3 has buckets and keys; the slash in s3://payments-warehouse/year=2026/month=04/file.parquet is just a character in the key. There is no mkdir, no rename of a directory, no atomic-move-of-folder. A LIST with a prefix scans an index of keys and is paginated — at 100 million keys per prefix, listing alone takes seconds. This is the second major asymmetry the lakehouse has to design around.

The four guarantees S3 gives the lakehouse for free

Strip the marketing away and the durability story is concrete. S3 standard storage advertises 99.999999999% (eleven nines) annual object durability. In production terms: across the entire fleet, AWS reports losing under one object per 10 billion stored per year. A 5 PB warehouse at a Bengaluru fintech with ~5 million Parquet files would, statistically, lose less than half a file every five years to disk failure — and even that loss is mitigated by versioning and cross-region replication.

Guarantee What S3 actually offers What the lakehouse uses it for
Durability 11 nines annual object durability across 3+ AZs DataPersist table storage; no separate backup tier needed
Read-after-write consistency Strong since Dec 2020 — a successful PUT is immediately visible to all GETs Manifest writes commit instantly; readers see them on the next list
Parallel reads Object can be GET-ranged in parallel by N clients with no coordination Trino/Spark workers shard a Parquet file by row group, fan-out reads
Cost ~0.023 / GB-month for S3 Standard, ~0.004 for Glacier IR A 5 PB warehouse on local SSDs would cost ~₹45 lakh/month; on S3 ~₹9 lakh/month

The pre-2020 caveat is worth understanding because it shaped a generation of lakehouse code: until December 2020, S3 only offered eventual consistency for overwrite PUTs and DELETEs. A worker that wrote manifest_v2.json and immediately listed the prefix might or might not see it for 10 seconds. Iceberg, Delta, and Hudi all carry the scar tissue from that era — atomic-rename-via-Hadoop-FileOutputCommitter, manifest pointers stored in DynamoDB or Hive Metastore, retry-with-exponential-backoff baked deep into the read path. Today's S3 is strongly consistent, but the table-format code still has those guards because Azure Blob, GCS, and MinIO have their own consistency stories, and the formats are designed to work on all of them.

Why "strongly consistent" doesn't mean "transactional": S3 guarantees that a successful single-object PUT is immediately visible. It does not guarantee that a write to object A and a write to object B happen atomically. There is no S3 primitive for "commit these two PUTs together or roll both back". A table is many objects (data files + manifest files + table-metadata file). Achieving table-level atomicity requires the lakehouse to design a commit protocol on top of single-object consistency — and that is exactly what manifest files solve in the next chapter.

The four things S3 refuses, and the cost of rebuilding them

Now flip the table. The four things S3 gives, listed cleanly. Here are the four it does not, and what each one costs the lakehouse to rebuild.

Refusal What you have to build Where in the lakehouse
In-place mutation of bytes within an object Either rewrite the whole object (copy-on-write) or write a delete file and merge at read time (merge-on-read) Iceberg's COW vs MOR, Hudi's CoW vs MoR, Delta's deletion vectors
Fast small writes (<1 KB or <100 ms) Buffer in memory, flush in batches; or write to a low-latency staging tier first Streaming sinks; Iceberg's streaming-write path with rolled-up commits
Cross-object atomicity / transactions Atomic commit of a single metadata pointer (the table's "current version") with optimistic-concurrency retry Manifest files + the next-chapter commit protocol
Millisecond GET latency Cache hot files locally on workers; lay out files so a query reads the minimum number of objects Trino's cache; data skipping via min/max stats; Z-ordering

Each refusal has a price. The lakehouse pays it consciously rather than secretly. To make this concrete, here's a short Python script that measures one of the prices: how much it costs to update a single row in a 16 MB Parquet file when in-place mutation isn't an option.

# rewrite_cost.py — measure the cost of "updating one row" on object storage.
# This is what the lakehouse pays in copy-on-write mode for every UPDATE/DELETE.

import time, hashlib, io
import pyarrow as pa
import pyarrow.parquet as pq

# Simulate a PaisaBridge payments shard: 200,000 rows, 6 columns,
# realistic Indian-fintech sizes. Roughly 16 MB on disk.
N = 200_000
CITIES = ["Bengaluru", "Mumbai", "Pune", "Delhi", "Chennai"]
data = {
    "txn_id":       [f"pay_{i:09d}" for i in range(N)],
    "merchant_id":  [f"mer_{i % 8000:06d}" for i in range(N)],
    "amount_paise": [(i * 137) % 1_000_000 for i in range(N)],
    "ts_unix_ms":   [1714000000000 + i for i in range(N)],
    "status":       ["captured" if i % 50 else "failed" for i in range(N)],
    "city":         [CITIES[i % 5] for i in range(N)],
}

table = pa.table(data)
buf   = io.BytesIO()
pq.write_table(table, buf, compression="snappy")
size_mb = len(buf.getvalue()) / (1024*1024)
print(f"Parquet size: {size_mb:.2f} MB")

# "Update one row": flip status of txn_id pay_000050000 from captured to refunded.
def update_one_row(buf_bytes):
    t0 = time.time()
    t  = pq.read_table(io.BytesIO(buf_bytes))
    cols = t.to_pydict()
    idx  = cols["txn_id"].index("pay_000050000")
    cols["status"][idx] = "refunded"
    new_buf = io.BytesIO()
    pq.write_table(pa.table(cols), new_buf, compression="snappy")
    return new_buf.getvalue(), time.time() - t0

new_bytes, secs = update_one_row(buf.getvalue())
print(f"Update one row (COW): rewrote {len(new_bytes)/(1024*1024):.2f} MB "
      f"in {secs*1000:.0f} ms")

# "Update one row" the merge-on-read way: write a tiny delete file pointing at row idx.
delete_file = {"file": "pay_2026_04_25.parquet", "deleted_pos": [50_000],
               "new_row": {"txn_id": "pay_000050000", "status": "refunded"}}
deletes_serialized = repr(delete_file).encode()
print(f"Update one row (MOR): wrote a {len(deletes_serialized)} byte delete file")

# Sample run on an M2 MacBook Air:
# Parquet size: 14.71 MB
# Update one row (COW): rewrote 14.69 MB in 412 ms
# Update one row (MOR): wrote a 124 byte delete file

Walk through the numbers, because this is the trade Build 12 keeps coming back to:

  • Parquet size: 14.71 MB — 200k rows of realistic payment data compress to ~15 MB with Snappy. Iceberg's default target file size is 128 MB; this is a smaller-than-default shard.
  • Update one row (COW): rewrote 14.69 MB in 412 ms — copy-on-write, the Iceberg/Delta default. Reading the whole file, swapping one cell, writing the whole file back. Why this is the right default for analytical workloads: read-side performance is a single pass over a contiguous file, which is exactly what columnar scans were optimised for. The write cost is paid at commit time; readers pay nothing extra.
  • Update one row (MOR): wrote a 124 byte delete file — merge-on-read. The data file is untouched; a tiny "row 50,000 of file X is now this" record is written. Writes are 100× cheaper. Reads now have to cross-reference every data file with its delete files at query time, a 5–25% read penalty. Why this trade flips for streaming CDC ingest: a CDC pipeline writes thousands of small per-row changes per minute, and rewriting 16 MB per row would generate enough write amplification to bankrupt the pipeline. MOR makes high-frequency updates affordable; COW does not.
  • The 412 ms vs sub-millisecond gap — on a Postgres heap, that one-row UPDATE would have been 40 µs. Object storage forces the lakehouse to choose between paying that 10,000× cost on every update (COW) or shifting it to query time (MOR). There is no third option that retains S3 as the substrate.

The script is not testing S3 — it's running locally — but the relative costs hold once you add network I/O, because the bottleneck on object storage is the data-volume-per-mutation, not the latency.

Why this trade is worth making at lakehouse scale

The case for moving the warehouse off local SSDs and onto S3 is not "S3 is cleaner". It's an arithmetic argument that becomes overwhelming above a few terabytes.

Take a PaisaBridge-scale analytical warehouse: 5 PB of payment, merchant, ledger, and fraud-signal data, growing at 200 GB per day, with 80 daily-active analysts running an average of 2 TB of scans per analyst per day.

Cost line Local-SSD warehouse S3-backed lakehouse
Storage (5 PB) i3.16xlarge fleet, ~₹45 lakh/month S3 Standard, ~₹9 lakh/month
Compute (always-on) Fixed cluster, ~₹38 lakh/month Trino on EC2 spot, scale-to-zero overnight, ~₹14 lakh/month
Backup Separate snapshot pipeline, ~₹6 lakh/month Free — S3 versioning + cross-region replication
DR / multi-region Manual replication, ₹20 lakh/month S3 Cross-Region Replication, ~₹5 lakh/month
Total monthly ~₹1.09 crore ~₹28 lakh

The order-of-magnitude saving is the headline number. The structural saving is more important: storage and compute scale independently. The warehouse can grow to 50 PB without buying a single extra compute box; the compute fleet can scale up 10× during month-end batch jobs without buying a single extra disk. Decoupling the two is what makes the lakehouse architecture interesting; the fact that S3 happens to be the cheapest substrate that allows the decoupling is a happy accident.

Storage and compute decoupling on the lakehouseDiagram contrasting a coupled warehouse (one tightly-bound storage+compute box) with a decoupled lakehouse (S3 storage layer at the bottom, three independently-scaling query engines on top: Trino, Spark, DuckDB). Coupled vs decoupled compute–storage Coupled (i3 fleet) compute scales with disks local SSDs scales with compute 5 PB → 200 boxes ~₹1.09 crore / month Decoupled (lakehouse) Trino analyst SQL Spark batch ETL DuckDB laptop analysts S3 — Iceberg tables 5 PB, scales independent of compute read-once, scan-many, no cluster pinning ~₹28 lakh / month, scale-to-zero idle
The lakehouse story: one storage substrate, many query engines. Each engine scales by its own workload; storage scales by retained data. PaisaBridge can run a 50-machine Trino cluster during month-end and 4 machines at 3 a.m., with the table footprint untouched.

Edge cases that bite at production scale

The marketing pitch ends here. Production has five more things you need to know before you commit a 5 PB warehouse to S3.

  • The small-files problem. S3 charges per-object overhead: every PUT is a billable operation (~5 per million), every GET is one (~0.40 per million), every LIST is one. A streaming pipeline that writes 200k tiny files per day pays more in PUT cost than in storage cost — and a query that scans 200k tiny files spends 90% of its time in object-open overhead. Compaction (Build 6, Build 12) exists to keep file counts bounded.
  • List-prefix performance degrades sub-linearly. S3 partitions index by key prefix. A bucket with s3://warehouse/payments/year=2026/month=04/day=25/file.parquet has different list latency than one with s3://warehouse/file_a8b3.parquet — the former benefits from prefix-sharding, the latter creates a hot index partition. The flat-bucket myth ("S3 doesn't have folders so layout doesn't matter") is wrong; layout matters for list throughput.
  • Object-overwrite versioning silently grows the bill. If versioning is on (and it should be — it's the cheapest "undo" you'll ever buy), every PUT of an existing key creates a new version and retains the old one. A pipeline that rewrites the same manifest file 10,000 times in a day silently grows the bucket by 10,000 manifest copies. Lifecycle rules to expire non-current versions are mandatory.
  • Cross-region replication has minutes of lag. Cross-Region Replication is an asynchronous process; lag is typically 15 minutes but can spike to hours during regional outages. A multi-region read-replica strategy that assumes sub-second propagation will silently serve stale data. Iceberg's snapshot id and atomic-commit semantics give you a way to reason about this — readers see a consistent snapshot, just an older one.
  • Egress, not storage, is the surprise on the AWS bill. S3 storage is cheap. Egress (data leaving the region) is ~$0.09/GB. A query engine in us-east-1 reading from a bucket in ap-south-1 costs more in egress than the storage itself. Co-locating compute with storage is not optional; it's load-bearing for the cost model.

Common confusions

  • "S3 is just a backup target." Pre-2020 it was. Post-strong-consistency S3 (Dec 2020) plus a table format on top (Iceberg/Delta/Hudi) plus a query engine that respects the format is a complete analytical database stack. The "primary store" claim is what Build 12 spends 13 chapters justifying.
  • "S3 is the same as HDFS." They have similar APIs and very different consistency, durability, and economic models. HDFS gives you a real filesystem with rename, weaker durability (3-way replication, no cross-AZ guarantee), and dedicated machines you have to operate. S3 gives you stronger durability, no rename, and operations you don't have to do. The fact that Hadoop tools speak both is a happy compatibility accident, not a sign they're equivalent.
  • "S3 is slower than a local SSD, so it's bad for analytics." Per-byte, yes — a single-threaded sequential read from a local SSD is faster than a single-threaded GET from S3. But analytical queries are massively parallel: Trino opens 200 GET requests concurrently, each fetching a different row group, and the aggregate throughput dwarfs what a single-machine SSD can deliver. The right mental model is "S3 is a parallel-bandwidth substrate, not a low-latency one".
  • "You can run an OLTP database on S3 directly." No. OLTP needs sub-millisecond writes with strong cross-row atomicity; S3 gives you 50–200 ms PUTs with single-object atomicity only. There are systems that layer OLTP on top of S3 (Aurora's storage layer is loosely S3-like; Neon does it explicitly) but they all add an in-memory + WAL tier on top — they don't run "on S3" in the way the lakehouse does.
  • "Strong consistency means transactions." Strong consistency means a single-object PUT is immediately readable. It does not mean two PUTs are atomic with respect to each other. The lakehouse builds transactions on top using a single-object commit pointer (the table-metadata file) and optimistic concurrency. This is the next chapter.

Going deeper

The Werner Vogels 2006 design choices that still echo

S3's original API — PUT, GET, DELETE, LIST, plus ACLs — was deliberately minimalist. Werner Vogels' 2006 announcement said the team "chose simplicity and durability over richness". Twenty years later, every "we should add a rename API" or "let's offer in-place edits" proposal has been rejected because the operational cost of those features across the global S3 fleet outweighs the per-customer benefit. The lakehouse architecture exists because S3 said no to those features — if S3 had been a posix-compliant networked filesystem, Iceberg would never have been designed.

What changed in December 2020

Before December 2020, S3's overwrite semantics were "eventually consistent" — a PUT of an existing key, followed by a GET, might return the old object for up to ~10 seconds. List-after-write had similar lag. Every Hadoop-on-S3 project carried workarounds: S3Guard (DynamoDB-backed metadata), the S3A committer's two-phase staging directory, custom retries. The flip to strong consistency made the workarounds obsolete but the code is still there because it was correct under the old semantics and remains correct under the new ones, just over-engineered. Iceberg v2 quietly relaxed several of these guards; Delta Lake 3.x did the same.

How ParakhTrade's market-data lakehouse uses S3 differently from a fintech's transaction lakehouse

ParakhTrade records every NSE/BSE tick — about 6 billion ticks on a normal day, 18 billion on a budget-day spike. A fintech writes maybe 100 million rows per day and updates them frequently. ParakhTrade writes 100× more rows but never updates them — market ticks are append-only by nature. Their S3 layout is therefore radically different: hourly partitions, large files (1 GB target), no MOR layer, no manifest churn beyond the daily partition rotation. The fintech needs Iceberg's full update/delete machinery; ParakhTrade needs barely any of it. Same substrate, different table-format choices, both correct.

The "S3 Express One Zone" question

In 2024 AWS launched S3 Express One Zone, a single-AZ tier with single-millisecond latency at ~10× the storage cost. It looks like the answer to "S3 is too slow for hot data". For lakehouse workloads, the answer is mostly "no, this isn't what you wanted" — Express trades durability (single-AZ) for latency, and the bottleneck for analytical queries is parallel bandwidth, not single-request latency. Where Express does help is feature-store online-serving and small-state streaming checkpoints — workloads that lakehouses outsource to other systems anyway. The 5 PB warehouse stays on S3 Standard.

Open Compute alternatives — why MinIO and Ceph haven't displaced S3 inside cloud-resident shops

MinIO, Ceph, and SeaweedFS all implement the S3 API on commodity hardware. For on-prem deployments (a defense contractor, a regulated bank that can't use AWS), they are the right choice. For cloud-resident shops, they don't displace S3 because the operational cost — running Ceph monitors, replacing failed disks, managing erasure-coding policies — eats the storage savings. The 11-nines durability claim is not just about disks; it's about AWS spending years tuning the replication and recovery policies. Reproducing that on commodity hardware costs an SRE team's headcount, which usually exceeds the AWS bill being avoided.

Where this leads next

Build 12 spends the next 13 chapters teaching you how to turn S3 from "key-value store of immutable bytes" into "transactional analytical database". The mechanism is the table format — a layer of metadata files on S3 that gives you commit, schema evolution, time travel, and concurrent writers. The next chapter steps directly into that layer:

The closing chapter of Build 12, /wiki/cdc-iceberg-the-real-world-pattern, connects this to Build 11: how do you actually land PaisaBridge's CDC stream into Iceberg-on-S3 with bounded latency and clean schema evolution?

References