Data Lakes on Object Storage

In short

A data lake is a folder. More precisely, it is a bucket on a cloud object store — s3://flipkart-lake/, gs://phonepe-events/, abfs://acko-claims/ — into which you dump raw data files (logs, CDC streams, event blobs, CSV exports) and the processed Parquet that downstream pipelines produce. There is no database server, no central catalog you have to talk to before writing, no schema you have to declare up front. Any engine that can read the file format you wrote — Spark, Trino, Athena, BigQuery External Tables, DuckDB on a laptop, a Python script with pyarrow — can query that data directly from the bucket.

This pattern displaced the classical centralised data warehouse for raw and intermediate data after about 2014 because three numbers shifted at once: S3 hit eleven nines of durability and effectively unbounded throughput, storage cost dropped to ~23 per TB-month (Standard) or4 per TB-month (Glacier IA), and 10–40 Gbit/s networks made pulling Parquet from S3 to a Spark worker faster than reading from a single spinning disk. The economics of bring data to the warehouse inverted into bring compute to the lake.

The cost of that flexibility is operational discipline. The convention that emerged is the medallion architecture: organise the bucket into three layers — bronze (raw, immutable, append-only — exactly as ingested), silver (cleansed, typed, deduplicated — useful but not yet business-ready), gold (aggregated, joined with dimensions, served to BI). Each layer is just a directory prefix; jobs read from one and write to the next. This chapter walks the architecture, the file format zoo (Parquet, Avro, JSON, binaries), the four classic failure modes of raw S3 lakes (no ACID, no enforced schema, slow listings, no data skipping across files), and a worked example of a 10 TB/day Indian fintech writing transaction logs through bronze → silver → gold. The next chapter shows how Iceberg, Delta and Hudi solve the four failure modes by adding a transactional metadata layer on top of the same files.

You finished the previous chapter with Snowflake's bet — that S3 had quietly invalidated the technical reason for tying compute to storage in a warehouse. This chapter takes the same observation one step further. If S3 is durable and cheap and fast enough to hold a warehouse's data, it is also durable and cheap and fast enough to hold all of your data — including the messy, schemaless, unaggregated, half-typed source feeds that no warehouse would ever accept without an ETL pipeline cleaning them first.

That second observation is the data lake. The warehouse said clean it before you store it. The lake says store it first, clean it later, and let many engines clean it differently for many purposes. For the right workloads — and there are a lot of them — this is a strictly better deal.

The thesis: a bucket is a database (sort of)

A data lake, stripped to its essentials, is three things glued together:

A cloud object store — almost always Amazon S3, Google Cloud Storage, or Azure Blob Storage (sometimes MinIO or Ceph for on-premise). The store gives you eleven nines of durability, infinite (in practice) capacity, a flat key-value namespace where keys can contain / to look like directories, and a simple HTTP API: PUT, GET, LIST, DELETE.
Files in open formats — Parquet for analytical tables (chapter 121), Avro for streaming/CDC payloads, JSON or CSV for raw drops, plus arbitrary binaries (images, PDFs, model checkpoints) when the workload demands it.
Many compute engines that read those files directly, with no central server mediating the read. Spark, Trino, Presto, Athena, BigQuery External Tables, Snowflake External Tables, DuckDB, ClickHouse, even a pyarrow.parquet.read_table('s3://...') call from a Jupyter notebook.

That is it. There is no data lake server. There is no data lake daemon. There is a bucket, there are files in it, and there are engines that know how to read those files. Why this is architecturally radical: every previous analytical platform — Teradata, Vertica, Hadoop, Redshift, even Snowflake's classical mode — required you to load your data through its loader into its storage layer before its engine could query it. The lake collapses that pipeline. You write a Parquet file to S3 with whatever tool you like, and ten minutes later a Trino cluster you have never spoken to can run SELECT count(*) against it. The file is the contract.

The data lake architecture. The bucket is the single source of truth; engines come and go around it.

The shift from data warehouse to data lake

The data warehouse pattern of the 1990s and 2000s — Teradata, Vertica, Greenplum, early Redshift — was an elegant, opinionated, expensive answer to a real question: how do we make analytical queries fast? The answer was: bring all the data into one centralised, schemafied, columnar database, run a heroic ETL pipeline to clean and conform it, give analysts SQL on top.

For the queries it was designed for, this worked beautifully. A clean, conformed, dimensionally modelled warehouse on Vertica could answer a 30-table star-schema BI query in under a second. The trouble was everything around the edges:

Onboarding new data was slow. Adding a new source — say, a new mobile app's clickstream — required schema design, ETL development, capacity planning, and a release window. Three to nine months was typical at a Fortune 500.
Storage was expensive. A Teradata appliance in 2012 cost roughly 20,000-50,000 per TB of usable storage — five hundred to a thousand times the per-TB cost of S3 today.
The schema was rigid. If you wanted to keep a column you might need later but did not currently know how to model, you could not. Warehouse data was schemafied at write time; ambiguous data either got dropped or stuffed into a JSON blob in a details column and never queried.
Multi-modal data was unwelcome. Images, audio, model checkpoints, raw HTTP responses, free-text fields longer than VARCHAR(4000) — none of these belonged in a row-oriented columnar warehouse, so they lived elsewhere and the warehouse never saw them.

The data lake answered the inverse design question: how do we store everything cheaply and decide what to do with it later? The pattern, sometimes called schema-on-read, says: write the raw bytes to S3 in whatever format the source produced them, with a directory structure that records when and from where they came, and let downstream consumers parse and type the data when they need to. The cost of querying becomes the cost of the engine doing the parsing, not the cost of a heroic up-front ETL.

The trade is real and worth naming. Schema-on-read pushes work to query time and to the consumer; that means slower ad-hoc queries (the engine has to parse the JSON every time, instead of reading typed columns), and it means garbage in, garbage out — without enforcement, anyone can write a malformed file that breaks every consumer downstream. The data warehouse forced quality up front; the lake defers it. Most production data platforms today land somewhere in between: raw lake for ingestion and cheap retention, curated warehouse-shaped layer (often inside the same lake) for serving.

The file format zoo

The bucket holds many different kinds of files. Each format earns its place by being good at something specific.

The file format zoo. Parquet for analytics, Avro for streaming, JSON/CSV for raw drops, binaries for everything else. The lake holds them all in the same bucket.

The rule of thumb that emerged in production: bronze is whatever the source produced (often JSON or Avro), silver and gold are always Parquet. The conversion happens once, in the bronze→silver step, and every downstream consumer benefits from columnar reads forever after. You will see this pattern in every mature lake architecture; the cost of keeping bronze in JSON is paid once per row at silver-write time, not over and over at read time.

The medallion architecture

A bucket without convention is a swamp. Files accumulate, partitions overlap, schemas drift, nobody knows which directory holds the canonical version of orders, and within eighteen months the lake has the same problem the warehouse had — except now without enforced schemas. The convention that emerged to prevent this, popularised by Databricks' medallion architecture blog, organises every lake into three explicit layers, each a top-level prefix in the bucket.

The medallion. Each layer is a directory prefix; jobs read from one and write to the next. The same Parquet file format throughout silver and gold.

Bronze is raw and immutable. When a Kafka consumer drains a topic, when a Debezium CDC job ships RDBMS row changes, when a vendor drops a daily CSV, the file lands in bronze/ exactly as received, partitioned by the ingestion date so you can find it later. Bronze is never modified — corrections happen in silver, never by overwriting bronze. Why bronze must be immutable: the value of a raw layer is that you can replay any downstream pipeline against it deterministically. The moment you start modifying bronze, replays produce different answers depending on when you ran them, and the lake's audit story collapses. Treat bronze as if it were on a write-once filesystem.

Silver is cleansed. A Spark or dbt job reads bronze partitions, parses the JSON or Avro into typed columns, applies basic data-quality checks, deduplicates by primary key, redacts or hashes PII, and writes Parquet. Silver is overwrite-per-partition: rerunning yesterday's job for dt=2024-04-25 produces the same partition (idempotent). Silver tables are still relatively normalised — they look like the source schema, just cleaned up.

Gold is served. Another job (often dbt) joins silver tables with dimension tables, aggregates, denormalises for BI latency, and writes the small, focused, partitioned Parquet files that Tableau and Looker actually query. Gold tables look like the warehouse layer of an old-school data warehouse — star-schema-shaped, dimensionally modelled, query-optimised.

The discipline of moving data through the layers is what keeps the lake from becoming the swamp. Each layer has a different SLA (bronze: minutes; silver: hours; gold: daily), a different audience (raw consumers; data scientists; BI users), and a different failure mode (lost ingestion; bad parsing; wrong aggregation). When something breaks, you know which layer to look in.

The four classic problems with raw S3 lakes

A lake built directly on S3 with no metadata layer has four problems that show up in every production deployment past about 100 TB. They are the reason the table formats covered in chapter 133 — Apache Iceberg, Delta Lake, Apache Hudi — exist.

1. No ACID. A Spark job writing 200 Parquet files to silver/orders/dt=2024-04-25/ is not a single atomic write — it is 200 independent S3 PUTs. If the job crashes after writing 137 of them, the directory contains a half-written partition. Any reader that runs against that partition sees 137 files of new data interleaved with the old (or worse, mixed schemas if the job was rewriting). There is no commit, no rollback, no BEGIN/COMMIT. The conventional workaround is to write to a staging prefix and atomically rename — but S3 has no atomic rename across many keys; "rename" is copy-then-delete and is not transactional either.

2. No enforced schema. S3 does not know that all the files in silver/orders/ are supposed to share a schema. If a buggy Spark job writes a file with price as STRING instead of DECIMAL(18,2), the next reader either crashes mid-scan or — worse — silently casts and produces wrong answers. Schema drift across files in the same logical table is a perennial source of 2 AM pages on raw S3 lakes.

3. Slow listings. To answer "what files belong to this table?", a query engine has to call S3's LIST API on the prefix. LIST returns 1000 keys per call; a table with one million Parquet files needs a thousand sequential paginated calls. Add S3's request-rate limits and listing a large table can take minutes before the engine has even decided which files to read.

4. No data skipping across files. Parquet has predicate pushdown within a file (chapter 121) — the footer's per-row-group min/max stats let the reader skip row groups. But across files, the engine has no global statistics. To run WHERE country = 'IN' it must open every file's footer just to find out which ones contain Indian data. On a million-file table that is a million S3 GETs before any actual reading begins.

Notice that all four problems share a root cause: there is no metadata layer above the files. S3 knows about objects; the engine knows about queries; nothing in between knows about tables. The table formats — Iceberg, Delta, Hudi — fix this by writing a small JSON or Avro manifest alongside the Parquet that describes the table state: which files belong to which snapshot, what schema each file has, what min/max stats each file carries. The engine reads the manifest first (one small file), gets a complete picture of the table without listing S3, and pushes predicates against per-file stats without opening footers. Chapter 133 walks the three formats in detail.

A small Python example: reading partitioned Parquet from S3

To make the architecture concrete, here is the smallest possible end-to-end read from a partitioned silver Parquet table:

import pyarrow.dataset as ds
import pyarrow.compute as pc
# Point at the silver folder; pyarrow discovers Hive-style partitioning automatically
dataset = ds.dataset(
    "s3://flipkart-lake/silver/orders/",
    format="parquet",
    partitioning="hive",   # reads dt=2024-04-25 → column "dt"
)
# Push down a predicate: only read partitions for last 7 days, only Indian orders
filter_expr = (pc.field("dt") >= "2024-04-19") & (pc.field("country") == "IN")
table = dataset.to_table(
    columns=["order_id", "user_id", "amount_inr", "dt"],
    filter=filter_expr,
)
# Aggregate locally
result = table.group_by("dt").aggregate([("amount_inr", "sum")])
print(result.to_pandas())

That is the entire data-lake read path. Why this works without any database server: pyarrow uses the partition path to skip whole directories that cannot match the filter (partition pruning), opens only the Parquet footers in surviving partitions (predicate pushdown), and reads only the four requested columns. The same query against a 50-column, 10 TB silver table touches maybe 20 GB of S3 — about 0.2% of the data — and runs in seconds from a laptop. No coordinator, no catalog, no warehouse cluster.

A worked example: an Indian fintech's transaction lake

Setting. A Bengaluru-based payments company — call them PayBharat — processes 80 million UPI transactions per day. Each transaction is logged as a JSON event from the payment-switch fleet to a Kafka cluster, drained continuously to S3 by a Kafka Connect S3 sink. Total volume: roughly 10 TB of compressed JSON per day. The data has to power three workloads: regulatory reporting for the RBI (T+1 SLA), risk and fraud models (T+1 hour SLA), and a partner-facing BI dashboard (T+24 hour SLA).

A classical Snowflake-only architecture would cost about ₹3.5 crore per year just in storage and ingest credits. The team builds a medallion lake on S3 instead.

Bucket layout.

s3://paybharat-lake/
  bronze/raw/transactions/yyyy=2024/mm=04/dd=25/hh=14/*.json.gz
                          ↑ partitioned by ingest hour, ~150 GB per hour
  bronze/raw/cdc/users/yyyy=2024/mm=04/dd=25/*.parquet
                          ↑ Debezium CDC from the user-profile MySQL
  silver/cleansed/transactions/dt=2024-04-25/*.parquet
                          ↑ typed, deduped, PII-hashed, ~3 TB/day Parquet
  silver/cleansed/users/*.parquet
                          ↑ users dimension, full snapshot, ~80 GB
  gold/fact_txn_daily/dt=2024-04-25/*.parquet
                          ↑ joined with users + merchants, partitioned by date
  gold/agg_merchant_revenue/*.parquet
                          ↑ pre-aggregated for BI dashboards

Bronze. The Kafka Connect sink writes one gzipped JSON file per topic-partition every five minutes, into the hourly partition. A typical bronze file at hour 14 looks like:

{"txn_id":"7f3e...","payer":"+919876543210","payee":"merchant.bigbasket@oksbi","amount":420.00,"currency":"INR","ts":"2024-04-25T14:03:17.219+05:30","status":"SUCCESS","switch":"NPCI-7","raw":{...60 more fields...}}

Bronze retention: 365 days, then move to Glacier IA. Cost: ~₹4 lakh/year.

Silver. A Spark job runs every hour, reads the previous hour's bronze partition, parses the JSON into a typed schema (50 columns), hashes the PII columns (payer, payee) with a per-tenant salt, deduplicates by txn_id, and writes Parquet to silver/cleansed/transactions/dt=2024-04-25/. The job is idempotent — rerunning it overwrites the partition. After conversion: 3 TB Parquet vs 10 TB JSON, a 3.3× reduction. Risk and fraud models read silver directly.

Gold. A nightly dbt job joins silver transactions with silver users (slowly-changing dimension type 2) and a merchants dimension, aggregates by date and merchant category, and writes gold/agg_merchant_revenue/dt=2024-04-25/. The gold partition is small — about 200 MB — and the BI dashboard queries it with sub-second latency from Athena.

Cost outcome. Storage: ~₹40 lakh/year (S3 Standard for 90 days hot, Glacier IA after). Compute: Spark on EMR, ~₹60 lakh/year. Athena for BI: ~₹15 lakh/year. Total: ~₹1.15 crore/year — a 3× reduction vs the Snowflake-only path, with the bonus that the same lake serves ML training (PyTorch reading Parquet directly via pyarrow) and ad-hoc Trino queries from a separate compute cluster the data science team owns.

The trade is operational: a four-engineer data platform team owns the medallion pipelines, the Spark jobs, the dbt models, the IAM policies, and the table-format migration to Iceberg that they are planning for next quarter. Snowflake would have outsourced most of that work in exchange for the ₹2.4 crore difference.

When a lake is the right answer (and when it is not)

Two readings of the same architecture: the lake is better than the warehouse for raw and intermediate data because it is cheap, flexible, and multi-engine; the lake is worse than the warehouse for serving because it lacks transactions, schema enforcement, and millisecond query latency. The mature production answer is to use both: a lake for bronze and silver and most of gold, a warehouse (or a lakehouse — chapter 133's table formats turn the lake itself into one) for the small slice of gold that powers latency-sensitive serving.

Concretely, pick a lake-first architecture when: your data is large (>10 TB), your sources are diverse (logs, events, CDC, vendor feeds, binaries), your consumers are heterogeneous (BI plus ML plus ad-hoc), and your team has the engineering bandwidth to own the medallion discipline. Pick a warehouse-first architecture when: your data is small (<1 TB), your queries are predominantly BI, your team is small, and you would rather pay Snowflake or BigQuery a premium to handle storage, transactions, and query optimisation for you.

The next two chapters close the loop. Chapter 133 introduces the table formats — Iceberg, Delta Lake, Hudi — that add a transactional metadata layer on top of the raw lake and fix the four classic problems. Chapter 134 walks the operational features that fall out of that layer: time travel, zero-copy clones, branching. By the end you will see why the industry is converging on the term lakehouse — a lake with a warehouse-grade metadata layer, the best of both architectures, the pattern that has now displaced both pure lakes and pure warehouses for new builds at almost every Indian and global tech company since about 2022.

References

Databricks. What is a medallion architecture? — the canonical bronze/silver/gold definition.
Apache Software Foundation. Apache Iceberg overview — the table-format spec that fixes the four classic lake problems.
Armbrust, Das, Paranjpye, Xin et al. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores (VLDB 2020) — the Delta whitepaper.
AWS. Lake Formation developer guide — the AWS-native lake-management service.
Snowflake. External tables documentation — querying lake data from a warehouse engine.
Armbrust, Ghodsi, Xin, Zaharia. Lakehouse: A New Generation of Open Platforms (CIDR 2021) — the lakehouse paradigm paper.