Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A data lake is a bucket on a cloud object store (S3, GCS, Azure Blob) holding raw and processed files in open formats, queryable by any engine that can read them — no central database server, no up-front schema. The bargain that made this work after 2014 is bring compute to the data: durable, cheap object storage plus columnar Parquet inverts the warehouse's load-then-query economics. The cost of that flexibility is operational discipline — the medallion convention (bronze raw, silver cleansed, gold served) — and four classic gaps (no ACID, no schema enforcement, slow listings, weak cross-file pruning) that the next chapter's table formats fix.

You finished the previous chapter with Snowflake's bet — that S3 had quietly invalidated the technical reason for tying compute to storage in a warehouse. This chapter takes the same observation one step further. If S3 is durable and cheap and fast enough to hold a warehouse's data, it is also durable and cheap and fast enough to hold all of your data — including the messy, schemaless, unaggregated, half-typed source feeds that no warehouse would ever accept without an ETL pipeline cleaning them first.

That second observation is the data lake. The warehouse said clean it before you store it. The lake says store it first, clean it later, and let many engines clean it differently for many purposes. For the right workloads — and there are a lot of them — this is a strictly better deal.

The thesis: a bucket is a database (sort of)

A data lake, stripped to its essentials, is three things glued together:

  1. A cloud object store — almost always Amazon S3, Google Cloud Storage, or Azure Blob Storage (sometimes MinIO or Ceph for on-premise). The store gives you eleven nines of durability, infinite (in practice) capacity, a flat key-value namespace where keys can contain / to look like directories, and a simple HTTP API: PUT, GET, LIST, DELETE.
  2. Files in open formats — Parquet for analytical tables (chapter 121), Avro for streaming/CDC payloads, JSON or CSV for raw drops, plus arbitrary binaries (images, PDFs, model checkpoints) when the workload demands it.
  3. Many compute engines that read those files directly, with no central server mediating the read. Spark, Trino, Presto, Athena, BigQuery External Tables, Snowflake External Tables, DuckDB, ClickHouse, even a pyarrow.parquet.read_table('s3://...') call from a Jupyter notebook.

That is it. There is no data lake server. There is no data lake daemon. There is a bucket, there are files in it, and there are engines that know how to read those files. Why this is architecturally radical: every previous analytical platform — Teradata, Vertica, Hadoop, Redshift, even Snowflake's classical mode — required you to load your data through its loader into its storage layer before its engine could query it. The lake collapses that pipeline. You write a Parquet file to S3 with whatever tool you like, and ten minutes later a Trino cluster you have never spoken to can run SELECT count(*) against it. The file is the contract.

Data lake layout: shared bucket, many enginesAn S3 or GCS bucket at the centre with a directory hierarchy: top-level folders bronze, silver, gold. Bronze contains raw subdirectories partitioned by year and month. Silver and gold contain cleansed and aggregated Parquet. Around the bucket sit four compute engines — Spark, Trino, Athena, DuckDB — each with an arrow pointing into the bucket, illustrating that all of them read the same files concurrently with no central database server.A data lake: one shared bucket, many independent enginess3://flipkart-lake/11 nines durability · pay per GB-month · no serverbronze/ raw/orders/2024/04/25/*.json.gz raw/clicks/2024/04/25/*.avro raw/cdc/users/2024/04/*.parquetsilver/ cleansed/orders/dt=2024-04-25/*.parquet cleansed/sessions/dt=2024-04-25/*.parquet cleansed/users/*.parquetgold/ fact_revenue_daily/*.parquet dim_product/*.parquet agg_user_cohorts/*.parquetartifacts/ models/recommender_v17.pklSparkheavy ETL, MLTrinointeractive SQLAthenaserverless SQLDuckDBlaptop / notebookBigQueryexternal tablesSnowflakeexternal tablesEvery engine reads the same files. No central database mediates the read.Add a new engine by giving it IAM access to the bucket. No re-shard, no re-load, no migration.
The data lake architecture. The bucket is the single source of truth; engines come and go around it.

The shift from data warehouse to data lake

The data warehouse pattern of the 1990s and 2000s — Teradata, Vertica, Greenplum, early Redshift — was an elegant, opinionated, expensive answer to a real question: how do we make analytical queries fast? The answer was: bring all the data into one centralised, schemafied, columnar database, run a heroic ETL pipeline to clean and conform it, give analysts SQL on top.

For the queries it was designed for, this worked beautifully. A clean, conformed, dimensionally modelled warehouse on Vertica could answer a 30-table star-schema BI query in under a second. The trouble was everything around the edges:

  • Onboarding new data was slow. Adding a new source — say, a new mobile app's clickstream — required schema design, ETL development, capacity planning, and a release window. Three to nine months was typical at a Fortune 500.
  • Storage was expensive. A Teradata appliance in 2012 cost roughly 20,000-50,000 per TB of usable storage — five hundred to a thousand times the per-TB cost of S3 today.
  • The schema was rigid. If you wanted to keep a column you might need later but did not currently know how to model, you could not. Warehouse data was schemafied at write time; ambiguous data either got dropped or stuffed into a JSON blob in a details column and never queried.
  • Multi-modal data was unwelcome. Images, audio, model checkpoints, raw HTTP responses, free-text fields longer than VARCHAR(4000) — none of these belonged in a row-oriented columnar warehouse, so they lived elsewhere and the warehouse never saw them.

The data lake answered the inverse design question: how do we store everything cheaply and decide what to do with it later? The pattern, sometimes called schema-on-read, says: write the raw bytes to S3 in whatever format the source produced them, with a directory structure that records when and from where they came, and let downstream consumers parse and type the data when they need to. The cost of querying becomes the cost of the engine doing the parsing, not the cost of a heroic up-front ETL.

The trade is real and worth naming. Schema-on-read pushes work to query time and to the consumer; that means slower ad-hoc queries (the engine has to parse the JSON every time, instead of reading typed columns), and it means garbage in, garbage out — without enforcement, anyone can write a malformed file that breaks every consumer downstream. The data warehouse forced quality up front; the lake defers it. Most production data platforms today land somewhere in between: raw lake for ingestion and cheap retention, curated warehouse-shaped layer (often inside the same lake) for serving.

The file format zoo

The bucket holds many different kinds of files. Each format earns its place by being good at something specific.

File format zoo in the lakeA 2x2 grid of file formats. Top-left Parquet: columnar analytical, dictionary encoded, predicate pushdown. Top-right Avro: row-oriented JSON-like binary, schema evolution friendly, Kafka and CDC. Bottom-left JSON and CSV: legacy and raw drops, human-readable, slow. Bottom-right images PDFs binaries: semi-structured, model artifacts, blobs.The lake holds a zoo of formats — each earns its placeParquet (columnar, analytical)column-by-column on disk · dict + RLE encodedsnappy/zstd compressed · footer with min/max statspredicate pushdown · 50-200x faster on scanssilver/, gold/ — anything you query repeatedlyread by Spark, Trino, Athena, DuckDB natively"the analytical default since 2015"Avro (row, schema-evolution friendly)JSON-like binary · schema embedded in fileadd/rename/drop fields without breaking readersrow-oriented — slow to scan, fast to appendbronze/ for streaming · Kafka payloads · CDCConfluent Schema Registry pairs with Avro"the streaming and CDC default"JSON / CSV (legacy, raw drops)human-readable · text · no schema · no compression10-30x larger than Parquet for the same dataslow scans · re-parse every read · type ambiguitybronze/raw/ — exports, third-party feeds, vendor APIsconvert to Parquet during bronze→silver step"legacy and inevitable"Images / PDFs / binariesPNG, JPEG, MP4, PDF, model checkpoints, archivesopaque to SQL engines · indexed by metadata sidecarqueried via path + ML jobs (Spark, Ray, PyTorch)artifacts/ — claims photos, ID scans, model weightsa warehouse cannot hold these — the lake can"the multi-modal advantage"
The file format zoo. Parquet for analytics, Avro for streaming, JSON/CSV for raw drops, binaries for everything else. The lake holds them all in the same bucket.

The rule of thumb that emerged in production: bronze is whatever the source produced (often JSON or Avro), silver and gold are always Parquet. The conversion happens once, in the bronze→silver step, and every downstream consumer benefits from columnar reads forever after. You will see this pattern in every mature lake architecture; the cost of keeping bronze in JSON is paid once per row at silver-write time, not over and over at read time.

The medallion architecture

A bucket without convention is a swamp. Files accumulate, partitions overlap, schemas drift, nobody knows which directory holds the canonical version of orders, and within eighteen months the lake has the same problem the warehouse had — except now without enforced schemas. The convention that emerged to prevent this, popularised by Databricks' medallion architecture blog, organises every lake into three explicit layers, each a top-level prefix in the bucket.

Bronze, silver, gold medallion architectureThree rectangles arranged left to right with arrows between them. Left rectangle bronze: raw, immutable, append-only — exactly as ingested from sources. Middle rectangle silver: cleansed, typed, deduplicated, joined references resolved — useful but not business-ready. Right rectangle gold: aggregated, joined with dimensions, served to BI dashboards and reverse-ETL. Above each rectangle a label of the source/sink: bronze receives from Kafka, CDC, S3 drops, vendor APIs; silver is consumed by data scientists; gold is consumed by BI dashboards and ML feature stores.Medallion architecture: bronze → silver → goldBRONZEraw, immutable, append-onlyexactly as ingestedJSON / Avro / CSV / Parquetpartitioned by ingestion datenever modified, only appended"the source of truth, replayable"retention: 90d - foreverSILVERcleansed, typed, deduplicatedalways ParquetPII redacted / hashedschema enforcedoverwrite per-partition (idempotent)"useful but not business-ready"consumed by data scienceGOLDaggregated, joined, servedstar-schema-shaped Parquetfacts joined with dimensionsdenormalised for BI latencysmall files, partitioned by date"served to dashboards"consumed by DashView / LookerSpark/dbtSpark/dbtSource feeds:Kafka topics, RDBMS CDC,S3 vendor drops, REST APIsConsumers:data scientists, ML training,ad-hoc Trino notebooksConsumers:DashView, Looker, Metabase,reverse-ETL to CRMone
The medallion. Each layer is a directory prefix; jobs read from one and write to the next. The same Parquet file format throughout silver and gold.

Bronze is raw and immutable. When a Kafka consumer drains a topic, when a Debezium CDC job ships RDBMS row changes, when a vendor drops a daily CSV, the file lands in bronze/ exactly as received, partitioned by the ingestion date so you can find it later. Bronze is never modified — corrections happen in silver, never by overwriting bronze. Why bronze must be immutable: the value of a raw layer is that you can replay any downstream pipeline against it deterministically. The moment you start modifying bronze, replays produce different answers depending on when you ran them, and the lake's audit story collapses. Treat bronze as if it were on a write-once filesystem.

Silver is cleansed. A Spark or dbt job reads bronze partitions, parses the JSON or Avro into typed columns, applies basic data-quality checks, deduplicates by primary key, redacts or hashes PII, and writes Parquet. Silver is overwrite-per-partition: rerunning yesterday's job for dt=2024-04-25 produces the same partition (idempotent). Silver tables are still relatively normalised — they look like the source schema, just cleaned up.

Gold is served. Another job (often dbt) joins silver tables with dimension tables, aggregates, denormalises for BI latency, and writes the small, focused, partitioned Parquet files that DashView and Looker actually query. Gold tables look like the warehouse layer of an old-school data warehouse — star-schema-shaped, dimensionally modelled, query-optimised.

The discipline of moving data through the layers is what keeps the lake from becoming the swamp. Each layer has a different SLA (bronze: minutes; silver: hours; gold: daily), a different audience (raw consumers; data scientists; BI users), and a different failure mode (lost ingestion; bad parsing; wrong aggregation). When something breaks, you know which layer to look in.

The four classic problems with raw S3 lakes

A lake built directly on S3 with no metadata layer has four problems that show up in every production deployment past about 100 TB. They are the reason the table formats covered in chapter 133 — Apache Iceberg, Delta Lake, Apache Hudi — exist.

1. No ACID. A Spark job writing 200 Parquet files to silver/orders/dt=2024-04-25/ is not a single atomic write — it is 200 independent S3 PUTs. If the job crashes after writing 137 of them, the directory contains a half-written partition. Any reader that runs against that partition sees 137 files of new data interleaved with the old (or worse, mixed schemas if the job was rewriting). There is no commit, no rollback, no BEGIN/COMMIT. The conventional workaround is to write to a staging prefix and atomically rename — but S3 has no atomic rename across many keys; "rename" is copy-then-delete and is not transactional either.

2. No enforced schema. S3 does not know that all the files in silver/orders/ are supposed to share a schema. If a buggy Spark job writes a file with price as STRING instead of DECIMAL(18,2), the next reader either crashes mid-scan or — worse — silently casts and produces wrong answers. Schema drift across files in the same logical table is a perennial source of 2 AM pages on raw S3 lakes.

3. Slow listings. To answer "what files belong to this table?", a query engine has to call S3's LIST API on the prefix. LIST returns 1000 keys per call; a table with one million Parquet files needs a thousand sequential paginated calls. Add S3's request-rate limits and listing a large table can take minutes before the engine has even decided which files to read.

4. No data skipping across files. Parquet has predicate pushdown within a file (chapter 121) — the footer's per-row-group min/max stats let the reader skip row groups. But across files, the engine has no global statistics. To run WHERE country = 'IN' it must open every file's footer just to find out which ones contain Indian data. On a million-file table that is a million S3 GETs before any actual reading begins.

Notice that all four problems share a root cause: there is no metadata layer above the files. S3 knows about objects; the engine knows about queries; nothing in between knows about tables. The table formats — Iceberg, Delta, Hudi — fix this by writing a small JSON or Avro manifest alongside the Parquet that describes the table state: which files belong to which snapshot, what schema each file has, what min/max stats each file carries. The engine reads the manifest first (one small file), gets a complete picture of the table without listing S3, and pushes predicates against per-file stats without opening footers. Chapter 133 walks the three formats in detail.

A small Python example: reading partitioned Parquet from S3

To make the architecture concrete, here is the smallest possible end-to-end read from a partitioned silver Parquet table:

import pyarrow.dataset as ds
import pyarrow.compute as pc
# Point at the silver folder; pyarrow discovers Hive-style partitioning automatically
dataset = ds.dataset(
    "s3://flipkart-lake/silver/orders/",
    format="parquet",
    partitioning="hive",   # reads dt=2024-04-25 → column "dt"
)
# Push down a predicate: only read partitions for last 7 days, only Indian orders
filter_expr = (pc.field("dt") >= "2024-04-19") & (pc.field("country") == "IN")
table = dataset.to_table(
    columns=["order_id", "user_id", "amount_inr", "dt"],
    filter=filter_expr,
)
# Aggregate locally
result = table.group_by("dt").aggregate([("amount_inr", "sum")])
print(result.to_pandas())

That is the entire data-lake read path. Why this works without any database server: pyarrow uses the partition path to skip whole directories that cannot match the filter (partition pruning), opens only the Parquet footers in surviving partitions (predicate pushdown), and reads only the four requested columns. The same query against a 50-column, 10 TB silver table touches maybe 20 GB of S3 — about 0.2% of the data — and runs in seconds from a laptop. No coordinator, no catalog, no warehouse cluster.

A worked example: an Indian fintech's transaction lake

Setting. A Bengaluru-based payments company — call them PayBharat — processes 80 million UPI transactions per day. Each transaction is logged as a JSON event from the payment-switch fleet to a Kafka cluster, drained continuously to S3 by a Kafka Connect S3 sink. Total volume: roughly 10 TB of compressed JSON per day. The data has to power three workloads: regulatory reporting for the RBI (T+1 SLA), risk and fraud models (T+1 hour SLA), and a partner-facing BI dashboard (T+24 hour SLA).

A classical Snowflake-only architecture would cost about ₹3.5 crore per year just in storage and ingest credits. The team builds a medallion lake on S3 instead.

Bucket layout.

s3://paybharat-lake/
  bronze/raw/transactions/yyyy=2024/mm=04/dd=25/hh=14/*.json.gz
                          ↑ partitioned by ingest hour, ~150 GB per hour
  bronze/raw/cdc/users/yyyy=2024/mm=04/dd=25/*.parquet
                          ↑ Debezium CDC from the user-profile MySQL
  silver/cleansed/transactions/dt=2024-04-25/*.parquet
                          ↑ typed, deduped, PII-hashed, ~3 TB/day Parquet
  silver/cleansed/users/*.parquet
                          ↑ users dimension, full snapshot, ~80 GB
  gold/fact_txn_daily/dt=2024-04-25/*.parquet
                          ↑ joined with users + merchants, partitioned by date
  gold/agg_merchant_revenue/*.parquet
                          ↑ pre-aggregated for BI dashboards

Bronze. The Kafka Connect sink writes one gzipped JSON file per topic-partition every five minutes, into the hourly partition. A typical bronze file at hour 14 looks like:

{"txn_id":"7f3e...","payer":"+919876543210","payee":"merchant.bigbasket@oksbi","amount":420.00,"currency":"INR","ts":"2024-04-25T14:03:17.219+05:30","status":"SUCCESS","switch":"NPCI-7","raw":{...60 more fields...}}

Bronze retention: 365 days, then move to Glacier IA. Cost: ~₹4 lakh/year.

Silver. A Spark job runs every hour, reads the previous hour's bronze partition, parses the JSON into a typed schema (50 columns), hashes the PII columns (payer, payee) with a per-tenant salt, deduplicates by txn_id, and writes Parquet to silver/cleansed/transactions/dt=2024-04-25/. The job is idempotent — rerunning it overwrites the partition. After conversion: 3 TB Parquet vs 10 TB JSON, a 3.3× reduction. Risk and fraud models read silver directly.

Gold. A nightly dbt job joins silver transactions with silver users (slowly-changing dimension type 2) and a merchants dimension, aggregates by date and merchant category, and writes gold/agg_merchant_revenue/dt=2024-04-25/. The gold partition is small — about 200 MB — and the BI dashboard queries it with sub-second latency from Athena.

Cost outcome. Storage: ~₹40 lakh/year (S3 Standard for 90 days hot, Glacier IA after). Compute: Spark on EMR, ~₹60 lakh/year. Athena for BI: ~₹15 lakh/year. Total: ~₹1.15 crore/year — a 3× reduction vs the Snowflake-only path, with the bonus that the same lake serves ML training (PyTorch reading Parquet directly via pyarrow) and ad-hoc Trino queries from a separate compute cluster the data science team owns.

The trade is operational: a four-engineer data platform team owns the medallion pipelines, the Spark jobs, the dbt models, the IAM policies, and the table-format migration to Iceberg that they are planning for next quarter. Snowflake would have outsourced most of that work in exchange for the ₹2.4 crore difference.

When a lake is the right answer (and when it is not)

Two readings of the same architecture: the lake is better than the warehouse for raw and intermediate data because it is cheap, flexible, and multi-engine; the lake is worse than the warehouse for serving because it lacks transactions, schema enforcement, and millisecond query latency. The mature production answer is to use both: a lake for bronze and silver and most of gold, a warehouse (or a lakehouse — chapter 133's table formats turn the lake itself into one) for the small slice of gold that powers latency-sensitive serving.

Concretely, pick a lake-first architecture when: your data is large (>10 TB), your sources are diverse (logs, events, CDC, vendor feeds, binaries), your consumers are heterogeneous (BI plus ML plus ad-hoc), and your team has the engineering bandwidth to own the medallion discipline. Pick a warehouse-first architecture when: your data is small (<1 TB), your queries are predominantly BI, your team is small, and you would rather pay Snowflake or BigQuery a premium to handle storage, transactions, and query optimisation for you.

The next two chapters close the loop. Chapter 133 introduces the table formats — Iceberg, Delta Lake, Hudi — that add a transactional metadata layer on top of the raw lake and fix the four classic problems. Chapter 134 walks the operational features that fall out of that layer: time travel, zero-copy clones, branching. By the end you will see why the industry is converging on the term lakehouse — a lake with a warehouse-grade metadata layer, the best of both architectures, the pattern that has now displaced both pure lakes and pure warehouses for new builds at almost every Indian and global tech company since about 2022.

Common confusions

  • "A data lake is just unstructured S3 — anything goes." It is not, or it should not be. A lake without the medallion discipline (or an equivalent convention) becomes a swamp within a year: orphaned partitions, schema drift across files of the "same" table, no canonical version of any entity. The bucket is the substrate; the medallion layout, partition contract, and file-format rule (silver and gold are always Parquet) are what make it a database. Drop the discipline and you have re-invented the 2009 Hadoop swamp that the lake pattern was supposed to replace.

  • "Schema-on-read means the lake has no schema." Schema-on-read means the schema is applied at read time by the consumer, not enforced at write time by a server. Every silver and gold table in a healthy lake has a schema — defined in dbt models, Spark types, or a Glue/Hive Metastore entry — and the bronze→silver job's job is to fail loudly when an incoming file violates it. The "no schema" misconception is what produces silent type-coercion bugs: a price column written as STRING in one file and DECIMAL in another, with downstream queries returning subtly wrong totals for months before anyone notices.

  • "S3 PUT is atomic, so a Spark write is atomic too." S3 single-object PUT is atomic for that one object. A Spark job writes hundreds or thousands of objects to the same partition; that multi-object write has no atomicity guarantee. Crash mid-write and the partition is in an inconsistent state — half-new, half-old, no rollback. This is the first of the four classic problems above and the single biggest reason production lakes adopt Iceberg/Delta/Hudi as soon as they cross ~10 TB.

  • "Listing S3 is fast because S3 is fast." S3 GET on a single object is fast (single-digit milliseconds). S3 LIST returns 1000 keys per call and is rate-limited per prefix. A million-file table takes 1000 sequential paginated calls — minutes — before any data is read. The cost of LIST is invisible on small tables and brutal on large ones, which is why naive s3a://path/* globs work fine in development and time out in production.

  • "Parquet's predicate pushdown means I do not need partitioning." Parquet pushes predicates within a file's row groups, but it cannot skip entire files without opening their footers. To run WHERE country = 'IN' against a million-file untyped lake table the engine still has to fetch every footer. Directory partitioning (country=IN/) is what lets the engine skip whole directories without any GET. Partition first, then rely on Parquet's intra-file pushdown.

  • "A data lake replaces the data warehouse." It does not — at least not for the slice of workload the warehouse was actually designed for. Lakes are better for raw, intermediate, large, multi-modal, multi-engine data. Warehouses (or lakehouses, which are lakes with a warehouse-grade metadata layer) are better for low-latency BI serving and for teams that want SQL transactions with no operational burden. Mature production architectures use both, deliberately, with the medallion's gold layer being the bridge.

Going deeper

The S3 consistency story, finally fixed (2020). Until December 2020, S3 was famously eventually consistent for overwrite PUTs and DELETEs — write a new version of key, then immediately read it, and there was no guarantee you saw the new version. This was a perpetual pain in lake operations: a Spark job that wrote a manifest file and immediately read it back could see the old contents. The original Hadoop S3A committers (S3A magic committer, Streamora's S3mper, Apache Hadoop's directory committer) existed entirely to paper over this. AWS shipped strong read-after-write consistency across all S3 operations in December 2020, retroactively for every bucket, with no API change. This single migration eliminated an entire class of lake bugs and is the reason articles about S3 lakes written before 2021 read like a horror story (and why pre-2021 advice about staging directories and rename committers can mostly be ignored on modern S3).

Why the Hadoop lake of 2010 failed where the S3 lake of 2018 succeeded. The data lake idea is not new. Hadoop and HDFS were marketed as a "data lake" platform from roughly 2009. By 2015 most of those Hadoop lakes were considered failed: Cloudera and Hortonworks had to merge, the term "data swamp" was coined, and the pendulum swung back toward the warehouse. The post-2018 S3 lake succeeded for three reasons the Hadoop lake did not have. First, durability without operational burden. HDFS three-way replication required a team to keep DataNodes healthy; S3 gives you eleven nines for ₹1.6 per GB-year with zero ops. Second, separation of storage and compute. On HDFS, scaling compute meant adding DataNodes (which dragged storage with them); on S3 you scale Spark and storage independently and pay only for what you use. Third, columnar formats matured. Parquet and ORC reached production stability around 2015; before that, lakes were full of slow, bloated text or sequence files. The lesson: a data lake is a viable architecture only when storage is durable-by-default, compute is decoupled, and the on-disk format is columnar. All three preconditions arrived simultaneously around 2017–2018, which is why the lake pattern suddenly worked then and not five years earlier.

Listing performance and the partition-design pathology. Listing a million-file table is not the worst case — the worst case is over-partitioning. A team partitions transactions by (country, dt, hour, switch_id) because more partitions sound like more pruning, but ends up with 40 million directories, each containing 5–20 small files. Engines now spend more time on LIST than on read. The rule of thumb that hardened in production: target ~100–500 MB Parquet files, partition only on columns that appear in the WHERE clause of your hottest queries, and never partition on a high-cardinality column (user_id, txn_id). When in doubt, partition by date only and let Parquet's intra-file pushdown do the rest. Iceberg's hidden partitioning (chapter 133) was invented specifically to let the table design forget about partition columns and let the format decide based on per-file stats.

The PII layer and the right-to-be-forgotten. A lake stores raw customer data in bronze and immutable retention is the whole point. India's DPDP Act (2023) and the EU's GDPR (2018) both require deletion of a customer's personal data on request, and immutable bronze conflicts with that requirement. The pattern that emerged: hash all PII at ingest with a per-tenant salt held in AWS KMS or HashiCorp Vault, store the unhashed mapping in a separate "PII vault" RDBMS or DynamoDB table that is not in the lake, and on a deletion request, delete the mapping. The hashed value remains in bronze forever (immutability preserved), but is no longer reversible to the deleted customer (legal compliance). Engineering teams call this tokenisation; it is the only known way to reconcile bronze immutability with personal-data-deletion law, and every Indian fintech building on a lake has had to implement some version of it.

Open file formats matter politically, not just technically. A subtle reason the lake displaced the warehouse for raw data is that Parquet, Avro, JSON, and ORC are open formats — readable by any engine, any vendor, any open-source tool, with no licence fee and no vendor lock-in. If Snowflake doubles its prices tomorrow, a warehouse-only company has to migrate petabytes of data through ETL pipelines to escape. A lake-first company can switch query engines (Trino → Athena → Databricks → DuckDB) by changing IAM credentials. This vendor optionality is now an explicit architectural goal at most CTO offices: the data is the asset, the engine is a commodity. Iceberg's REST catalogue and Apache Polaris (Snowflake's open Iceberg catalogue, 2024) are the political payoff of that thesis — even Snowflake now treats Iceberg-on-S3 as a first-class storage layer, because their customers demanded the right to leave.

The lakehouse as the synthesis. The chapter has carefully separated "lake" (bucket of files) from "warehouse" (centralised analytical database). The two are converging. The lakehouse term, coined in the Armbrust 2021 paper cited below, names the synthesis: a lake with a transactional metadata layer (Iceberg, Delta, Hudi) on top, providing ACID, schema enforcement, time travel, and statistics-driven query optimisation — without giving up the lake's openness, multi-engine reads, or per-GB pricing. By 2026, virtually every greenfield analytical platform at large Indian and global tech companies is a lakehouse, not a pure lake or a pure warehouse. The next chapter, on the table formats, is the mechanism by which a lake becomes a lakehouse.

Where this leads next

References

  1. Databricks. What is a medallion architecture? — the canonical bronze/silver/gold definition.
  2. Apache Software Foundation. Apache Iceberg overview — the table-format spec that fixes the four classic lake problems.
  3. Armbrust, Das, Paranjpye, Xin et al. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores (VLDB 2020) — the Delta whitepaper.
  4. AWS. Lake Formation developer guide — the AWS-native lake-management service.
  5. Snowflake. External tables documentation — querying lake data from a warehouse engine.
  6. Armbrust, Ghodsi, Xin, Zaharia. Lakehouse: A New Generation of Open Platforms (CIDR 2021) — the lakehouse paradigm paper.