In short
A table format is a thin metadata layer that sits on top of a folder of Parquet files in object storage and turns that folder into a real table — one with ACID transactions, snapshot isolation, schema evolution, time travel, and efficient updates and deletes. Without it, a folder of Parquet on S3 is read-only-for-analytics at best: there is no atomic way to replace files, no way to know which files are "current" if a writer crashed mid-job, and no way to evolve the schema without rewriting history.
Apache Iceberg (Netflix, 2017) tracks every commit as an immutable snapshot, with a hierarchy of snapshot → manifest list → manifests → data files. Readers fetch one snapshot pointer from a catalog, follow its manifest list, and know exactly which Parquet files are alive in that table version. Iceberg is vendor-neutral; Spark, Trino, Snowflake, BigQuery, and DuckDB all read it.
Delta Lake (Databricks, 2019) stores its metadata as an append-only transaction log in _delta_log/ — a directory of JSON commits plus periodic Parquet checkpoints. Readers replay the log to compute the current set of files. Delta is dominant in Databricks shops and increasingly read elsewhere.
Apache Hudi (Uber, 2017) is built around streaming upserts: ingest CDC or Kafka events directly into a queryable table, with two table flavours — COPY_ON_WRITE (rewrite files on update, fast reads) and MERGE_ON_READ (append delta logs, async compaction, fast writes). PhonePe and Flipkart run heavy Hudi workloads; Razorpay's lake is on Iceberg.
The market is converging: Databricks acquired Tabular (the company founded by Iceberg's creators) in 2024, signalling that Delta and Iceberg will eventually share a single physical format under the hood. Hudi will likely remain the streaming-first specialist.
In the previous chapter you saw how a data lake is, physically, just a bucket of Parquet files on S3 plus a Hive Metastore or Glue catalog telling query engines where the partitions live. That model — call it the "Hive table" — was state of the art around 2015. It works beautifully for read-only analytics on append-only data: drop a Parquet file in s3://lake/orders/dt=2024-04-25/, run MSCK REPAIR TABLE, and Athena starts answering queries.
It also breaks the moment you need to do anything more interesting than append. Two writers committing to the same partition can clobber each other. A failed Spark job leaves orphaned Parquet files that look just like committed ones. A DELETE FROM orders WHERE order_id = 42 requires rewriting an entire partition. Adding a column means walking every existing file and either rewriting it or hoping the readers handle the schema mismatch correctly. Time travel — "what did the table look like yesterday at 3pm?" — is impossible unless you snapshot the whole bucket.
Iceberg, Delta, and Hudi exist to fix all of this without giving up the cheap, vendor-neutral, engine-agnostic appeal of "Parquet on S3."
What is missing from raw Parquet on S3
Before we look at how the three formats work, it's worth being precise about what a Hive-table layout cannot do.
- Atomic writes. A Spark job writing 200 Parquet files into a partition writes them one at a time. If the job crashes after writing 137 files, those 137 files are visible to readers — the table is in a half-committed state. There is no
BEGIN/COMMIT. - Snapshot isolation for readers. If a writer is mid-flight, readers see whatever subset of files happen to be on S3 at the moment they list the bucket. Two readers issued one second apart can see different data.
- Schema evolution. Adding a column to a Parquet table means either rewriting every file or hoping the reader can handle files where the column is missing. Renaming a column is even worse — Parquet stores column names, and old readers will choke on the rename.
- Efficient updates and deletes. Parquet files are immutable. Updating one row in a 1 GB file means rewriting the whole 1 GB. There is no row-level delete.
- Hidden partitioning. In Hive, if you partition by
dateand a query saysWHERE order_ts BETWEEN '2024-04-25 09:00' AND '...', you have to either also passdt = '2024-04-25'or scan everything — the engine cannot derive the partition key from the timestamp. - Fast file pruning. With thousands of partitions and millions of files, just listing the relevant files via
LIST s3://...becomes the bottleneck. Athena queries can spend their first 30 seconds doing S3 listings before any data is read. - Time travel. "Read the table as of last Tuesday" requires either holding a snapshot of the bucket or running a separate audit pipeline.
All three table formats solve all seven of these problems with the same architectural move: keep the data files as Parquet, but write an explicit metadata layer alongside them that lists exactly which files belong to which version of the table.
The trick is that the metadata layer is itself just files in object storage — Iceberg writes Avro and JSON, Delta writes JSON and Parquet, Hudi writes Avro. There is no separate metadata server you have to run, no Postgres-as-catalog dependency. The metadata files live next to the data files in the same bucket. Why metadata-as-files matters: it preserves the lakehouse promise of "your data is yours, in your bucket, in an open format." Any engine that can read the metadata files can query the table — there is no vendor-locked metadata service in the way.
Apache Iceberg: snapshots and manifest hierarchies
Iceberg was born at Netflix in 2017 because their Hive-table petabyte-scale data lake had become unmanageable. The team — led by Ryan Blue and Daniel Weeks — designed Iceberg around one central idea: every commit produces a new immutable snapshot of the table, and a reader who wants to query the table at any point in time just has to find the snapshot pointer and follow the metadata down to the data files.
Iceberg's metadata hierarchy has four levels.
The flow when you query an Iceberg table:
- The engine asks the catalog (Glue, REST catalog, Nessie, Hive Metastore) for the current snapshot pointer for
orders. That returns a metadata file location likes3://lake/orders/metadata/v123.metadata.json. - That metadata file contains the current snapshot id
s7and a pointer to the manifest list fors7, which is an Avro file listing every manifest that is alive in this snapshot. - The manifest list also stores per-manifest partition range summaries — so for
WHERE dt = '2024-04-25', the engine immediately discards manifests whose partition range does not overlap. - For surviving manifests, the engine reads them — each manifest is an Avro file listing data files with per-file partition values, row counts, and column min/max statistics.
- The engine prunes data files by their stats and finally issues
Range:requests to S3 for the surviving Parquet files.
Why this hierarchy is fast: pruning happens at three levels (manifest list, manifest, data file) before any data is read. A 100,000-file table with 1,000 manifests and 1 manifest list can be pruned to ~50 candidate files in three small Avro reads — versus listing the whole bucket in the Hive model.
Iceberg's other defining features:
- Snapshot isolation. Every write produces a new snapshot; the catalog atomic-swap of the snapshot pointer is the commit. Readers always see one consistent snapshot.
- Hidden partitioning. You declare
PARTITIONED BY days(order_ts). Queries that filter onorder_tsautomatically prune by the day-bucketed partition without you ever writing adtcolumn. Queries do not need to know the partition spec. - Schema evolution by id. Iceberg assigns every column an integer id; the schema records
{id: 1, name: "user_id"}. Renaming a column changes the name but keeps the id; old data files keep working. Adding, dropping, reordering, and even type-promoting columns are all metadata-only operations. - Row-level deletes (v2 spec). Iceberg can write positional delete files (
{file: data_001.parquet, position: 1234}) or equality delete files (WHERE user_id = 42). Readers apply deletes when reading, until a compaction merges them away. - Vendor neutrality. Apache top-level project, governed broadly. Spark, Trino, Flink, Snowflake, BigQuery, DuckDB, ClickHouse, Dremio, and StarRocks all read it natively.
Delta Lake: an append-only transaction log
Delta was born at Databricks in 2019 and open-sourced shortly after. Where Iceberg's mental model is "snapshots and manifests," Delta's is "a transaction log": the table's history is a sequence of JSON commit files in _delta_log/, each describing what files were added or removed at that version.
s3://lake/orders/
_delta_log/
00000000000000000000.json <- version 0: first commit
00000000000000000001.json <- version 1
...
00000000000000000010.checkpoint.parquet <- compacted state
00000000000000000011.json
_last_checkpoint <- pointer to most recent checkpoint
part-00000-uuid.snappy.parquet
part-00001-uuid.snappy.parquet
...
Each .json commit is a list of add and remove actions:
{"add": {"path": "part-00042-...parquet", "partitionValues": {"dt": "2024-04-25"}, "size": 2400000, "stats": "{\"numRecords\": 10000, \"minValues\": {...}, \"maxValues\": {...}}"}}
{"remove": {"path": "part-00007-...parquet", "deletionTimestamp": 1714000000}}
A reader at version N opens the most recent checkpoint Parquet (which holds the consolidated state up to some earlier version), then replays every JSON commit from there to N to compute the current set of files. Why a log instead of snapshots: appending one small JSON file is the cheapest possible commit — just a PUT of a few KB to S3. Periodic checkpoints (every 10 commits by default) prevent the log replay from growing without bound.
Delta's commit protocol uses multi-version optimistic concurrency: a writer reads the current version, prepares its commit as version + 1, and atomically PUTs the new JSON file with If-None-Match: * semantics. If another writer beat it to that version number, the writer detects the conflict, re-reads the new state, and retries. Why optimistic OCC works on S3: S3 now supports conditional puts, so the atomic-swap primitive needed for serialisable commits is one HTTP call. Before that primitive existed (pre-2024), Delta needed an external coordination service for multi-writer scenarios.
Delta's strengths:
- Best-in-class on Databricks. The Photon engine, Delta caching, Z-ordering, OPTIMIZE compaction, AUTO OPTIMIZE — all of these get the best performance on Delta tables on Databricks.
- Schema enforcement and evolution. Schema mismatches are rejected on write by default; opt in to
mergeSchema=truefor evolution. - Change Data Feed (CDF). Read the row-level changes between two versions of the table — useful for incremental ETL.
- Liquid Clustering (newer): replaces partitioning with a clustering scheme that adapts as data grows.
- Growing reader support. Spark/Databricks first-class; Trino, Athena, Snowflake, BigQuery all read Delta now.
Delta's historical weakness was vendor concentration — it was best on Databricks and second-best everywhere else. The Linux Foundation governance and the Delta Universal Format (UniForm) work narrow that gap.
Apache Hudi: a streaming-first table format
Hudi was built at Uber in 2016–2017 to solve a specific problem: ingesting CDC streams from operational databases (Postgres, MySQL) into a queryable lake within minutes, not hours. Uber had thousands of upstream tables changing constantly and needed MERGE-style updates at scale. Iceberg and Delta emerged later from analytics-first contexts; Hudi was streaming-first from day one.
Hudi has two table types — and the choice is the central design decision when you adopt Hudi.
COPY_ON_WRITE (CoW): when a record is updated, the entire Parquet file containing that record is rewritten with the change applied. Reads are pure Parquet reads — fast. Writes are expensive when many files are touched. Equivalent in spirit to Iceberg's default behaviour and Delta's default behaviour.
MERGE_ON_READ (MoR): updates are appended to per-file delta log files (Avro). Readers merge the base Parquet with the deltas at read time. A background compaction periodically rewrites base + deltas into a fresh Parquet. Writes are cheap; reads are slower until compaction catches up. Why MoR matters: with a high-throughput stream of small upserts (e.g. 10,000 record updates per second from a Kafka topic of CDC events), CoW is unaffordable — every batch would rewrite gigabytes of Parquet to apply a few thousand changed rows. MoR amortises the rewrite over async compactions.
Hudi also has first-class concepts that Iceberg and Delta added later or differently:
- Record keys and primary keys. Every Hudi record has a logical primary key; updates and deletes are addressed by key.
- Pre-combine field. When two updates for the same key arrive in the same batch, the record with the higher pre-combine value (typically a timestamp or version column) wins. This is exactly the upsert semantics CDC pipelines need.
- Timeline. Hudi's analogue of the Iceberg snapshot list — every commit, compaction, cleanup, and rollback gets a timestamped entry on the timeline.
- Indexes. Bloom filters, hash indexes, and HFile indexes let writers efficiently locate which file holds a given record key, so an upsert does not require a full table scan.
In Indian industry, Hudi is heavily used at PhonePe (UPI transaction lake), Flipkart (order-event ingestion), and Walmart Labs / Sams Club India for streaming ingestion. Razorpay's data platform is Iceberg-centred. Most Databricks customers in India — Swiggy, MakeMyTrip, Mahindra Group — are Delta shops.
A concrete walkthrough: pyiceberg
Let's actually create an Iceberg table, append data to it, query it, and time-travel to a past snapshot, all in Python. This uses pyiceberg, the official Python implementation.
import pyarrow as pa
from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, IntegerType, StringType, TimestampType
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import DayTransform
import datetime, time
# A SQLite-backed catalog pointing at a local "warehouse" directory.
# In production this would be Glue, REST catalog, Nessie, etc.
catalog = SqlCatalog(
"local",
**{
"uri": "sqlite:////tmp/iceberg_catalog.db",
"warehouse": "file:///tmp/iceberg_warehouse",
},
)
catalog.create_namespace_if_not_exists("ecommerce")
# Define the schema and partitioning.
schema = Schema(
NestedField(1, "order_id", IntegerType(), required=True),
NestedField(2, "user_id", IntegerType(), required=True),
NestedField(3, "amount_inr", IntegerType(), required=True),
NestedField(4, "order_ts", TimestampType(), required=True),
)
partition_spec = PartitionSpec(
PartitionField(source_id=4, field_id=1000, transform=DayTransform(), name="order_day")
)
table = catalog.create_table_if_not_exists(
"ecommerce.orders",
schema=schema,
partition_spec=partition_spec,
)
# First batch: 3 orders on 2024-04-24.
batch1 = pa.Table.from_pylist([
{"order_id": 1, "user_id": 100, "amount_inr": 1499,
"order_ts": datetime.datetime(2024, 4, 24, 10, 0)},
{"order_id": 2, "user_id": 101, "amount_inr": 2999,
"order_ts": datetime.datetime(2024, 4, 24, 11, 30)},
{"order_id": 3, "user_id": 102, "amount_inr": 799,
"order_ts": datetime.datetime(2024, 4, 24, 14, 15)},
], schema=table.schema().as_arrow())
table.append(batch1)
snapshot_v1 = table.current_snapshot().snapshot_id
print(f"After batch 1, snapshot = {snapshot_v1}")
time.sleep(1) # ensure distinct timestamps
# Second batch: 2 orders on 2024-04-25.
batch2 = pa.Table.from_pylist([
{"order_id": 4, "user_id": 100, "amount_inr": 4499,
"order_ts": datetime.datetime(2024, 4, 25, 9, 45)},
{"order_id": 5, "user_id": 103, "amount_inr": 349,
"order_ts": datetime.datetime(2024, 4, 25, 16, 20)},
], schema=table.schema().as_arrow())
table.append(batch2)
snapshot_v2 = table.current_snapshot().snapshot_id
print(f"After batch 2, snapshot = {snapshot_v2}")
# Read the table now (snapshot v2): all 5 rows.
print("Current state:")
print(table.scan().to_arrow().to_pandas())
# Time-travel to snapshot v1: only the first 3 rows.
print("As of snapshot v1:")
print(table.scan(snapshot_id=snapshot_v1).to_arrow().to_pandas())
Run that and you will see five rows in the current state and three rows in the v1 read. The scan(snapshot_id=...) call is the time-travel primitive — Iceberg's catalog still holds snapshot v1's metadata pointer, so the reader can resolve the manifest list for that older version and read exactly the files that were alive at that commit. Nothing about the Parquet files themselves had to change. Why time travel is essentially free: Iceberg never deletes old snapshots until you explicitly run an expire_snapshots maintenance call. The data files for v1 are still on S3 (since v2 only added files, never removed any), and the v1 metadata still points at them.
Look at /tmp/iceberg_warehouse/ecommerce.db/orders/ after running this and you will find the data Parquet files plus a metadata/ directory with the snapshot manifests, manifest lists, and *.metadata.json files that drive the whole machine.
A real Indian e-commerce data lake architecture
Imagine you are the data platform engineer at a mid-size Indian e-commerce company — say, an electronics marketplace doing 200,000 orders a day, with the order service running on Postgres and a Kafka stream of every order event flowing into your data lake on S3 in Mumbai (ap-south-1).
Your bronze layer needs to ingest the Kafka stream with millisecond latency. Your silver layer needs to dedupe, enrich, and conform. Your gold layer powers Tableau dashboards and reverse ETL into Salesforce. Three layers, three workloads, and the table format choice matters at each level.
Bronze (raw, streaming): You pick Apache Hudi MERGE_ON_READ. Order events arrive on Kafka with order_id as the record key and event_ts as the pre-combine field. Hudi's Flink writer continuously appends to per-file delta logs; updates to the same order_id (status changes, cancellations) overwrite earlier versions on read. Compaction runs every 30 minutes, keeping read latency reasonable. End-to-end ingestion lag from Kafka to a queryable Hudi table is under 60 seconds — good enough for an internal "live order monitor" dashboard.
# Spark Structured Streaming writing to Hudi
hudi_options = {
"hoodie.table.name": "orders_bronze",
"hoodie.datasource.write.recordkey.field": "order_id",
"hoodie.datasource.write.precombine.field": "event_ts",
"hoodie.datasource.write.table.type": "MERGE_ON_READ",
"hoodie.datasource.write.partitionpath.field": "order_date",
"hoodie.compact.inline": "false",
"hoodie.compact.schedule.inline": "true",
}
(kafka_df
.writeStream
.format("hudi")
.options(**hudi_options)
.outputMode("append")
.option("checkpointLocation", "s3://lake/checkpoints/orders_bronze/")
.start("s3://lake/bronze/orders/"))
Silver (cleaned, batch): Every hour, a Spark job reads the latest Hudi snapshot, deduplicates, joins against your customer dimension, and writes to an Apache Iceberg silver table partitioned by days(order_ts). Iceberg's hidden partitioning means downstream queries do not need to know about the partition column — WHERE order_ts BETWEEN '2024-04-25 00:00' AND '2024-04-26 00:00' prunes correctly automatically.
Gold (BI, Tableau, dbt): Daily aggregates land in a second Iceberg table, orders_daily_gold. This is the table Tableau and your data analysts hit. When marketing asks "what were yesterday's GMV numbers?" three weeks after the fact, Iceberg's snapshot history lets you reproduce the exact number that was reported on the day, even if the silver layer has since been backfilled. Time travel is built in: SELECT * FROM orders_daily_gold FOR TIMESTAMP AS OF '2024-04-25 09:00:00 IST'.
In 2024, the product team adds a new gst_state_code column to support state-wise GST reporting. With Iceberg, the change is ALTER TABLE orders_daily_gold ADD COLUMN gst_state_code STRING — a single metadata-only commit. Old Parquet files keep working; readers see NULL for rows written before the schema change. No backfill needed unless you want one. Why this is hard with raw Parquet: in the Hive-table model, adding a column either requires rewriting every existing Parquet file or crossing your fingers and hoping every reader correctly handles missing columns. Iceberg's column-id-based schema makes this a metadata change.
By 2026, Trino-on-Iceberg has become your single source of truth for analytics, while Hudi remains the ingestion-side workhorse. Querying the gold tables from Snowflake (which natively reads Iceberg via the Polaris catalog) means your BI seat costs go down — you no longer need to copy data into Snowflake just to query it.
Comparing the three formats
A simple decision rubric:
- If your stack is Databricks-centric and you want the best per-rupee analytics performance there → Delta Lake.
- If you need multi-engine reads (Spark for ETL, Trino for ad-hoc, Snowflake for BI, DuckDB on a laptop) and want the most vendor-neutral choice → Iceberg.
- If your primary workload is streaming ingestion with frequent upserts (CDC from operational DBs, Kafka to lake) → Hudi, especially
MERGE_ON_READ.
For most new Indian SaaS and consumer-internet companies starting a data lake in 2026, the default has shifted to Iceberg, because Snowflake, BigQuery, and Databricks now all read it. The question used to be "which lock-in do you accept?"; with Iceberg it is closer to "no lock-in."
The market dynamics and the convergence
The format wars from 2020 to 2023 were fierce. Iceberg, Delta, and Hudi had overlapping feature sets and incompatible on-disk metadata; choosing one meant a multi-year commitment. Three things shifted the landscape decisively in 2024.
First, Snowflake committed to Iceberg as a first-class table format — not just a read path, but a write path through Snowflake-managed Iceberg tables stored in customer S3 buckets. This signalled to the market that Iceberg was the "neutral" format vendors could agree on.
Second, Databricks acquired Tabular — the company founded by Iceberg's creators Ryan Blue and Daniel Weeks — for around USD 1 billion, ahead of an expected Snowflake bid. Databricks immediately announced commitments to merge Delta and Iceberg metadata under a single underlying format. The Delta UniForm work, which writes Iceberg-compatible metadata alongside the Delta log, accelerated.
Third, the catalog layer became the new battleground. Snowflake released Polaris (an open Iceberg REST catalog), Databricks released Unity Catalog with Iceberg interop, and the Apache Nessie project gained traction for git-style branching of table metadata. The data files and the table format are increasingly commodity; the catalog is where the next round of competition will play out.
Hudi has stayed slightly outside this convergence — its model is too streaming-first to fold neatly into the Iceberg or Delta metadata shape. The likely outcome is a future where Iceberg and Delta share an underlying physical format, while Hudi remains the specialist for high-throughput streaming ingestion (with bridges that expose Hudi tables as Iceberg-readable for downstream BI).
Going deeper
The metadata layer is where modern lakehouses earn their keep, and the design choices have deep implications for performance, cost, and operability. The next sections look at three details that matter in production.
Why metadata size matters
A petabyte-scale Iceberg table can have 10 million Parquet files. The naive metadata for that — one entry per file in a single manifest — would be enormous. Iceberg solves this by sharding into many manifests, and by writing a manifest list at each snapshot that summarises per-manifest partition ranges.
A well-tuned Iceberg table on a 1 PB warehouse table at a large company might have:
- 10 million data files (~100 MB each)
- 5,000 manifests (~2,000 data files per manifest)
- 1 manifest list per snapshot (~5,000 entries)
- 100,000 snapshots in total history (with
expire_snapshotsregularly run, perhaps only the last 1,000 are kept)
A query reads one snapshot's manifest list (~500 KB), prunes to perhaps 10 manifests, reads those (~10 MB total), and prunes to perhaps 200 data files. Why pruning matters at this scale: without it, listing the 10 million data files via S3 LIST would take several minutes. With Iceberg, total metadata I/O is under 20 MB and completes in seconds.
Delta's transaction log handles the same scale via periodic checkpoints — every 10 commits, Delta writes a single Parquet checkpoint that consolidates the active file set. Readers start from the latest checkpoint plus subsequent JSON commits, never from version 0.
Compaction and the small-file problem
Streaming writes (and frequent batch writes) produce many small files. Twenty-thousand 5 MB Parquet files are dramatically slower to query than 100 1 GB files, even with the same total bytes — because every file has its own footer to read, and the per-file fixed cost dominates.
All three formats provide compaction:
- Iceberg:
rewrite_data_filesaction, runnable via Spark, Trino, or PyIceberg. Combines small files in a partition into target-sized files. - Delta:
OPTIMIZE table(in Spark/Databricks). On Databricks, AUTO OPTIMIZE runs compaction inline with writes. - Hudi: built-in async compaction service for
MERGE_ON_READtables, plus clustering forCOPY_ON_WRITE.
A common operational mistake is to never run compaction. After three months of streaming ingestion, a Hudi or Iceberg table with a million tiny files becomes 5–10× slower to query than it should be. Schedule compaction in your data platform from day one.
The Photon engine and the lakehouse vision
Databricks' Photon paper makes the case that a lakehouse — open formats, separated storage and compute, lake-style economics — can match a data warehouse on performance with the right vectorised execution engine. Photon is Databricks' C++ rewrite of Spark's execution layer, optimised heavily for Delta/Parquet on S3.
The implication: with table formats supplying ACID, schema, and time travel, and engines like Photon, Trino, and DuckDB supplying fast execution, the historical reasons to copy data from a lake into a data warehouse (correctness, governance, performance) have largely disappeared. Snowflake's pivot to read Iceberg natively is an acknowledgement of this shift.
The lakehouse is not just a marketing term — it is the architecture that table formats made possible.
References
- Apache Iceberg — official documentation, spec v2 and v3, table behaviour and snapshot model.
- Armbrust et al., Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores (VLDB 2020) — the original Delta whitepaper from Databricks.
- Apache Hudi documentation and Uber engineering blog — table types, timeline service, indexing strategies.
- Behm et al., Photon: A Fast Query Engine for Lakehouse Systems (SIGMOD 2022) — Databricks' vectorised C++ engine and the lakehouse performance argument.
- Onehouse blog, Apache Hudi vs Delta Lake vs Apache Iceberg — feature and architecture comparison.