Partition evolution and the rename problem
Aditi's payments lakehouse from the previous chapter has been running for two years. The original table was partitioned by dt alone — a sane choice when the warehouse held 5 GB a day. Today it holds 200 GB a day, the daily partitions are 12× the size the team can compact in one window, and engineering has decided to add a bucket column so that dt × bucket=hash(merchant_id) % 64 becomes the new layout. Aditi opens a ticket. Six weeks later it is still open. Not because the new layout is wrong, but because every one of the ~2 PB already on S3 is married to the old layout, and the table format she picked decides whether "married" means "we can change our mind cheaply", "we can change our mind for new data only", or "we are stuck until we re-write the entire history".
Partitioning is a write-time decision baked into every file path. Hive freezes the partition spec at table creation — changing it means a full re-write. Iceberg's partition evolution lets the spec change going forward while old data keeps its old layout, with the planner reading both layouts at query time. The "rename problem" is the special case where a partition column name changes; no format handles it cleanly without a rewrite.
Why partitioning is so hard to change
The reason this is a hard problem is that the partition value is encoded into the file path, not into the file content. When you wrote s3://lake/payments/dt=2026-04-24/part-001.parquet, the engine put the row's dt value into the directory name. The Parquet file itself does not contain a dt column inside it — that would be wasted bytes, since every row in the file shares the same dt. The path is the column.
This is brilliant for pruning. The planner reads dt=2026-04-24 from the path string and knows, without opening the file, what the value is. But it is a disaster for evolution. If you decide tomorrow that the partition column should be event_date instead of dt, the path on every existing file is wrong. You cannot rename the column without re-writing the path, and you cannot re-write the path without re-writing the file (S3 does not have a true rename; copy-and-delete is the only move).
Why the writer skips the partition column inside the file: storing dt='2026-04-24' 50 million times in a 50 million-row Parquet file would waste ~14 bytes per row, which Parquet's dictionary encoding then compresses back to almost nothing — but Hive and Iceberg made the choice that "almost nothing" is still more than zero, and that the path representation is canonical. Some engines will even reject a file that contains the partition column as a redundant field.
This is also why "partition evolution" is not the same as "schema evolution". Adding a new regular column to a table is cheap: write the new column going forward, and old rows read back as NULL. Adding a new partition column is hard: the engine has to know how to assign old data to the new partition layout — and the old data does not contain the value to assign.
What Hive lets you change (almost nothing)
In a Hive-managed table, the partition spec is fixed at CREATE TABLE time. Once the table exists with PARTITIONED BY (dt STRING), the only mutations Hive supports are:
ALTER TABLE ... ADD PARTITION (dt='2026-04-25')— register a new partition value (a folder you wrote elsewhere) into the metastore. Same partition column.ALTER TABLE ... DROP PARTITION (dt='2026-04-21')— un-register a partition. The folder may or may not be deleted depending onEXTERNALvs managed.MSCK REPAIR TABLE— re-scan the table root and pick up any partitions someone wrote outside the metastore's knowledge.
What Hive does not let you do:
- Add a second partition column to an existing table. The metastore's
PARTITION_KEYStable is keyed on(table_id, partition_position); there is no idempotent way to add a key without invalidating every existing partition entry. - Change the type of a partition column (e.g. from
STRINGtoDATE). - Rename a partition column.
- Drop a partition column.
- Change the bucketing.
The official "evolution" path in pure Hive is CREATE TABLE AS SELECT into a new table with the new layout, then swap. This is a full re-write of the historical data. On Aditi's 2 PB warehouse, with sustained Spark throughput of about 200 MB/s per worker and a 200-worker cluster, this is around a 14-hour job — and that is for a single table. The rewrite cost is also ~₹40,000 in S3 and EMR (or equivalent EC2 spot-instance) charges, which is the cheap number; the engineering cost of validating the rewrite, dual-running queries against both, and cutting over is weeks.
This is the legacy pain that the modern table formats were designed to fix.
What Iceberg's partition evolution actually does
Iceberg added a feature called partition evolution in v1: you can change the partition spec, and the table format keeps a versioned history of all the specs it has ever had. New data is written with the new spec; old data stays in its old spec. The query planner reads both at query time, applies the appropriate predicate against each spec independently, and the reader sees a single table.
Mechanically, an Iceberg table's metadata file (the metadata.json at the catalog pointer) holds an array called partition-specs, indexed by spec ID:
{
"partition-specs": [
{
"spec-id": 0,
"fields": [
{"name": "dt", "transform": "day", "source-id": 5, "field-id": 1000}
]
},
{
"spec-id": 1,
"fields": [
{"name": "dt", "transform": "day", "source-id": 5, "field-id": 1000},
{"name": "bucket_merchant", "transform": "bucket[64]",
"source-id": 7, "field-id": 1001}
]
}
],
"default-spec-id": 1,
...
}
Every data file in the Iceberg manifest tree has a spec-id field telling the planner which spec the file was written under. Why every file carries a spec-id: at query time, the planner has to evaluate the user's WHERE clause against each spec independently. If the spec-1 files are partitioned by dt × bucket_merchant and the spec-0 files are partitioned by dt alone, a query of WHERE dt='2026-04-24' AND merchant_id='razorpay-007' prunes spec-1 files to one (day, bucket) combination and prunes spec-0 files to one day — and the planner needs to know which rule to apply to which file.
The transformation matters as much as the column. Iceberg's bucket[64] is not the same as Hive's bucketing; it is a deterministic transform that the engine applies to the source column at write time and stores in the manifest as a partition value. So the spec is (source_column, transform, partition_field_id). Evolving the spec is adding a new tuple to that array, with all the ID-based plumbing in place to keep referring to the same logical column.
What Iceberg lets you do today:
- Add a partition field (a new
(column, transform)entry). - Drop a partition field (forward only — old data keeps its old layout).
- Replace a transform (e.g. switch
bucket[16]tobucket[64]— old data stays in 16, new data goes into 64). - Promote / demote a column type (with limits —
int → longis fine,string → intis not).
What Iceberg does not let you do:
- Rename a partition column (this is the rename problem — coming up below).
- Re-bucket old data without rewriting it.
- Change the partition spec retroactively — there is no "go back and re-shuffle the 2018 partitions to the new layout" command.
The savings are substantial. On Aditi's table, switching from dt to dt × bucket_merchant for new data costs zero rewrite — the change is a metadata commit on the catalog, taking milliseconds. New data starts going into the new layout immediately. Old data stays where it is, queries still work, and a one-time backfill rewrite of historical data can happen on a slow weekend if it is even necessary.
A working comparison: Hive evolution vs Iceberg evolution
The script below builds the same dataset twice — once as a Hive-style partitioned directory tree, once as an Iceberg-style manifest with a versioned partition spec — and shows the cost of evolving the partition layout in each.
# evolve_compare.py — what does it cost to change the partition layout?
import os, json, hashlib, time, shutil
from datetime import date, timedelta
import pyarrow as pa
import pyarrow.parquet as pq
OUT = "/tmp/evolve_demo"
shutil.rmtree(OUT, ignore_errors=True); os.makedirs(OUT, exist_ok=True)
def synth_day(d, n=10000):
return pa.table({
"txn_id": pa.array(range(n), type=pa.int64()),
"merchant": pa.array([f"m-{i % 200}" for i in range(n)]),
"amount": pa.array([100 + (i * 7) % 50000 for i in range(n)],
type=pa.int32()),
"dt": pa.array([d.isoformat()] * n),
})
# ---- Hive-style: partition by dt only.
hive_root = f"{OUT}/hive_table"
for i in range(7):
d = date(2026, 4, 1) + timedelta(days=i)
pq.write_to_dataset(synth_day(d), root_path=hive_root,
partition_cols=["dt"])
print("Hive table layout, before evolution:")
for r, _, fs in os.walk(hive_root):
for f in fs:
if f.endswith(".parquet"):
print(" ", r.replace(OUT, "").lstrip("/"))
# Hive "evolution": add bucket column. The only safe path is to create a new
# table and re-write everything. Time and cost the rewrite.
def bucket(m): return int(hashlib.md5(m.encode()).hexdigest(), 16) % 8
t0 = time.time()
hive_v2 = f"{OUT}/hive_table_v2"
for i in range(7):
d = date(2026, 4, 1) + timedelta(days=i)
t = synth_day(d)
buckets = pa.array([bucket(m) for m in t["merchant"].to_pylist()])
t2 = t.append_column("bucket", buckets)
pq.write_to_dataset(t2, root_path=hive_v2,
partition_cols=["dt", "bucket"])
print(f"\nHive evolution = full rewrite. Wall: {time.time()-t0:.2f}s, "
f"files copied: {sum(len(fs) for _, _, fs in os.walk(hive_v2) if fs)}")
# ---- Iceberg-style: keep one root, simulate two specs in a metadata file.
ice_root = f"{OUT}/ice_table"
os.makedirs(ice_root, exist_ok=True)
# Spec 0: by dt. Write 4 days under spec 0.
for i in range(4):
d = date(2026, 4, 1) + timedelta(days=i)
pq.write_to_dataset(synth_day(d),
root_path=f"{ice_root}/data",
partition_cols=["dt"])
# "Evolve the spec" — write metadata.json. Cost: a single JSON file.
metadata = {
"format-version": 2,
"partition-specs": [
{"spec-id": 0, "fields": [{"name": "dt", "transform": "identity"}]},
{"spec-id": 1, "fields": [
{"name": "dt", "transform": "identity"},
{"name": "bucket", "transform": "bucket[8]"}
]},
],
"default-spec-id": 1,
}
t1 = time.time()
with open(f"{ice_root}/metadata.json", "w") as f:
json.dump(metadata, f, indent=2)
print(f"\nIceberg evolution = metadata commit. Wall: {time.time()-t1:.4f}s")
# Spec 1: by dt + bucket. Write 3 more days under spec 1.
for i in range(4, 7):
d = date(2026, 4, 1) + timedelta(days=i)
t = synth_day(d)
buckets = pa.array([bucket(m) for m in t["merchant"].to_pylist()])
t2 = t.append_column("bucket", buckets)
pq.write_to_dataset(t2, root_path=f"{ice_root}/data",
partition_cols=["dt", "bucket"])
print("\nIceberg layout after evolution (mixed specs):")
for r, _, fs in os.walk(f"{ice_root}/data"):
for f in fs:
if f.endswith(".parquet"):
print(" ", r.replace(OUT, "").lstrip("/"))
# Sample run on a laptop:
Hive table layout, before evolution:
hive_table/dt=2026-04-01/...parquet
hive_table/dt=2026-04-02/...parquet
... (7 files)
Hive evolution = full rewrite. Wall: 1.83s, files copied: 56
Iceberg evolution = metadata commit. Wall: 0.0007s
Iceberg layout after evolution (mixed specs):
ice_table/data/dt=2026-04-01/...parquet <- spec 0
ice_table/data/dt=2026-04-02/...parquet <- spec 0
ice_table/data/dt=2026-04-03/...parquet <- spec 0
ice_table/data/dt=2026-04-04/...parquet <- spec 0
ice_table/data/dt=2026-04-05/bucket=2/...parquet <- spec 1
ice_table/data/dt=2026-04-05/bucket=5/...parquet <- spec 1
ice_table/data/dt=2026-04-06/bucket=0/...parquet <- spec 1
...
The numbers are the punchline. The Hive "evolution" is a full rewrite — every row, every file, every path — and on a real 2 PB warehouse extrapolates to roughly 14 hours of compute. The Iceberg evolution is a single JSON commit of about 700 microseconds, after which writes immediately use the new layout while old files keep theirs.
The lines that matter: partition_cols=["dt"] for the spec-0 writer, partition_cols=["dt", "bucket"] for the spec-1 writer — same code, different argument. The metadata file's partition-specs array holds the version history. default-spec-id: 1 tells future writers which spec to use; old files unaffected. Why this works without breaking queries: at read time, the planner walks the manifest tree, sees each file tagged with its spec-id, and applies the user's WHERE clause through the spec's transform to decide whether to read the file. A query of WHERE dt='2026-04-03' prunes spec-0 files by exact match on the path, prunes spec-1 files by the dt component of their compound path, and reads exactly the matching ones from each — without the user knowing two specs exist.
The rename problem: one thing no format handles cleanly
Even with Iceberg's evolution, there is one mutation that nothing handles cheaply: renaming a partition column. If you decide that dt should now be called event_date, every existing file's path contains the literal string dt=.... The Parquet files do not need to change — but their paths do, and an S3 "rename" is COPY + DELETE.
Iceberg gets close. It refers to columns internally by field-id, not by name. So a RENAME COLUMN dt TO event_date updates the column name in the table schema and the partition spec, and queries that filter on event_date work because the field-id is unchanged. But a tool that walks the table root and filters by parsing the path strings (some external systems still do this — Hive Metastore-backed Spark jobs, cloud-vendor crawlers, ad-hoc aws s3 ls scripts) sees the old dt= prefix and may break.
The full clean rename, in any format, requires re-writing the affected files so the path matches the new name. There is no metadata-only rename for partition columns. This is the single biggest sharp edge in lakehouse evolution: rename is the most expensive change you can make, even though it sounds like the cheapest.
Common confusions
- "Iceberg partition evolution rewrites my old data." No — old data stays exactly where it is, in the spec it was written under. The evolution is a metadata commit. Only future writes use the new spec. Querying mixed-spec data is the planner's job.
- "
ALTER TABLE ... ADD PARTITIONis partition evolution." No, that's adding a new value under the existing spec. The spec (the column list and transforms) stays the same. Real partition evolution adds or drops a partition field. - "Hive lets me change the partitioning by dropping and re-adding the table." Technically yes, practically no — you've just done a full rewrite, which is exactly what you wanted to avoid. "DROP and CREATE" is the rewrite path with extra downtime.
- "Renaming a partition column in Iceberg is free because Iceberg uses field-ids." It's free for the schema and the spec. It is not free for the file paths, which still have the old name. Tools that read path strings see the old name. The full rename requires a rewrite.
- "I can change a transform without rewriting." You can change it for new data. Old data stays in the old transform. If you need uniform layout across all data, you have to rewrite — there is no incremental conversion.
- "Delta and Hudi do partition evolution the same way Iceberg does." Not quite. Delta only added partition evolution in late 2024 and limits it to add/drop. Hudi has always supported a different style (record-key + partition-path), and partition-path evolution requires a rewrite (table rebuild). Iceberg has the most mature evolution semantics today.
Going deeper
How the Iceberg planner reads two specs at once
When you run SELECT * FROM payments WHERE dt = '2026-04-05' on a table that has both spec-0 and spec-1 data, the planner does this: (1) load the latest manifest list; (2) for each manifest, read its spec-id field; (3) compile the user's predicate into the form the spec understands — for spec-0, dt = '2026-04-05' is a direct path filter; for spec-1, dt = '2026-04-05' matches all 64 buckets and the predicate compiles to "any bucket"; (4) apply each compiled predicate to the manifest entries; (5) emit the surviving file list. The cost is one manifest-list read (cheap, <100 ms), one manifest read per surviving spec (also cheap), then the file reads. The user sees a single logical table.
Snowflake's micro-partitions: a different design point
Snowflake skipped partitioning-by-folder entirely. Tables are stored as immutable, ~16 MB micro-partitions, each one auto-clustered by ingest order. There are no user-declared partition columns. The system tracks min/max statistics per micro-partition and prunes with those. This eliminates the evolution problem (no spec exists to evolve) but pushes the cost into the auto-clustering background process, which has to constantly re-organise hot data. The trade-off: zero schema-evolution pain at the price of a constantly-running maintenance job and an opaque cost model. For a payments shop running its own infrastructure, Iceberg's explicit spec is usually preferable; for a managed-warehouse shop, Snowflake's invisibility is.
The migration playbook for a real production table
Aditi's actual playbook on the 2 PB Razorpay-style table: (1) on a Friday evening, commit the partition-spec change to Iceberg; this is metadata-only and instant. (2) over the weekend, monitor that new data is landing in the new layout and that queries against the last 7 days (which now span both specs) return correct results; if anything looks wrong, the rollback is another metadata commit. (3) over the next 2-3 weeks, run a background RewriteDataFiles job that copies old-spec data into new-spec layout, partition by partition, in low-traffic hours. The full conversion takes a fortnight, costs the equivalent of a one-time ₹40,000 in EMR, and is interruptible. The Hive equivalent is a 14-hour atomic job with no rollback.
Why bucketing transforms hurt evolution most
If your spec uses bucket[N] and you want to change N (say from 16 to 64), every old file is in a bucket-of-16 and every new file in a bucket-of-64. The planner can still answer queries (it computes the predicate against both transforms), but the file-pruning effectiveness on old data is now 4× worse than on new data — a query for one merchant prunes new files to 1 of 64 and old files to 1 of 16. The result is asymmetric query latency depending on date range. Rewriting is the only fix, and Iceberg's RewriteDataFiles action is built for exactly this.
Where this leads next
Iceberg's partition evolution makes the layout question reversible — but it does not make small-files hell go away. When you switch from dt to dt × bucket=64, today's 200 GB partition becomes 64 partitions of about 3 GB each. That is fine. When tomorrow's traffic is half, those 64 partitions become 64 × 1.5 GB. When tomorrow's traffic is a tenth, they become 64 × 300 MB — under the 1 GB threshold most engines want for parallel reads. The next chapter, /wiki/compaction-small-files-hell-and-how-to-avoid-it, is about the maintenance job that fixes that. After that, /wiki/iceberg-delta-hudi-from-the-producers-perspective compares the three modern formats on every dimension this chapter touched, including their evolution semantics.
References
- Apache Iceberg: Partition evolution — canonical reference for spec versioning, transform changes, and the metadata commit semantics.
- Iceberg Spec v2: partition specs — the on-disk metadata format,
partition-specsarray, and field-id system. - Netflix: From Hive to Iceberg — the original production migration, including why Hive's partition spec was the blocker that motivated Iceberg.
- Delta Lake: partition evolution (added 2024) — Delta's later, more limited take on the same feature.
- Snowflake micro-partitions — the alternative design point that avoids partition evolution by removing user-declared partitions altogether.
- PhonePe Engineering: Iceberg migration at UPI scale — Indian-context production write-up on Hive-to-Iceberg cutover for high-volume payments tables.
- /wiki/partitioning-strategies-date-hash-composite — the chapter you came from; sets up which partition column to pick before you have to worry about changing it.
- /wiki/compaction-small-files-hell-and-how-to-avoid-it — the maintenance job that fixes the small-files cost composite layouts create.