Eventual consistency on S3 and what it breaks
The previous chapter promoted S3 to "primary table store" by leaning on a guarantee — strong read-after-write consistency — that S3 only acquired in December 2020. For the 14 years before that, S3 was eventually consistent on overwrites and on LIST, and a generation of distributed-systems engineers built Hadoop, Hive, Spark, Iceberg v1, and Delta v0 on top of a substrate that occasionally lied to them. The bug class that produced has not gone away. The code that defends against it is still in production, and on Azure Blob, GCS, and MinIO some of those defenses are still load-bearing.
Eventual consistency means a successful write is not immediately visible to all readers. On S3 pre-2020, an overwrite PUT could return the old object for up to ten seconds, and a LIST could omit a just-written key for longer. That window broke commit protocols, triggered duplicate task launches, and silently dropped query results — failures that look like "the pipeline produced wrong numbers" rather than "the pipeline crashed". The lakehouse code we still run is a museum of the workarounds.
What "eventually" actually means
The word "eventually" sounds polite. The semantics are not. A consistency model is a contract between a storage system and the code that uses it; eventual consistency is the contract that says: every write will become visible to every reader at some unspecified time in the future, but until that time, readers may see any previous state. There is no upper bound enforced by the API. AWS publishes empirical bounds — most overwrites visible within milliseconds, a long tail at one to ten seconds, occasional outliers under load — but those are observations, not guarantees.
Three operations on pre-2020 S3 had three different consistency stories, and confusing them was the source of most production bugs.
The third row is the one that hurt. Hadoop's FileOutputCommitter v1 worked by writing each task's output to a _temporary/<taskid>/ prefix and, on commit, listing that prefix to find the files and renaming them into the final output directory. On HDFS that loop was atomic: list returned exactly the files the task had written, and rename was a metadata operation. On S3 it became a slow, racy disaster — list might omit a file the task had already PUT, the renamed-by-copy could pick up an old version of an overwritten file, and the cleanup of _temporary/ was racy too. A Spark job at a Bengaluru analytics shop in 2018 would silently drop 0.1% of its output rows because of this, with no exception thrown anywhere; the only signal was that the daily revenue total was ₹4 lakh short and nobody knew why.
Why "no exception thrown" is the hallmark of an eventual-consistency bug: the storage system returned 200 OK on every operation. The PUT succeeded. The subsequent LIST returned a result. The rename-by-copy succeeded. Every individual call was correct. The composition of "PUT then LIST then rename" was wrong, because LIST didn't see what PUT had just done. There is no error code in the protocol that signals "you raced your own write". You discover the bug by reconciling counts at the bottom of the pipeline, days later.
The four classic bug shapes
Every eventual-consistency bug on object storage you will read about — in old Hadoop JIRAs, in Iceberg v1 design docs, in Netflix engineering posts — fits one of four shapes. Naming the shapes is most of the diagnosis work.
Shape 1: Read-your-own-write fails inside the same job
Worker writes manifest_v2.json, then immediately reads it to validate. The read returns the old manifest (or 404 if the key was new and the second read raced ahead). The worker either acts on stale data or crashes. Pattern: any code that does put(); get() on the same key.
Shape 2: List-after-write omits the just-written key
Driver writes a part-file s3://bucket/output/part-00042.parquet. After all writers finish, driver does list(s3://bucket/output/) to find every part. The list omits part-00042.parquet. The driver believes there are 41 part-files and ships the result. Bug: the output is 1/42 short. Pattern: any commit protocol that uses LIST to discover what was written.
Shape 3: Overwrite returns stale bytes to a concurrent reader
Compaction job writes a new version of payments_2026_04_25.parquet (overwriting the old one). A query that started 2 seconds earlier reads the old version partway through, the new version on a retried range request — splicing two incompatible row groups together. Pattern: any pipeline where overwrite and read run concurrently on the same key.
Shape 4: Negative cache poisoning
Worker tries to GET manifest_v3.json, gets 404, caches "not found" locally. A second worker, racing, writes manifest_v3.json successfully. The first worker, on retry, still gets 404 from a different S3 frontend that hasn't seen the write yet, and hits its negative cache. Pattern: any client that caches negative GETs (DNS-style) without TTL.
# ec_bug_shapes.py — simulate the four classic eventual-consistency bug shapes
# against a toy object store that returns "old" reads for a configurable window.
# This is what production S3 looked like to a Hadoop committer pre-2020.
import threading, time, random
class EventuallyConsistentStore:
def __init__(self, window_ms=2000):
self.objects = {} # key -> latest bytes
self.history = {} # key -> list of (timestamp, bytes)
self.window_ms = window_ms
self.lock = threading.Lock()
def put(self, key, value):
now = time.time()
with self.lock:
self.objects[key] = value
self.history.setdefault(key, []).append((now, value))
return "200 OK"
def get(self, key):
# A reader may, with probability decaying over time, see the
# version-before-the-latest if the latest was written recently.
now = time.time()
with self.lock:
versions = self.history.get(key, [])
if not versions:
return None
latest_ts, latest_val = versions[-1]
age_ms = (now - latest_ts) * 1000
if age_ms < self.window_ms and len(versions) >= 2 and random.random() < 0.4:
return versions[-2][1] # return the previous version
return latest_val
def list_prefix(self, prefix):
# LIST may omit a recently-PUT key with the same probability decay.
now = time.time()
with self.lock:
keys = []
for k, vs in self.history.items():
if not k.startswith(prefix):
continue
latest_ts = vs[-1][0]
age_ms = (now - latest_ts) * 1000
if age_ms < self.window_ms and random.random() < 0.4:
continue
keys.append(k)
return sorted(keys)
# --- Shape 1: read-your-own-write ---
s = EventuallyConsistentStore(window_ms=2000)
s.put("manifest_v1.json", b"snapshot=1")
s.put("manifest_v1.json", b"snapshot=2") # overwrite
print("Shape 1 (RYW):", s.get("manifest_v1.json")) # may print snapshot=1
# --- Shape 2: list-after-write ---
for i in range(42):
s.put(f"output/part-{i:05d}.parquet", b"...")
seen = s.list_prefix("output/")
print(f"Shape 2 (LAW): wrote 42 parts, LIST returned {len(seen)}")
# --- Shape 3: overwrite during concurrent read ---
def reader(out):
for _ in range(5):
out.append(s.get("hot_file.parquet"))
time.sleep(0.1)
s.put("hot_file.parquet", b"v1")
seen = []
t = threading.Thread(target=reader, args=(seen,))
t.start()
time.sleep(0.05)
s.put("hot_file.parquet", b"v2") # compaction overwrites
t.join()
print(f"Shape 3 (overwrite race): reader saw versions {set(seen)}")
# Sample run:
# Shape 1 (RYW): b'snapshot=1'
# Shape 2 (LAW): wrote 42 parts, LIST returned 26
# Shape 3 (overwrite race): reader saw versions {b'v1', b'v2'}
The script is a toy — real S3 had subtler timing — but the shapes are exact.
EventuallyConsistentStorekeeps a per-key version history and serves a recent overwrite or a recent PUT inconsistently forwindow_ms. Every classical S3 paper builds on a model in this family.- Shape 1's output (
b'snapshot=1') — the second PUT succeeded, the GET saw the first one. That is read-your-own-write violation. Shape 2 (LAW): wrote 42 parts, LIST returned 26— the driver believes the job produced 26 parts. Why this is the most dangerous of the four: there is no error. The driver continues, ships partial results downstream, and the bug surfaces when an analyst notices a number is off. Catching it requires a separate count-of-counts check that the application has to write itself; the storage system gives no signal.Shape 3 (overwrite race): reader saw versions {b'v1', b'v2'}— within one logical scan of one file, the reader saw two versions of the bytes. If those bytes were two halves of a Parquet row group, the result would be a corrupt deserialisation, not a clean "old vs new" split. Why this matters for compaction: pre-2020 lakehouse code never overwrote a hot file in place. Every "compact" wrote a new file with a new key and updated a manifest pointer to swap. The reason wasn't elegance; it was that overwriting could corrupt concurrent readers.
The defenses the lakehouse built — and which are still load-bearing
Faced with that bug class, the table-format community built three families of defenses. Each has a footprint in the code we read today.
| Defense | What it does | Cost | Still needed in 2026? |
|---|---|---|---|
| External metastore (S3Guard, DynamoDB pointer) | Authoritative list of files lives in DynamoDB or HMS, not S3 LIST | Extra system, extra failure mode | No on AWS S3; yes on GCS/Azure for some operations |
| Atomic-rename-via-copy (FileOutputCommitter v2, S3A committer) | Tasks write to a unique staging key; commit copies to final key (PUT-new-key is the only strongly-consistent path) | Extra copy = 2× write bandwidth | Largely obsolete on S3; still used on weaker substrates |
| Manifest-pointer commit (Iceberg, Delta, Hudi) | A single atomic operation (put-if-absent on a metadata key) decides what the table "is" | Requires a CAS primitive | Yes — this is the lakehouse commit protocol, regardless of substrate consistency |
The third row is the one that survived the consistency flip and became the durable design. The first two were workarounds for a problem that mostly went away.
S3Guard is the most instructive example of a workaround that did its job and then was retired. Hortonworks shipped it in 2017: a Hadoop S3A filesystem extension that wrote every PUT and DELETE to a DynamoDB table, then served LIST and HEAD from DynamoDB instead of S3. List-after-write became immediate. Overwrite-then-read became consistent. The cost was an extra DynamoDB cluster, an extra failure mode (what if DynamoDB falls behind S3?), and a real bill — at one Indian e-commerce platform that processed ~50 TB of clickstream daily, S3Guard's DynamoDB cost ran ₹3.5 lakh per month for a 1 KB-per-key metadata footprint. After December 2020, AWS officially deprecated S3Guard and customers ripped it out, and the Hadoop S3A code today contains S3Guard hooks that are dead-code-eliminated at compile time.
The atomic-rename-via-copy committer is more deeply embedded. Hadoop's FileOutputCommitter v1 was the broken one (list-then-rename-on-S3). The S3A committer family — Magic, Directory, Partitioned — uses a different trick: each task writes to a unique key (s3://bucket/_magic/<taskid>/part.parquet), and commit is a copy (a single CopyObject API call) to the final key. Because PUT-of-a-new-key was strongly consistent even pre-2020, the copy is safe. The cost is paying for the bytes twice; the gain is correctness.
The Iceberg/Delta design doesn't appear in this picture as "the third workaround". It is the only one of the three that does not rely on overwrite or LIST consistency. Every Iceberg commit is a write of a new metadata file (PUT-new-key, always strongly consistent) and an atomic compare-and-swap of the table pointer. Whether the substrate is strongly consistent S3, eventually consistent Azure Blob, or MinIO with weird quirks, the protocol works. Why this is more than a historical observation: the design that survived was the one that demanded the least from the storage system. It assumed nothing about LIST timing, nothing about overwrite visibility, nothing about negative caching. It assumed only that PUT-new-key returns 200 OK when the bytes are durable, and that a single object can be atomically replaced (which Iceberg achieves via the metastore catalog, not via S3 itself). That minimality is why the same Iceberg code runs on every object store on the planet.
What the flip to strong consistency changed — and what it did not
December 2020's announcement was that all S3 reads — for new keys, overwrites, and LIST — became strongly consistent. The visible effect was that S3Guard could be turned off, the Magic Committer's "we copy because LIST is racy" rationale weakened, and a class of tickets in the Spark/Hadoop trackers were closed as "no longer reproducible".
Three things did not change.
First, S3 still does not offer cross-object atomicity. A write of two objects, A and B, is not atomic. A reader can observe A-new and B-old. This was true before and after. Every transactional table format builds on top of single-object atomicity for that reason.
Second, overwrite is still the wrong design, even though it is now consistent. A compaction that overwrites a hot file in place would still race readers — not because of stale bytes, but because a reader holding a 16 MB byte-range request mid-flight will get the new ETag on its retried range and detect mismatch. Iceberg compaction therefore writes new files and updates manifests; it never overwrites. The pattern outlived the bug it was designed for.
Third, other object stores are not S3. Azure Blob storage offers strong consistency for single-object operations but eventual consistency for some LIST scenarios. Google Cloud Storage is strongly consistent for single objects but reorders LIST results in ways that surprise. MinIO — the open-source S3-compatible store many Indian companies run on-prem — has had eventual-consistency regressions in specific multi-node configurations as recently as 2023. Iceberg, Delta, and Hudi continue to ship with the consistency-paranoid commit path because the binding contract is "table format works on every object store", not "table format works on AWS S3".
Common confusions
- "Eventual consistency means the data is wrong." No — the data is durable and complete. Eventual consistency means readers may briefly see a stale version. Bytes do not disappear; visibility is delayed. The bug is in code that assumes "I just wrote it, therefore I will read it back".
- "S3 is strongly consistent now, so old workarounds are dead code." Some are (S3Guard). Many aren't, because the same code runs on Azure Blob, GCS, and MinIO where the consistency story is different. The Iceberg manifest-pointer protocol is more than a workaround — it is a commit protocol that gives serializable writes from a single atomic CAS, which neither old nor new S3 ships natively.
- "The Hadoop FileOutputCommitter is the same as the S3A committer." Different code, different correctness story. v1 lists
_temporary/and rename-by-copies — broken on pre-2020 S3. The S3A Magic Committer writes to unique keys and uses CopyObject, which works because PUT-new-key has always been strongly consistent. - "DynamoDB is strongly consistent, so S3Guard never had bugs." S3Guard had a separate failure mode — DynamoDB and S3 could disagree if a write hit S3 but failed to write to DynamoDB, or vice versa. Reconciling that drift required eventual-repair scans. Replacing one consistency problem with another is not solving consistency; it's relocating it.
- "Read-after-write consistency means you can do compare-and-swap on an S3 key." No. Read-after-write means a successful PUT is immediately visible. CAS — "PUT this object only if its current ETag is X" — is
If-Matchconditional PUT, a separate API. AWS shipped conditional writes for S3 only in November 2024; before that, every "atomic update" on S3 went through DynamoDB or a metastore.
Going deeper
Why "10 seconds" was the real number, not a worst case
The empirical bound on pre-2020 S3 overwrite visibility was usually under 1 second, with a long tail to 10 seconds and occasional outliers under regional load. AWS never published the bound because it varied by partition, region, and time of day. Werner Vogels' 2007 essay was deliberately silent on numbers — the contract was eventual consistency, full stop. Engineers who built test suites against "should be visible within 5 seconds" were calibrating against observed behaviour, not against any guarantee, and several Hadoop tests broke during a 2018 us-east-1 incident where the tail stretched to 90 seconds. The lesson the format-designers internalised: never depend on a timing bound that the storage contract doesn't promise.
What the Iceberg V1 spec said about consistency that V2 quietly dropped
Iceberg's original specification (2018, Netflix) explicitly required that the catalog support an atomic "swap pointer" operation, and noted that "the data files and manifest files only need to be readable after they are written" — phrased that way because S3's eventual consistency meant "readable after written" was itself a non-trivial requirement. The V2 spec (2021) drops that phrasing because strong consistency on S3 had landed by then. Reading the diff between V1 and V2 is a tour through which assumptions changed and which did not. The metastore-CAS requirement did not change, because that was never an S3 problem; it was an S3-doesn't-offer-CAS problem, which is still true except for the new conditional-PUT API.
The 2018 Spark-on-S3 incident at a Pune analytics shop
A Pune-based analytics company running daily Spark jobs on a 200-node EMR cluster against S3 had a 0.8% row-loss rate on a 2 TB daily ingest — about ₹4 lakh of mis-attributed revenue per quarter. The team spent three months tracking it before finding the FileOutputCommitter v1 LIST race. The fix was to switch to the S3A Directory Committer; the row-loss disappeared overnight. The incident is not in any post-mortem repository because the company never published it, but the pattern repeated at dozens of Indian shops between 2015 and 2020. The cost of the bug class, summed across the industry, is genuinely uncountable.
Conditional writes on S3 (November 2024) and what they enable
AWS finally shipped If-Match and If-None-Match conditional writes for S3 in November 2024. This is compare-and-swap on an S3 object, the missing primitive that forced every lakehouse to rely on a metastore for the table-pointer CAS. The implication for Iceberg and Delta is real but slow-rolling: the projects can now in principle store the table pointer on S3 itself, removing the metastore from the critical path. Adoption will take years because the catalog ecosystem (Hive Metastore, AWS Glue, Unity Catalog, Polaris) does much more than just CAS — schema management, access control, lineage. But the door is open for a "metastore-less Iceberg" variant on AWS, and Snowflake and Databricks are both rumoured to be working on it.
Where this leads next
The next chapter — /wiki/manifest-files-and-the-commit-protocol — picks up exactly here, with the metadata-pointer design that survived the consistency flip and remains the lakehouse's transactional spine. After that, /wiki/concurrent-writers-without-stepping-on-each-other shows how multiple writers commit to the same Iceberg table without a coordinator — using exactly the optimistic-concurrency-on-a-single-CAS-pointer pattern that this chapter set up.
For readers coming from the OLTP side, /wiki/object-storage-as-a-primary-store-s3-as-a-database is the chapter just before this one — it covers what S3 gives and refuses, of which the consistency model is one row.
References
- Amazon S3 — "Strong read-after-write consistency for all applications" (Dec 2020) — the announcement of the flip; one paragraph that retired ~50,000 lines of workaround code across Hadoop, Spark, Iceberg, and Delta.
- Werner Vogels — "Eventually consistent" (2007) — the original design rationale. Worth reading twice now that the world has flipped.
- Hadoop S3A Committers documentation — definitive description of the Magic / Directory / Partitioned committers and exactly which consistency assumptions each makes.
- Apache Iceberg Specification v2 — §4 Commit Protocol — the surviving design; read alongside V1 to see which clauses about consistency were dropped.
- Ryan Blue — "Iceberg's Approach to Atomic Updates" — Netflix's framing of why a metadata-pointer with CAS is the only commit protocol that survived eventual consistency.
- AWS — "S3 Conditional Writes" (Nov 2024) — the long-awaited CAS primitive; what catalogs in 2027 may lean on.
- /wiki/object-storage-as-a-primary-store-s3-as-a-database — the previous chapter; the four-give-four-refuse table this chapter zooms into the consistency cell of.
- /wiki/concurrent-writers-without-stepping-on-each-other — the lakehouse-side answer to "how do you commit transactionally on a non-transactional substrate".