Eventual consistency on S3 and what it breaks

The previous chapter promoted S3 to "primary table store" by leaning on a guarantee — strong read-after-write consistency — that S3 only acquired in December 2020. For the 14 years before that, S3 was eventually consistent on overwrites and on LIST, and a generation of distributed-systems engineers built Hadoop, Hive, Spark, Iceberg v1, and Delta v0 on top of a substrate that occasionally lied to them. The bug class that produced has not gone away. The code that defends against it is still in production, and on Azure Blob, GCS, and MinIO some of those defenses are still load-bearing.

Eventual consistency means a successful write is not immediately visible to all readers. On S3 pre-2020, an overwrite PUT could return the old object for up to ten seconds, and a LIST could omit a just-written key for longer. That window broke commit protocols, triggered duplicate task launches, and silently dropped query results — failures that look like "the pipeline produced wrong numbers" rather than "the pipeline crashed". The lakehouse code we still run is a museum of the workarounds.

What "eventually" actually means

The word "eventually" sounds polite. The semantics are not. A consistency model is a contract between a storage system and the code that uses it; eventual consistency is the contract that says: every write will become visible to every reader at some unspecified time in the future, but until that time, readers may see any previous state. There is no upper bound enforced by the API. AWS publishes empirical bounds — most overwrites visible within milliseconds, a long tail at one to ten seconds, occasional outliers under load — but those are observations, not guarantees.

Three operations on pre-2020 S3 had three different consistency stories, and confusing them was the source of most production bugs.

Three S3 operations and their consistency before December 2020A three-row diagram. Row 1: PUT of a new key, strong read-after-write — a successful PUT is immediately visible. Row 2: PUT overwriting an existing key, eventually consistent — readers may see the old version for several seconds. Row 3: LIST after PUT, eventually consistent — a just-written key may be omitted from a LIST for longer. Three operations, three consistency stories (S3 pre-Dec-2020) PUT new key strong read-after-write — visible immediately, every region, every reader always safe PUT overwriting existing key new ETag returned readers may see old object — typically <1s, tail to 10s eventual LIST with prefix after PUT PUT 200 OK key may be missing from LIST — minutes in pathological cases eventual Bug class: code that LISTs to discover what it just PUT will sometimes find nothing. This is exactly what the Hadoop FileOutputCommitter does — and why it broke on S3 before S3Guard existed.
The three operations had three different windows. PUT-new-key was always safe. PUT-overwrite would briefly serve the old bytes. LIST after PUT could omit the just-written key for the longest. Most production failures lived in the third row.

The third row is the one that hurt. Hadoop's FileOutputCommitter v1 worked by writing each task's output to a _temporary/<taskid>/ prefix and, on commit, listing that prefix to find the files and renaming them into the final output directory. On HDFS that loop was atomic: list returned exactly the files the task had written, and rename was a metadata operation. On S3 it became a slow, racy disaster — list might omit a file the task had already PUT, the renamed-by-copy could pick up an old version of an overwritten file, and the cleanup of _temporary/ was racy too. A Spark job at a Bengaluru analytics shop in 2018 would silently drop 0.1% of its output rows because of this, with no exception thrown anywhere; the only signal was that the daily revenue total was ₹4 lakh short and nobody knew why.

Why "no exception thrown" is the hallmark of an eventual-consistency bug: the storage system returned 200 OK on every operation. The PUT succeeded. The subsequent LIST returned a result. The rename-by-copy succeeded. Every individual call was correct. The composition of "PUT then LIST then rename" was wrong, because LIST didn't see what PUT had just done. There is no error code in the protocol that signals "you raced your own write". You discover the bug by reconciling counts at the bottom of the pipeline, days later.

The four classic bug shapes

Every eventual-consistency bug on object storage you will read about — in old Hadoop JIRAs, in Iceberg v1 design docs, in Netflix engineering posts — fits one of four shapes. Naming the shapes is most of the diagnosis work.

Shape 1: Read-your-own-write fails inside the same job

Worker writes manifest_v2.json, then immediately reads it to validate. The read returns the old manifest (or 404 if the key was new and the second read raced ahead). The worker either acts on stale data or crashes. Pattern: any code that does put(); get() on the same key.

Shape 2: List-after-write omits the just-written key

Driver writes a part-file s3://bucket/output/part-00042.parquet. After all writers finish, driver does list(s3://bucket/output/) to find every part. The list omits part-00042.parquet. The driver believes there are 41 part-files and ships the result. Bug: the output is 1/42 short. Pattern: any commit protocol that uses LIST to discover what was written.

Shape 3: Overwrite returns stale bytes to a concurrent reader

Compaction job writes a new version of payments_2026_04_25.parquet (overwriting the old one). A query that started 2 seconds earlier reads the old version partway through, the new version on a retried range request — splicing two incompatible row groups together. Pattern: any pipeline where overwrite and read run concurrently on the same key.

Shape 4: Negative cache poisoning

Worker tries to GET manifest_v3.json, gets 404, caches "not found" locally. A second worker, racing, writes manifest_v3.json successfully. The first worker, on retry, still gets 404 from a different S3 frontend that hasn't seen the write yet, and hits its negative cache. Pattern: any client that caches negative GETs (DNS-style) without TTL.

# ec_bug_shapes.py — simulate the four classic eventual-consistency bug shapes
# against a toy object store that returns "old" reads for a configurable window.
# This is what production S3 looked like to a Hadoop committer pre-2020.

import threading, time, random

class EventuallyConsistentStore:
    def __init__(self, window_ms=2000):
        self.objects = {}            # key -> latest bytes
        self.history = {}            # key -> list of (timestamp, bytes)
        self.window_ms = window_ms
        self.lock = threading.Lock()

    def put(self, key, value):
        now = time.time()
        with self.lock:
            self.objects[key] = value
            self.history.setdefault(key, []).append((now, value))
        return "200 OK"

    def get(self, key):
        # A reader may, with probability decaying over time, see the
        # version-before-the-latest if the latest was written recently.
        now = time.time()
        with self.lock:
            versions = self.history.get(key, [])
        if not versions:
            return None
        latest_ts, latest_val = versions[-1]
        age_ms = (now - latest_ts) * 1000
        if age_ms < self.window_ms and len(versions) >= 2 and random.random() < 0.4:
            return versions[-2][1]   # return the previous version
        return latest_val

    def list_prefix(self, prefix):
        # LIST may omit a recently-PUT key with the same probability decay.
        now = time.time()
        with self.lock:
            keys = []
            for k, vs in self.history.items():
                if not k.startswith(prefix):
                    continue
                latest_ts = vs[-1][0]
                age_ms = (now - latest_ts) * 1000
                if age_ms < self.window_ms and random.random() < 0.4:
                    continue
                keys.append(k)
        return sorted(keys)

# --- Shape 1: read-your-own-write ---
s = EventuallyConsistentStore(window_ms=2000)
s.put("manifest_v1.json", b"snapshot=1")
s.put("manifest_v1.json", b"snapshot=2")     # overwrite
print("Shape 1 (RYW):", s.get("manifest_v1.json"))   # may print snapshot=1

# --- Shape 2: list-after-write ---
for i in range(42):
    s.put(f"output/part-{i:05d}.parquet", b"...")
seen = s.list_prefix("output/")
print(f"Shape 2 (LAW): wrote 42 parts, LIST returned {len(seen)}")

# --- Shape 3: overwrite during concurrent read ---
def reader(out):
    for _ in range(5):
        out.append(s.get("hot_file.parquet"))
        time.sleep(0.1)

s.put("hot_file.parquet", b"v1")
seen = []
t = threading.Thread(target=reader, args=(seen,))
t.start()
time.sleep(0.05)
s.put("hot_file.parquet", b"v2")               # compaction overwrites
t.join()
print(f"Shape 3 (overwrite race): reader saw versions {set(seen)}")

# Sample run:
# Shape 1 (RYW): b'snapshot=1'
# Shape 2 (LAW): wrote 42 parts, LIST returned 26
# Shape 3 (overwrite race): reader saw versions {b'v1', b'v2'}

The script is a toy — real S3 had subtler timing — but the shapes are exact.

The defenses the lakehouse built — and which are still load-bearing

Faced with that bug class, the table-format community built three families of defenses. Each has a footprint in the code we read today.

Defense What it does Cost Still needed in 2026?
External metastore (S3Guard, DynamoDB pointer) Authoritative list of files lives in DynamoDB or HMS, not S3 LIST Extra system, extra failure mode No on AWS S3; yes on GCS/Azure for some operations
Atomic-rename-via-copy (FileOutputCommitter v2, S3A committer) Tasks write to a unique staging key; commit copies to final key (PUT-new-key is the only strongly-consistent path) Extra copy = 2× write bandwidth Largely obsolete on S3; still used on weaker substrates
Manifest-pointer commit (Iceberg, Delta, Hudi) A single atomic operation (put-if-absent on a metadata key) decides what the table "is" Requires a CAS primitive Yes — this is the lakehouse commit protocol, regardless of substrate consistency

The third row is the one that survived the consistency flip and became the durable design. The first two were workarounds for a problem that mostly went away.

S3Guard is the most instructive example of a workaround that did its job and then was retired. Hortonworks shipped it in 2017: a Hadoop S3A filesystem extension that wrote every PUT and DELETE to a DynamoDB table, then served LIST and HEAD from DynamoDB instead of S3. List-after-write became immediate. Overwrite-then-read became consistent. The cost was an extra DynamoDB cluster, an extra failure mode (what if DynamoDB falls behind S3?), and a real bill — at one Indian e-commerce platform that processed ~50 TB of clickstream daily, S3Guard's DynamoDB cost ran ₹3.5 lakh per month for a 1 KB-per-key metadata footprint. After December 2020, AWS officially deprecated S3Guard and customers ripped it out, and the Hadoop S3A code today contains S3Guard hooks that are dead-code-eliminated at compile time.

The atomic-rename-via-copy committer is more deeply embedded. Hadoop's FileOutputCommitter v1 was the broken one (list-then-rename-on-S3). The S3A committer family — Magic, Directory, Partitioned — uses a different trick: each task writes to a unique key (s3://bucket/_magic/<taskid>/part.parquet), and commit is a copy (a single CopyObject API call) to the final key. Because PUT-of-a-new-key was strongly consistent even pre-2020, the copy is safe. The cost is paying for the bytes twice; the gain is correctness.

Three generations of S3 commit protocolThree vertical lanes side by side. Left lane: FileOutputCommitter v1, broken on S3 due to LIST-after-write. Middle lane: S3A Magic Committer, copies each task output to a unique key, then atomically renames into place. Right lane: Iceberg/Delta manifest pointer, a single put-if-absent on a metadata key decides the table's current snapshot. How the commit protocol survived eventual consistency v1 FileOutputCommitter (Hadoop, broken on S3) 1. tasks PUT to _temporary/ 2. driver LISTs _temporary/ ↑ may omit recent PUT 3. driver renames each ↑ rename-by-copy is racy 4. cleanup _temporary/ silently drops files S3A Magic Committer (workaround, ~2017) 1. task PUTs unique key (PUT-new = strong) 2. CopyObject to final (single atomic call) 3. delete staging correct, 2× write Iceberg / Delta manifest (durable design, 2019+) 1. tasks PUT data files (unique keys) 2. PUT manifest_vN.json (lists data files) 3. CAS table-pointer (single atomic op) survives the flip
The leftmost column was broken by S3's pre-2020 semantics. The middle column was a workaround that traded write bandwidth for correctness. The rightmost column did not depend on overwrite-or-list consistency at all, only on the existence of a single atomic compare-and-swap on one metadata key — which is why it remains the design even now that S3 is strongly consistent.

The Iceberg/Delta design doesn't appear in this picture as "the third workaround". It is the only one of the three that does not rely on overwrite or LIST consistency. Every Iceberg commit is a write of a new metadata file (PUT-new-key, always strongly consistent) and an atomic compare-and-swap of the table pointer. Whether the substrate is strongly consistent S3, eventually consistent Azure Blob, or MinIO with weird quirks, the protocol works. Why this is more than a historical observation: the design that survived was the one that demanded the least from the storage system. It assumed nothing about LIST timing, nothing about overwrite visibility, nothing about negative caching. It assumed only that PUT-new-key returns 200 OK when the bytes are durable, and that a single object can be atomically replaced (which Iceberg achieves via the metastore catalog, not via S3 itself). That minimality is why the same Iceberg code runs on every object store on the planet.

What the flip to strong consistency changed — and what it did not

December 2020's announcement was that all S3 reads — for new keys, overwrites, and LIST — became strongly consistent. The visible effect was that S3Guard could be turned off, the Magic Committer's "we copy because LIST is racy" rationale weakened, and a class of tickets in the Spark/Hadoop trackers were closed as "no longer reproducible".

Three things did not change.

First, S3 still does not offer cross-object atomicity. A write of two objects, A and B, is not atomic. A reader can observe A-new and B-old. This was true before and after. Every transactional table format builds on top of single-object atomicity for that reason.

Second, overwrite is still the wrong design, even though it is now consistent. A compaction that overwrites a hot file in place would still race readers — not because of stale bytes, but because a reader holding a 16 MB byte-range request mid-flight will get the new ETag on its retried range and detect mismatch. Iceberg compaction therefore writes new files and updates manifests; it never overwrites. The pattern outlived the bug it was designed for.

Third, other object stores are not S3. Azure Blob storage offers strong consistency for single-object operations but eventual consistency for some LIST scenarios. Google Cloud Storage is strongly consistent for single objects but reorders LIST results in ways that surprise. MinIO — the open-source S3-compatible store many Indian companies run on-prem — has had eventual-consistency regressions in specific multi-node configurations as recently as 2023. Iceberg, Delta, and Hudi continue to ship with the consistency-paranoid commit path because the binding contract is "table format works on every object store", not "table format works on AWS S3".

Common confusions

Going deeper

Why "10 seconds" was the real number, not a worst case

The empirical bound on pre-2020 S3 overwrite visibility was usually under 1 second, with a long tail to 10 seconds and occasional outliers under regional load. AWS never published the bound because it varied by partition, region, and time of day. Werner Vogels' 2007 essay was deliberately silent on numbers — the contract was eventual consistency, full stop. Engineers who built test suites against "should be visible within 5 seconds" were calibrating against observed behaviour, not against any guarantee, and several Hadoop tests broke during a 2018 us-east-1 incident where the tail stretched to 90 seconds. The lesson the format-designers internalised: never depend on a timing bound that the storage contract doesn't promise.

What the Iceberg V1 spec said about consistency that V2 quietly dropped

Iceberg's original specification (2018, Netflix) explicitly required that the catalog support an atomic "swap pointer" operation, and noted that "the data files and manifest files only need to be readable after they are written" — phrased that way because S3's eventual consistency meant "readable after written" was itself a non-trivial requirement. The V2 spec (2021) drops that phrasing because strong consistency on S3 had landed by then. Reading the diff between V1 and V2 is a tour through which assumptions changed and which did not. The metastore-CAS requirement did not change, because that was never an S3 problem; it was an S3-doesn't-offer-CAS problem, which is still true except for the new conditional-PUT API.

The 2018 Spark-on-S3 incident at a Pune analytics shop

A Pune-based analytics company running daily Spark jobs on a 200-node EMR cluster against S3 had a 0.8% row-loss rate on a 2 TB daily ingest — about ₹4 lakh of mis-attributed revenue per quarter. The team spent three months tracking it before finding the FileOutputCommitter v1 LIST race. The fix was to switch to the S3A Directory Committer; the row-loss disappeared overnight. The incident is not in any post-mortem repository because the company never published it, but the pattern repeated at dozens of Indian shops between 2015 and 2020. The cost of the bug class, summed across the industry, is genuinely uncountable.

Conditional writes on S3 (November 2024) and what they enable

AWS finally shipped If-Match and If-None-Match conditional writes for S3 in November 2024. This is compare-and-swap on an S3 object, the missing primitive that forced every lakehouse to rely on a metastore for the table-pointer CAS. The implication for Iceberg and Delta is real but slow-rolling: the projects can now in principle store the table pointer on S3 itself, removing the metastore from the critical path. Adoption will take years because the catalog ecosystem (Hive Metastore, AWS Glue, Unity Catalog, Polaris) does much more than just CAS — schema management, access control, lineage. But the door is open for a "metastore-less Iceberg" variant on AWS, and Snowflake and Databricks are both rumoured to be working on it.

Where this leads next

The next chapter — /wiki/manifest-files-and-the-commit-protocol — picks up exactly here, with the metadata-pointer design that survived the consistency flip and remains the lakehouse's transactional spine. After that, /wiki/concurrent-writers-without-stepping-on-each-other shows how multiple writers commit to the same Iceberg table without a coordinator — using exactly the optimistic-concurrency-on-a-single-CAS-pointer pattern that this chapter set up.

For readers coming from the OLTP side, /wiki/object-storage-as-a-primary-store-s3-as-a-database is the chapter just before this one — it covers what S3 gives and refuses, of which the consistency model is one row.

References