State files and checkpoints: the poor man's job queue
At 03:42 a Flipkart catalogue-sync job that pulls 14 lakh SKU updates per night from a vendor SFTP into a Postgres staging table dies on file 8,217 of 14,200 with a ConnectionResetError. The on-call engineer Aditi has two choices when she logs in at 04:10. She can rerun the whole job — which means re-pulling 8,216 files she already processed, hammering the vendor, and finishing two hours past the 06:00 SLA when the buyer-app's "what's new" carousel needs to be ready. Or she can resume from file 8,218 — but only if the job wrote down somewhere that 8,217 was where it died. The difference between option A and option B is one line of JSON on disk that nobody bothered to write. That one line is what this chapter is about.
A checkpoint file is the smallest possible piece of state that lets a long-running job recover from a crash without re-doing committed work. It is a flat text file. It is updated atomically. It is read on startup. It is the first thing every batch pipeline grows once it survives its first 03:00 outage, and it is the conceptual seed of every scheduler, every Kafka offset, every Spark _SUCCESS marker, every Airflow XCom, and every Flink savepoint that the next 127 chapters will build on.
A checkpoint is a single piece of durable state — usually a file with one line of JSON — recording the last unit of work the pipeline successfully committed. On restart, the pipeline reads the checkpoint, skips everything before it, and resumes. Done correctly (atomic write, written after the data commits, keyed to the data not the wall clock), it converts a job that must complete in one shot into one that can crash, restart, and converge. Done wrong (written before the commit, written non-atomically, keyed to wall time) it ships duplicates or skips data.
The checkpoint pattern is the conceptual core of every job queue, every scheduler, every streaming framework that follows it.
What a checkpoint is, and the four ways it can break
A checkpoint records "I finished everything up to here." The "here" is some position in the input — a file index, a row number, a Kafka offset, a database id cursor, a wall-clock window boundary. The "I finished" is a claim that the pipeline has already committed the corresponding output. Both halves matter.
There are exactly four ways to get this wrong, and the production-incident folder of any data team will eventually contain examples of all four.
Order inversion. Write the checkpoint before committing the data. The job crashes between checkpoint and commit; the checkpoint says "8217 is done" but the data isn't there. Restart skips file 8217 forever. This is a silent data-loss bug — nothing alerts because the next run sees a clean checkpoint and proceeds happily.
Non-atomic write. Update the checkpoint in place with a f.write() that gets half-flushed before a kill -9. The next read returns garbage or a truncated value, the parser raises, and now the job won't even start. Recovery requires a human to inspect and hand-edit the file.
Wall-clock keying. Store last_run_at = "2026-04-25T03:42:11+05:30" and resume by querying source data after that timestamp. Clock skew between machines, daylight-saving transitions, sources whose timestamps are themselves stale — all turn this into a leaking sieve. The right key is the input position itself (file index, offset, primary-key cursor), not when the pipeline last ran.
Single global checkpoint for parallel work. One file recording "I finished up to here" implicitly serialises the job. The moment two workers process two partitions in parallel, neither can write the global checkpoint correctly because "up to here" is ambiguous. The fix is one checkpoint per partition, not one global checkpoint with a lock.
Why the order matters so much: a pipeline that writes the data first and the checkpoint second can crash between those two steps, leaving the data committed but the checkpoint stale. The next run re-processes that batch — a duplicate, which the dedup-key pattern from chapter 6 handles. A pipeline that writes the checkpoint first can crash leaving the checkpoint advanced but the data missing. The next run skips the batch entirely — silent data loss, which no pattern recovers because there is no signal that anything went wrong. Duplicate-and-dedupe is recoverable; skip-and-pretend is not. Always commit the data first.
Atomic write: the tempfile + os.replace pattern
The checkpoint file is small (a few hundred bytes), but the write must be atomic — the file on disk is either the old version or the new version, never a half-written hybrid. Filesystems give you exactly one primitive that guarantees this: rename() (or its Python wrapper os.replace). Every other strategy is a subtle race condition.
The wrong way:
with open("ckpt.json", "w") as f:
json.dump({"last_file_idx": idx}, f) # crash here = corrupt file
If the process is killed between open (which truncates) and the next flush, the file on disk is empty or truncated. The next read fails to parse and the job won't restart.
The right way:
import json, os, tempfile
def write_checkpoint(state: dict, path: str = "ckpt.json") -> None:
dirpath = os.path.dirname(os.path.abspath(path)) or "."
fd, tmp = tempfile.mkstemp(prefix=".ckpt-", dir=dirpath, suffix=".tmp")
try:
with os.fdopen(fd, "w") as f:
json.dump(state, f, sort_keys=True, separators=(",", ":"))
f.flush()
os.fsync(f.fileno()) # force bytes to disk, not just OS cache
os.replace(tmp, path) # atomic rename — single inode swap
except Exception:
try: os.unlink(tmp)
except FileNotFoundError: pass
raise
Three steps, in order: write to a temp file in the same directory, fsync to disk, atomic rename over the target. The rename is what POSIX guarantees as atomic on the same filesystem — at any instant, a reader either sees the old ckpt.json or the new one. The temp file lives in the same directory because cross-filesystem renames are not atomic (they fall back to copy-then-delete).
The os.fsync is the load-bearing line that engineers most often skip. Without it, os.replace is atomic with respect to other processes on the same machine, but the bytes might still be in the OS page cache when the kernel panics or the disk's write cache is dropped on a power loss. The fsync forces the bytes through to the platter (or SSD's persistent media) before the rename happens. On AWS EBS-backed EC2 instances this matters less because EBS itself has durability guarantees; on local NVMe, on bare-metal Kubernetes, on-prem servers, it matters enormously. The cost is one or two milliseconds per checkpoint — utterly invisible in a job that processes a file every few seconds.
Why tempfile.mkstemp and not open(path + ".tmp", "w"): two pipeline replicas (perhaps a stuck process and its replacement after a Kubernetes pod restart) would both write to ckpt.json.tmp and one would win the rename, but the loser might still be holding the file open when it gets overwritten by the winner — potentially corrupting its own future write. mkstemp generates a unique name (mkstemp returns a different temp filename for every caller), eliminating the collision class entirely.
A complete, runnable resumable pipeline
The pattern is small enough to fit in 50 lines and big enough to actually use. The example below pulls SKU updates from a directory, writes them to Postgres, and checkpoints after each file. Crash at any point and the next run resumes; double-run from clean state and it is idempotent.
# resumable_pipeline.py — checkpoint-driven file-by-file ingestion.
import json, os, tempfile, glob, hashlib
from typing import Iterator
import psycopg2
CKPT = "state/skus.ckpt.json"
INPUT_DIR = "/data/vendor/skus"
CONN = "host=db.flipkart.internal dbname=catalogue user=etl"
def read_checkpoint() -> dict:
if not os.path.exists(CKPT):
return {"last_file": "", "last_file_idx": -1}
with open(CKPT) as f:
return json.load(f)
def write_checkpoint(state: dict) -> None:
os.makedirs(os.path.dirname(CKPT) or ".", exist_ok=True)
fd, tmp = tempfile.mkstemp(prefix=".ckpt-", dir=os.path.dirname(CKPT) or ".")
with os.fdopen(fd, "w") as f:
json.dump(state, f, sort_keys=True, separators=(",", ":"))
f.flush(); os.fsync(f.fileno())
os.replace(tmp, CKPT)
def list_files() -> list[str]:
return sorted(glob.glob(os.path.join(INPUT_DIR, "skus_*.json")))
def files_to_process(all_files: list[str], ckpt: dict) -> Iterator[tuple[int, str]]:
for idx, path in enumerate(all_files):
if idx <= ckpt["last_file_idx"]:
continue # already done in a previous run
yield idx, path
def write_skus(rows: list[dict], conn) -> int:
sql = """INSERT INTO skus_staging(sku_id, dedup_key, payload, run_date)
VALUES (%(sku_id)s, %(dedup_key)s, %(payload)s, CURRENT_DATE)
ON CONFLICT (dedup_key) DO NOTHING"""
with conn.cursor() as cur:
for r in rows:
r["dedup_key"] = hashlib.sha256(
json.dumps(r, sort_keys=True).encode()).hexdigest()
r["payload"] = json.dumps(r)
cur.execute(sql, r)
return cur.rowcount
def run() -> None:
ckpt = read_checkpoint()
all_files = list_files()
print(f"resume from idx={ckpt['last_file_idx']+1}, total={len(all_files)}")
with psycopg2.connect(CONN) as conn:
for idx, path in files_to_process(all_files, ckpt):
with open(path) as f:
rows = json.load(f)
n = write_skus(rows, conn)
conn.commit() # commit data FIRST
write_checkpoint({"last_file": os.path.basename(path),
"last_file_idx": idx}) # checkpoint SECOND
print(f" [{idx+1}/{len(all_files)}] {os.path.basename(path)}: {n} new")
if __name__ == "__main__":
run()
Sample run, with a deliberate kill -9 between files 3 and 4, then a restart:
$ python resumable_pipeline.py
resume from idx=0, total=14200
[1/14200] skus_00000.json: 1247 new
[2/14200] skus_00001.json: 1198 new
[3/14200] skus_00002.json: 1301 new
^C [killed -9]
$ python resumable_pipeline.py
resume from idx=3, total=14200
[4/14200] skus_00003.json: 1284 new
[5/14200] skus_00004.json: 1192 new
...
The four load-bearing lines are worth dwelling on.
if idx <= ckpt["last_file_idx"]: continue is the resume logic. Single comparison; no fancy state machine. The checkpoint stores the last completed index, so the next file to process is last_file_idx + 1. Off-by-one bugs here are common — using < instead of <= would re-process the last completed file and ship duplicates (the dedup-key pattern catches them, but observability still shows 0 new for that file, confusing on-call).
conn.commit() then write_checkpoint(...), in that order, is the order-inversion safety from §1. If the process dies between these two lines, the data is in Postgres but the checkpoint hasn't advanced; the next run re-processes the file and the dedup keys cause ON CONFLICT DO NOTHING to silently skip the duplicates. The pipeline converges correctly. If the order were reversed, a crash between the two lines would advance the checkpoint without committing the data, and that file's rows would be lost forever — no signal, no alarm.
write_checkpoint(...) uses os.replace internally — the atomic-rename pattern from the previous section. Even if the process is killed during the checkpoint write, the on-disk file is either the old checkpoint or the new one, never corrupt.
write_skus(...) is the chapter-6 idempotent write, with ON CONFLICT (dedup_key) DO NOTHING. The two patterns compose: the checkpoint cuts re-work, and the dedup key handles the unavoidable boundary case where a crash happened between commit and checkpoint write.
Why the dedup-key safety net is non-negotiable even with a perfectly-ordered checkpoint: between conn.commit() and write_checkpoint(...) there is a non-zero window — usually one or two milliseconds — during which the data is durably in Postgres but the checkpoint has not advanced. A kill -9 or a power loss inside that window is rare but not impossible, and over a year of nightly runs it will eventually happen. Without ON CONFLICT, the next restart double-inserts that one batch. With ON CONFLICT, the duplicates land on existing keys and are silently rejected. The checkpoint cuts 99.99% of redundant work; the dedup key handles the 0.01% the checkpoint cannot.
When checkpoints are the wrong primitive
The single-file checkpoint scales to about a billion rows of input or a few thousand parallel partitions, beyond which it becomes a bottleneck or a coordination problem. Knowing when to outgrow it is as important as knowing how to use it.
A single file works for any sequential ingestion that finishes in one machine's working day. Most of Build 1–3's pipelines fit here. Once the input is partitioned and you want workers running in parallel, the single file becomes a bottleneck (every worker contending for the same lock) or a correctness hazard (workers overwriting each other's progress). The fix is one checkpoint file per partition, written by the worker that owns that partition. Spark Structured Streaming's _spark_metadata directory and Flink's per-operator checkpoint files are this pattern at production scale.
Beyond a few hundred partitions, even per-partition files become a coordination problem: who decides which partitions exist, who reads which checkpoints, what happens when a partition is reassigned mid-run. At that point the checkpoint moves into a broker-managed store — Kafka's __consumer_offsets topic, Kinesis's per-shard checkpointer in DynamoDB, Pulsar's per-subscription cursor. The mechanics are the same (record the last committed position, read on startup, write atomically) but the storage is now a dedicated highly-available service rather than a flat file. Build 7 and Build 8 are entirely about this regime.
The smell that says "outgrow the single-file checkpoint" is one of three: (1) the job runs longer than one machine's reasonable lifetime (24h+), (2) parallelism is non-trivial (>4 workers wanting to advance in parallel), (3) the input is unbounded (a stream, not a file directory). Until you hit one of those, a single JSON file beats every framework on simplicity, debuggability, and recoverability.
Common confusions
-
"A checkpoint is the same as a database transaction." A transaction is atomic over a single commit; a checkpoint is durable across many transactions. The data write is the transaction; the checkpoint is the bookmark recording which transaction was last successful. They compose: the transaction makes "did this batch commit?" definite, the checkpoint makes "where was I before the crash?" definite. Either alone is insufficient.
-
"A
_SUCCESSmarker file is the same as a checkpoint." Almost — a_SUCCESSfile is a checkpoint with one bit of state ("the whole job finished"). It works for batch jobs that either complete or restart from scratch. It does not work for resumable jobs that need to know where to restart, only whether to restart. The Spark_SUCCESSis a degenerate checkpoint; a real resumable pipeline needs the position too. -
"I can use the database's
MAX(updated_at)as the checkpoint, no separate file needed." Tempting, and dangerous.MAX(updated_at)reads the destination state to figure out source progress, which means a partial commit (some rows written, some not) advances the apparent checkpoint past where the source actually stopped. The checkpoint must record the source position, not a derived view of the destination. -
"Atomic rename works on every filesystem." No. POSIX guarantees
renameatomicity within a single filesystem. Renaming across filesystems (NFS to local, or two different mount points) silently degrades to copy-then-delete, which is not atomic. The temp file must live in the same directory as the target file, on the same filesystem. NFS itself has additional caveats — NFSv3 rename is not atomic across clients without server-side support; NFSv4 is. -
"The checkpoint should record everything I might want to know." No — the checkpoint should be the minimum state needed to resume. Everything else goes in logs or metrics. A bloated checkpoint is slow to write, slow to read, and a magnet for schema-drift bugs (the next version of the code can't parse the older checkpoint format). Keep it to the resume cursor and a timestamp; let observability live elsewhere.
-
"My checkpoint write is atomic because I use
with open(path, 'w'): ...." False. Thewithblock guarantees the file handle is closed; it does not guarantee the bytes reached disk. Akill -9mid-write or a kernel panic betweenclose()and the OS flushing the page cache leaves a half-written file. Thetempfile + fsync + os.replacesequence is the only correct pattern.
Going deeper
What the checkpoint should and should not contain
The minimum viable checkpoint has exactly one field: the resume cursor. Everything else is observability, and observability belongs in logs, not in the durable state. The Razorpay payments-ledger pipeline, which runs hourly and processes about 4 crore transactions per day across 96 batches, has a checkpoint of exactly two fields: {"last_offset": <int>, "schema_version": "v3"}. The schema-version field is there for one reason — when the checkpoint format itself changes, the new code must refuse to load an old checkpoint rather than misinterpret it.
A common bloat pattern is to store run metadata — the source URL, the destination table, the worker hostname, recent error messages. None of this belongs in the checkpoint. The source URL is configuration. The destination is configuration. The hostname changes every restart. Errors belong in the log file. If the checkpoint contains anything that varies between runs of the same logical job, you have crossed the line from state to logging.
A subtler bloat pattern is to store intermediate state for "performance" — a bloom filter of recently-seen keys, a counter of duplicates skipped, a summary of the last batch. This couples checkpoint correctness to that intermediate state's correctness, and in a partial failure where the intermediate state is wrong but the resume cursor is right, the pipeline silently malfunctions. Either keep the checkpoint to the resume cursor only, or accept that you are now maintaining two pieces of durable state and need an explicit reconciliation protocol between them.
Where to store the checkpoint: local disk, object storage, or a database
The checkpoint must live somewhere that survives the worker dying. A local-disk file works only when the worker's filesystem itself survives crashes — true on a long-lived EC2 with EBS, false on an ephemeral Kubernetes pod with a memory-backed emptyDir. The first time a Kubernetes-deployed pipeline loses its checkpoint to pod replacement, the team learns this distinction.
Three production-grade locations dominate. Object storage (S3, GCS, ADLS) is the default for batch pipelines — the same os.replace pattern works against s3.put_object because S3 PUT is atomic per object, and object-storage durability (11 nines on S3) exceeds anything a single disk offers. The downside is per-write latency (~30–80ms versus ~1ms for a local file) and per-write cost; pipelines that checkpoint per row will run up a meaningful S3 bill. Pipelines that checkpoint per batch are fine.
A small relational database (SQLite for single-machine, Postgres for multi-machine) gives you transactional checkpoint updates that are guaranteed atomic with the data write. The pipeline's data INSERT and the UPDATE pipeline_state SET last_offset = ? happen in the same transaction, eliminating the order-inversion class entirely. Airflow's metadata DB is exactly this pattern at scale; Dagster and Prefect both default to a similar setup. The downside is that the database becomes a critical dependency; if it's down, the pipeline can't checkpoint, and the question becomes whether to block (and miss SLA) or skip checkpointing (and risk data loss on the next crash).
A dedicated checkpoint service (Zookeeper, etcd, Kafka topic, DynamoDB) is what frameworks at scale use. Kafka's consumer offsets live in __consumer_offsets topic — a special compacted Kafka topic that the cluster manages with HA replication. Spark Structured Streaming with Kafka source can use either a checkpoint directory (object storage) or store offsets directly in Kafka. Flink's RocksDB-backed state with periodic snapshots to S3 is a hybrid — local durability for read-heavy state plus periodic upload to durable object storage for crash recovery. Build 8 walks the full Flink mechanism.
The right choice maps directly to the pipeline's tier: a hobby ETL on one machine uses local disk, a production batch pipeline uses S3 or a Postgres state table, a streaming pipeline at Flipkart-or-PhonePe scale uses broker-managed offsets backed by replicated state.
How long to keep old checkpoints (and why "never delete" is wrong)
A subtle long-running bug is to never garbage-collect old checkpoints. The pipeline keeps every checkpoint it ever wrote, the directory grows to millions of files, the next read_checkpoint() accidentally reads the wrong one, and now restarts resume from an offset months in the past — a backfill the team didn't ask for.
Three retention policies are common. Keep the latest only (the os.replace pattern enforces this naturally — the rename overwrites). Simplest, no garbage collection needed, but no history if you need to debug. Keep the latest N (e.g. last 24 hourly checkpoints, the most recent always being ckpt.json, the older ones renamed to ckpt-2026-04-25T03.json). Useful for debugging "when did the offset start drifting?" without unbounded growth. Time-bounded retention (delete checkpoints older than 7 days). Aligns with retention of the underlying data — if the input source itself only retains 7 days of files, a checkpoint older than that is unusable anyway.
Zerodha's order-book ingestion, which checkpoints every 10 seconds during market hours, uses pattern 2 with N=4320 (12 hours of history at 10s cadence). The team has fielded exactly two incidents in three years where the historical checkpoints were the difference between a clean recovery and a partial reconstruction; both incidents involved a stuck consumer whose current checkpoint had been overwritten with a corrupted value, and the previous checkpoint was clean.
The two-phase checkpoint for distributed coordination
Single-machine checkpoints have one ordering decision: data first, checkpoint second. Distributed pipelines have a harder problem — multiple workers that must all checkpoint at the same logical point for a downstream consumer to know the state is consistent. The textbook solution is the Chandy-Lamport snapshot algorithm, and Apache Flink's checkpointing protocol is a production implementation.
The mechanism is two-phase. Phase one: a coordinator broadcasts a "checkpoint barrier" with a unique id N to every worker. Workers receiving the barrier flush their in-flight state to durable storage, write a per-worker checkpoint marker tagged with N, and acknowledge the coordinator. Phase two: when all workers have acknowledged, the coordinator writes a single global "checkpoint N complete" marker. Recovery rolls back to the last globally-complete checkpoint — never a partial one where some workers checkpointed N and some didn't.
The same pattern, simplified, appears in Spark Structured Streaming's commit logs (the commits/ directory under the checkpoint root) and in Kafka's transactional commits. The data-engineering reader's takeaway is that distributed checkpointing requires a global agreement protocol on top of per-worker checkpoint files — it cannot be reduced to "every worker writes its own and we'll figure it out later". Build 8 walks the Flink algorithm with the actual barrier-injection mechanics.
Schema evolution of the checkpoint format itself
A failure mode that takes years to surface: the checkpoint format changes between versions of the pipeline code, and an older checkpoint from a stalled worker resumes against newer code that misinterprets it. The Razorpay incident in 2024 where a stuck consumer rolled back to an offset two days old — because a v2 checkpoint was loaded by v3 code that looked for a different field name and silently fell back to "fresh start" — cost a half-day of duplicate-detection work.
The discipline is to version the checkpoint explicitly. Every checkpoint file includes {"schema_version": "v3", ...}. The loading code refuses to start if it doesn't recognise the version. Migrations are explicit code that reads vN and writes vN+1, run once before the new pipeline version starts. This is more ceremony than feels worth it on day one and is exactly what makes incidents like the above impossible.
A reasonable rule: increment the schema version any time you add, remove, or rename a checkpoint field. Never reuse a field name with new semantics. Treat the checkpoint format as a public API between past-you and future-you, because that is what it is.
Where this leads next
Chapter 8 generalises the dedup-key pattern from chapter 6 into a streaming context — hash-based deduplication where the dedup state itself must be checkpointed, intersecting both this chapter and the previous one. Chapter 9 introduces UPSERT and MERGE, the destination-side primitive that turns "checkpoint and dedup" into a transactional whole. Chapter 11 brings retry policy together with checkpoints and shows how exponential backoff interacts with checkpoint advancement.
- Hash-based deduplication on load — chapter 8
- Upserts and the MERGE pattern — chapter 9
- Partial failures and the at-least-once contract — chapter 10
- What "idempotent" actually means for data (and why it's hard) — chapter 6, the prerequisite
- Crashing mid-run: which state are you in? — chapter 4, the failure-mode taxonomy that makes this chapter necessary
Build 4 returns to checkpoints from the scheduler's perspective — when a DAG executor needs to remember which tasks have completed across a multi-hour run with hundreds of tasks. The mechanism is the same (durable state recording committed progress), but the unit of work shifts from "row" or "file" to "task in a DAG", and the checkpoint format grows from one offset to a graph of task states. Build 8 returns again, this time as Flink's barrier-based snapshotting where the state is distributed and the protocol is non-trivial. Each return refines the same idea — record what's done, atomically, before claiming success — at progressively larger scales.
References
- POSIX rename(2) atomicity — the underlying primitive that makes
os.replace-based checkpoint writing safe, and the precise scope of its atomicity guarantee (within a single filesystem). - Apache Flink: state and fault tolerance — production reference for distributed checkpointing using the Chandy-Lamport-derived barrier protocol; the long-form version of §4 above.
- Spark Structured Streaming checkpointing — the per-batch checkpoint directory layout (
offsets/,commits/,_spark_metadata) that productionises the single-file pattern at scale. - Kafka KIP-447: Producer scalability for exactly-once — how Kafka's consumer-offset checkpoints interact with the transactional producer to give end-to-end exactly-once.
- "Atomic file updates on Linux" — LWN, 2017 — practical guide to the
tempfile + fsync + renamepattern, including the subtleties of fsyncing the directory afterwards on some filesystems. - Designing Data-Intensive Applications, Chapter 7 — Transactions — Martin Kleppmann on what "atomic" actually means at each layer of the storage stack, foundational for understanding why the rename-based pattern works.
- What "idempotent" actually means for data (and why it's hard) — the chapter-6 dedup-key pattern that composes with this chapter's checkpoint pattern.
- Airflow architecture: scheduler and metadata DB — production example of the "checkpoint as transactional database row" pattern, scaled to thousands of DAGs.
A practical exercise to internalise the pattern: take the resumable_pipeline.py script above and inject a deliberate failure between conn.commit() and write_checkpoint() (a raise RuntimeError("crash")). Run it once on a small input. Observe that some rows are in Postgres but the checkpoint hasn't advanced. Restart the script. The next run re-reads the file the crash hit, the rows already in Postgres are silently skipped via ON CONFLICT, and the new rows complete. Total state: identical to the no-crash case. That is the property — duplicate-and-dedupe converges; skip-and-pretend does not — that makes this combination of checkpoint + idempotent write the foundation of every batch pipeline you will write.
A second exercise, more revealing: invert the order. Put write_checkpoint(...) before conn.commit(), induce the same crash, and observe what happens on restart. The checkpoint says the file is done; the rows are absent from Postgres; the restart skips the file; the data is lost forever, with no signal. Run a SELECT COUNT(*) and the missing rows are simply gone. This is the failure mode the order-inversion rule prevents, and feeling it once on a 100-row test makes the rule unforgettable.
A third exercise, for senior engineers calibrating their own systems: walk a checkpoint-based pipeline you currently own and answer four questions about each checkpoint write. Is the order data-then-checkpoint? Is the write atomic (fsync + rename, not in-place truncate)? Is the resume key derived from the source position rather than wall-clock time? Is the checkpoint per-partition rather than global? A "no" to any of these is a quiet bug waiting for the next on-call to discover. The most-common "no" in the field is the second one — engineers writing the checkpoint with with open(path, "w"): json.dump(...) and never realising the gap. That fix is one screenful of code and converts a class of incidents into impossibilities.