Crashing mid-run: which state are you in?
At 06:08 IST on a Tuesday, Riya's laptop runs out of memory while the merchants dictionary is loading. The Python process gets SIGKILL-ed by the OOM killer. There is no traceback, no log line, no exit code — just a missing file in /data/out/, a 217 MB orphan in /data/_staging/, and an ops Slack channel asking "where's the refunds report?". She has twelve minutes before the 06:20 escalation. The decision she makes in the next two minutes — re-run, resume, or roll back — is the same decision a senior engineer at Razorpay makes against a Kafka topic at 02:14, except the blast radius is bigger.
This chapter takes the fifty-line pipeline from chapter 3 and crashes it at every primitive boundary. You will walk the disk after each crash, name the state the process left behind, and pick the right recovery action. The goal is not memorisation — it is to build the muscle of asking which state am I in before typing the next command.
A crashed pipeline is in one of four states: nothing started, partial Extract, partial Transform, or partial Load. Recovery is safe only when you can name the state, and the staging-rename pattern from chapter 3 is what makes "name the state" easy — anything in /data/_staging/ is throwaway, anything in /data/out/ is a committed past run. The fifty-line shape was designed precisely so this question has a quick answer.
The four states a crash leaves behind
Every crash mid-run lands the pipeline in exactly one of four states. The state is determined by what is on disk, not by what the process intended to do. A pipeline that knows its disk layout can answer the recovery question in seconds; a pipeline whose state is spread across half-written shared variables and uncommitted DB transactions cannot.
The reason this picture has only one "messy" state — D — is the staging-rename pattern from chapter 3. Without it, every state would have a half-written file in /data/out/ that consumers might be reading, and you would have no idea whether the file represents a complete past run or a half-finished current one. With it, the contract is one sentence: /data/out/ only ever holds files that were complete at the moment of os.replace.
Why states A, B, C all collapse to "re-run": the in-memory data structures (refunds, merchants, out) live in the Python process and die with it. Nothing survived to disk that the new run would conflict with. Re-running starts from scratch and produces the correct output, exactly as the original run would have if it had not crashed. This is the cheapest possible recovery and is only available because Transform was a pure function with no side effects.
Walking the disk after each crash
The detective work to identify which state you are in is two ls commands and one inequality. Build the muscle now; the same two ls commands answer the same question against S3 prefixes in Build 12.
# crash_diagnosis.py — answer "what state am I in?" before running anything else.
import sys, os, glob, time
from pathlib import Path
OUT_DIR = Path("/data/out")
STAGING_DIR = Path("/data/_staging")
def diagnose(run_date: str):
final = OUT_DIR / f"refunds_settled_{run_date}.csv"
staging_pattern = str(STAGING_DIR / f"{run_date}.*.csv")
orphans = glob.glob(staging_pattern)
final_exists = final.exists()
final_age_s = (time.time() - final.stat().st_mtime) if final_exists else None
if final_exists and final_age_s < 6 * 3600 and not orphans:
return "DONE", "today's run already published; do nothing"
if not final_exists and not orphans:
return "STATE_A_B_C", "no run in flight; safe to re-run"
if orphans and not (final_exists and final_age_s < 6 * 3600):
return "STATE_D", f"crash during Load; {len(orphans)} orphan(s) to clean"
if final_exists and orphans:
return "STATE_D_RACE", "completed run plus orphan; clean orphan, no re-run"
return "UNKNOWN", "manual investigation required"
if __name__ == "__main__":
state, action = diagnose(sys.argv[1])
print(f"state={state} action={action}")
if state == "STATE_D":
for o in glob.glob(str(STAGING_DIR / f"{sys.argv[1]}.*.csv")):
print(f" orphan: {o} ({os.path.getsize(o):,} bytes)")
Sample run after Riya's OOM crash:
$ python crash_diagnosis.py 2026-04-24
state=STATE_D action=crash during Load; 1 orphan(s) to clean
orphan: /data/_staging/2026-04-24.91247.csv (217,418,832 bytes)
A few load-bearing lines.
final_age_s = ... if final_exists else None distinguishes "today's run already finished" from "yesterday's file is still sitting there because today never ran". Without the age check, a missing run for a whole day would look indistinguishable from a successful one. Six hours is the chosen freshness window because the pipeline runs at 06:00 and the next scheduled invocation is also 06:00 — a file older than that is not "today's".
glob.glob(staging_pattern) uses the os.getpid() suffix in the staging filename from chapter 3 to enumerate orphans. There may be more than one if the pipeline was retried by a wrapper script and crashed twice; the diagnostic catches both.
return "STATE_D_RACE", ... is the rare state where someone re-ran the pipeline by hand while the original run was still mid-Load. The original os.replace succeeded (final exists, fresh), but the manual run crashed mid-write and left an orphan. This is the state where "just delete everything in staging" is wrong if you're not careful — the orphan is from a different run, but the final is fine. The action is to clean the orphan only, not re-run.
Why the four-state taxonomy uses the disk and not the application logs: logs lie. A process that segfaults inside csv.DictReader may not flush its log buffer; a process killed by kill -9 definitely does not. Disk state is the only ground truth a crashed pipeline can leave behind. The diagnostic is "what files exist, when were they last modified" — the same primitives ls and stat that survive every Linux crash since 1991.
The recovery actions, ranked by risk
For each state, there is exactly one safe recovery action. Memorise this table; it is the contract you maintain forever.
| State | What's on disk | Safe action | Risky action | Reason |
|---|---|---|---|---|
| A | yesterday's final, no staging | re-run | nothing risky here | re-run is idempotent because Load uses atomic rename |
| B | yesterday's final, no staging | re-run | nothing risky here | same as A; can't tell B apart from A from disk, and don't need to |
| C | yesterday's final, no staging | re-run | nothing risky here | same as A and B; the "level" of progress was in-memory and gone |
| D | yesterday's final + orphan in staging | delete orphan, then re-run | re-running without cleaning | orphan accumulates with every retry; eventually fills the disk |
| D-race | today's final + orphan in staging | delete orphan, do not re-run | re-running | re-run would overwrite a correct, freshly-published file |
Notice that A, B, and C are operationally identical — you cannot tell them apart from the disk, and you do not need to. The pure-function Transform is what collapses three logical states into one operational state. This is the deepest payoff of the chapter 3 architecture.
The unreliable-narrator problem: why disk beats memory beats logs
When a pipeline crashes, four sources of "what happened" are available, and they disagree. Memorising their reliability ranking is the difference between debugging in five minutes and chasing a phantom for an hour.
Disk state is ground truth. A file either exists or doesn't. Its mtime is what the kernel last recorded. If the rename completed, the kernel committed it; if not, it didn't. The disk does not lie about its current contents — only about why it is in that state.
Process exit status is mostly truth, when it exists. A clean exit code 0 is reliable: Python ran to completion. A non-zero exit code is reliable for what failed (Python re-raised an exception). But a process killed by SIGKILL (signal 9), OOM, or a power loss leaves no exit code at all — the orchestrator just observes "the process is gone".
Application logs are partial truth. They are correct for events the process actually flushed. They are silent for events that happened in the buffer between the last flush() and the crash. A pipeline that logs "extract.done rows=42318" and then crashes during Transform may not have flushed the next log line; absence of "transform.done" does not prove transform didn't run, only that it didn't log completion.
Operator memory is unreliable. "I think I re-ran it around 2 a.m." has betrayed every senior engineer at least once. Always cross-check operator memory with disk state and orchestrator history.
The discipline: in a crash investigation, start from disk, then process status, then logs, and treat operator recollection as a hint to verify, not a fact. The fifty-line pipeline is designed so disk state alone answers the recovery question, which is why this chapter starts there.
Why "resume from where we left off" is the wrong instinct
A natural first instinct, especially for engineers coming from long-running jobs, is "checkpoint progress so we can resume mid-flight". A Spark job at PhonePe processing two days of UPI history would absolutely checkpoint — restarting from row 0 after a crash six hours in is unacceptable. The fifty-line pipeline does not, and intentionally so.
The reason: resuming is harder than re-running, and the threshold where it becomes worth the complexity is much higher than people expect. A resume requires three pieces of state to be on disk: the input cursor (which row you'd processed up to), the partial output (what you'd already written), and the dedup state (what keys you'd already emitted). Each of those has to be written atomically, has to be replayed correctly under all crash combinations, and has to compose with retries. Building it for a 0.5-second pipeline is theatre.
The break-even point for resume-vs-rerun is roughly when a single run takes longer than the longest acceptable recovery window. If the pipeline takes 30 seconds and you have an hour, just re-run. If it takes 6 hours and ops needs results in 4, now you need resume. Build 4's scheduler will introduce checkpoints for the cases where they pay off; for everything else, the doctrine is make re-run safe and cheap, then re-run.
Why this doctrine survives at scale: every system you respect — Postgres replication, Kafka consumer groups, S3 multipart uploads, Iceberg snapshots — has the same answer at its core. Each layer either makes the unit of work small enough that re-run is cheap, or makes the commit point single-atomic so partial state is throwaway. The fifty-line pipeline did the first; Iceberg does the second; Build 9 will do both at once.
Riya's twelve minutes: walking through the actual recovery
Back to the original scene. It is 06:08, the OOM has fired, and the diagnostic has identified State D with one orphan. Riya has eleven minutes and forty seconds. Here is exactly what she types, and why each step is in the order it is.
$ ls -la /data/_staging/
-rw-r--r-- 1 riya riya 217418832 Apr 24 06:07 2026-04-24.91247.csv
$ ls -la /data/out/refunds_settled_*.csv | tail -3
-rw-r--r-- 1 riya riya 4218914 Apr 23 06:00 refunds_settled_2026-04-23.csv
-rw-r--r-- 1 riya riya 4112877 Apr 22 06:00 refunds_settled_2026-04-22.csv
-rw-r--r-- 1 riya riya 4087261 Apr 21 06:00 refunds_settled_2026-04-21.csv
The first ls confirms the orphan. The second ls confirms there is no refunds_settled_2026-04-24.csv — so this is genuine State D, not State D-race. The chronological gap (last good run is 06:00 yesterday, no file for today) is the second piece of evidence. Now the action sequence.
Step 1: clean the orphan. rm /data/_staging/2026-04-24.91247.csv. This is safe because the staging filename is namespaced by PID, so no concurrent run is using it. If two operators were diagnosing simultaneously, they would not collide on this filename.
Step 2: free memory. The OOM was deterministic — running the same pipeline on the same laptop will OOM again. Riya checks free -m, kills the Slack desktop client and a Chrome window with sixty tabs, recovers 1.4 GB, and tries again. In production, this step is "request a worker pod with more memory" or "switch to streaming the join instead of loading the merchants dict in full". The point is: re-running before fixing the cause produces State D again with a different PID.
Step 3: re-run, pinned to the same run_date. python pipeline.py 2026-04-24. Because run_id is computed from run_date and not from the wall clock, the output is bit-identical to what the 06:00 run would have produced. The _run_id column reads 2026-04-24T06:00:00+05:30, not 2026-04-24T06:09:43+05:30. Downstream consumers cannot tell the run was retried.
Step 4: confirm and notify. ls /data/out/refunds_settled_2026-04-24.csv shows the new file with current mtime; wc -l shows 28,915 rows; she posts in Slack: "refunds report published at 06:11; root cause: OOM during merchants load; fix shipping today." The recovery, end to end, took 3 minutes 40 seconds.
The entire procedure is recoverable in this time only because chapter 3's pipeline made the recovery question answerable. A pipeline that wrote partial results to /data/out/, or that used wall-clock run_id, or that opened a long-running database transaction would each turn this 3-minute recovery into a 30-minute incident. The architecture is the SLA.
Why "fix the root cause before re-running" is non-negotiable: a deterministic crash that re-runs without a fix is the textbook way to fill an on-call rotation. Each retry consumes minutes, fills staging disk, and burns operator attention. Fix-first is also why Build 4's scheduler defaults to bounded retries (3 attempts, exponential backoff) rather than unbounded — the assumption is that if 3 attempts fail, the cause is structural, not transient.
What changes when the destination isn't a file
The four-state diagnosis and the recovery flow above assume Load is os.replace to a CSV. The doctrine survives almost unchanged when Load is a Postgres MERGE, a Kafka idempotent producer, or an Iceberg snapshot commit — but the mechanism of "atomic publication" changes per destination.
For Postgres, Load wraps the writes in a transaction; partial writes are auto-rolled-back on crash because the WAL hasn't been committed. The "orphan in staging" equivalent is uncommitted intermediate data in temp tables, which Postgres cleans on next connection. For Kafka, Load uses the idempotent producer (KIP-98) so a re-sent message with the same producer ID and sequence number is deduped at the broker; the orphan equivalent is "messages sent before the crash but never committed", which the broker discards. For Iceberg, Load writes data files to S3 and then atomically rewrites the manifest pointer; un-pointed data files become orphans that a periodic compaction sweeps. Build 6, Build 9, and Build 12 walk these in detail; the shape is identical to chapter 3's staging-rename, just with different primitives.
The unifying property is single-atomic-commit-at-end. Every destination worth using gives you a way to say "publish this work, all of it or none of it, in one atomic step at the end". Pipelines that ignore this property and instead stream partial writes into the destination are the pipelines that fill on-call rotations.
A subtle implication: not every destination is "worth using" by this measure. A pipeline at a Pune logistics company once wrote results directly to a Google Sheet via the gSheets API, one row per append, with no atomic-commit primitive. When the pipeline crashed mid-run, the sheet had a partial day of orders mixed with yesterday's closing snapshot, and the operations team spent six hours reconciling by hand. The fix was not "make the pipeline more reliable" — it was "land in S3 first, then push to the sheet from a separate, idempotent step". Choosing destinations that support atomic publication is part of pipeline design, not an afterthought.
Common confusions
-
"Re-running a pipeline that crashed mid-Load is dangerous because the staging file is corrupted." The staging file is corrupted, yes — but it is in
/data/_staging/, not/data/out/, and the new run uses a different staging filename (different PID). The corrupt file does not affect the new run at all. Cleaning it is a hygiene step (the disk fills otherwise), not a correctness step. The "danger" people fear is leaking corrupt data to consumers, which the staging-rename pattern prevents by construction. -
"State C and State D are the same — both mean we crashed mid-pipeline." They look similar from the application's perspective but are operationally different on disk. State C left no files behind; State D left an orphan. The diagnostic command runs differently, and so does the recovery (re-run vs clean+re-run). Conflating them is how you end up with a
/data/_staging/directory full of multi-gigabyte orphans nobody dares delete because "what if some of them are real?" -
"Idempotent re-runs mean the pipeline is fault-tolerant." Idempotency is necessary but not sufficient. Fault tolerance also requires that the pipeline detects it has crashed (some scheduler is watching), that re-runs happen on a bounded schedule (not five times in a tight loop), and that a poison-pill input doesn't crash the pipeline forever. Build 4 covers detection and bounded retries; Build 5 covers contracts that catch poison-pill inputs early.
-
"Atomic rename solves all crash problems." It solves publication atomicity — a reader of
/data/out/always sees a consistent file. It does not solve input atomicity (your input file can still be half-written by an upstream producer that doesn't use staging-rename), and it does not handle multi-output pipelines where Load writes to two destinations and crashes between them. Multi-sink atomicity needs Build 9's two-phase commit pattern; for now, single-sink Load is enough. -
"If I just retry with
try/except, I'll be fine." Atry/exceptaroundload(...)catches Python exceptions, not OOM kills, segfaults, or disk-full errors that arrive asOSErrorafter the partial write has hit disk. Recovery has to assume the process can disappear without notice, which is what the on-disk state machine in this chapter is designed for.try/exceptis layered on top, not a substitute. -
"The orphan in
/data/_staging/is corrupted data that I should examine before deleting." It is intermediate output that the pipeline never committed, by design. Examining it tells you what would have been written if the run had completed — useful for debugging the why of the crash, but never as a source of recovered data. A junior engineer once tried to "salvage" an orphan by renaming it into/data/out/; the file was missing the last 1,200 rows because the crash happened mid-write. Orphans are forensic, never functional.
Going deeper
The five failure modes that produce State D
State D — orphan staging file — happens for a recognisable, finite set of reasons. Each maps to a defence that goes into the production version of the pipeline.
OOM kill. Riya's case. The Linux OOM killer sends SIGKILL, which the process cannot trap. Defence: cap the merchants dictionary at the size that actually fits, or stream the join with a hash-built lookup against a smaller in-memory subset. Build 6 introduces Parquet-with-statistics so you can join without loading the full reference table.
Disk full during write. csv.DictWriter raises OSError. Defence: a pre-flight check that estimates output size and reserves the space; an df check before starting. Production pipelines at Zerodha pre-allocate the output region; smaller pipelines simply alert when the disk is over 85% full so an operator clears space before the run.
SIGTERM from the orchestrator. Kubernetes evicts the pod; a CI runner times out. Defence: handle SIGTERM and clean up /data/_staging/<run_date>.<pid>.csv before exiting. The fifty-line pipeline doesn't, but a one-line signal.signal(signal.SIGTERM, cleanup) would.
Network partition during input read. S3 reads time out mid-Extract. With the chapter 3 layout this is State B (no staging file written yet), but a longer-running pipeline that streams Extract → Transform → Load may have written staging mid-stream. Defence: use a single output writer opened just before the first row, closed after the last; do not fan out staging files across the run.
Power loss during fsync. The most insidious. The os.replace succeeded according to the kernel, but the filesystem journal hadn't been flushed. On reboot, the rename may have rolled back. Defence: fsync on the parent directory after os.replace for the cases where a power-loss-tolerant guarantee is required (banking, identity, compliance). For 95% of pipelines, the consumer can detect "stale or missing" and the operator re-runs.
The order in which these five modes appear in production also follows a pattern. OOM and disk-full cluster around growth events — onboarding a new merchant, adding a new tenant, the first Big Billion Days after a year's data accumulation. SIGTERM clusters around platform changes — a Kubernetes node pool resize, an Airflow worker rotation, a CI image upgrade. Network partitions cluster around AWS regional events. Power loss is the rare one, and almost always traces to a single-AZ deployment without battery-backed disks. A team that classifies its post-mortems by which of these five caused the crash builds a clearer mental model of what to harden next than one that classifies by symptom.
What "re-runnable" actually requires of upstream
The fifty-line pipeline is re-runnable because the input is re-readable. Riya can run python pipeline.py 2026-04-24 again and csv.DictReader will iterate the same file. Many production pipelines do not have this luxury — the input is a Kafka stream, an HTTP webhook, or a CDC stream of changes that don't replay.
The discipline: whenever the pipeline reads from a non-replayable source, the first thing Extract does is land the input as an immutable file. Kafka consumer reads a window of messages and writes them to S3 as /raw/<topic>/<run_id>.json before any processing happens. From that file onward, the rest of the pipeline is re-runnable as if it were reading a CSV from disk. This is sometimes called "raw landing zone" or "bronze layer" in lakehouse terminology, and it is the same idea Riya's pipeline already has in /data/in/ — every CSV there is immutable, by upstream contract.
If your pipeline starts from a non-replayable source and doesn't land it first, you have built a system where every crash is data loss. Build 11 (CDC) goes deep on this exact discipline; the takeaway here is that "re-runnable" is not a property you grant to a pipeline by writing it carefully, it's a property the upstream contract permits or doesn't.
What an Indian-scale OOM looks like
The OOM killer fires more often in production than juniors expect. At a payments company in Bengaluru, the merchants dictionary grew from 8 lakh rows in 2023 to 47 lakh rows in early 2026 (a side effect of GSTN onboarding more small businesses). The fifty-line pipeline's dict representation grew from ~120 MB to ~700 MB in resident memory, which exceeded the 512 MB cap on the Airflow worker pod. The pipeline crashed every morning for nine days before someone noticed — because the orchestrator's retry policy kept "succeeding" by retrying on a worker with more memory, while filling the staging disk with orphans.
Two fixes shipped: a memory-budget check at Extract that raises before the OOM killer fires (if estimate_dict_bytes(merchants) > 0.7 * available_memory: raise MemoryError), and a daily orphan-cleaner cron that deletes /data/_staging/*.csv files older than 24 hours. The first fix turned silent OOMs into loud, debuggable errors with a stack trace; the second prevented disk-fill cascades. Total engineering cost: 28 lines of Python and one cron entry. This is the production-hardening shape — most "complex distributed system" code at Indian scale is small, boring fixes layered on top of a clean primitive.
The crash-during-rename window: the part nobody talks about
There is a sub-state inside State D that the four-state diagram glosses over: the crash happens during the rename syscall itself. POSIX guarantees rename(2) is atomic with respect to other observers — concurrent readers see either the old file or the new file, never an intermediate. It does not guarantee atomicity with respect to the caller that gets killed mid-call. The kernel will either complete the rename or not, and the caller cannot reliably know which.
In practice, this matters for power loss more than for SIGKILL — the rename completes inside one syscall, which on most filesystems is a few microseconds, and the OOM killer runs only at scheduler boundaries. But for a pipeline with a strict freshness contract (e.g., the regulator requires that yesterday's settlement file exists by 06:30 IST), the diagnostic must handle the case where the final file might exist with a slightly old mtime because the rename completed but the caller died before logging success.
The defensive pattern: after Load completes, write a sibling success marker — os.replace(tmp, final); Path(final).with_suffix(".csv.success").touch(). The orchestrator gates downstream consumers on the existence of .success, not on .csv. If a power loss occurs between the os.replace and the .touch(), the .csv exists but .success does not, and the downstream knows to treat the file as not-yet-published. This is one of the oldest patterns in batch processing — Hadoop's _SUCCESS files do exactly this — and it composes with everything else in the chapter without changing the four-state taxonomy.
Why the diagnostic doesn't run inside the pipeline
A natural extension is to bake crash_diagnosis.py into pipeline.py itself — at startup, check the disk and decide what to do. The fifty-line shape deliberately separates them, for two reasons.
First, the diagnostic has to run before the pipeline is allowed to start. If the pipeline starts and then discovers an orphan from a previous run, it has already taken its lock on /data/_staging/<run_date>.<pid>.csv and competing with itself. The orchestrator (cron, Airflow, or a wrapper script) is the right place to gate the run.
Second, baking diagnostics into the pipeline grows the pipeline. This is the "framework creep" pattern — the pipeline starts as 50 lines of clear logic, then accumulates 200 lines of self-protection logic that obscure what it actually does. Build 4 keeps the pipeline simple and pushes diagnostics up to the orchestrator, where they belong with retry logic, sensors, and SLAs. The fifty-line pipeline stays fifty lines forever; the orchestrator is where complexity lives.
A field guide: the questions to answer before typing anything
When a pipeline crashes at 02:14 and a Slack channel is waiting, the muscle that separates "fixed in 5 minutes" from "fixed in 90 minutes" is the order of questions. The list below is the field guide adopted by data-platform on-call rotations at three Indian companies — refined after enough 2 a.m. mistakes to be worth typing in full.
- Is this a crash, or is the pipeline still running? Check the process table (
ps aux | grep pipeline.py), the orchestrator UI, the worker logs. A pipeline that hasn't crashed is one you do not touch. - Which run_date is affected? Today's run, yesterday's late run, a backfill from last month? Each has different blast radius.
- What is on disk right now? Run the diagnostic. Do not guess from memory of how the pipeline "usually" looks.
- Is the input still readable? If Extract reads from S3 and the issue is an upstream producer, re-running won't help — fix upstream first, then re-run.
- What does the output's downstream consumer actually need? Sometimes the answer is "yesterday's file is good enough for the morning meeting; we'll re-run by lunch" rather than "drop everything and re-run now". Communicating this upward is part of recovery.
- What is the root cause, and is the next run going to crash again? If yes, fix first. If unclear, re-run once with extra logging — but only once.
This list is ordered, not optional. Skipping question 1 to "save time" is how a re-run gets fired against a still-running pipeline and a database lock cascade takes the warehouse offline. Skipping question 5 is how a 5-minute incident becomes a 4-hour incident because the team chose the wrong fix for the urgency level.
Where this leads next
Chapter 5 names the three structural walls every Build-1 pipeline runs into — no idempotency at the destination, no scheduled execution, no observability of what happened — and uses them to motivate Builds 2 through 5. The crash-recovery doctrine in this chapter handles "what to do after a crash"; chapter 5 confronts "how to know a crash happened, and what to do during one".
- The three walls: no idempotency, no observability, no schedule
- Idempotency by construction — Build 2, where the dedup-hash and atomic-rename patterns generalise to every destination type.
- Your first pipeline: CSV → CSV in 50 lines — the chapter this one crashes.
- The three primitives: extract, transform, load — why the four-state taxonomy collapses to one decision per state.
Build 9 (chapters 102–110) returns to the recovery question with stream destinations, where "atomic publication" becomes "exactly-once delivery" and the disk-state diagnostic becomes a transactional-coordinator state diagnostic. The shape is the same; the words and primitives change.
A practical exercise to fix this chapter in muscle memory: take Riya's pipeline.py from chapter 3, run it on a small input, and kill -9 the process at three different points (just after Extract starts, mid-Transform, mid-Load). After each kill, run crash_diagnosis.py and recover by hand. The first time you do this it takes ten minutes; the third time it takes thirty seconds. That speed is what on-call rotations call "competence", and it is built one deliberate crash at a time. If you cannot induce the failure on a calm Sunday afternoon, you cannot fix it on a panicked Tuesday morning.
Chapter 5 then steps back from the recovery lens to the architectural one — taking the same fifty-line pipeline and naming what it still cannot do, even with perfect crash recovery. The three walls in that chapter (no destination idempotency, no schedule, no observability) are each their own multi-chapter build, and this chapter's recovery doctrine is one of the foundations they all rest on. Before moving forward, it is worth re-reading chapter 3 with this chapter's framing: every architectural choice in those fifty lines exists to answer a recovery question. The staging directory exists to make orphans visible. The os.replace exists to make publication atomic. The pure Transform exists to make in-memory state safely throwaway. The run_id parameter exists to make retries deterministic. None of those choices are "code aesthetics" — each one is the difference between a 3-minute recovery and a 3-hour incident.
References
- The Linux OOM killer — Mel Gorman's chapter on what happens when the kernel decides your process is the one to kill. Worth reading once to lose the illusion that OOMs come with a stack trace.
- Two Phase Commit and Beyond — Princeton lecture notes on commit protocols. The fifty-line pipeline is a degenerate case of 1PC; multi-sink Load needs at least 2PC.
- Designing Data-Intensive Applications, Chapter 8: The Trouble with Distributed Systems — Martin Kleppmann on partial failure as the dominant problem. The single-machine version of the same problem is what this chapter walks.
- POSIX rename(2) — the formal atomicity guarantee that makes the staging-rename pattern work. Read alongside the Linux man page from chapter 3.
- Apache Iceberg snapshot semantics — the lakehouse-scale version of the same atomic-publish primitive. Build 12 returns here.
- Razorpay engineering: how we recover failed payment pipelines — the production version of this chapter's recovery doctrine, applied to UPI refund pipelines at 100M tx/day.
- The append-only log: simplest store — cross-domain reference. The append-only log is what makes State A, B, C all equivalent: if writes are appends, partial writes are tail-truncated, never structurally corrupt.
- Kafka KIP-98: Idempotent Producer — the Kafka analogue of
os.replace: a producer-side mechanism that makes re-sends safe at the broker.
A final test before moving to chapter 5: pull up the diagnostic script from the listing above and dry-run it in your head against three scenarios — (i) the pipeline succeeded yesterday but never ran today, (ii) the pipeline crashed mid-Load five minutes ago and you re-ran it manually two minutes ago, (iii) two operators ran the pipeline simultaneously by mistake. For each, write down which state the script returns and what action you take. If any of the three takes more than a minute to answer, re-read the recovery table.