The three walls: no idempotency, no observability, no schedule

The fifty-line pipeline from chapter 3, hardened with chapter 4's recovery doctrine, runs perfectly when Riya is at her desk and types python pipeline.py 2026-04-24 herself. On the day she takes leave for Diwali, the pipeline does not run. The day the upstream payments team replays a Kafka topic for an unrelated bug fix, the same refunds get loaded twice and her CFO sees a phantom ₹4.7 crore that does not exist. The day a downstream BI dashboard shows yesterday's number, nobody knows whether the pipeline ran late, ran wrong, or didn't run at all — because the pipeline tells nobody anything. Three walls separate "a script that survives crashes" from "a pipeline a team can rely on": no schedule, no observability, no destination idempotency. Each wall is the thesis of a Build that follows.

A crash-safe script is not a pipeline. To become one, it must run on a schedule it does not own, emit signals an operator can read without grepping logs, and absorb upstream re-deliveries without double-counting at the destination. These three walls — schedule, observability, idempotency — motivate Builds 2 through 5 and recur at every scale of the curriculum.

Why a single script is not yet a pipeline

Chapter 3 built a working extract-transform-load. Chapter 4 made it crash-safe. Both are necessary, neither is sufficient. The script is still a thing Riya invokes by hand, that runs once and falls silent, and that trusts upstream never to re-send the same row. Production removes all three of those assumptions, and the day each assumption breaks is a different kind of incident.

The three walls between a script and a production pipelineA horizontal layout. On the left, a small box labelled SCRIPT showing crash recovery. Three vertical walls separate it from a box on the right labelled PIPELINE. Each wall is labelled: no schedule, no observability, no destination idempotency. Below each wall, the chapter or build that breaks it.SCRIPTcrash-safestaging-renamepure transformdeterministic run_idwhere ch.3 + ch.4 leave usWALL 1no scheduleruns only whena human types it→ Build 4DAG executorWALL 2noobservabilitysilent success,silent failure→ Build 5runs, lineage,contractsWALL 3no destinationidempotencyreplays double-count rows→ Build 2dedup keys,MERGE/UPSERTPIPELINEruns on time,tells you when,survives replayswhere ch.30 leaves usThe three walls are not "missing features" — they are the structural problems that motivate the next 25 chapters.
The three walls between a crash-safe script (left) and a production-grade pipeline (right). Each wall is the thesis of a multi-chapter Build, and each must be crossed before the next has any value. A scheduler without observability tells you nothing when it fails; observability without idempotency lets you watch the warehouse double-count in real time.

The order on the diagram — schedule, observability, idempotency — is the order this chapter discusses them, but not the order Builds tackle them. Build 2 starts with idempotency because it is the deepest wall: a pipeline without idempotency is unsafe to schedule and dishonest to monitor. The discussion order below tracks how Riya hits each wall in the field, which is also the order most engineers meet them.

Wall 1: no schedule — the pipeline runs only when a human types

Riya's python pipeline.py 2026-04-24 runs the day she remembers. The morning of 2026-11-12, she is on a train from Bengaluru to her parents' home for Diwali, and the refunds report does not get generated. At 06:30 the regulator's compliance feed times out waiting for the file. At 07:00 her CFO opens the dashboard and sees "data unavailable". At 09:14 she gets a frantic call. The pipeline did not crash — it simply was not invoked.

The naive fix is crontab -e and 0 6 * * * /usr/bin/python /opt/jobs/pipeline.py. This works for exactly the cases where nothing ever goes wrong. The day cron's host is rebooted for a kernel patch at 05:55, the job misses its window. The day the upstream S3 file lands at 06:07 instead of 05:58, the pipeline starts at 06:00, finds no input, and crashes — and cron does not retry. The day Riya wants to backfill 30 days because of a bug fix, cron has no concept of "give me a parallel run for each of the last 30 dates". Cron is a scheduler in the same way a while true; do; done is a worker pool — it runs something, but none of the production needs.

A real scheduler — Airflow, Dagster, Prefect, or the tiny one Build 4 builds from scratch — answers four questions cron cannot:

  1. Did the upstream input actually arrive? A "sensor" task that polls S3 and only triggers downstream when the file shows up.
  2. What happens if the run fails? Bounded retries with exponential backoff, dead-lettering after N attempts, alert routing.
  3. What does the dependency graph look like? Pipeline A produces a file that pipeline B reads; B does not start until A succeeds and lands a _SUCCESS marker.
  4. How do I backfill? Re-run the same DAG against historical dates without modifying code, with bounded parallelism so the warehouse doesn't keel over.

Why cron survives in the field anyway: it is the cheapest possible scheduler and works fine for jobs whose inputs are always-on (a database query, an internal API) and whose outputs nobody depends on hourly. The danger is the gradient — a cron job that "always works" gets relied on by a downstream team, then a regulator, then a board metric, and one day cron drops a run silently and three layers of consumers don't know why their numbers are stale. A real scheduler is the moment you stop treating "did the job run?" as a question you ask ps aux.

The Bengaluru fintech in chapter 4 illustrates the cron-graduation point exactly. The refunds pipeline started as a crontab line in 2022. It moved to Airflow in 2023 because the compliance team needed bounded retries and a runs-history UI to prove to the regulator that yesterday's file was, in fact, generated at 06:08 IST and signed by an engineer they could name. The schedule wall is rarely "we need fancier scheduling" — it is "we need to be able to answer questions about runs without grepping logs".

Wall 2: no observability — silent success, silent failure

The pipeline ran. Did it produce the right output? Did it produce any output? When did it finish? How does today's row count compare to yesterday's? If the answer to any of these is "let me ssh into the box and tail the log", the pipeline has no observability.

Observability has three layers, each invisible to a script and visible to a pipeline:

Run observability. The orchestrator records, per run: start time, end time, exit status, log output, retry count, who triggered it. When ops asks "did the pipeline run today?", the answer is a row in a runs table, not "yes I think so".

Data observability. The pipeline records, per run: input rowcount, output rowcount, distribution checks (was the average refund amount within 3σ of the trailing 30-day mean?), null-count per column. A pipeline that produces 28k rows on Monday, 28k on Tuesday, 28k on Wednesday and then 4k on Thursday with no error is a pipeline whose data observability is missing — the ETL succeeded; the data is wrong; nobody knows yet.

Lineage observability. When a downstream dashboard shows a wrong number, the question is "which upstream feed corrupted it?". A pipeline that knows its inputs by URL/table-name and its outputs by URL/table-name lets you walk the graph from "this dashboard" backwards to "this S3 prefix from this Kafka topic from this Postgres table". Without lineage, the answer is "open every pipeline's code and read it".

# pipeline_with_observability.py — the fifty-line pipeline, instrumented.
import json, time, hashlib, sys, os
from pathlib import Path
from datetime import datetime, timezone, timedelta

IST = timezone(timedelta(hours=5, minutes=30))

def run(run_date: str):
    run_id = f"{run_date}T06:00:00+05:30"
    started_at = datetime.now(IST).isoformat()
    metrics = {"run_id": run_id, "started_at": started_at, "stage": "init"}

    try:
        metrics["stage"] = "extract"
        refunds = extract(run_date)
        metrics["input_rows"] = len(refunds)
        metrics["input_sha"] = sha_of(refunds)

        metrics["stage"] = "transform"
        out = transform(refunds, load_merchants(run_date))
        metrics["output_rows"] = len(out)
        metrics["output_amount_inr"] = sum(int(r["amount_paise"]) for r in out) / 100

        metrics["stage"] = "load"
        load(out, run_date)
        metrics["finished_at"] = datetime.now(IST).isoformat()
        metrics["status"] = "success"
    except Exception as e:
        metrics["status"] = "failed"
        metrics["error"] = f"{type(e).__name__}: {e}"
        metrics["failed_at_stage"] = metrics["stage"]
        raise
    finally:
        Path(f"/data/_runs/{run_id}.json").write_text(json.dumps(metrics, indent=2))

def sha_of(rows):
    h = hashlib.sha256()
    for r in rows: h.update(json.dumps(r, sort_keys=True).encode())
    return h.hexdigest()[:12]

if __name__ == "__main__":
    run(sys.argv[1])

Sample run output, written to /data/_runs/2026-04-24T06:00:00+05:30.json:

{
  "run_id": "2026-04-24T06:00:00+05:30",
  "started_at": "2026-04-24T06:00:01.412+05:30",
  "stage": "load",
  "input_rows": 31827,
  "input_sha": "a1f3b88c0ed4",
  "output_rows": 28915,
  "output_amount_inr": 4218914.5,
  "finished_at": "2026-04-24T06:00:09.788+05:30",
  "status": "success"
}

A few load-bearing lines.

metrics["stage"] = "extract" (etc.) is the cheapest possible failure-localisation primitive. If the pipeline crashes, the JSON file written in the finally block records where it died — extract, transform, or load. The chapter 4 disk-state diagnostic still works, but now there is also an explicit, machine-readable answer to "which primitive failed?".

metrics["input_sha"] is a hash of the input rows. If the upstream replays the same file, the hash is identical and downstream knows "this run is a re-run of a previously-seen input". If the file changes between runs, the hash changes and downstream can detect drift. This single field becomes the foundation of Build 2's idempotency story.

metrics["output_amount_inr"] is a sentinel statistic. A run that produces 28,915 rows but a total amount of ₹0.05 has clearly transformed something wrong. A simple alert — "yesterday's amount was ₹42 lakh; today's is ₹50; alert" — catches a class of bugs that no exception-based logging can. This is the seed of data-quality testing in Build 5.

finally: block writes the metrics JSON whether the pipeline succeeded or failed. This is the single most-skipped line in junior-built observability. A pipeline that only writes metrics on success is invisible exactly when it most needs to be visible.

Why a JSON file on disk is enough at this stage: the /data/_runs/ directory becomes the runs-history table. A 30-second cron job (or, after Build 4, an Airflow plugin) reads the directory, parses each JSON file, and writes to Postgres or pushes to a metrics endpoint. The pipeline itself does not depend on the metrics backend being available — if the metrics backend is down, the pipeline still produces correct output and the metrics catch up later. This separation is the same shape as "land raw input first, transform later" — observability is a downstream consumer of the pipeline, not a dependency of it.

Wall 3: no destination idempotency — replays double-count

Wall 3 is the deepest, the most counterintuitive, and the wall that breaks the most CFO dashboards.

The fifty-line pipeline writes its output to /data/out/refunds_settled_2026-04-24.csv via os.replace. If the same run is invoked twice, the second invocation overwrites the first with the same content, byte-for-byte. The output is file-idempotent: re-running produces the same file. Good.

The downstream consumer, however, is not a file system — it is a Postgres refunds_ledger table that the BI team queries. The downstream loader reads the CSV and INSERTs the rows. If the pipeline runs twice, the table now has every row twice, and the daily refund total reads ₹84 lakh instead of ₹42 lakh. The CFO's dashboard now shows a phantom ₹42 lakh that does not exist. This is the day a senior analyst spends nine hours reconciling, and the day the engineering team learns the difference between "pipeline-idempotent" and "destination-idempotent".

The chain is:

Extract  →  Transform  →  Load (file)  →  Load (warehouse)
  pure       pure          file-idempotent    NOT idempotent
                           (os.replace)       (INSERT appends)

The wall is at the rightmost arrow. Three patterns get past it, each the subject of multiple chapters in Build 2:

Pattern A: dedup hash + UPSERT. The pipeline computes a stable, deterministic key for each row (e.g., sha256(refund_id + run_date)) and stores it in a column. The warehouse loader uses INSERT ... ON CONFLICT (dedup_key) DO NOTHING (Postgres) or MERGE INTO ... USING ... WHEN MATCHED THEN ... WHEN NOT MATCHED THEN INSERT (Snowflake/BigQuery/Redshift). A second invocation finds every row's dedup_key already present and inserts nothing.

Pattern B: snapshot replace. The loader writes to a partitioned table where the partition key is run_date. Each run truncates today's partition and reloads it. A second invocation truncates and re-writes; the table state is identical. This is the warehouse analogue of the staging-rename pattern from chapter 3.

Pattern C: append-only with _run_id filter. The loader appends every row, including a _run_id column. Downstream queries always do SELECT ... WHERE _run_id = (SELECT MAX(_run_id) FROM runs WHERE run_date = '2026-04-24') to pick the canonical run for that date. A re-run produces a new _run_id that wins; the old rows stay in the table for forensics. This pattern wastes storage but trivialises auditing — every re-run is recoverable by quoting an old _run_id.

Three idempotency patterns at the destinationThree columns showing pattern A (dedup hash + UPSERT), pattern B (snapshot replace), pattern C (append-only with run_id filter). Each shows: how a re-run is absorbed and what the warehouse looks like afterwards.A — dedup hash + UPSERTB — snapshot replaceC — append + run_id filterfirst runINSERT 28,915 rowsunique dedup_key per rowtotal rows: 28,915second run (replay)UPSERT — every keyalready presenttotal rows: 28,915 ✓first runDELETE WHERE date='today'INSERT 28,915 rowstotal rows: 28,915second run (replay)DELETE 28,915, INSERT 28,915partition-level swaptotal rows: 28,915 ✓first runINSERT 28,915 rows_run_id = run_atotal rows: 28,915second run (replay)INSERT 28,915 more_run_id = run_b (newer)queries see 28,915 ✓cheap; needs unique keyevery rowcheap; needs partitioncolumn on tablestorage-heavy; trivialisesaudit and forensic queries
Three patterns for destination idempotency. Pattern A is the default for OLTP destinations (Postgres, MySQL); Pattern B is the default for partitioned warehouse tables (BigQuery, Snowflake, Iceberg); Pattern C is the default for append-only stores like Kafka or S3-as-blob-store. Build 2 walks each in detail and shows the failure modes of each.

Why "destination idempotency" is named that way: the property is not about the pipeline (Riya's pipeline is already deterministic) but about how the destination absorbs a re-delivered output. The same correct pipeline can be safe against one destination and unsafe against another, depending on whether the destination's commit primitive supports INSERT ... ON CONFLICT or only blind INSERT. This is why the wall is destination-shaped, not pipeline-shaped.

A subtle property of the three patterns: only Pattern C is retroactively idempotent — meaning, if the destination already has rows from a non-idempotent earlier life, Pattern C still works because it filters by _run_id, while Patterns A and B require a one-time cleanup before they become safe. Teams adopting destination idempotency on an existing warehouse usually start with Pattern C for the migration period, then move to Pattern A or B once the historical data is reconciled. The migration is itself a multi-day task, which is why "we'll add idempotency later" is one of the most expensive sentences in data engineering.

Why a UNIQUE index alone is not enough — you also need the loader to use ON CONFLICT DO NOTHING (or MERGE). A UNIQUE index without ON CONFLICT causes a re-run to fail on the first duplicate row instead of silently absorbing it. The pipeline's exit code will be non-zero, the orchestrator will retry, the retry will fail again, and the team will spend the morning debugging why "their idempotency is broken". The idempotency is not broken; the loader is missing the half of the contract that says "expected duplicates are not errors".

How the three walls compose

Each wall on its own is a problem; together, they are the architecture of a production pipeline. The composition matters: solving them in the wrong order produces something that looks like progress but isn't.

A scheduler with no observability is a black box that runs on a timer and tells you nothing. You know the pipeline ran (cron's stderr), you know it didn't crash (exit code 0), but you have no idea whether it produced the right data. The team at the Bengaluru fintech ran exactly this configuration for three months in early 2024 and discovered, only when an external auditor matched their refund totals against bank statements, that 4.7% of refunds had been silently dropped due to a schema-drift bug. The scheduler had been "working" the whole time.

A scheduler with observability but no idempotency is honest about its dishonesty — you can see the pipeline ran twice, but the warehouse has double-counted anyway. This is the most common production state because it is the easy intermediate point: orchestrator first (visible win), data-quality later (invisible loss). The cure is destination idempotency before scheduler retries; that is why Build 2 precedes Build 4.

Idempotency without scheduling means the pipeline is safe to re-run, but nobody re-runs it on time. This is where a 12-hour outage becomes a 48-hour outage because the on-call engineer correctly says "we need to wait for the upstream to publish first" and then forgets to come back at 04:00 to invoke the pipeline by hand. Idempotency makes scheduling possible; scheduling makes idempotency useful.

The deepest lesson: the three walls are not independent. Solving any one in isolation creates a more dangerous system, not a less dangerous one. The Build sequence (2 → 4 → 5) is calibrated so that each new capability is paired with the safety net the previous one demanded.

Common confusions

Going deeper

What "right number of walls" looks like — why three and not five

A natural question is whether three walls is the right abstraction or whether it should be five (add: schema evolution, security/PII), or seven (add: cost attribution, multi-region replication). The answer is that the three named here — schedule, observability, destination idempotency — are the walls that block a single pipeline from running unattended in production. Schema evolution becomes a wall when you have more than one pipeline sharing types; PII becomes a wall when you have more than one tenant sharing infrastructure. Cost becomes a wall when you have more than one team sharing a warehouse.

The narrowing-down to three is deliberate. A class-12 student building their first pipeline at an internship needs exactly these three; the others come later as the pipeline graduates from "an internal report" to "a billed product". Writing the curriculum to a five-wall or seven-wall framework would be more "complete" but would also make the early build experience feel like wading through scaffolding. Build 5 introduces schema evolution and contracts; Build 12 introduces multi-tenancy and isolation; Build 16 closes the loop with cost and governance. Each wall enters the curriculum at the build where it first becomes the dominant problem.

Why Build 2 (idempotency) precedes Build 4 (scheduler) in the curriculum

The chapter ordering looks counterintuitive — Build 4 (the scheduler) feels like the "first" production primitive, since you can't run a pipeline reliably without one. But the curriculum places Build 2 (idempotency) first. The reason is the composition rule from §"How the three walls compose": a scheduler that retries a non-idempotent pipeline causes more harm than no scheduler at all. A failed pipeline that an engineer manually re-runs after fixing the root cause is at most one incident; a failed pipeline that an Airflow retry-3-times-with-exponential-backoff loop attacks is three incidents at scale.

The Razorpay engineering team learned this in 2021 when a refund pipeline with default-Airflow-retries-on caused a brief duplication incident during a transient S3 timeout. The fix shipped in two parts: dedup-key the destination (Build 2 thinking), and only then turn on retries (Build 4 thinking). The curriculum follows the same order because it matches the order in which production teams must adopt the patterns. Build 2's content is the precondition for Build 4 to be safe.

The lineage wall hidden inside the observability wall

§Wall 2 lists three layers of observability: run, data, lineage. The chapter treats all three as a single wall, but in production they fail at different timescales. Run observability is what you need at 06:30 when ops asks "did it run". Data observability is what you need at 09:00 when an analyst asks "is the number right". Lineage observability is what you need at 14:00 three weeks later when a downstream dashboard is wrong and nobody knows which of the seventeen feeding pipelines corrupted it.

The reason lineage hides inside the observability wall in this chapter (rather than being its own fourth wall) is that for a single pipeline, lineage is trivial — there is one input, one output, one line of metrics["input_uri"] = .... Lineage becomes its own structural problem when the pipeline graph has dozens of nodes, which is Build 5's territory. At chapter 5's level of pipeline complexity, lineage is an extension of run observability; at Build 5's level (and beyond), lineage is its own dimension of the problem.

This is also why "data lineage" tools (OpenLineage, Marquez, DataHub, Atlan) become useful exactly at the company size where the pipeline graph stops fitting in one engineer's head. A two-pipeline shop in Coimbatore does not need Marquez; a forty-pipeline shop at PhonePe does. The lineage wall is the same wall — it just becomes load-bearing later.

What the three walls look like at Razorpay scale

Each wall has a well-known production manifestation at large Indian companies. The shapes are instructive because they show how the same wall appears at every scale, just with bigger consequences.

The schedule wall at Razorpay-scale is not "did the cron run" but "did the multi-region orchestrator pick the right region after the Mumbai AZ went degraded at 03:14". Their pipeline orchestrator is Apache Airflow, configured with active-passive failover across regions. The wall isn't the concept of scheduling; it is the concept of scheduling-under-failure. Build 4 covers the local version; Build 17 returns to the multi-region version.

The observability wall at Flipkart-scale during Big Billion Days is not "is the pipeline metrics file written" but "out of 2,400 production pipelines, which 14 are running 8 minutes slower than baseline today, and is the cause our load on warehouse compute or upstream API throttling". The wall is the same wall — it is just being asked of a much larger fleet. The instrumentation primitives are the same JSON-per-run files; the difference is a centralised metrics store, an alerting tier, and a runbook that answers "given this anomaly, what do I do".

The destination-idempotency wall at PhonePe-scale is not "did we INSERT a refund twice" but "did our exactly-once-semantics across Kafka transactions and Postgres MERGE actually hold during the partition-rebalance event at 14:22 last Thursday". The pattern is the same dedup-hash-plus-MERGE; the destination is a more complex coordinated commit. Builds 8 and 9 are this exact wall at stream-processing scale.

The continuity matters: a class-12 student implementing the three walls on a laptop pipeline is doing the same engineering, in spirit, that the Razorpay platform team is doing at 100M tx/day. The mechanisms scale; the walls don't change.

The cheapest possible breach of each wall — what "afternoon defence" looks like

Each wall has a deluxe answer (Airflow + DataHub + Snowflake MERGE) and a cheap-but-honest answer that fits in an afternoon and is sufficient for a pipeline serving fewer than three downstream consumers. The cheap answers are worth memorising because they let you move past a wall today without committing to a platform decision the team isn't ready to make.

For the schedule wall, the afternoon answer is cron with flock for mutual exclusion: 0 6 * * * /usr/bin/flock -n /tmp/refunds.lock /usr/bin/python /opt/jobs/pipeline.py $(date -d "today" -I). The flock prevents the cron job from running twice if a previous invocation is still going; the $(date) produces the deterministic run_date the pipeline expects. This is not a real scheduler — it has no retries, no sensors, no DAG — but it solves the "pipeline doesn't run when Riya is on leave" problem for ₹0 of platform cost and 11 minutes of engineer time.

For the observability wall, the afternoon answer is exactly the JSON-per-run file the listing in §Wall 2 writes. A second 20-line script reads /data/_runs/*.json and posts to a Slack webhook on any status: failed or any output_rows outside ±20% of the trailing 7-day median. Total moving parts: two files, one webhook URL, no SaaS. A team at a Pune logistics company ran exactly this for 14 months before they outgrew it; for the volumes most pipelines see in their first year, this is sufficient.

For the destination-idempotency wall, the afternoon answer is CREATE UNIQUE INDEX refunds_dedup_idx ON refunds_ledger (run_date, refund_id) plus changing the loader from INSERT to INSERT ... ON CONFLICT DO NOTHING. Two SQL statements. No new tools, no schema redesign, no new engineering vocabulary. Pattern A from §Wall 3 applied at its cheapest. This single defence absorbs every "Kafka replayed the topic, the same row arrived twice" event the team will see in its first year of operation.

The reason to highlight the afternoon answers is that Build 2–5 will introduce the deluxe answers, and it is easy to read those builds and conclude "we can't have a real pipeline until we have Airflow plus DataHub plus Iceberg". That is wrong. The walls exist regardless of platform; the deluxe answers are one way to breach them, but the cheap-and-honest answers breach the same walls and are what most production pipelines in India actually run on for their first 18 months. Engineering judgement is knowing which answer your team is at the size for.

When a wall is not a wall — pipelines where one of the three doesn't apply

A useful sanity check is: is there a pipeline where one of the three walls genuinely doesn't apply? Yes, and the exceptions illuminate the rule.

A pipeline that is invoked synchronously from a user request (e.g., a recommendation engine called from a Flipkart product page) does not need a scheduler — its scheduler is the HTTP request. The other two walls still apply: it must be observable (request-level traces) and destination-idempotent (a retry of the same request must not double-charge a user). The schedule wall dissolves when the trigger is the user.

A pipeline whose only output is an in-memory cache that is rebuilt from scratch every run (e.g., a daily-recompute of a leaderboard) does not need destination idempotency — there is no destination state to corrupt because the destination is wholly replaced each time. The scheduler and observability walls still apply. The destination-idempotency wall dissolves when the destination is fully overwriting.

A pipeline that is part of an interactive notebook session — a data scientist running cells one at a time — has none of the three walls in its purest form; the human is the scheduler, the human reads the output, and the destination is a notebook variable. It is also not a pipeline in the sense this curriculum cares about. The three walls are precisely what define the boundary between "a script" and "a pipeline".

The discipline: when a pipeline seems to "not need" one of the three walls, ask which boundary it crosses that absolves it. The answer is always a structural property of the pipeline (synchronous trigger, fully-overwrite output, human in the loop). When that structural property goes away — and it usually does, six months in — the wall reappears.

Where this leads next

The next chapter (chapter 6) opens Build 2 by walking the simplest case where destination idempotency matters: re-running the same pipeline twice and counting rows. Builds 4 and 5 then layer scheduling and observability on top. By chapter 30 — the end of the foundational sequence — Riya's fifty-line pipeline has crossed all three walls and is something a team can rely on without typing.

A useful exercise before moving to chapter 6: take the instrumented pipeline above, pipe its metrics JSON to a new file per run, and write a 20-line script that reads the last 30 days of runs and prints an alert if today's output_rows is more than 3σ below the trailing mean. That script is, in miniature, the data-observability layer Build 5 builds at scale.

References

  1. Hadoop _SUCCESS marker convention — the original "destination idempotency" primitive, predating modern lakehouses by a decade.
  2. Airflow scheduling concepts — the production reference for what a real scheduler offers on top of cron. Build 4's vocabulary is largely drawn from here.
  3. Postgres ON CONFLICT documentation — the destination side of Pattern A. Required reading before writing your first idempotent INSERT.
  4. Designing Data-Intensive Applications, Chapter 11: Stream Processing — Martin Kleppmann's framing of "exactly-once" as "effectively-once via idempotency", which is the same wall this chapter names.
  5. OpenLineage specification — the data-lineage layer Build 5 builds on. Worth skimming once to see where Wall 2's third layer eventually lives.
  6. Razorpay engineering: pipeline reliability at UPI scale — production examples of all three walls at 100M tx/day, including the 2021 destination-idempotency incident referenced in §Going deeper.
  7. Zerodha Pulse: data infrastructure at an Indian broker — the orderbook-tick equivalent of refunds-pipeline reliability, with the same three walls in different costumes.
  8. Crashing mid-run: which state are you in? — internal cross-reference. The crash-recovery doctrine is the layer below this chapter's three walls; everything in this chapter assumes the previous chapter's recovery shape.

A final test before moving to chapter 6: take Riya's pipeline and write down, for each of the three walls, the cheapest possible defence — not the elegant one, the one you would ship in an afternoon. For Wall 1, "cron with flock for mutual exclusion". For Wall 2, "the JSON-per-run file from this chapter's listing". For Wall 3, "a UNIQUE index on (run_date, refund_id) plus INSERT ... ON CONFLICT DO NOTHING". If those three afternoon-defences are in place, the pipeline is past the walls in form, and Builds 2–5 are the elaboration. If any of them is missing, that is where production will hit you first.