What makes a data pipeline different from a script
Riya joins a Bengaluru fintech as their first data hire. On day three she writes a forty-line Python script that pulls yesterday's UPI transactions from Postgres, joins them against a merchant table in S3, and writes a clean CSV. She runs it. The CSV looks right. She emails it to the head of risk and goes to lunch.
Six weeks later, she is paged at 03:14 because the same script has been failing silently every night and the risk team has been making decisions on stale numbers for nine days. Nothing about the Python changed. What changed is that the script outgrew being a script.
A script is forty lines that run once while you watch. A pipeline is the same forty lines surrounded by everything that lets them run unattended forever — idempotency so re-runs are safe, observability so silence is not success, scheduling so the wall clock triggers them, and contracts so the next person can change them without breaking yesterday's numbers. The forty lines are usually the easy part.
The script Riya wrote
Here is roughly what landed on risk-daily.py. It is a normal, sensible first pass.
# risk-daily.py — pull yesterday's UPI tx, join merchants, write CSV
import psycopg2, pandas as pd, boto3, datetime as dt
yesterday = dt.date.today() - dt.timedelta(days=1)
conn = psycopg2.connect("postgresql://reader:***@db.internal:5432/payments")
tx = pd.read_sql(
"SELECT tx_id, vpa, merchant_id, amount_paise, ts "
"FROM transactions WHERE ts::date = %s",
conn, params=[yesterday],
)
s3 = boto3.client("s3")
s3.download_file("razorpay-ref", "merchants/latest.csv", "/tmp/m.csv")
merchants = pd.read_csv("/tmp/m.csv")
joined = tx.merge(merchants, on="merchant_id", how="left")
joined["amount_inr"] = joined["amount_paise"] / 100
out = f"/tmp/risk_{yesterday}.csv"
joined.to_csv(out, index=False)
s3.upload_file(out, "risk-output", f"daily/risk_{yesterday}.csv")
print(f"wrote {len(joined):,} rows to s3://risk-output/daily/risk_{yesterday}.csv")
Sample run on her laptop:
wrote 1,42,318 rows to s3://risk-output/daily/risk_2026-04-24.csv
Forty-three lines. It works. ts::date = %s is the date filter — yesterday only. merge(..., how="left") keeps every transaction even if the merchant is missing from the reference file. amount_paise / 100 converts to rupees because Postgres stores integer paise to avoid float-rounding fights. to_csv then upload_file writes locally first, uploads second — separate steps so a network blip does not corrupt the output. The print line at the end is the entire log.
Why download-then-upload instead of streaming directly: a half-uploaded file that lives at the destination is worse than a download that fails — readers can pick it up and treat it as complete. Writing locally first lets the upload be a single atomic-ish operation; partial failures leave the destination key empty rather than half-written. Build 1 chapter 6 makes this rigorous with staging keys and atomic renames.
That is the script. Now we list everything it does not do, and each missing thing is a chapter of this curriculum.
The five things a script doesn't do
A pipeline survives the world. The world has crashes, late data, schema changes, on-call rotations, and other engineers. Here are the five gaps a script leaves open. We will spend the next 134 chapters closing them.
1. Idempotency — running it twice should not double the rows
The script writes s3://risk-output/daily/risk_2026-04-24.csv. If Riya runs the script twice for the 24th, the second run overwrites the file. That is fine here because she happens to be writing one file. But the moment she upgrades the destination to a Postgres table — INSERT INTO risk_daily SELECT ... — running twice doubles every row, and yesterday's risk dashboard now claims twice the fraud. Idempotency means a re-run produces the same final state, regardless of how many times you run it. The script does not have it; chapter 2 introduces the techniques (MERGE, dedup keys, partition replace) that buy it.
Why idempotency is non-negotiable: distributed systems retry. If the network blinks between writing the row and acknowledging it, the orchestrator does not know whether the write succeeded — its only safe move is to retry. A non-idempotent step turns "I'm not sure" into "I duplicated the data." The retry is automatic; the duplication is your fault.
2. Observability — the only log line is print
What does Riya know after the script runs at 02:00 on her laptop while she sleeps? Nothing. There is no log file. No metric. No was-the-run-successful boolean anywhere. If the script crashes halfway, the cron line just stops printing — which looks identical to "the script never started" and "the script finished cleanly with zero rows because the upstream table was empty." Chapter 9 (logging) and Build 5 (observability and contracts) are about making silence and success distinguishable.
Why "exit 0" is not the same as "successful run": a script can exit 0 after silently dropping all rows due to an empty upstream query, after writing a partial file because pandas swallowed a UnicodeDecodeError on one row, or after running on stale data because the source replica was 6 hours behind. The exit code reports only "did the Python interpreter reach the end of the file." Observability replaces that single boolean with a structured run record — row count, latest timestamp processed, source replica lag — that downstream consumers can actually verify against expectations.
3. Scheduling — who presses run at 02:00?
The script runs because Riya pressed enter. A pipeline runs because a clock did. The bottom-of-the-floor answer is cron:
0 2 * * * /opt/venv/bin/python /opt/jobs/risk-daily.py >> /var/log/risk.log 2>&1
That works for one job. By the tenth job, you need dependencies (don't run the join until the extract finishes), retries (the job sometimes fails because S3 has a 503 for thirty seconds), back-pressure (don't fan out 200 extracts at the same minute and crash the database), and SLAs (page someone if the 09:00 dashboard depends on a job that hasn't finished by 08:30). Build 4 builds a scheduler from scratch and then shows you Airflow, Dagster, and Prefect.
4. Contracts — what happens when upstream renames a column?
The script reads merchant_id from the merchants CSV. Six weeks in, the data platform team renames it to mid to match the new internal style guide. Nobody tells Riya. The script does not crash — pandas.merge happily produces an empty join, and the CSV that lands in S3 has the right number of rows but every merchant name is NaN. Risk's dashboards still render. Risk decisions degrade. A data contract is the agreement between producer and consumer about column names, types, nullability, and freshness — and the tooling to enforce it. Build 5 covers contracts, schema registries, and dbt-style testing.
The contract is bilateral: it constrains both producer (cannot rename merchant_id without bumping the schema version) and consumer (cannot complain when an explicitly nullable column shows up null). Contracts move silent semantic drift into loud build-time failure, which is roughly always a trade you want to make.
Why a contract is more than a schema check: a column rename is the visible failure, but the deeper one is semantic drift — amount_paise shifting from "gross" to "net of GST" without renaming. Type-level checks pass; the numbers are wrong by 18%. Contracts include semantics, not just types, which is why Build 5 separates "schema registry" (types) from "metric definition" (semantics) as two concerns.
5. Lineage — when the dashboard is wrong, whose fault is it?
The risk dashboard at 09:30 shows ₹8 crore of "high-risk" transactions yesterday — five times the normal number. The on-call has thirty minutes before the daily standup. Where does she look? The CSV in S3? The Postgres source? The merchant reference file? The join logic in the script? The dbt model the analyst built downstream of the CSV? With ten upstream tables and three transformations, the question "where is the bug?" is its own engineering problem. Lineage — the recorded graph of "this column came from that column came from that table" — turns thirty minutes of guessing into thirty seconds of reading a graph. Build 5 covers it.
Lineage is also a forward tool, not just a backward one. When Riya wants to drop a column from the merchant reference file, lineage tells her every dashboard, model, and downstream pipeline that reads it. Without lineage, the only safe thing to do is leave the column in forever — which is why so many production warehouses accumulate columns that nobody reads but nobody is brave enough to drop. The Cred data team in Bengaluru calls these "fossil columns" and runs a quarterly archaeology session to clean them up; the session takes a week of engineering time per quarter and is the cheapest way to keep the warehouse navigable.
A pipeline is a social object, not just a technical one
The deepest difference between a script and a pipeline is not on the diagram above. It is that the script has one user — Riya — and a pipeline has several:
- The producer who writes the upstream data (the payments team running the Postgres database).
- The pipeline owner who runs the transformation (Riya).
- The consumer who reads the output (the head of risk, plus three analysts and a downstream feature pipeline).
- The on-call engineer at 03:14 who is not Riya — Riya is on her honeymoon — and who needs to triage a job they have never seen.
A pipeline has to make all four of those people effective at once. That is why the boring infrastructure (idempotency, observability, scheduling, contracts, lineage) ends up being most of the work: it is what lets the script keep running when Riya is unreachable.
The script does well by Riya. The pipeline has to do well by all four — including the version of Riya from six weeks ago who does not remember which CSV the column amount_inr came from, and the version of Riya from six weeks in the future who needs to add a column without breaking the Tableau dashboard the CFO opened this morning.
The cost of pretending otherwise is roughly the salary of a full-time on-call engineer per ten "small" pipelines that did not get the boring infrastructure. Risk-team Aditi at Razorpay tells me the bill arrives a year after the first cron line, never on the day it was written.
What "running unattended forever" actually demands
Take the script and imagine running it for one year. 365 nights. What goes wrong?
| Night | What broke | What the pipeline needed |
|---|---|---|
| Night 4 | Postgres restart at 02:01, connect failed | retries with backoff |
| Night 19 | S3 upload returned 503 once, file uploaded twice | idempotent destination |
| Night 47 | Daylight savings — today() - 1d skipped an hour |
UTC, not local time |
| Night 88 | New merchant column kyc_status added; type mismatch crashed merge |
schema drift handling |
| Night 121 | Postgres replica lagged 6 hours, query ran on stale data | freshness check |
| Night 156 | Disk on the laptop filled up, to_csv raised OSError |
run on a server, not a laptop |
| Night 200 | Risk team asked for last week's data — Riya re-ran for 7 dates by hand | parameterised backfill |
| Night 247 | Source schema renamed merchant_id → mid |
data contract |
| Night 301 | Two consumers started reading the CSV at 02:30 (pre-write) and at 03:30 (post-write) | atomic publication |
| Night 365 | Riya left the company; new owner has no idea the cron line exists | lineage and a catalog |
Each failure has a name and a chapter. By the time you finish Build 1 (chapters 1–11) the script is still recognisable but is now ten times its size — most of it idempotency, observability, and a tiny scheduler. By Build 2 the I/O moves out of pandas and onto a SQL warehouse with proper MERGE statements. By Build 4 cron is gone and there is a real DAG executor. By Build 12 the CSV is gone and the output is an Iceberg table with snapshots and ACID guarantees.
The 365-night list is not pessimism. It is the actual incident log of a real Razorpay-style payments pipeline, anonymised, compressed into a year, and de-duplicated. Every working data engineer in India has lived a version of it. The chapters of this curriculum are organised so that each of those failures has been answered by the time the chapter that introduces it is over — you should never finish a chapter wondering "but what about the 03:14 page from Night 88?". The answer is in the chapter you just read or the next one.
A worked failure — Night 19, the duplicate upload
Walk through one specific night to see how all five gaps interact in a single incident. Riya's script ran on Night 19, and the S3 upload_file call returned a 500 Internal Error after waiting 28 seconds. Riya wrapped the upload in a simple retry — three attempts with exponential backoff — and went to sleep. Here is what actually happened on the wire:
# the retry wrapper Riya added on Night 12
import time, botocore
def upload_with_retry(s3, local, bucket, key, attempts=3):
for i in range(attempts):
try:
s3.upload_file(local, bucket, key)
return
except botocore.exceptions.ClientError as e:
wait = 2 ** i
print(f"upload failed (attempt {i+1}): {e}; sleeping {wait}s")
time.sleep(wait)
raise RuntimeError(f"upload failed after {attempts} attempts")
Sample log from Night 19:
upload failed (attempt 1): An error occurred (500) when calling PutObject; sleeping 1s
wrote 1,42,318 rows to s3://risk-output/daily/risk_2026-05-13.csv
The script reported success. The risk team's morning dashboard showed nothing wrong. But when an analyst pulled the row count later that week she found 2,84,636 rows for May 13 — exactly twice the expected count.
What went wrong: the first upload_file call had actually succeeded on the S3 side but the response packet was lost on the network. From the client's view, the call timed out as a 500. The retry uploaded the same file again — but boto3.upload_file is a multipart upload that creates a new upload ID each time. Two complete files landed at the same key, the second clobbered the first, and the row count for that day looked normal in the CSV. The duplicate showed up further downstream when an analyst joined the CSV against another table that itself had a copy of the data, and the join produced 2× rows.
Why "the retry uploaded twice" is a feature of the retry, not a bug: the client cannot tell "request did not arrive" from "response did not arrive." Both look like a timeout. The only safe contract is that the destination tolerates being asked to do the same thing twice — i.e. is idempotent. A retry without an idempotent destination is a bug factory, no matter how nicely you wrote the backoff loop.
The fix touches three of the five gaps:
- Idempotency — write the file to a staging key (
risk_2026-05-13.csv.uploading-{uuid}), thenCopyObjectto the final key. S3CopyObjectto an existing key is a clean overwrite and the file at the final key is always whole or absent, never partial. - Observability — emit a structured log line with the upload's request-id and ETag so you can later prove the file you wrote is the file the consumer read.
- Contracts — the consumer's contract is "exactly one file per day at this key, never partial." The staging-then-rename pattern enforces that contract from the producer side.
By Build 1 chapter 6 this whole pattern has a name (atomic publication), a default implementation (write-temp-then-rename), and a test (kill the process between write and rename and verify the destination is unaffected). Riya's retry wrapper from Night 12 was three lines of code; the correct version is twenty, and most of the new lines are about the destination, not the retry.
The same pattern shows up in three more places before Build 4 ends. When you write to Postgres, the safe form is INSERT INTO staging_risk_daily (...) ON CONFLICT (run_id, tx_id) DO UPDATE, not INSERT. When you write to a Parquet partition on S3, the safe form is "write to a tmp prefix, then rename the prefix" — the GCS/S3 multi-object rename is not atomic, but the per-key rename is, and chapter 6 walks through how to use partition-level metadata to publish atomically anyway. When you publish to a Slack channel, the safe form carries a dedup key in a tiny SQLite sidecar so a retry does not double-alert. The destinations differ; the pattern is the same. Idempotency is destination-shaped, not pipeline-shaped.
Common confusions
-
"My script is a pipeline because it's scheduled in cron." Cron is the floor of scheduling, not the ceiling. A cron line gives you "fire at 02:00." It does not give you retries, dependency ordering, parallelism, SLAs, or a place to read the logs from yesterday's run. Build 4 builds a real DAG executor and shows you what cron is missing.
-
"ETL and pipeline are the same thing." ETL — extract, transform, load — is one shape a pipeline takes (the most common one for batch warehouse work). Pipelines also include streaming jobs (CDC, real-time aggregations), reverse-ETL (warehouse → operational systems), and ML feature pipelines (offline + online). Chapter 2 unpacks E, T, L individually and chapter 13 explains why the order is sometimes ELT instead.
-
"If my script doesn't crash, it's working." Silent corruption is the most expensive failure mode in data engineering. Riya's script kept "succeeding" for nine days while writing CSVs full of
NaNmerchant names. Build 5 (observability and contracts) is half about making silent failure impossible. -
"Idempotency is just
INSERT ... ON CONFLICT." That is one technique for one destination. A pipeline that writes to S3, Postgres, and a Slack channel needs three different idempotency strategies — partition replace for S3, MERGE for Postgres, and a dedup key for Slack so the same alert is not sent twice. Chapter 6 covers the family of techniques. -
"Data engineering is just plumbing — the modelling team does the real work." The modelling team's results are bounded by the freshness, completeness, and correctness of the data they get. A wrong number from a fast model is worse than no number; a right number from a slow model arrives after the decision. Build 5 onwards treats data quality as a first-class engineering output, not infrastructure overhead.
-
"A pipeline is just a script with retries bolted on." Retries help, but a script with retries is still a script — it just retries before giving up. A pipeline differs in kind: the destination is engineered for replay, the runs are addressable artefacts (you can ask "what did the 02:00 run on April 24th do?" weeks later), and the schedule is a contract with consumers ("you can read this dataset by 03:00 every day"). Build 4 makes the addressable-runs idea concrete with run IDs and run metadata.
Going deeper
The five-promise definition
A pipeline is a script that satisfies five promises to its consumers, where the script satisfies none of them by default:
- Freshness. The output reflects data no older than some agreed bound. The script has no notion of freshness; the pipeline carries an SLA.
- Completeness. Every row that should be in the output is in the output. The script has no notion of "should"; the pipeline has a contract.
- Correctness. Rows that are present are not corrupted by transformation bugs, schema mismatches, or partial writes. The script has no checksums or validation; the pipeline runs assertions.
- Idempotency. Re-running the same logical job for the same logical period produces the same output. The script overwrites or duplicates depending on the destination; the pipeline is engineered for safe replay.
- Auditability. A reader six months later can answer "where did this number come from, when was it produced, what version of the code, what was the upstream state?" The script has no audit trail; the pipeline has lineage and run metadata.
Every chapter in this curriculum buys back one of these promises that the script gives up.
Where Indian-scale teams break first
Talking to working data engineers in Bengaluru, Pune, Hyderabad, and Coimbatore over the last year, the most common first failure mode is freshness without anyone noticing. A team builds a cron-driven pipeline. It runs nightly. Six months in, a Postgres replica starts lagging by 4 hours during peak load. The pipeline's WHERE ts::date = yesterday query starts missing the last 4 hours of yesterday's data. The numbers go down by ~15%. Nobody notices for weeks because the dashboard still renders, the script still exits 0, and the row count looks plausible. By the time someone notices, the wrong numbers have informed a marketing budget decision. ₹40 lakh is gone before the bug has a name.
The fix is freshness checks (Build 5), but the underlying lesson is the one in this chapter: a script-shaped solution is fine until you have stakes, and the moment you have stakes, the boring infrastructure stops being optional.
The second-most-common first break is timezone confusion at month boundaries. Teams write WHERE ts::date = current_date - 1 in IST while the source database stores timestamps in UTC. On the night of the 31st of a month, the IST "yesterday" overlaps with the UTC "day before yesterday" by 5.5 hours, and the monthly aggregate over- or under-counts the boundary day depending on which way the team chose. Build 1 chapter 7 is about always-UTC discipline and the small cost of writing ts::date AT TIME ZONE 'UTC' = ....
The job–pipeline–platform progression
You can roughly track a data team's maturity by where it is on this ladder:
- Job stage. Each engineer has a few cron lines. The pipeline is the cron line. Failure means a Slack message ("daily report didn't run, looking now"). Common at startups under ten engineers. Riya's company is here.
- Pipeline stage. A workflow tool (Airflow, Dagster, Prefect) replaces cron. Jobs are organised into DAGs with dependencies and retries. There is a UI you can look at to see "is the 02:00 job done?" Common at series-A and -B companies.
- Platform stage. Pipelines become products with SLAs, contracts, lineage UIs, and a self-service interface so that non-data engineers (analysts, ML engineers) can author their own pipelines without writing Airflow operators. Common at series-C and beyond, and at the data platform teams of all big tech companies. The hard problems shift from "does the pipeline run" to "who pays for that query, and is it allowed to read PII?"
The progression is not optional — every team that survives gets pushed up the ladder by reality, whether or not anyone names it. The 134 chapters of this curriculum walk that ladder.
The cost of skipping the boring infrastructure
There is a tempting argument: "we are a startup, we move fast, we can afford to skip idempotency and observability for now." The argument is wrong, and the way it is wrong is instructive.
Suppose Riya's company has 100 nightly jobs by year two. Each job has a base failure rate of 1% per night — once every 100 nights, something breaks (network blip, replica lag, S3 hiccup, schema drift). With no retry/idempotency infrastructure, every failure is a manual page. That is one page per night, on average, across 100 jobs. The on-call burns out. With retries and idempotency, ~95% of failures self-heal silently, and the page rate falls to one per twenty nights. The boring infrastructure is what makes a 100-pipeline platform humanly survivable.
The math is brutal: a system of N jobs each with success probability p runs end-to-end with probability p^N. At p = 0.99 and N = 100, end-to-end success is about 36%. At p = 0.999 (which is what idempotent retries buy you), the same N gives 90%. The difference between "the platform works most nights" and "the platform works most nights" is the boring infrastructure. There is no shortcut.
A working definition for the rest of the track
For the rest of this curriculum, when we say pipeline, we mean: a directed graph of computations over data, executed on a schedule (or in response to events), engineered to satisfy the five promises above when run unattended. The graph can be a single node (Riya's script, with proper wrappers) or a thousand nodes (the canonical Flipkart-grade analytics platform). The engineering — idempotency, observability, scheduling, contracts, lineage — is the same shape at every scale. Only the constants change.
A pipeline is also, importantly, a thing you can replay. Replay is the most underappreciated property in this list. If a row is found to be wrong on the 47th day, you should be able to re-run the pipeline for the period that produced it and have the wrong row become the right one, without manual surgery and without affecting the other days. Replayability is the union of idempotency (re-running is safe) and parameterisation (you can ask for a specific period). Build 1 chapter 10 makes replay a first-class operation; Build 11 (CDC) extends it to streams, where "replay" means rewinding a Kafka offset rather than re-running a SQL query.
A small but important observation: the transformation logic itself stays roughly the same shape Riya wrote on day three. Build 1 through Build 17 are mostly about the wrappers around the transformation, not the transformation itself. Forty lines of SQL is forty lines of SQL whether it runs in a Jupyter notebook or an Airflow DAG. Senior data engineers spend most of their time on the wrappers because that is where the platform value lives; new engineers tend to think the transformation is the hard part, and that is the script-shaped instinct showing through.
Where this leads next
The next four chapters take Riya's script apart and reassemble it with the boring infrastructure attached.
- The three primitives: extract, transform, load — chapter 2, where the script is broken into its three load-bearing pieces and each is examined separately.
- Your first pipeline: CSV → CSV in 50 lines — chapter 3, where you build a properly-wrapped version of Riya's script with retries, idempotent output, and a run log.
- Crashing mid-run: which state are you in? — chapter 4, the systematic look at partial-failure modes and how to recover.
- The three walls: no idempotency, no observability, no schedule — chapter 5, the canonical statement of the three pressures every Build-1 pipeline confronts.
By the end of Build 1 (chapter 11) Riya's script will be a proper pipeline: idempotent, observable, scheduled, and surrounded by a tiny but real run-tracking layer. By the end of Build 4 (chapter 44) the scheduling part is replaced by a real DAG executor with retries, dependencies, and SLAs. The progression is paced; nothing in the curriculum jumps over the boring infrastructure.
References
- Designing Data-Intensive Applications — Martin Kleppmann, O'Reilly 2017. The reference book for working data engineers; chapters 1 and 10 are the canonical statement of "what does it mean for a pipeline to be reliable."
- The Log: What every software engineer should know about real-time data's unifying abstraction — Jay Kreps, LinkedIn Engineering 2013. The essay that reframed pipelines as compositions of logs and views.
- Functional Data Engineering: A Modern Paradigm for Batch Data Processing — Maxime Beauchemin (creator of Airflow and Superset), 2018. The clearest short statement of why idempotency and immutability are the bedrock of batch pipelines.
- The Rise of the Data Engineer — Maxime Beauchemin, 2017. The essay that named the role.
- Reliable Data Pipelines — Databricks, 2020. A vendor whitepaper, but the failure-mode taxonomy in chapter 2 is unusually honest.
- Data Engineering at Razorpay — the Razorpay engineering blog on scaling their UPI event pipeline from 1M to 1B events/day. A useful Indian-scale case study for what "outgrowing a script" actually looks like.
- The Append-Only Log — the cross-domain gold-standard chapter on the storage primitive every modern pipeline destination is built on.
- The DAGs of Data Engineering — Fennel.ai, 2024. A short blog post that contrasts "a DAG of tasks" with "a DAG of data assets" — the underlying mental model shift that Build 4 makes precise.