The three primitives: extract, transform, load

Riya's script from chapter 1 reads from Postgres, joins against an S3 CSV, and writes a different CSV back to S3. It does not feel like three things. It feels like one thing — "the daily risk job." But every working data engineer eventually develops the reflex of seeing the three pieces separately, because the three pieces fail differently, scale differently, and belong to different teams.

You will read 134 chapters of this curriculum. About 130 of them are about exactly one of the three primitives at a time, with a different concern attached. So we should be precise about what each one is.

Extract is the boundary between a system you do not control and your pipeline. Transform is pure computation over data already in your hands. Load is the boundary between your pipeline and the system the consumer reads. Most production bugs sit at one of the two boundaries; almost none sit in the middle.

The three primitives, named carefully

A pipeline turns input data into output data. The transformation lives in the middle. On either side is a boundary with a system you do not own — the source database, the destination warehouse, the consumer's dashboard. The three primitives name each piece:

The three primitives of a data pipelineA horizontal flow with three labelled boxes — Extract, Transform, Load — connected by arrows. Above each box is the system it touches; below each is its failure mode. Source database on the left feeds into Extract; Load writes into a destination warehouse on the right. The Transform in the middle has a small note that it is pure computation in your hands.Postgres(not yours)Extractread at boundaryTransformpure computationLoadwrite at boundaryWarehouse(not yours)fails on:timeouts, auth,schema driftfails on:logic bugs,type mismatchfails on:duplicates, partialwrites, races
Extract and Load are boundaries with systems you do not own; Transform is pure computation in your hands. The failure modes are different at each.

Extract is the operation that gets data out of a source system and into your process memory (or your staging area). The source is usually owned by another team, sometimes by another company, occasionally by yourself two months ago. Extract is where you cope with credentials, rate limits, schema drift, replica lag, pagination, and the bad luck of the source database restarting at 02:01.

Transform is the operation that turns the extracted data into the shape the consumer needs. It is pure computation: a dataframe in, a dataframe out; a SQL query over staging tables, a result. No network, no clock, no credentials — only logic. This is where your idempotent dedup hash lives, your join, your amount_paise / 100, your business rules.

Load is the operation that writes the transformed data to the destination — a CSV in S3, a row in Postgres, a partition in Snowflake, a Kafka topic. The destination is, again, usually owned by another team. Load is where you cope with duplicates from retries, partial writes, transactional commit, and consumers who started reading before you finished writing.

The reason this naming matters: the three pieces fail differently, and the engineering response to each is different. Conflating them is the most common reason a "small refactor" in one piece breaks something elsewhere.

Why the boundary distinction matters: code in your hands can be unit-tested, replayed deterministically, and changed without coordination. Code that touches a system you do not own has to handle that system's failure modes — and those failure modes leak into the transformation step only if you let them. The discipline of keeping Extract and Load thin, with the messy stuff isolated to those two layers, is what makes Transform reliable.

Each primitive, sized properly

The intuition that gets you furthest in this curriculum is that each primitive has a different shape, both in code and in operations. Knowing the shape lets you predict what will go wrong before the page arrives at 03:14.

Extract: thin, defensive, retryable

A good Extract is small. It does one thing — pull a slice of data from a source — and it does it in a way that survives the source being slow, unavailable, or schema-drifted. It does not transform; it does not enrich; it does not validate beyond the bare minimum to fail loudly when the schema is wrong. Anything more belongs in Transform.

A typical Extract for a Razorpay-style merchant table refresh looks like this:

# extract_merchants.py — pull merchant reference table from source Postgres
import psycopg2, pandas as pd, time, hashlib

SOURCE_DSN = "postgresql://reader:***@payments-replica.ap-south-1:5432/payments"
EXPECTED_COLS = {"merchant_id", "merchant_name", "kyc_status",
                 "registered_state", "created_at"}

def extract_merchants(run_id: str, attempt: int = 1) -> pd.DataFrame:
    for i in range(3):
        try:
            conn = psycopg2.connect(SOURCE_DSN, connect_timeout=10)
            df = pd.read_sql("SELECT * FROM merchants", conn)
            conn.close()
            break
        except psycopg2.OperationalError as e:
            wait = 2 ** i
            print(f"[extract] attempt {i+1} failed: {e}; sleep {wait}s")
            time.sleep(wait)
    else:
        raise RuntimeError("source unreachable after 3 attempts")
    missing = EXPECTED_COLS - set(df.columns)
    if missing:
        raise RuntimeError(f"schema drift: missing columns {missing}")
    df["_extract_run_id"] = run_id
    df["_extract_row_hash"] = df.apply(
        lambda r: hashlib.sha1(
            f"{r.merchant_id}|{r.merchant_name}|{r.kyc_status}".encode()
        ).hexdigest()[:12], axis=1)
    return df

if __name__ == "__main__":
    df = extract_merchants(run_id="2026-04-25T02:00:00")
    print(f"extracted {len(df):,} merchants; cols={list(df.columns)}")

Sample run:

extracted 8,42,167 merchants; cols=['merchant_id', 'merchant_name', 'kyc_status', 'registered_state', 'created_at', '_extract_run_id', '_extract_row_hash']

Walk through the load-bearing lines. connect_timeout=10 is the most important line in the file — without it, a slow source hangs the entire pipeline forever, because psycopg2 defaults to no timeout. for i in range(3) with exponential backoff is the standard retry shape for boundary calls; chapter 9 covers retry budgets in detail. EXPECTED_COLS - set(df.columns) is the schema drift check — if upstream renames merchant_id to mid, the extract fails loudly here rather than silently passing a bad column name to the transform. _extract_run_id stamps every row with the run that produced it, so downstream you can tell which extract generated which row even years later. _extract_row_hash is a stable content hash that lets the transform layer detect "did this merchant actually change since last run?" without comparing all columns.

Why the row hash belongs in extract, not transform: if you compute it in transform, every transform run pays the cost of hashing rows that did not change. If you compute it in extract once, the hash is part of the staged data and any number of downstream transforms can use it for free. The principle: do work once, as close to the source as possible, and let downstream stages reuse it.

The other thing this Extract does — and most early-career Extracts do not — is keep the surface area tiny. There is no business logic, no joining, no amount_paise / 100. The only "transformation" is two metadata columns. Everything else waits for the Transform layer.

Transform: deterministic, replayable, free of clocks

A good Transform is a function: same inputs in, same outputs out, no side effects. It does not call the network. It does not read the wall clock (the run timestamp is passed in as a parameter, not pulled from datetime.now()). It does not have credentials. The whole point is that you can run it on yesterday's data, today's data, and last quarter's data, and get the same answer for each.

Functional transforms — same in, same out — are the easiest part of a pipeline to test, and the hardest to break. Most of the buggy pipelines you will inherit have transforms that pulled the clock or the network into themselves; the fix is almost always to push those calls back to Extract or forward to Load.

Load: atomic, idempotent, observable

A good Load writes the transformed output to the destination in a way that is safe to retry, and that lets the consumer tell whether they are reading the latest version. The two patterns that solve almost everything here are staging-then-rename (write to a tmp key, rename to final) and transactional MERGE (insert into a staging table, then MERGE into the final table inside one transaction). Build 1 chapter 6 covers atomic publication in depth; for now, the rule is: if a retry of your Load can leave the destination in a worse state than no Load at all, the Load is wrong.

A worked refactor — Riya's script split into three

Take Riya's chapter-1 script and split it. The original was forty-three lines, all mixed together. Here is what each primitive looks like once they are separated:

# transform_risk.py — pure compute over already-extracted dataframes
import pandas as pd, hashlib

def transform_risk(tx: pd.DataFrame, merchants: pd.DataFrame,
                   run_id: str) -> pd.DataFrame:
    """Join transactions with merchants, convert paise to rupees, dedup."""
    joined = tx.merge(merchants, on="merchant_id", how="left",
                      validate="many_to_one")
    joined["amount_inr"] = joined["amount_paise"] / 100
    joined["_dedup_key"] = joined.apply(
        lambda r: hashlib.sha1(
            f"{r.tx_id}|{r.merchant_id}|{r.amount_paise}".encode()
        ).hexdigest()[:16], axis=1)
    joined = joined.drop_duplicates(subset=["_dedup_key"], keep="first")
    joined["_run_id"] = run_id
    risky = joined[joined["amount_inr"] > 200000].copy()
    return risky[[
        "tx_id", "merchant_id", "merchant_name", "kyc_status",
        "registered_state", "amount_inr", "ts", "_dedup_key", "_run_id"
    ]]

Sample run, given 1,42,318 input transactions and 8,42,167 merchants:

>>> risky = transform_risk(tx, merchants, run_id="2026-04-25T02:00:00")
>>> len(risky)
4,127
>>> risky.amount_inr.sum()
1,82,47,30,000.00

The transform takes two dataframes and a run id, and returns a dataframe. There is no psycopg2, no boto3, no time, no datetime. validate="many_to_one" is a guardrail — pandas will raise if the merchant table contains duplicate merchant_id, which is exactly the kind of silent corruption Riya's chapter-1 script tolerated. _dedup_key is the row's logical identity — if the same (tx_id, merchant_id, amount_paise) shows up twice (because Extract retried), Transform collapses it. risky = joined[joined["amount_inr"] > 200000] is the actual business rule: transactions over ₹2 lakh go to the risk team. The whole function is replayable — pass yesterday's dataframes back in next month and you get the same answer.

Why dedup belongs in transform and not extract: extract operates row-at-a-time and may not see all duplicates within a window. Transform operates on the full window of a run's data, so dedup at this layer can use the entire window's content. The general pattern: row-local checks (schema, types) belong in extract; window-level checks (dedup, joins, aggregations) belong in transform.

The Load piece is small but careful:

# load_risk.py — atomic publication to S3 with run-id partition
import boto3, pandas as pd, uuid

def load_risk(df: pd.DataFrame, run_id: str, bucket: str = "risk-output"):
    s3 = boto3.client("s3")
    final_key = f"daily/risk/{run_id[:10]}/data.csv"
    staging_key = f"_staging/{uuid.uuid4()}.csv"
    local = "/tmp/risk_load.csv"
    df.to_csv(local, index=False)
    s3.upload_file(local, bucket, staging_key)
    s3.copy_object(Bucket=bucket, CopySource=f"{bucket}/{staging_key}",
                   Key=final_key)
    s3.delete_object(Bucket=bucket, Key=staging_key)
    print(f"[load] wrote {len(df):,} rows to s3://{bucket}/{final_key}")

The staging-then-copy pattern means a retry never leaves a partial file at final_key. If the upload to staging fails, the final key is unchanged. If the copy fails, the final key is still unchanged. If the delete of the staging file fails, the final key is correct — just a small janitorial leak in _staging/ that a separate cleanup job sweeps. Build 1 chapter 6 makes this rigorous.

ETL vs ELT — a question of when, not whether

Up until about 2015, "ETL" was the dominant shape: Extract, Transform, then Load the already-transformed output into a slow analytical database. The transform happened on a separate compute layer (Informatica, Pentaho, Spark on a Hadoop cluster) because the destination database could not handle the compute.

Since about 2018, "ELT" — Extract, Load the raw data into a powerful warehouse, then Transform inside the warehouse — has taken over for most use cases. The reason is economic: warehouses like Snowflake, BigQuery, Redshift, ClickHouse, and Databricks SQL are now powerful enough that running transformations inside them is cheaper and faster than maintaining a separate compute cluster. dbt (covered in Build 13) is the tool that made ELT-in-the-warehouse the default shape.

ETL vs ELTTwo horizontal flows stacked vertically. The top flow shows ETL: Source -> Extract -> Transform (on a separate compute cluster) -> Load -> Warehouse. The bottom flow shows ELT: Source -> Extract -> Load (raw) -> Warehouse where Transform now happens inside.ETL — transform happens outside the warehouseSourceExtractTransform(Spark/Hadoop)LoadWarehouse(read-only)ELT — transform happens inside the warehouseSourceExtractLoad (raw)WarehouseTransform here, in SQL(dbt, Snowflake, BigQuery)
The shift from ETL to ELT moved Transform from a separate compute cluster into the warehouse itself. The three primitives stayed; only their location changed.

The thing to notice: the three primitives did not change. Only the location of the Transform step moved. ELT is still extract, load, transform — the names just got reordered to reflect the new ordering of operations. A working data engineer in 2026 mostly writes ELT, but you will still encounter ETL when:

Flipkart's catalogue refresh is canonical ELT — raw product rows from the merchant SQL replicas land in a Snowflake landing schema, and dbt models in Snowflake compute the searchable catalogue. Razorpay's UPI fraud-scoring is hybrid — raw events land in S3 (E + L), then PySpark scores them on EMR (T), then the scored output is loaded into ClickHouse (L again). Pure ETL, pure ELT, and hybrids all coexist; the three primitives keep their identity throughout.

Why ELT won when storage got cheap: in 2010, storing the raw data was expensive enough that you transformed before storing. By 2020, S3 cost about ₹2/GB-month for cold storage; storing the raw data forever was suddenly cheaper than the engineering cost of re-fetching it for every new use case. Once raw data was kept anyway, doing the transform on demand inside the warehouse was the obvious move.

Common confusions

Going deeper

What modern Extract looks like in production

The chapter-1 script uses pd.read_sql. That works for a one-off, but at production scale Extract typically uses one of three patterns:

  1. CDC (change data capture, Build 11). The extract is not a query — it is a subscription to the source database's write-ahead log. Postgres logical decoding, MySQL binlog, MongoDB oplog. Debezium is the common open-source implementation. The advantage is that you see changes, not snapshots, and you see them in commit order.

  2. Bulk export + incremental. Once a week, dump the whole source table; daily, query the rows whose updated_at > last_run_high_water_mark. The combination handles full-state-at-rest plus changes since. This is what most warehouse extracts do today.

  3. Source API call. When the source is a SaaS product (Salesforce, Stripe, Shopify), the extract is a paginated API call with rate limit handling. Tools like Fivetran and Airbyte are mostly elaborate connectors that turn an API into a snapshot-or-incremental extract.

The chapter-1 SELECT * is not wrong, but it is the simplest of the three. As the source grows past a few million rows, full-table reads become expensive, and pattern 2 or pattern 1 takes over.

Why pure-function Transform is non-negotiable for serious pipelines

A pure function (input, run_id) -> output has three properties that matter operationally:

A Transform that calls datetime.now() inside it loses all three properties at once. Replays produce different output (today's clock, not last Tuesday's). Tests need to mock the clock. The cache key has nothing to bind to. Most "this pipeline is impossible to backfill" stories trace back to a clock or network call hidden inside the transform.

The four destinations and their idempotency primitives

Load looks different at every destination. The four common ones in Indian-stack data engineering, with their canonical idempotency move:

The pattern across all four: the destination must accept a "logical write" identifier (run_id, key, partition) such that re-issuing the same logical write does not change the final state.

Where the three primitives compose into bigger shapes

Once you can name the three primitives, you start to see them composing into larger structures. A medallion architecture (bronze/silver/gold) is just three E-T-L cycles stacked: source → bronze (E + L), bronze → silver (T + L), silver → gold (T + L). A lambda architecture (batch + speed layer) is two parallel E-T-L pipelines feeding a unified serving layer. A feature store (Build 15) is two E-T-L pipelines — one offline (for training) and one online (for serving) — that have to produce identical Transform output to avoid train-serve skew.

Understanding the three primitives at this level is what lets you read a 100-node Airflow DAG and see the structure rather than the noise. Every node is doing one of three things; the question is which and to what.

A diagnostic for "where does this code belong?"

A small rubric you can apply to any line of pipeline code, in order. The first answer that fires is the right home for that line.

  1. Does it call the network or read external state? → Extract.
  2. Does it write to a destination that other processes read? → Load.
  3. Does it depend only on its arguments? → Transform.
  4. Does it read a clock or generate randomness? → Belongs at the boundary of a primitive, not inside one. The clock is part of run_id, which is passed in. Randomness is seeded by run_id. Both are extracted to the orchestrator layer (Build 4), not buried inside a step.

When you inherit a pipeline and a single function does all four, the refactor is to peel each concern out into its rightful primitive. Most of the bugs that the chapter-1 night-by-night list ascribed to "the script" are bugs of this peeling not having happened yet — a clock inside a transform, a network call inside a load.

Where this leads next

The next three chapters take this primitive-level view and apply it. Chapter 3 builds your first proper pipeline (CSV → CSV in 50 lines) where Extract, Transform, and Load are visibly separated. Chapter 4 explores what happens when the process crashes mid-run — which primitive's state is corrupt, and how to find out. Chapter 5 introduces the three "walls" that every Build-1 pipeline runs into: missing idempotency, missing observability, missing schedule.

Build 6 (chapters 67–77) returns to the Load step in depth, with Parquet, Iceberg, and atomic publication. Build 11 (chapters 122–132) returns to the Extract step in depth with CDC. The three primitives are the spine that the next 132 chapters hang on.

References

  1. Designing Data-Intensive Applications, Chapter 10: Batch Processing — Martin Kleppmann, O'Reilly 2017. The clearest treatment of why "input → pure function → output" is the foundational batch shape.
  2. The Modern Data Stack: Past, Present, Future — Tristan Handy (dbt Labs), 2021. The essay that named the ELT shift and walked through why warehouse compute changed the economics.
  3. Functional Data Engineering — Maxime Beauchemin, 2018. The argument for pure-function transforms; the source of much of the Transform discipline in Build 1.
  4. Apache Airflow: Architecture and concepts — Airflow docs. Useful as a reference for how a real scheduler maps the three primitives to operators (PythonOperator, S3CopyObjectOperator, etc.).
  5. Razorpay engineering: Scaling payments data pipelines from 1M to 1B events/day — concrete Indian-scale case study showing the three primitives at scale.
  6. Snowflake's COPY INTO command — the canonical bulk-Load primitive in modern ELT; reading the docs gives a feel for what an idempotent warehouse Load looks like.
  7. What is Reverse ETL? — Hightouch, 2022. Useful for understanding how the three primitives reuse themselves in the warehouse-to-operational direction.
  8. The Append-Only Log — the cross-domain gold-standard chapter on the storage primitive that every Load operation eventually relies on.