Wall: re-processing everything every night

For the first eighteen months at a small fintech, Aditi's nightly job was a thing of beauty. At 01:00 a single Python script truncated the analytics tables, re-read every transaction the company had ever processed, recomputed every aggregate, and wrote everything back. By 02:30 the dashboards were green and the morning standup had clean numbers. The pipeline was idempotent — every run produced the same final state — and it was simple enough that a new joiner could read the whole thing in an afternoon. Then the company crossed 50 lakh transactions, then 5 crore, then 50 crore, and one Tuesday in March the job didn't finish by 09:00. By Thursday it didn't finish by 13:00. By the following Monday Aditi was being asked, in the polite-but-tight voice the CFO uses, why yesterday's revenue number wasn't on the slide. The pipeline hadn't broken. It had just stopped fitting inside a night.

This is the wall this chapter is about. Build 2 has spent five chapters teaching you to make every stage idempotent so that the cheapest, simplest disaster-recovery move — re-run the whole thing — is always available. That works beautifully at small scale and is the right starting point for every new pipeline. It also has a hard limit, and the pipelines that hit the limit at 3 a.m. are the ones whose owners hadn't seen the wall coming.

A full-refresh pipeline re-reads and re-writes everything on every run. It is the simplest, most idempotent design and the right starting point — until input growth pushes the runtime past the SLA. The wall is not a bug; it is the cost model catching up. The fix is to stop re-processing what you already processed, which is the entire subject of Build 3 (incremental processing). This chapter shows you exactly how the wall arrives, what the warning signs are, and why the answer is not "buy a bigger box".

The shape of a full-refresh job

A full-refresh job is the simplest pipeline that can correctly handle any kind of upstream chaos: late-arriving rows, schema corrections, retroactive fixes, deletions. Every run is a complete rebuild from the source of truth. There is no dependency on yesterday's run, no checkpoint to corrupt, no high-water mark to drift. If the previous run was wrong, this run will fix it. If the source has been retroactively corrected, this run will reflect the correction.

Full-refresh nightly pipelineA diagram showing the source OLTP database on the left, an arrow labelled SELECT * pulling all rows into a transform stage in the middle, and a destination warehouse on the right being truncated and reloaded. Numbers across the bottom show input row counts growing from 1 lakh to 50 crore over time.Source (OLTP)transactionsall rows evergrows monotonicallyTransformread N rowsaggregate / joinemit rebuilt rowscost = O(N)WarehouseTRUNCATEINSERT allstateless destSELECT *overwriteN=1Lruntime: 4 minN=50Lruntime: 28 minN=5Crruntime: 4h 40mN=50Crruntime: 38h (over SLA)linear in input size — every doubling of N doubles the runtime
The full-refresh shape. Input grows; the destination is rebuilt in full each night. Runtime scales linearly with input size, which is fine until input size scales linearly with calendar time. Then the wall arrives on a Tuesday.

Almost every analytics pipeline starts here. There are excellent reasons. A full-refresh job has exactly one source of truth (the source itself), one shape of failure (it didn't finish), and one recovery action (run it again).

Schema changes are absorbed by the next run because the next run reads the new schema. Late-arriving rows are absorbed because they're just rows in the source on the night they're seen. Retroactive corrections from finance — "those 1.2 lakh transactions in February were mis-categorised, please re-classify" — propagate the next night without anyone writing a special script. The simplicity is not a bug; it is the design.

The cost is also simple: the runtime is linear in the cumulative input size, not in the day's new rows. A pipeline that took 4 minutes to rebuild 1 lakh rows takes 40 minutes to rebuild 10 lakh rows on the same hardware, and 6.6 hours to rebuild 1 crore rows.

The size of the daily input — a few thousand new transactions a day, say — has nothing to do with the runtime; the runtime is set by how much history the company has accumulated. Which means a young company has a tiny full-refresh job and a five-year-old company has a job that takes a working day to finish. The shape of the work doesn't change as the company grows; only the bill does.

Why "linear in cumulative size, not daily delta": the SQL the job runs is shaped like SELECT ... FROM transactions GROUP BY merchant_id, day — there is no WHERE created_at > ? clause filtering to recent rows. The transform reads every row that has ever existed, every night, because the aggregation must include every row. Adding a date filter would skip yesterday's late-arriving rows, which is exactly the chaos the full-refresh shape exists to absorb. The design pays for that simplicity with linear-in-history runtime, and the bill grows with the company.

When the wall arrives

The wall is not a single point on the runtime curve; it is the moment the curve crosses the SLA line. The SLA is set by humans, not engines: someone in finance needs the previous day's number on the 09:00 standup, the marketing team's cohort dashboard has to be fresh by 10:00, the regulator's daily filing is due by 14:00.

The pipeline runtime grows monotonically with company growth. The SLA is fixed. Their intersection is the wall.

A second-order effect makes the wall arrive faster than the runtime curve alone suggests. As the company grows, the pipeline accumulates dependent stages: the raw extract is the input to a deduplication job, which is the input to a dimension-upsert job, which is the input to a fact-aggregation job, which is the input to a metric-rollup job.

Each stage is full-refresh because each stage was the simplest thing that worked at the time. The total nightly runtime is the sum of all stage runtimes — and each stage's input is the previous stage's output, which is also growing. So a doubling of source-row count can produce a 3× or 4× growth in total nightly runtime once cascading effects propagate.

Runtime curve crossing the SLAA two-axis chart showing pipeline runtime in hours rising over time, with a horizontal SLA line at 8 hours. The curve crosses the SLA somewhere around month 30 of company growth, with a labelled point indicating where the dashboards start arriving late.0h4h8h12h16hruntimem0m12m24m36m48m60months since launchSLA: 8h (must finish by 09:00)THE WALLfirst night runtime > SLA"feels slow"comfortable
The wall is the intersection of two lines: a runtime curve that rises with the cumulative dataset, and an SLA that is fixed by the humans who consume the output. Long before the wall, the team complains the pipeline "feels slow"; that is the warning sign worth listening to.

The single most useful operational metric for spotting the wall a quarter early is the ratio of last night's runtime to the SLA budget. A pipeline running at 25% of budget is in steady state. A pipeline at 60% will hit the wall within a year at typical Indian-startup growth rates of 8–15% month-on-month input growth. A pipeline at 80% has six months. A pipeline at 95% is one bad upstream day from breaching, and the on-call is going to find out the hard way. Plot this ratio on a dashboard and the wall stops being a surprise.

Why monthly ratios are more useful than nightly runtime: a single night's runtime is noisy — a slow shared cluster, a long-running query in the same instance, a network blip on the source. The 7-day or 30-day moving average of runtime divided by SLA flattens the noise and makes the trend legible. The trend is what tells you when to start the migration to incremental, not the noise.

A full-refresh you can reproduce

The example below is a full-refresh job that materialises a daily merchant-revenue table from a transactions source. It is intentionally simple — no incremental cleverness, no high-water mark, no upsert. Every run truncates the destination and rebuilds it from the source. The example also instruments the runtime so you can watch the wall approach as you grow the input.

# full_refresh.py — the simplest correct nightly job.
import os, time, datetime as dt
import psycopg2

SQL_TRUNCATE = "TRUNCATE TABLE merchant_daily_revenue"
SQL_REBUILD  = """
INSERT INTO merchant_daily_revenue
    (merchant_id, day, txn_count, gross_paise, refunded_paise, net_paise)
SELECT
    merchant_id,
    DATE_TRUNC('day', created_at AT TIME ZONE 'Asia/Kolkata') AS day,
    COUNT(*)                                        AS txn_count,
    SUM(amount_paise)                               AS gross_paise,
    SUM(CASE WHEN status='refunded' THEN amount_paise ELSE 0 END) AS refunded_paise,
    SUM(CASE WHEN status='captured' THEN amount_paise ELSE 0 END) AS net_paise
FROM transactions
WHERE status IN ('captured', 'refunded')
GROUP BY merchant_id, DATE_TRUNC('day', created_at AT TIME ZONE 'Asia/Kolkata')
"""

def main() -> None:
    started = time.monotonic()
    with psycopg2.connect(os.environ["DSN"]) as conn:
        with conn.cursor() as cur:
            cur.execute("SELECT count(*) FROM transactions")
            (input_rows,) = cur.fetchone()
            cur.execute(SQL_TRUNCATE)
            cur.execute(SQL_REBUILD)
            output_rows = cur.rowcount
        conn.commit()
    elapsed = time.monotonic() - started
    rate = input_rows / elapsed if elapsed > 0 else 0
    print(f"{dt.datetime.utcnow().isoformat()}Z "
          f"input={input_rows:,} output={output_rows:,} "
          f"elapsed={elapsed:.1f}s rate={rate:,.0f} rows/s")

if __name__ == "__main__": main()

Three runs at increasing scales, on a single 4-vCPU Postgres instance with a (merchant_id, created_at) btree index:

2026-04-25T01:00:14Z input=1,02,431       output=8,432    elapsed=3.8s   rate=27,000 rows/s
2026-04-25T01:00:31Z input=51,68,022      output=21,891   elapsed=82.4s  rate=62,700 rows/s
2026-04-25T01:08:12Z input=4,87,29,103    output=42,118   elapsed=487.0s rate=1,00,061 rows/s
2026-04-25T01:38:48Z input=21,18,40,219   output=58,224   elapsed=2284s  rate=92,747 rows/s

Three lines deserve attention.

TRUNCATE TABLE merchant_daily_revenue. The destination is wiped before every rebuild. This is the cleanest possible idempotency: there is nothing to dedup, nothing to merge, no chance of stale rows from a previous bad run lingering in the output. The next stage downstream reads a complete, freshly-built table. The cost is that every row in the output is recomputed every night, even the ones that haven't changed in three years.

GROUP BY merchant_id, DATE_TRUNC('day', ...). The transform is a single GROUP BY that fans every transaction in history out into per-merchant-per-day buckets. There is no WHERE created_at > '2026-04-24' clause to limit to recent rows. Adding one would speed the job up massively — but it would also miss rows whose created_at is yesterday but which arrived in the source today (late-binding webhooks, store-and-forward retries from offline POS terminals, manually-corrected entries from finance). The full-refresh shape pays for safety with throughput.

output_rows = cur.rowcount. The instrumentation captures both input and output cardinalities, which is what lets you spot the wall coming. The row-count ratio is roughly stable (one output row per merchant per day, regardless of input volume), but the runtime grows linearly with input. When you plot elapsed against input over a year, the slope is the cost of being stateless and the y-intercept is the fixed cost of starting Postgres connections and parsing the query. The slope is the wall.

Why the rate stabilises around 90,000–1,00,000 rows/s on this hardware and not higher: the GROUP BY is bound by sequential read throughput from the table's heap pages plus the cost of hashing the group-by keys. On a tuned warehouse-class instance with column storage you would see 5–10× higher row rates; on a small Postgres instance you would see 3–4× lower. The exact number doesn't matter. What matters is that the rate is roughly constant across run-sizes, which means runtime is linear in input. That is the equation that hits the wall.

The five symptoms (in the order they appear)

A pipeline doesn't hit the wall on a Tuesday. It walks toward the wall for months, and the warning signs arrive in a predictable order. If you've seen any of these, the wall is closer than the runtime metric alone is telling you.

The first symptom is the dashboards arriving inconsistently — sometimes by 09:00, sometimes by 09:45, sometimes by 11:00. The mean runtime is still well under the SLA, but the variance is widening.

The cause is shared infrastructure: the warehouse cluster runs analyst queries during the night, the source database has its own backup window, the network has noisy neighbours. A pipeline running comfortably under SLA is one that absorbs this variance silently. A pipeline running close to SLA is one that breaches whenever variance spikes. Variance is the canary; mean runtime is the death certificate.

The second symptom is the appearance of "ghost" optimisations: someone adds an index, someone partitions a table, someone rewrites a join, and the runtime drops by 30%. This buys six months. The team feels good.

The wall hasn't moved; the pipeline has just slid back along the curve. Each ghost-optimisation buys less than the previous one, because the easy wins are taken first. The Zomato analytics team in 2023 ran this cycle four times in two years before deciding the underlying shape of the pipeline had to change.

The third symptom is the splitting of the nightly job into two windows: an "early" run that does the urgent stages, and a "late" run that does the rest. This is the warning shot. You're admitting the SLA can't be hit anymore, and you're triaging which numbers can be wrong until lunch.

It works for one quarter. It accumulates two new failure modes: the boundary between early and late drifts, and the splits create new dependency edges that aren't documented anywhere except in the head of the engineer who designed them.

The fourth symptom is the "weekend full-refresh": the team admits the nightly job can't do a full rebuild every night, and arranges a weekend window where the full rebuild runs and weeknight runs are partial. This is when the pipeline stops being idempotent without anyone deciding to make that call.

Weeknight runs are now appending to a destination that was rebuilt on Sunday, which means a bad weeknight run leaves the destination in a state no Sunday rebuild can deterministically produce. The weekend full-refresh is also the largest single operational risk in the system — if Sunday's run fails, the entire week is built on stale foundations.

The fifth symptom is the wall itself: the night the job doesn't finish before 09:00, the standup happens with stale numbers, the CFO asks why, and the answer is "we ran out of night". This is the moment the pipeline shape has to change.

Throwing more cores at it will buy a few months at most, and Indian-startup growth rates of 10% MoM compound to a 3× annual size increase. You cannot scale a linear-in-history pipeline against linear-in-time growth without eventually changing the design. The discipline is to start the migration at symptom three, not symptom five.

Five symptoms in order of arrivalA horizontal timeline showing five labelled milestones — variance widening, ghost optimisations, two-window split, weekend full-refresh, and the wall itself — placed left to right with descriptions underneath each.1varianceruntime jittergrows2ghost fixesindexes,rewrites,tuning3two windowsurgent vsnon-urgent4weekend FRno longeridempotent5THE WALLSLAmissedstart the migration at symptom 3, not symptom 5
The wall is the fifth symptom; by then you have months of compounding before you can change the pipeline shape. The earlier symptoms are quieter but more useful — variance widening is the cheapest signal to act on.

A worked scenario: the night Aditi's pipeline didn't finish

It helps to walk through what actually happened on the Tuesday morning that opens this chapter, because the failure mode is more textured than "ran out of time". The 01:00 cron fired on schedule. The first stage — the extract from the OLTP replica — finished by 02:15, slower than usual but inside the historical envelope. The second stage, the merchant-revenue rebuild from the example above, started at 02:16 and was supposed to finish by 03:00. By 04:00 it was at 60% and still going. By 06:00 it was at 88%. The on-call engineer woke up to a Slack ping at 06:30, looked at the runtime metric, and made the only reasonable call: let it finish.

It finished at 09:47, by which point the morning standup had already happened and the numbers shown there were 30 hours stale instead of 6. The dashboards refreshed silently at 09:48 and the marketing analyst who had quoted the wrong number in her standup found out from her manager an hour later. The CFO was emailed by lunchtime. By the end of the week, "the pipeline" was on the agenda for the engineering all-hands.

What is striking, looking at the runtime curve in retrospect, is that the trend was visible for three months before the breach. The 30-day moving average had crossed 50% of SLA in January, 70% in February, and 90% in early March. The team had seen the numbers and concluded — accurately, but incompletely — that the pipeline was "still under SLA". The piece they missed was that under SLA at 90% with a noisy upstream is a pipeline that breaches the first time the upstream has a slow night, and upstreams have slow nights all the time. The breach was not a surprise event; it was the convergence of a known trend and a known noise floor that nobody had drawn on the same chart.

Why the moving-average ratio matters more than the raw runtime: a single night's runtime is dominated by per-night noise — query plan jitter, shared-instance contention, network blips. The 30-day moving average filters the noise out and reveals the structural trend. When the structural trend is at 90% of SLA, the noise distribution puts you over the SLA on roughly one night in five. That is the breach rate, and it is computable from the numbers you already have.

The honest takeaway from Aditi's Tuesday is that the wall is forecastable — you can compute, from your runtime trend and your noise distribution, the calendar date on which the first breach is more likely than not. The discipline that separates teams who see the wall from teams who hit the wall is plotting that forecast on the same dashboard that shows the runtime, and triggering the migration conversation when the forecast date crosses inside the next two quarters. Two quarters is roughly the time it takes to plan, build, and stabilise an incremental pipeline. Less than two quarters and you are racing. More than two quarters and the team finds something else to do.

What "buy a bigger box" buys you (and what it doesn't)

The first reaction to the wall is almost always vertical scaling: bigger Postgres instance, bigger warehouse, bigger Spark cluster. It sometimes works for a quarter or two. It fundamentally does not solve the problem, and it often makes the underlying problem harder to fix later by removing the urgency.

The arithmetic is unforgiving. A pipeline that takes 8 hours on the current instance and is growing at 12% per month doubles in size every six months.

Doubling the hardware halves the runtime — to 4 hours — which is where the pipeline was 6 months ago. You have bought back exactly the time you've already lost. After the next 6 months of growth, runtime is back to 8 hours, and you double the hardware again. The cost line starts to look exponential while the dataset line is merely exponential at the same rate.

There is also a shape of cost that's invisible in the first quarter. Bigger instances mean bigger blast radius when something goes wrong: a long-running rogue query on a small instance kills 5% of the cluster's time, on a giant instance it kills 5% of a much bigger absolute number.

The instance becomes a single point of failure for half the company's analytics. The DR plan grows from "run the pipeline on the spare instance" to "we don't have a spare instance because the spare instance also costs ₹40 lakh a year". The vertical-scaling path leads to a place where the system is too big to fail and too expensive to redundantly back up — both consequences of the original decision to scale the wrong way.

The honest answer is that vertical scaling is a holding action while the underlying pipeline shape changes. Treat it as buying calendar time to do the migration to incremental, not as a substitute for the migration.

The Cred analytics team in 2024 made exactly this trade: they tripled the warehouse cluster cost for one quarter explicitly to free up the engineering bandwidth to convert nine full-refresh pipelines to incremental, and downsized the cluster the following quarter when the new pipelines stabilised. Bigger box bought the time; the time bought the migration; the migration bought the long-term cost curve. None of those steps work without the others.

Why incremental processing is the only real fix: the runtime of a full-refresh job is O(N) where N is cumulative size. The runtime of an incremental job is O(Δ) where Δ is daily delta — typically Δ is a tiny fraction of N (a 5-year-old payments company does a few million transactions a day against a few billion of history). For most analytics pipelines, the daily delta grows at roughly the same rate as the cumulative size, but it is one to three orders of magnitude smaller in absolute terms. Trading O(N) for O(Δ) is the entire reason Build 3 exists.

Common confusions

Going deeper

The wall is also a data-quality wall

A second-order effect of the wall is that the failure mode of a missed SLA is rarely "no data" — it is "stale data presented as fresh". The dashboards show yesterday's number labelled "today" because the rebuild didn't finish. If the consumers don't know the run is late, they make decisions on the wrong number.

The Indian fintech this chapter opened with had a particularly painful version of this in 2024: the marketing team's reactivation campaign sent ₹5 lakh of credits based on customer segments computed on three-day-old data, and the customers who had already churned got the credits anyway because the segmentation hadn't refreshed.

The defence is freshness-as-a-metric: every dashboard renders the timestamp of the underlying data, not just "loaded at 09:00". Better, every dashboard refuses to render at all if the data is older than its declared SLA.

This is a one-line change in the dashboard layer that prevents the most expensive wall-related failure mode. It is also exactly the kind of safety the team won't add until after they've been burned by it once.

The "two-pizza" growth law

The empirical observation across Indian data teams is that a pipeline grows roughly proportional to the company's headcount, which grows roughly proportional to the company's revenue, which grows at a typical 8–15% MoM in the early years.

A pipeline born when the company had 10 engineers is at the wall when the company has 200 engineers — typically 24 to 36 months. This is "the two-pizza growth law" because the original team designed the pipeline at a size where two pizzas fed everyone who'd ever seen the code, and the wall arrives when the team is too big to fit in one room.

The law has a useful corollary: the timing of the wall is predictable from the team's growth, not just from the pipeline's runtime. When engineering headcount doubles, the pipelines designed for the smaller team are 12–18 months from the wall.

Treat hiring as a leading indicator of pipeline scale and start the conversation early. The Razorpay platform team in 2022 published an internal post-mortem of the wall they hit on the merchant-onboarding analytics pipeline; the runtime curve crossed SLA exactly 26 months after the original commit, almost to the day.

Incremental's own walls

Incremental processing doesn't make the wall disappear; it makes a different wall, further away. An incremental pipeline that processes only yesterday's data is O(Δ) per night, but the cumulative metric tables grow at O(N), and at sufficient scale even reading the destination state to MERGE into is expensive.

Build 6 (columnar storage) and Build 12 (lakehouse) exist to push that second wall further out by attacking the destination side of the equation: column pruning, partition skipping, manifest-based file selection, Z-ordering. Each of those buys 1–2 orders of magnitude of further headroom.

The honest framing is that data engineering is a sequence of walls. Full-refresh hits a wall at a certain scale. Incremental hits a wall at a much larger scale. Lakehouse-with-partitioning hits a wall at a much-much larger scale. Streaming with watermarks hits a wall at near-real-time SLAs.

Each new layer of the stack pushes the wall out by an order of magnitude and adds operational complexity in proportion. The senior engineer's craft is knowing which wall they're standing closest to and which migration path moves it furthest with the least risk.

Why the warehouse vendor's "auto-incremental" feature is rarely enough

Modern warehouses (Snowflake, BigQuery, Databricks) advertise materialised views that "automatically refresh incrementally". They are real and they are useful for a constrained class of pipelines: pipelines whose SQL is a simple aggregation over an append-only base table with no late-arriving rows and no upstream deletes.

For Build-2 readers who are still on a single transactional source with cleanly-ordered events, that constraint is sometimes satisfied and the auto-incremental view is a free lunch worth taking.

For most production pipelines the constraint is violated. Late-arriving rows happen (offline POS terminals syncing the next day, mobile-app events buffered behind a flaky 4G connection, manual finance corrections). Upstream deletes happen (GDPR, RTO, customer-data corrections). Joins across multiple source tables happen.

As soon as any of these are in scope, the warehouse's auto-incremental feature fails to materialise correctly and silently degrades to either a full refresh or, worse, a stale view that doesn't reflect the corrections. The migration to "real" incremental — the kind Build 3 will teach — is unavoidable for the pipelines that matter, and the warehouse feature is at best a head start, not a finish line.

The capacity-planning conversation finance never asks for

There is a conversation a finance team almost never asks for and every data engineer should rehearse: "what is the cost trajectory of the current pipeline over the next 24 months at our growth rate?" The full-refresh answer is straightforward — multiply current cost by (1 + monthly growth rate)^24.

At 12% MoM growth, that is a 15× cost over two years. The incremental answer at the same growth rate is roughly 1.5–2× over two years (the destination grows but the runtime is bounded by daily delta).

Putting that 15× number in front of a finance team unlocks the engineering time to do the migration. Without it, the conversation is "the pipeline is slow, can we have more engineers?" — which never wins against the next product launch.

The cost projection is the lever that turns "boring infra work" into "we're saving the company ₹4 crore a year by doing this now". Frame the wall as a finance problem and the budget appears.

Where this leads next

Build 3 is the response to this chapter. The next twelve chapters walk through the shape of incremental processing: high-water marks (chapter 12), late-arriving rows and the bitemporal model (chapters 13–14), backfill semantics (chapters 15–16), schema drift through migrations (chapters 17–19), the discipline of versioned source-of-truth state (chapters 20–22), and the operational pattern that emerges when all of these compose (chapter 23, the build's capstone).

Each chapter is a tool for moving the wall further out. The wall does not disappear; it is replaced by a more distant wall whose contour is set by the destination's growth rather than the source's, and whose first symptoms look quite different from the ones in this chapter.

Build 6 returns to the destination side of the wall — column storage, partitioning, and file layouts that make incremental destinations cheap. Build 12 returns to it again at lakehouse scale, where the destination is petabyte-scale object storage and the migration cost is one engineering-quarter rather than one engineering-year. The wall this chapter named will be redrawn at every level of the stack, and the discipline of seeing it coming early is the same at every level.

References

  1. The Architecture of a Modern Data Warehouse — Maxime Beauchemin — the original "rise of the data engineer" post that named the full-refresh-to-incremental transition as the central craft of the field.
  2. dbt: incremental models documentation — the canonical reference for how a modern transformation tool handles the migration from full-refresh to incremental, with the operational considerations made explicit.
  3. BigQuery: incremental materialised views — the warehouse vendor's view of automatic incremental refresh, including the constraints under which it actually works.
  4. Designing Data-Intensive Applications, Chapter 11 — Stream Processing — Kleppmann's framing of the batch-vs-stream choice, which is downstream of the full-refresh-vs-incremental choice this chapter makes.
  5. Snowflake: cost optimisation case studies — published case studies on the cost trajectory of full-refresh pipelines that have been migrated to incremental, with real numbers.
  6. Razorpay engineering: scaling our analytics pipeline — Razorpay's published account of their own wall and migration, including the cost projection they used to win engineering time for the work.
  7. Partial failures and the at-least-once contract — the Build-2 chapter immediately preceding this one, on the idempotency contract that full-refresh exploits to be simple.
  8. The Lambda Architecture (Marz & Warren) — the original framing of the speed-layer + batch-layer split that emerges when a single full-refresh pipeline can no longer hit the latency required of it.

A practical exercise: take the full_refresh.py example above and run it three times with different input sizes — 1 lakh, 10 lakh, 1 crore. Plot elapsed time against input size on a log-log scale; the slope of the line is the runtime exponent. For a well-tuned full-refresh job you will see a slope very close to 1.0, confirming the linear relationship. Now imagine the company growth curve overlaid: a 12% MoM growth crosses any fixed SLA in finite time, and the time is a deterministic function of where you start on the curve. The exercise that makes the wall real is computing, for your own pipeline, the calendar date on which it will breach. The answer is rarely more than 24 months away, and it is almost always closer than the team's gut feel.