Wall: running this in production is its own discipline
It is a Sunday night at Razorpay. The data platform has been quiet for six weeks. Pipelines are green, dashboards refresh on schedule, the on-call rotation has been forgettably boring. At 02:14 a.m. the merchant-payouts pipeline drifts five minutes past SLA. At 02:31 a partner-bank API starts returning 503 on 8% of calls and the retry shoulder back-pressures the upstream Kafka topic. At 02:44 the freshness alert on the GST-filings table fires; the consumer is the finance reconciliation job that has to clear before the merchant reports go out at 09:00. By 03:10 the on-call engineer has to decide: pause the upstream and let the partner bank recover, or push through with stale-but-present data and let reconciliation explain a 0.4% mismatch. There is no unit test for that decision. There is no library that helps. There is only the runbook the team has, the practice they've had with backfills, and the operational instinct nobody on LinkedIn lists as a skill.
The first sixteen builds taught the mechanisms — how to write a pipeline, schedule it, ship contracts, store columnar, stream stateful, govern access, attribute cost. They did not teach what happens when those mechanisms have to run for years against drift, regulators, ₹-denominated SLAs, and a 3 a.m. page. Build 17 is that curriculum, and it does not look like the previous sixteen.
Why this is a wall, not a chapter
The previous sixteen builds were about getting the bytes right. A SCRIPT IS A PIPELINE — get the rows from source to sink. IDEMPOTENCY AND RETRIES — get them right after a crash. INCREMENTAL PROCESSING — get them right efficiently. A SCHEDULER FROM SCRATCH — orchestrate it. Contracts, lineage, columnar storage, message logs, stateful streams, exactly-once, unified batch-stream, CDC, lakehouse, semantic layer, real-time analytics, feature stores, multi-tenancy, governance, cost — every one of those is a problem with a verifiable mechanism. You can write a unit test for "did the dedup key collapse the duplicate row?". You can benchmark a Z-ordered Iceberg query against a non-Z-ordered one. You can prove your offset commit is exactly-once with a partition tail and a transactional sink.
Production is the first build in this curriculum where the answers do not look like unit tests. Did your backfill of 18 months of GST data complete correctly? Define "correctly". Define what the consumer dashboards looked like during the backfill window. Define how you communicated that the WAU number on the founders' deck was based on partial data for 4 hours on Tuesday. Define how you decided whether to take the risk of the backfill running in parallel with the daily incremental, or pause the daily and run the backfill alone, or fan out across compute clusters and pay 6× the spend for 1/4 the wallclock. The answer is not in the code. It is in the calendar, the stakeholder map, the budget, and the honest negotiation between a data team and the rest of the company.
This chapter is the wall. It does not teach you a tool. It teaches you why the next eight chapters look different — why they are about runbooks, calendars, stakeholders, ₹ amounts, and judgement instead of MERGE INTO and EXACTLY_ONCE.
Why mechanism vs discipline matters as a framing: the SDE-2 who passed the system-design interview at Razorpay can land a Kafka topic with the right partition count, the right replication factor, and an idempotent producer — and still take down merchant payouts on a Friday because nobody on the team had the runbook for "partner-bank API is degraded, do we shed load or wait?". The mechanism was perfect; the discipline was missing. Production failures in mature data platforms almost never look like "a bug in the code"; they look like "a decision nobody had practised making".
What "production" actually adds
Five things change when a pipeline crosses from "it works on my laptop" to "it serves the company's reconciliation":
Time horizon stretches from minutes to years. A pipeline that ran cleanly for 90 days still has to handle the schema change next quarter, the GST regime change in 2027, the merger that doubles the upstream tables in 2028. Decisions made today against 1 lakh rows have to survive at 50 crore rows. The mechanism doesn't break; the assumptions inside it do.
Failure modes change shape. During development, "the pipeline failed" means "the code threw an exception". In production, the more common failure modes are partial: 0.3% of rows quietly went to the dead-letter queue, the upstream API started returning 200s with empty bodies, an upstream system silently changed the timezone of one column, the watermark advanced past data that hadn't arrived yet because a Kafka broker swapped leadership at 02:00. The on-call's job is not to read a stack trace; it is to notice the dashboard is wrong and reconstruct what changed.
Stakeholders multiply. A working pipeline serves the analyst who originally requested it. A production pipeline serves the analyst, the analyst's three downstream dashboards, the data-science team who joined a feature against it, the finance team who reconciles against it monthly, the auditor who samples it quarterly, and the regulator who reads about it in the post-incident report. Every change is a negotiation across all of them.
₹ enters the conversation. A development pipeline costs nothing; a production pipeline that scans 200 TB of S3 daily costs ₹4-8 lakh per month at retail Snowflake/BigQuery rates. The senior data engineer's job acquires a budget axis: a query that runs in 12 seconds for ₹600 versus the same query in 45 seconds for ₹80, and the right answer depends on whether the consumer is the founders' deck (12 seconds) or the audit warehouse (45 seconds is fine).
Reversibility shrinks. In dev you can drop the table, re-run, and try again. In prod, dropping the table breaks 14 downstream dashboards, kicks off three pages, and may violate a DPDP 2023 record-retention obligation. Every decision becomes harder to undo. The job becomes thinking about reversibility before you take the action.
These five axes are not independent — they compound. A schema change in year 3 of a pipeline (long time horizon) introduces a quiet drift (new failure shape) that affects six dashboards (stakeholder count) that each represent a different ₹ flow (cost) and cannot be cleanly rolled back without re-running 9 months of incremental loads (low reversibility). The whole shape of the problem is qualitatively different from "I wrote a bug yesterday and need to fix it". Why this compounding matters for hiring: the engineer who has only ever written greenfield pipelines has never seen the compounded shape. Their instinct is to optimise the mechanism (write better code) when the right move is to optimise the operational design (write a smaller change, ship behind a flag, run for a week, then promote). This is learned in production, not in code review.
A small piece of evidence: what an SLA actually costs
Here is a concrete way to feel the shift. Below is a tiny simulator that takes a pipeline's measured per-run latency distribution and asks: "what SLA can I actually promise without burning out the on-call rotation?" The mechanism (latency measurement) is trivial; the decision (which percentile to promise) is the discipline.
# sla_budget.py — what can you actually promise?
import random, statistics
from collections import Counter
# 1) Past 90 days of pipeline run latencies (in minutes), as observed
random.seed(2026)
latencies = []
for _ in range(83): # normal days
latencies.append(random.gauss(28, 4))
for _ in range(5): # mildly bad days (upstream slow)
latencies.append(random.gauss(52, 8))
for _ in range(2): # one full incident week
latencies.append(random.gauss(180, 30))
latencies = [max(8, x) for x in latencies]
latencies.sort()
# 2) Compute candidate SLA promises
def percentile(xs, p):
k = int(round((p/100) * (len(xs)-1)))
return xs[k]
p50 = percentile(latencies, 50)
p95 = percentile(latencies, 95)
p99 = percentile(latencies, 99)
worst = latencies[-1]
# 3) For each candidate SLA, count how many days it would have breached
print(f"{'SLA promise':<18}{'breach days':>14}{'on-call pages':>16}")
for sla_minutes in [30, 45, 60, 90, 120, 240]:
breaches = sum(1 for x in latencies if x > sla_minutes)
pages = breaches # one page per breach (assume always paged)
print(f" <= {sla_minutes:>3} min {breaches:>10}/90{pages:>14}")
print(f"\nObserved p50={p50:.1f} p95={p95:.1f} p99={p99:.1f} worst={worst:.1f} (all minutes)")
# 4) The decision: pick the tightest SLA that breaches <= 4 times/quarter
# (industry rule of thumb: more than 4 quarterly breaches = SLA is wrong)
budget = 4
chosen = None
for sla in [30, 45, 60, 90, 120, 240]:
if sum(1 for x in latencies if x > sla) <= budget:
chosen = sla; break
print(f"\nRecommended SLA: <= {chosen} min")
print(f"Reason: tightest commitment with <= {budget} breaches in 90 days.")
# Output:
SLA promise breach days on-call pages
<= 30 min 42/90 42
<= 45 min 13/90 13
<= 60 min 7/90 7
<= 90 min 2/90 2
<= 120 min 2/90 2
<= 240 min 0/90 0
Observed p50=27.8 p95=58.6 p99=181.2 worst=212.0 (all minutes)
Recommended SLA: <= 90 min
Reason: tightest commitment with <= 4 breaches in 90 days.
Walk through what is happening. Lines 6–13 fabricate a realistic latency distribution — 83 normal days clustered around 28 minutes, 5 mildly bad days near 52 minutes, and 2 incident-week days that drift past 180 minutes. This is roughly what a year-old, well-run merchant-reporting pipeline at a place like Razorpay actually looks like; the mean is fine, the tail is brutal. Why fabrication is fair here: real production latency is fat-tailed in a way Gaussian noise alone won't capture. Mixing three regimes (normal, slow, incident) reproduces the bimodal-with-tail shape that wrecks naïve p95-based SLA promises. Lines 19–25 compute the percentiles and give a hard look at the distribution: p50 is 28 minutes, p95 is 59, p99 is 181 — the gap between p95 and p99 is the whole game. Lines 28–32 ask the question the team actually has to answer: for each candidate SLA promise, how many days in the last 90 would we have breached it? Why count breaches, not percentiles: a percentile is a property of the distribution; a breach is a property of the day that the breach happened on. Stakeholders care about "how often did the dashboard show stale numbers?", which is a count of days, not a moment of a curve. Lines 35–41 pick the tightest SLA that fits the breach budget — the rule "no more than 4 quarterly breaches" is the operational cost the team can carry without burning out. The output recommends 90 minutes, not the more aggressive 60. The 60-minute SLA looks great on a slide ("we promise 1-hour freshness") and produces 7 pages a quarter — a number that destroys an on-call rotation over 18 months.
The simulator's mechanism is 40 lines of Python. The discipline is the rule "more than 4 quarterly breaches means the SLA is wrong" — a rule no library knows, that is learned by carrying the pager for two years and watching what number burns the team out.
There is a second-order effect the simulator hints at but doesn't model: every breach also costs trust. The merchant-payouts team that promised 60-minute freshness and missed it 7 times in a quarter loses negotiating power for the next ₹-budget conversation. The team that promised 90-minute freshness and missed it twice gets believed when it asks for compute headroom. This is not an engineering metric and won't show up in any dashboard, but it is the load-bearing variable for whether the team gets to invest in the platform or just defends it. Build 17 is partly about understanding which numbers are stakeholders' numbers (they care about breaches per quarter) versus engineers' numbers (we care about p99 latency); the bridge between them is the only conversation that actually moves the platform forward.
What changes about the team
Discipline work reshapes the team. The first sixteen builds reward T-shaped engineers who can ship feature work — design a pipeline, code it, deploy it, move on. Build 17 rewards a different shape: engineers who stay with the system long enough to know what it does on Tuesdays. The bug count for incident-prone teams correlates more strongly with average tenure than with headcount. A team of 8 engineers with 18-month average tenure handles 3× more pipelines reliably than a team of 12 with 6-month tenure — because the operational mental model takes a year to grow and walks out the door when people leave.
This is why senior data-platform leaders at Indian fintechs (PhonePe, Razorpay, Cred) talk so much about runbook hygiene and post-incident review culture. The runbooks are not a substitute for engineering skill; they are how engineering skill gets compounded across a rotation that turns over. The post-incidents are not blame; they are how the next on-call gets a 6-month head start on the same failure mode.
The wiki cannot teach you to stay. It can teach you what to stay for — what the operational shape of the next eight chapters looks like, so when you carry the pager you are not surprised by which problems are easy and which are politically wedged.
A second team-shape effect: the senior engineers who survive Build 17 develop a particular allergy. They become unwilling to ship a mechanism without its operational scaffolding — runbook, alert, dashboard, escalation path, ownership tag. To a junior eye this looks like over-engineering ("we just want to land the table; why do we need a runbook?"). To the senior, the runbook is the engineering — a table without a runbook is a table that will page someone at 02:00 and waste their hour. The seniors don't argue this; they just refuse to merge the PR. Over 18 months the standard becomes invisible — every new pipeline ships with the operational scaffolding built-in, and the team's incident rate drops by half. The mechanism stayed the same; what changed was what counted as "done".
Common confusions
- "Production discipline is the same as SRE." SRE in the Google sense is a specific operational philosophy (error budgets, toil quantification, blameless post-mortems). Data-engineering production discipline overlaps but adds two large axes SRE under-emphasises: data correctness over time (does the warehouse still mean what it meant six months ago?) and stakeholder politics (who decides whether to back-fill?). Borrow SRE's mental tools, but don't expect the SRE handbook to cover the data half.
- "Build 17 is just 'ops' — it's beneath real engineering." This is a junior-engineer view. The reason senior data-platform engineers are paid 1.6–2× their feature-shipping peers in Indian fintechs is that ops is the engineering — almost any team can ship a pipeline; very few can keep it correct for 5 years across schema drift, stakeholder churn, and ₹-budget pressure. The compensation gap reflects the difficulty.
- "You can automate your way out of operational discipline." Automation is a force multiplier for discipline, not a substitute. A team without runbooks that buys PagerDuty just gets paged more efficiently. A team with mature runbooks that adds automation reduces toil. Order matters: practice first, then automate the practice you've already proved is right.
- "The mechanism builds and the discipline build are independent." They are not. A pipeline written without idempotency (Build 2) is impossible to operate reliably; a warehouse without column-level lineage (Build 5) cannot be governed at scale; a stream without checkpointing (Build 8) cannot be backfilled. The first sixteen builds are prerequisites for the seventeenth — Build 17 is what those mechanisms unlock when you compose them in production.
- "Production runbooks are a Confluence problem." They start there but don't end there. A useful runbook is executable: a Markdown file in the same repo as the pipeline code, version-controlled, referenced by the on-call alert, updated in the same PR that fixes the underlying bug. A runbook that lives in Confluence and gets touched once a year is theatre.
Going deeper
What "operating a system" looked like in 1972 versus 2026
Operations is older than data engineering. The IBM mainframe shops of the 1970s ran payroll for the entire ITC group on overnight batches; the operator on-shift watched a console for ABEND codes and knew how to mount a backup tape inside 8 minutes. The discipline was unbroken from then to now, but the texture changed twice. First, in the 1990s, when distributed systems made "what state is the system in?" a non-trivial question — Lamport's 1978 paper on logical clocks is in the engineering canon precisely because the answer stopped being obvious. Second, in the 2010s, when cloud storage made "where is my data?" stop having a fixed answer — your data is in S3, which means your data is in 6 racks across 3 availability zones, and your operator has to reason about that without ever touching a tape. The discipline survived both shifts; the runbooks got rewritten each time. The 2026 reader inherits both layers of practice.
Why "data engineering" is not "software engineering on data"
A common confusion in hiring is to interview data-platform candidates as if they were back-end SDEs who happen to deal with data. The interview tests system design, latency, RPC, scaling — and misses entirely that the candidate's actual job involves negotiating with finance about a ₹2 crore infra spend, explaining to the CFO why the WAU number changed retroactively, running a 14-month backfill with stakeholder calendars. The skill set overlaps with SDE work by maybe 60%, but the differentiating 40% is exactly Build 17. Companies that interview only on the overlap end up with teams that can ship pipelines but not operate them.
The 5-year operational arc at an Indian fintech
The pattern across Razorpay, PhonePe, and Cred (from public engineering posts and conference talks): year 1 is "build it"; year 2 is "the first incident teaches you the mechanism wasn't enough"; year 3 is "the team writes the first runbook"; year 4 is "the runbook teaches the on-call rotation"; year 5 is "the team can absorb a 50% turnover without a quality regression". Companies that quit at year 2 — usually because feature pressure overwhelms ops investment — never make the year-5 transition and stay in firefighting mode forever. The next eight chapters are designed to skip the team forward in this arc.
The DPDP 2023 effect on Indian operational practice
The Digital Personal Data Protection Act 2023, in force from 2024, reshaped what "production-grade" means at Indian fintechs. A pipeline that processes payment data must now satisfy: deletion-on-request inside 30 days (across raw + materialised + cached), per-user purpose limitation (no using payments data for marketing without consent), a queryable access audit trail (who read this PII column when), and a data-breach notification to the Data Protection Board inside 72 hours. None of these are mechanism problems — they are operational discipline problems. A team that built the perfect Kafka cluster and the perfect Iceberg lake without DPDP-compliant runbooks is a team that gets fined ₹250 crore the first time something goes wrong. The next eight chapters take this seriously.
The operational pyramid: alerts, runbooks, post-mortems, retros
Run the production layer as a pyramid. Bottom: alerts that fire on real symptoms (freshness, lag, error rate), not on metrics that drift naturally. Middle: runbooks that map alert → mitigation step, written by the engineer who closed the last incident and reviewed by the next on-call. Upper-middle: post-incident reviews that record what happened, what was hard, and what the runbook missed. Top: quarterly retros that look at incident clusters and ask "what design choice is generating this incident shape?". Most teams have alerts and post-mortems, miss runbooks, and never do retros. The teams that survive 5 years run the whole pyramid.
Why the pager rotation is the curriculum
You learn discipline by carrying it. The week-long on-call rotation at Razorpay or Zerodha or Cred is not a chore the senior engineers offload to the juniors; it is the actual classroom of Build 17. A junior engineer in their first rotation gets a freshness alert at 03:14, opens the runbook the previous on-call wrote, follows steps 1–4, finds step 4 didn't work, escalates, watches a senior engineer reason through the actual root cause, then writes step 5 into the runbook before going back to bed. Six months later they are on the senior side of that conversation. Why this works as pedagogy: production failure modes are too rare and too varied to teach in any controlled environment — by the time you've seen 30 incidents, your hand-rolled mental model is more reliable than any course material. The rotation is the only mechanism that exposes engineers to enough failure variance fast enough. A team that protects juniors from on-call also blocks them from the only training environment that produces a Build 17-ready engineer; a team that throws juniors on-call without runbooks burns them out. The middle path — junior shadows senior for two rotations, then carries solo with the runbook as scaffolding — is the practice that produces the next senior cohort.
Where this leads next
The next eight chapters are the discipline curriculum. Read them in order; they reference each other.
- /wiki/slas-on-data-what-you-can-actually-promise — what numbers you should put on a contract, and the breach budget.
- /wiki/backfills-without-breaking-downstream — the playbook for re-running history without paging 14 dashboards.
- /wiki/reprocessing-a-year-of-data-the-real-runbook — the war story version: 18 months of GST data on a 6-week deadline.
- /wiki/on-call-for-data-alerts-that-matter — what to alert on, what to suppress, how to design pages a tired human can read.
- /wiki/cost-on-the-cloud-the-s3-egress-compute-trinity — where ₹ actually goes and what to do about it.
- /wiki/disaster-recovery-your-warehouse-just-got-deleted — the recovery drill that tests the runbook.
Past those: migrations, the 30-year arc, and the parts they don't teach. By the end of Build 17 you will not be surprised by the operational shape of the job — you'll be ready to carry the pager and pick up the runbook.
If you stop at Build 16, you have a working data platform and the start of an operational scar. If you stay through Build 17, you have a working data platform, a runbook for it, an on-call rotation that doesn't burn out, and a budget conversation you can defend. The difference is what separates a team that ships features from a team that owns infrastructure. Pick which one you want to be on, because every chapter past this wall is for the second kind.
References
- Saltzer, Reed, Clark — End-to-End Arguments in System Design (1984) — the discipline origin: correctness is a system-level property, not a layer property; foundational for any operational thinking.
- Google SRE Book — Embracing Risk and Service Level Objectives — the canonical write-up of error budgets and SLO discipline, half of which transfers directly to data engineering.
- Lamport — Time, Clocks, and the Ordering of Events (1978) — why "what state is the system in?" is a hard question even before data engineering.
- DPDP Act 2023 (Government of India) — the regulatory floor that turned operational discipline from optional into mandatory for Indian data platforms.
- Razorpay engineering — Building a reliable payments data platform (2024) — public write-up of the Razorpay arc from "we have a pipeline" to "we have a platform".
- Etsy — Blameless PostMortems and a Just Culture (2012) — the practice that lets a rotation share operational knowledge instead of hoarding it.
- PagerDuty — On-Call Handbook — the most concrete public guide to running a sustainable rotation; reads as a discipline document, not a tool manual.
- Charity Majors — Operational Excellence in April (2018) — the canonical write-up of operational maturity scoring for engineering teams.
- Vicki Boykis — What is data engineering really? (2023) — public reflection on the dev/prod gap that this wall chapter formalises.
- /wiki/slas-on-data-what-you-can-actually-promise — the first chapter of the discipline curriculum.
- /wiki/usage-tracking-and-lineage-for-access-decisions — the closing mechanism chapter of Build 16, the prerequisite for governance-aware operations.