Wall: running this in production is its own discipline

It is a Sunday night at Razorpay. The data platform has been quiet for six weeks. Pipelines are green, dashboards refresh on schedule, the on-call rotation has been forgettably boring. At 02:14 a.m. the merchant-payouts pipeline drifts five minutes past SLA. At 02:31 a partner-bank API starts returning 503 on 8% of calls and the retry shoulder back-pressures the upstream Kafka topic. At 02:44 the freshness alert on the GST-filings table fires; the consumer is the finance reconciliation job that has to clear before the merchant reports go out at 09:00. By 03:10 the on-call engineer has to decide: pause the upstream and let the partner bank recover, or push through with stale-but-present data and let reconciliation explain a 0.4% mismatch. There is no unit test for that decision. There is no library that helps. There is only the runbook the team has, the practice they've had with backfills, and the operational instinct nobody on LinkedIn lists as a skill.

The first sixteen builds taught the mechanisms — how to write a pipeline, schedule it, ship contracts, store columnar, stream stateful, govern access, attribute cost. They did not teach what happens when those mechanisms have to run for years against drift, regulators, ₹-denominated SLAs, and a 3 a.m. page. Build 17 is that curriculum, and it does not look like the previous sixteen.

Why this is a wall, not a chapter

The previous sixteen builds were about getting the bytes right. A SCRIPT IS A PIPELINE — get the rows from source to sink. IDEMPOTENCY AND RETRIES — get them right after a crash. INCREMENTAL PROCESSING — get them right efficiently. A SCHEDULER FROM SCRATCH — orchestrate it. Contracts, lineage, columnar storage, message logs, stateful streams, exactly-once, unified batch-stream, CDC, lakehouse, semantic layer, real-time analytics, feature stores, multi-tenancy, governance, cost — every one of those is a problem with a verifiable mechanism. You can write a unit test for "did the dedup key collapse the duplicate row?". You can benchmark a Z-ordered Iceberg query against a non-Z-ordered one. You can prove your offset commit is exactly-once with a partition tail and a transactional sink.

Production is the first build in this curriculum where the answers do not look like unit tests. Did your backfill of 18 months of GST data complete correctly? Define "correctly". Define what the consumer dashboards looked like during the backfill window. Define how you communicated that the WAU number on the founders' deck was based on partial data for 4 hours on Tuesday. Define how you decided whether to take the risk of the backfill running in parallel with the daily incremental, or pause the daily and run the backfill alone, or fan out across compute clusters and pay 6× the spend for 1/4 the wallclock. The answer is not in the code. It is in the calendar, the stakeholder map, the budget, and the honest negotiation between a data team and the rest of the company.

This chapter is the wall. It does not teach you a tool. It teaches you why the next eight chapters look different — why they are about runbooks, calendars, stakeholders, ₹ amounts, and judgement instead of MERGE INTO and EXACTLY_ONCE.

The shift from mechanism to disciplineA two-column comparison: the left column lists Builds 1 through 16 as mechanisms with verifiable outputs, the right column lists the operational concerns of Build 17 as judgements that resist unit tests. A vertical rule labelled "the wall" separates them. Two kinds of work, one team Builds 1–16 build the system. Build 17 keeps it alive. Builds 1–16: mechanism unit-testable, library-supported — write a pipeline that survives crashes — make retries idempotent — track watermarks and late data — schedule a DAG with SLAs — enforce contracts at the boundary — compact small files — exactly-once across sink + log — Z-order to skip 80% of partitions — PII column tagging + masking — per-query cost attribution — usage matrix → grant minimisation a unit test can decide if you got it right the wall Build 17: discipline judgement, calendars, stakeholders — promise an SLA you can defend — backfill without breaking dashboards — reprocess 18 months under deadline — design alerts a tired human can read — negotiate ₹ between S3, egress, compute — survive the warehouse getting wiped — migrate engines without losing data — answer the regulator on Monday — hire the next 3 engineers — run the team for 5 years, not 5 months — say "no" to a request worth saying no to a runbook and a stakeholder decide
Mechanism work and operational discipline are both engineering, but they reward different practice. Build 17 is the discipline curriculum the rest of the wiki cannot avoid.

Why mechanism vs discipline matters as a framing: the SDE-2 who passed the system-design interview at Razorpay can land a Kafka topic with the right partition count, the right replication factor, and an idempotent producer — and still take down merchant payouts on a Friday because nobody on the team had the runbook for "partner-bank API is degraded, do we shed load or wait?". The mechanism was perfect; the discipline was missing. Production failures in mature data platforms almost never look like "a bug in the code"; they look like "a decision nobody had practised making".

What "production" actually adds

Five things change when a pipeline crosses from "it works on my laptop" to "it serves the company's reconciliation":

Time horizon stretches from minutes to years. A pipeline that ran cleanly for 90 days still has to handle the schema change next quarter, the GST regime change in 2027, the merger that doubles the upstream tables in 2028. Decisions made today against 1 lakh rows have to survive at 50 crore rows. The mechanism doesn't break; the assumptions inside it do.

Failure modes change shape. During development, "the pipeline failed" means "the code threw an exception". In production, the more common failure modes are partial: 0.3% of rows quietly went to the dead-letter queue, the upstream API started returning 200s with empty bodies, an upstream system silently changed the timezone of one column, the watermark advanced past data that hadn't arrived yet because a Kafka broker swapped leadership at 02:00. The on-call's job is not to read a stack trace; it is to notice the dashboard is wrong and reconstruct what changed.

Stakeholders multiply. A working pipeline serves the analyst who originally requested it. A production pipeline serves the analyst, the analyst's three downstream dashboards, the data-science team who joined a feature against it, the finance team who reconciles against it monthly, the auditor who samples it quarterly, and the regulator who reads about it in the post-incident report. Every change is a negotiation across all of them.

₹ enters the conversation. A development pipeline costs nothing; a production pipeline that scans 200 TB of S3 daily costs ₹4-8 lakh per month at retail Snowflake/BigQuery rates. The senior data engineer's job acquires a budget axis: a query that runs in 12 seconds for ₹600 versus the same query in 45 seconds for ₹80, and the right answer depends on whether the consumer is the founders' deck (12 seconds) or the audit warehouse (45 seconds is fine).

Reversibility shrinks. In dev you can drop the table, re-run, and try again. In prod, dropping the table breaks 14 downstream dashboards, kicks off three pages, and may violate a DPDP 2023 record-retention obligation. Every decision becomes harder to undo. The job becomes thinking about reversibility before you take the action.

These five axes are not independent — they compound. A schema change in year 3 of a pipeline (long time horizon) introduces a quiet drift (new failure shape) that affects six dashboards (stakeholder count) that each represent a different ₹ flow (cost) and cannot be cleanly rolled back without re-running 9 months of incremental loads (low reversibility). The whole shape of the problem is qualitatively different from "I wrote a bug yesterday and need to fix it". Why this compounding matters for hiring: the engineer who has only ever written greenfield pipelines has never seen the compounded shape. Their instinct is to optimise the mechanism (write better code) when the right move is to optimise the operational design (write a smaller change, ship behind a flag, run for a week, then promote). This is learned in production, not in code review.

Five axes that change in productionA pentagon-shaped radar diagram showing the five axes: time horizon, failure shape, stakeholder count, cost in rupees, and reversibility. The development inner pentagon is small; the production outer pentagon is large. Five axes change shape between dev and prod time horizon minutes → years failure shape crash → quiet drift stakeholders 1 → 14 ₹ cost free → ₹4-8L/mo reversibility drop+retry → none development production
The mechanism doesn't change between dev and prod. The five axes around it do, and Build 17 is about working under the larger pentagon.

A small piece of evidence: what an SLA actually costs

Here is a concrete way to feel the shift. Below is a tiny simulator that takes a pipeline's measured per-run latency distribution and asks: "what SLA can I actually promise without burning out the on-call rotation?" The mechanism (latency measurement) is trivial; the decision (which percentile to promise) is the discipline.

# sla_budget.py — what can you actually promise?
import random, statistics
from collections import Counter

# 1) Past 90 days of pipeline run latencies (in minutes), as observed
random.seed(2026)
latencies = []
for _ in range(83):    # normal days
    latencies.append(random.gauss(28, 4))
for _ in range(5):     # mildly bad days (upstream slow)
    latencies.append(random.gauss(52, 8))
for _ in range(2):     # one full incident week
    latencies.append(random.gauss(180, 30))
latencies = [max(8, x) for x in latencies]
latencies.sort()

# 2) Compute candidate SLA promises
def percentile(xs, p):
    k = int(round((p/100) * (len(xs)-1)))
    return xs[k]

p50 = percentile(latencies, 50)
p95 = percentile(latencies, 95)
p99 = percentile(latencies, 99)
worst = latencies[-1]

# 3) For each candidate SLA, count how many days it would have breached
print(f"{'SLA promise':<18}{'breach days':>14}{'on-call pages':>16}")
for sla_minutes in [30, 45, 60, 90, 120, 240]:
    breaches = sum(1 for x in latencies if x > sla_minutes)
    pages = breaches  # one page per breach (assume always paged)
    print(f"  <= {sla_minutes:>3} min      {breaches:>10}/90{pages:>14}")

print(f"\nObserved p50={p50:.1f} p95={p95:.1f} p99={p99:.1f} worst={worst:.1f} (all minutes)")

# 4) The decision: pick the tightest SLA that breaches <= 4 times/quarter
# (industry rule of thumb: more than 4 quarterly breaches = SLA is wrong)
budget = 4
chosen = None
for sla in [30, 45, 60, 90, 120, 240]:
    if sum(1 for x in latencies if x > sla) <= budget:
        chosen = sla; break
print(f"\nRecommended SLA: <= {chosen} min")
print(f"Reason: tightest commitment with <= {budget} breaches in 90 days.")
# Output:
SLA promise         breach days   on-call pages
  <=  30 min              42/90            42
  <=  45 min              13/90            13
  <=  60 min               7/90             7
  <=  90 min               2/90             2
  <= 120 min               2/90             2
  <= 240 min               0/90             0

Observed p50=27.8 p95=58.6 p99=181.2 worst=212.0 (all minutes)

Recommended SLA: <= 90 min
Reason: tightest commitment with <= 4 breaches in 90 days.

Walk through what is happening. Lines 6–13 fabricate a realistic latency distribution — 83 normal days clustered around 28 minutes, 5 mildly bad days near 52 minutes, and 2 incident-week days that drift past 180 minutes. This is roughly what a year-old, well-run merchant-reporting pipeline at a place like Razorpay actually looks like; the mean is fine, the tail is brutal. Why fabrication is fair here: real production latency is fat-tailed in a way Gaussian noise alone won't capture. Mixing three regimes (normal, slow, incident) reproduces the bimodal-with-tail shape that wrecks naïve p95-based SLA promises. Lines 19–25 compute the percentiles and give a hard look at the distribution: p50 is 28 minutes, p95 is 59, p99 is 181 — the gap between p95 and p99 is the whole game. Lines 28–32 ask the question the team actually has to answer: for each candidate SLA promise, how many days in the last 90 would we have breached it? Why count breaches, not percentiles: a percentile is a property of the distribution; a breach is a property of the day that the breach happened on. Stakeholders care about "how often did the dashboard show stale numbers?", which is a count of days, not a moment of a curve. Lines 35–41 pick the tightest SLA that fits the breach budget — the rule "no more than 4 quarterly breaches" is the operational cost the team can carry without burning out. The output recommends 90 minutes, not the more aggressive 60. The 60-minute SLA looks great on a slide ("we promise 1-hour freshness") and produces 7 pages a quarter — a number that destroys an on-call rotation over 18 months.

The simulator's mechanism is 40 lines of Python. The discipline is the rule "more than 4 quarterly breaches means the SLA is wrong" — a rule no library knows, that is learned by carrying the pager for two years and watching what number burns the team out.

There is a second-order effect the simulator hints at but doesn't model: every breach also costs trust. The merchant-payouts team that promised 60-minute freshness and missed it 7 times in a quarter loses negotiating power for the next ₹-budget conversation. The team that promised 90-minute freshness and missed it twice gets believed when it asks for compute headroom. This is not an engineering metric and won't show up in any dashboard, but it is the load-bearing variable for whether the team gets to invest in the platform or just defends it. Build 17 is partly about understanding which numbers are stakeholders' numbers (they care about breaches per quarter) versus engineers' numbers (we care about p99 latency); the bridge between them is the only conversation that actually moves the platform forward.

What changes about the team

Discipline work reshapes the team. The first sixteen builds reward T-shaped engineers who can ship feature work — design a pipeline, code it, deploy it, move on. Build 17 rewards a different shape: engineers who stay with the system long enough to know what it does on Tuesdays. The bug count for incident-prone teams correlates more strongly with average tenure than with headcount. A team of 8 engineers with 18-month average tenure handles 3× more pipelines reliably than a team of 12 with 6-month tenure — because the operational mental model takes a year to grow and walks out the door when people leave.

This is why senior data-platform leaders at Indian fintechs (PhonePe, Razorpay, Cred) talk so much about runbook hygiene and post-incident review culture. The runbooks are not a substitute for engineering skill; they are how engineering skill gets compounded across a rotation that turns over. The post-incidents are not blame; they are how the next on-call gets a 6-month head start on the same failure mode.

The wiki cannot teach you to stay. It can teach you what to stay for — what the operational shape of the next eight chapters looks like, so when you carry the pager you are not surprised by which problems are easy and which are politically wedged.

A second team-shape effect: the senior engineers who survive Build 17 develop a particular allergy. They become unwilling to ship a mechanism without its operational scaffolding — runbook, alert, dashboard, escalation path, ownership tag. To a junior eye this looks like over-engineering ("we just want to land the table; why do we need a runbook?"). To the senior, the runbook is the engineering — a table without a runbook is a table that will page someone at 02:00 and waste their hour. The seniors don't argue this; they just refuse to merge the PR. Over 18 months the standard becomes invisible — every new pipeline ships with the operational scaffolding built-in, and the team's incident rate drops by half. The mechanism stayed the same; what changed was what counted as "done".

Common confusions

Going deeper

What "operating a system" looked like in 1972 versus 2026

Operations is older than data engineering. The IBM mainframe shops of the 1970s ran payroll for the entire ITC group on overnight batches; the operator on-shift watched a console for ABEND codes and knew how to mount a backup tape inside 8 minutes. The discipline was unbroken from then to now, but the texture changed twice. First, in the 1990s, when distributed systems made "what state is the system in?" a non-trivial question — Lamport's 1978 paper on logical clocks is in the engineering canon precisely because the answer stopped being obvious. Second, in the 2010s, when cloud storage made "where is my data?" stop having a fixed answer — your data is in S3, which means your data is in 6 racks across 3 availability zones, and your operator has to reason about that without ever touching a tape. The discipline survived both shifts; the runbooks got rewritten each time. The 2026 reader inherits both layers of practice.

Why "data engineering" is not "software engineering on data"

A common confusion in hiring is to interview data-platform candidates as if they were back-end SDEs who happen to deal with data. The interview tests system design, latency, RPC, scaling — and misses entirely that the candidate's actual job involves negotiating with finance about a ₹2 crore infra spend, explaining to the CFO why the WAU number changed retroactively, running a 14-month backfill with stakeholder calendars. The skill set overlaps with SDE work by maybe 60%, but the differentiating 40% is exactly Build 17. Companies that interview only on the overlap end up with teams that can ship pipelines but not operate them.

The 5-year operational arc at an Indian fintech

The pattern across Razorpay, PhonePe, and Cred (from public engineering posts and conference talks): year 1 is "build it"; year 2 is "the first incident teaches you the mechanism wasn't enough"; year 3 is "the team writes the first runbook"; year 4 is "the runbook teaches the on-call rotation"; year 5 is "the team can absorb a 50% turnover without a quality regression". Companies that quit at year 2 — usually because feature pressure overwhelms ops investment — never make the year-5 transition and stay in firefighting mode forever. The next eight chapters are designed to skip the team forward in this arc.

The DPDP 2023 effect on Indian operational practice

The Digital Personal Data Protection Act 2023, in force from 2024, reshaped what "production-grade" means at Indian fintechs. A pipeline that processes payment data must now satisfy: deletion-on-request inside 30 days (across raw + materialised + cached), per-user purpose limitation (no using payments data for marketing without consent), a queryable access audit trail (who read this PII column when), and a data-breach notification to the Data Protection Board inside 72 hours. None of these are mechanism problems — they are operational discipline problems. A team that built the perfect Kafka cluster and the perfect Iceberg lake without DPDP-compliant runbooks is a team that gets fined ₹250 crore the first time something goes wrong. The next eight chapters take this seriously.

The operational pyramid: alerts, runbooks, post-mortems, retros

Run the production layer as a pyramid. Bottom: alerts that fire on real symptoms (freshness, lag, error rate), not on metrics that drift naturally. Middle: runbooks that map alert → mitigation step, written by the engineer who closed the last incident and reviewed by the next on-call. Upper-middle: post-incident reviews that record what happened, what was hard, and what the runbook missed. Top: quarterly retros that look at incident clusters and ask "what design choice is generating this incident shape?". Most teams have alerts and post-mortems, miss runbooks, and never do retros. The teams that survive 5 years run the whole pyramid.

Why the pager rotation is the curriculum

You learn discipline by carrying it. The week-long on-call rotation at Razorpay or Zerodha or Cred is not a chore the senior engineers offload to the juniors; it is the actual classroom of Build 17. A junior engineer in their first rotation gets a freshness alert at 03:14, opens the runbook the previous on-call wrote, follows steps 1–4, finds step 4 didn't work, escalates, watches a senior engineer reason through the actual root cause, then writes step 5 into the runbook before going back to bed. Six months later they are on the senior side of that conversation. Why this works as pedagogy: production failure modes are too rare and too varied to teach in any controlled environment — by the time you've seen 30 incidents, your hand-rolled mental model is more reliable than any course material. The rotation is the only mechanism that exposes engineers to enough failure variance fast enough. A team that protects juniors from on-call also blocks them from the only training environment that produces a Build 17-ready engineer; a team that throws juniors on-call without runbooks burns them out. The middle path — junior shadows senior for two rotations, then carries solo with the runbook as scaffolding — is the practice that produces the next senior cohort.

Where this leads next

The next eight chapters are the discipline curriculum. Read them in order; they reference each other.

Past those: migrations, the 30-year arc, and the parts they don't teach. By the end of Build 17 you will not be surprised by the operational shape of the job — you'll be ready to carry the pager and pick up the runbook.

If you stop at Build 16, you have a working data platform and the start of an operational scar. If you stay through Build 17, you have a working data platform, a runbook for it, an on-call rotation that doesn't burn out, and a budget conversation you can defend. The difference is what separates a team that ships features from a team that owns infrastructure. Pick which one you want to be on, because every chapter past this wall is for the second kind.

References