SLAs on data: what you can actually promise

It is Friday at Razorpay and the head of merchant analytics wants the daily payouts table to be "fresh by 9 a.m." The data engineer types "9 a.m. SLA" into a Confluence page and closes the laptop. Eleven months later, after 34 missed mornings, three war-room meetings, and one quietly awful quarterly review, that one line has become the most expensive sentence in the team's history. Nobody negotiated what fresh meant, what 9 a.m. meant, or what was supposed to happen when the upstream payment processor was the one running late.

A data SLA is a four-part promise — what is fresh, by when, with what completeness, with what consequence on miss. The numbers come from the past 90 days of measured behaviour, not from the stakeholder's wish. The on-call rotation can absorb roughly 4 breaches per quarter before it burns out; that is the budget the SLA must respect.

What a real SLA actually contains

An SLA on data is not "the dashboard is updated by 9 a.m." That sentence has at least four hidden questions. Updated to include what data? (events with event_time < 09:00, or with event_time < 08:00, or event_time until midnight previous day?) Measured how? (the latest row in the table, or the watermark, or the last successful run timestamp?) With what completeness? (100% of rows, or "≥99.5% of rows with the rest in a recoverable late-arrival queue"?) And what happens when the SLA is missed? (page someone? credit the customer? auto-fail downstream? do nothing?)

A defensible SLA is a single sentence that answers all four. Here is the Razorpay merchant-payouts SLA, after the team learned the hard way:

The merchant_payouts_daily table contains all payouts with payout_initiated_at strictly before 00:00 IST, present in the table by 09:00 IST, with completeness ≥99.5% rows-by-volume and ≥99.9% rows-by-rupee-value, measured as the median across the trailing 30 days. A miss generates one P2 page and a manual finance-team check; three misses inside a quarter trigger a runbook review.

Read that twice. It is exactly four answers in one sentence. The boring version on Confluence — "table is fresh by 9 a.m." — is one answer to one of the four questions and silence on the other three. The silence is where the war-room meetings happen.

Anatomy of a real data SLAA four-part diagram showing the components of a defensible SLA: scope (what data), deadline (by when), completeness (what fraction), consequence (on miss). Each segment is labelled with example values from the Razorpay merchant-payouts SLA. Four questions a real SLA must answer silence on any one of these is where Friday-night meetings come from 1. Scope what is "the data"? payouts with `payout_initiated_at` < 00:00 IST today not "yesterday's data" — too vague to test measured by: watermark, not latest row 2. Deadline by when? 09:00 IST "by morning" is not testable; 09:00 IST is timezone matters: finance opens 09:30, UK reads at 04:30 IST commit to one TZ 3. Completeness how much of it? ≥99.5% rows ≥99.9% by ₹ value two metrics because whales matter more than tail in finance measured how: trailing-30d median, not single-day mean 4. Consequence what happens on miss? P2 page + finance manual check 3 misses/quarter → runbook review not in scope here: customer credits, contractual penalties — that's an OLA
Each box is a question the SLA must answer in writing. The "table is fresh by 9 a.m." version of the Razorpay SLA had answers only to box 2; the rewritten version answers all four.

Why all four matter together: a tight deadline (box 2) without a completeness floor (box 3) lets the team game the SLA by shipping an empty table at 08:59. A high completeness (box 3) without a deadline (box 2) lets the team ship a perfect table three days late. A scope statement (box 1) without a measurement method makes "completeness" untestable. And a consequence (box 4) without a budget on misses turns every miss into a war-room. The four questions interlock; missing one breaks the contract.

SLO, SLA, OLA — and why the words matter

The data community uses "SLA" loosely; the SRE community uses three words that carry different meanings. Borrow the discipline. Service Level Objective (SLO) is the internal target the team is engineering toward — "we aim for 9 a.m. freshness with 99% reliability". Service Level Agreement (SLA) is the externalised version with a stated consequence — "if we miss the SLO 3 times in a quarter, finance runs a manual check, and the data team writes a runbook review". Operational Level Agreement (OLA) is the internal upstream contract — "the payments-ingestion team commits Kafka topic lag ≤ 60 seconds at p95, because that is the input the merchant-payouts SLA depends on".

The three layer like a stack: an SLA on the consumer-facing table is meaningless without OLAs on every upstream input. A team that promises "9 a.m. freshness on merchant payouts" without a corresponding OLA from the payments ingestion team is a team that has no defence the morning Kafka lag spikes. Why this layering matters operationally: when the SLA is missed, the on-call has to know whether to fix the consumer pipeline or page the upstream team. Without OLAs the answer is always "I don't know, let me investigate" — i.e. an hour wasted at 03:00 before any actual mitigation begins.

The SLO target should always be tighter than the SLA promise. Razorpay's payouts SLO is "complete by 08:00 IST"; the SLA is "complete by 09:00 IST". The 60-minute buffer is the operational headroom — the time the on-call has to mitigate a hot incident before stakeholders notice. A team that promises the same number externally and internally has built no shock absorber and will be running on the redline forever.

SLO, SLA, OLA stackA three-layer diagram showing OLAs from upstream sources feeding into a team's internal SLO target, which surfaces externally as the SLA promise. Each layer has tighter numbers than the next. SLO is tighter than SLA; SLA depends on every upstream OLA SLA "merchant_payouts_daily complete by 09:00 IST, ≥99.5% rows" → stakeholder ↑ externalised promise with consequence SLO internal target: complete by 08:00 IST, ≥99.7% rows internal ↑ engineering target, with 60-minute buffer for mitigation OLA: payments ingestion Kafka lag ≤ 60s p95 topic uptime ≥ 99.95% OLA: settlement service batch complete by 06:30 retry success ≥ 99.5% OLA: KYC freshness service records hourly to S3 staleness ≤ 24h p99
The merchant-payouts SLA is only as defensible as the three OLAs feeding it. If any upstream's OLA quietly slips, the SLA breaks downstream — that is why mature data teams version OLAs alongside SLAs and review them quarterly together.

Pick the SLA from the data, not the wish

The most common failure mode in writing a data SLA is letting the stakeholder pick the number. Stakeholders do not have the latency distribution; the engineering team does. The right process: measure the past 90 days, model the breach budget the on-call rotation can carry, then propose the tightest SLA that fits inside the budget. Anything tighter is theatre.

Here is a 60-line tool that does exactly that — feed it a history of pipeline run times and a breach budget, and it returns the SLA the team can actually keep.

# sla_recommend.py — pick the SLA the data already supports
from collections import Counter
from datetime import datetime, timedelta
import math, random

# 1) Past 90 days of completion times for `merchant_payouts_daily`,
#    expressed as minutes-past-midnight on the day the data is for.
random.seed(2026)
finishes = []  # minutes past midnight (e.g. 540 = 09:00 IST)
for _ in range(70):                                  # normal mornings
    finishes.append(int(random.gauss(465, 25)))     # ~07:45 ± 25 min
for _ in range(15):                                  # mildly slow days
    finishes.append(int(random.gauss(545, 30)))     # ~09:05 ± 30 min
for _ in range(5):                                   # incidents
    finishes.append(int(random.gauss(720, 60)))     # ~12:00 ± 60 min

def hhmm(m): return f"{m//60:02d}:{m%60:02d}"

# 2) For each candidate SLA cutoff, count breaches in the last 90 days
candidates = [480, 510, 540, 570, 600, 660, 720]    # 08:00 .. 12:00
breach_budget = 4    # max breaches per quarter the on-call can absorb

print(f"{'SLA cutoff':<12}{'breaches/90d':>15}{'on-call pages':>15}{'fits?':>10}")
print("-" * 52)
recommended = None
for cutoff in candidates:
    breaches = sum(1 for f in finishes if f > cutoff)
    fits = breaches <= breach_budget
    flag = "yes" if fits else "no"
    print(f"  {hhmm(cutoff):<10}{breaches:>10}/90{breaches:>15}{flag:>10}")
    if fits and recommended is None:
        recommended = cutoff

# 3) Show the actual distribution so the choice is defensible
finishes_sorted = sorted(finishes)
def pct(p): return finishes_sorted[int(round((p/100)*(len(finishes_sorted)-1)))]
print(f"\nObserved finish times: p50={hhmm(pct(50))} "
      f"p90={hhmm(pct(90))} p99={hhmm(pct(99))} "
      f"worst={hhmm(finishes_sorted[-1])}")
print(f"\nRecommended SLA: complete by {hhmm(recommended)} IST")
print(f"Reason: tightest cutoff with ≤{breach_budget} breaches in 90 days.")
print(f"Stakeholder asked for: 09:00 IST.")
print(f"Defensible? {'yes — promise the data, not the wish' if recommended >= 540 else 'no — counter-propose'}")
# Output:
SLA cutoff      breaches/90d  on-call pages     fits?
----------------------------------------------------
  08:00              30/90             30        no
  08:30              13/90             13        no
  09:00               7/90              7        no
  09:30               5/90              5        no
  10:00               5/90              5        no
  11:00               3/90              3       yes
  12:00               1/90              1       yes

Observed finish times: p50=07:46 p90=09:34 p99=12:17 worst=13:01

Recommended SLA: complete by 11:00 IST
Reason: tightest cutoff with ≤4 breaches in 90 days.
Stakeholder asked for: 09:00 IST.
Defensible? yes — promise the data, not the wish

Walk through the mechanism. Lines 8–14 fabricate a realistic finish-time distribution — 70 normal days clustering around 07:45, 15 mildly slow days near 09:05, and 5 incident days drifting to noon. This is roughly the shape of a real Razorpay morning-batch table after a year of operation; the median is comfortably ahead of the wish, but the tail is the problem. Why this shape is realistic: data pipelines have multimodal latency because their failure modes are bimodal — most days the upstream is healthy and the run takes its normal time; some days an upstream hiccup adds ~80 minutes; rarely an actual incident pushes runtime to "we will get to it when we can". A unimodal Gaussian assumption hides the entire tail. Lines 19–28 ask the question that matters operationally — for each candidate SLA cutoff, how many breaches in the last 90 days, and does that fit the on-call rotation's budget of 4 breaches per quarter? Lines 31–36 print the percentiles for transparency: p50 is 07:46 (we'd hit 09:00 most of the time), p90 is 09:34 (we'd miss 09:00 noticeably), p99 is 12:17 (incident days are a different planet). Lines 37–41 surface the recommendation against the stakeholder's wish: the data supports 11:00 IST; the stakeholder asked for 09:00. The team cannot keep 09:00 without burning out the rotation, so the conversation has to happen before the SLA is signed, not after the third war-room.

What the tool doesn't do — and what no tool can do — is have the conversation with the merchant analytics head. That conversation goes: "you asked for 09:00. The data says 11:00 is what the team can keep without burning out. We have three options: (a) accept 11:00 SLA today, (b) invest 6 weeks of engineering to compact the upstream OLAs and re-measure, (c) pay for 4× compute headroom and re-measure in two weeks. Which one is in the budget?" The right answer is sometimes (a), sometimes (b), occasionally (c) — but the question can only be asked once the data is on the table. Without the tool above, the conversation is "the team is incompetent because they keep missing 09:00", which is the conversation that loses good engineers.

How error budgets travel from SRE to data

The SRE community has a clean mental model for SLAs called the error budget: if you commit to 99.9% reliability over 30 days, you have 43.2 minutes of "budget" you can spend on incidents, deploys, or experiments before the SLO is broken. Spend more than the budget, and the team must freeze risky changes until the budget regenerates.

Data engineering inherits this model with one twist: the budget is rarely measured in minutes-of-downtime. It is measured in breaches per quarter (for freshness SLAs), rows-out-of-spec (for completeness SLAs), or ₹-misattributed (for correctness SLAs). The maths is the same — burn rate, regeneration, freeze threshold — but the unit is whatever the stakeholder actually feels.

Pick the unit deliberately. A team that prints "p95 latency last week was 612 ms" to a finance stakeholder is speaking the wrong language; the same team printing "we missed the 09:00 freshness SLA on 2 of the last 90 mornings, well inside our quarterly budget of 4" is having the right conversation. Why unit choice changes outcomes: the stakeholder doesn't care about percentiles; they care about whether the report on their desk is on time and right. An engineer reporting in percentiles signals "I don't know what your problem looks like"; an engineer reporting in breaches signals "I do, and here is the budget I'm spending against it".

Indian fintechs run two budgets simultaneously: the freshness budget (breaches per quarter, soft) and the correctness budget (rows or rupees out of spec, hard). The two are not interchangeable. Razorpay can absorb 4 freshness breaches per quarter without the merchants noticing; it cannot absorb a single correctness breach above ₹1 lakh without a regulatory disclosure. The SLA must reflect this asymmetry — freshness is negotiable; correctness, past a threshold, is not.

Common confusions

Going deeper

The SRE error budget paper, translated to data

The canonical formulation is in chapters 3–4 of the Site Reliability Engineering book (Beyer et al., 2016): for a 99.9% reliability SLO over a 30-day window, you have 0.1% × 30 × 24 × 60 = 43.2 minutes of error budget. Burn the budget faster than 1.0× and you must freeze risky deploys until it regenerates. Translated to data: a 4-breaches-per-quarter freshness SLA gives you a budget of 4 breaches per 90 days = 1 breach per 22.5 days. If you've burned 3 in the first 30 days of a quarter, you have 1 breach left for 60 more days — and the on-call rotation should be in cooldown mode (no risky upstream changes, no new pipeline launches, only debt paydown). This translation is straight-line; only the unit changes.

Why the median, not the mean, is the right SLO base

Pipeline latency distributions are right-skewed: a few incident days drag the mean far above the median. A team that picks an SLO based on mean completion time is consistently optimistic — the typical day will hit, but the tail will breach often. The median better matches the modal day, and the breach budget covers the tail. Razorpay's published 2024 engineering blog (architecture overview of merchant-data platform) explicitly uses median-based SLOs for this reason; PhonePe's data-org talk at Index Conference 2024 made the same recommendation. The mean lies; the median tells you where most of the data sits.

How DPDP 2023 changes correctness SLAs at Indian fintechs

Before the DPDP Act 2023, a correctness breach in a payments table was an internal embarrassment. After DPDP, certain classes of correctness breach (e.g., wrong consent attribution on a payment record, mis-assignment of PII to a different user_id) became reportable to the Data Protection Board within 72 hours, with potential penalties up to ₹250 crore for repeat offenders. This means the correctness SLA at a regulated fintech now has a regulatory floor — you cannot accept a 99.5% correctness budget on user-PII columns; you have to engineer for ~99.999% and treat any breach as P0. The SLA structure becomes layered: freshness is a P2 conversation, correctness on PII is a P0 conversation, and the budget arithmetic is different for each.

The asymmetry of upstream-bound vs downstream-bound SLAs

A subtle pattern: SLAs whose breach probability is dominated by an upstream you don't control (a partner-bank API at Razorpay, a stock-exchange feed at Zerodha) cannot be tightened by engineering effort on your team's side past a hard floor. You can build the most efficient pipeline in the world; if the partner bank takes 25 minutes on slow mornings, your pipeline cannot finish before 25 minutes after kickoff. The right move on upstream-bound SLAs is to either negotiate a tighter OLA with the upstream team (often impossible if they're external) or accept a looser SLA and document the cause. Downstream-bound SLAs (where your own team's transformations are the bottleneck) are the ones engineering effort actually moves; tell the two apart before committing time.

The breach budget is the right knob, and SLAs need quarterly reviews

The instinct of a junior data team is to argue about the SLA number — should we promise 9 a.m. or 10 a.m.? Senior teams argue about the budget — how many breaches per quarter can the rotation absorb without quality regressing? Once the budget is fixed at, say, 4 breaches per quarter, the SLA number falls out of the data automatically: the tightest number that hits ≤4 breaches in the trailing 90 days. This decouples the negotiation: you and the stakeholder argue about the budget (an organisational property), and the data tells you the SLA (a derived property). Inverting the conversation — fix the SLA, derive the budget — is what produces SLAs nobody can keep.

A live SLA is not frozen. Stakeholder needs change (a new compliance regime, a new business unit consuming the table), upstream behaviour drifts (a partner bank improves, a payment processor migrates), and the engineering team's own infrastructure changes (a Spark → Trino migration drops p50 by 40%). The right cadence is a quarterly review: pull the trailing-90-day finish-time distribution, recompute the breach budget against the recommended SLA, and either tighten the SLA (good news), loosen it (honest news), or hold it (also fine). Razorpay's published 2024 platform write-up notes that ~30% of their data SLAs shift each quarter — the team that treats the SLA as a one-time signed document, never revisited, is the team that drifts into perpetual breach without realising it. The review itself is part of the operational discipline; skip it and the budget conversation becomes adversarial when it should be routine.

Where this leads next

Past those, Build 17 closes with migrations and the 30-year arc. By the end you'll have a written SLA, a budget for breaches, an OLA stack feeding it, and a rotation that can actually carry it. That is what "running data engineering in production" means in practice.

References