SLAs and the meaning of "late"

The finance team at Razorpay opens the daily settlement dashboard at 9 a.m. sharp every weekday. On the morning of 14 March 2024 the dashboard was empty. The pipeline had finished — but at 9:07. Seven minutes late. Nobody was paged, no alert fired, and the engineer who owned the pipeline didn't know anything was wrong until the CFO's office called at 9:11. The pipeline's job in the scheduler said success. The actual contract — be done by 8:55 so the 9:00 dashboard reads fresh data — had no row in any table. That gap, between "the job ran" and "the data was on time", is what an SLA is for.

An SLA on a pipeline is a deadline plus a clock plus an owner: by what time, measured against which moment, does which person get paged when missed. The mechanism is three numbers — the start of the data interval, the deadline relative to it, and the actual completion time — compared by the scheduler at the deadline. Without those three, "late" is a feeling; with them, "late" is a row in a table.

What an SLA actually measures

A scheduled pipeline has at least four clocks the on-call engineer cares about, and the SLA picks one. The reader's intuition usually conflates them; the bug usually lives in which one the contract is written against.

The four clocks of a pipeline run — and which one the SLA measures againstA timeline showing four moments in a pipeline's lifecycle: data interval start (the period being processed), data interval end (the period closes), scheduled start (when the scheduler queues the run), actual completion (when the last task finishes). The SLA deadline is anchored to one of these — most commonly the data interval end — and missing it triggers a page.Four clocks. The SLA picks one to measure "late" against.data interval start2026-04-24 00:00 ISTdata interval end2026-04-25 00:00 ISTscheduled start+02:00 cron firesactual completion+05:30 last task okSLA deadlineinterval_end + 9hSLA = deadline_offset (here +9h) measured from a chosen anchor (here interval_end). Late iff completion > deadline.
The four clocks: when the data window opens, when it closes, when the scheduler runs, when the last task finishes. The SLA is a deadline anchored to one of these — usually data-interval-end — and "late" is the comparison the scheduler makes at the deadline.

The four clocks are: data interval start (the beginning of the window the run is processing — for a daily run that processes 24 April's orders, this is 24 April 00:00 IST), data interval end (the close of the window — 25 April 00:00 IST), scheduled start (the moment the scheduler queues the run — typically interval_end + small_offset, e.g. 25 April 02:00 IST to give upstream sources time to settle), and actual completion (when the last task in the run finishes successfully).

The SLA is a deadline anchored to one of these and a stopwatch that compares against actual completion. The most common anchor is data_interval_end — "the dashboard for 24 April's data must be ready by 25 April 09:00 IST" maps to deadline = interval_end + 9 hours. Anchoring against scheduled_start instead is a subtle bug: if the scheduler queues the run late (because the previous run is still holding a slot, or the worker pool is full, or the cluster is rebooting), the SLA window slides with it and the dashboard misses 9 a.m. while the SLA still reads "on time". Airflow's sla parameter on a task historically anchored to scheduled-start; teams using it for finance dashboards learned to switch to sla_miss_callback against interval-end the hard way.

There is a fifth clock that the contract sometimes needs to acknowledge but rarely does: data freshness — the timestamp of the latest source row the pipeline read, relative to wall-clock now. A pipeline can finish on time and still produce stale data if its source ingestion is lagging. A run that completes at 08:50 IST but only ingested events up to 24 April 22:00 IST has finished early and is two hours stale at the same time. Banks and brokers operating on tick data are particularly bitten by this — Zerodha's risk pipeline cares less about "the run completed by 09:00" and more about "the latest tick processed is no older than 5 minutes". For those teams, the SLA is written against freshness, and actual_completion is irrelevant.

SLO, SLI, error budget — and the difference from SLA

Three terms the SRE world has settled on, all related and routinely confused. An SLI (service-level indicator) is the measurement — was_on_time = (completion <= deadline), computed for every run, yielding a stream of booleans. An SLO (service-level objective) is the target — "99 % of runs over the last 30 days are on time". An SLA (service-level agreement) is the externally-promised version — "if SLO drops below 99 % in any month, the data team owes the finance team a postmortem within 24 hours". The SLO is what the on-call wakes up to; the SLA is what shows up in the contract with the consuming team.

The reason this distinction matters is the error budget. If the SLO is 99 %, the budget is 1 % — for a daily pipeline, that is roughly 3 missed days per year. The team can spend the budget on risky migrations, infrastructure changes, or aggressive parallelism — knowing that one missed day costs 1/3 of the year's budget. Without the explicit budget, every SLA miss becomes a panic, and the team optimises for "never miss" instead of "miss the right things at the right times". Why a budget rather than a hard target: a 100 % target makes every change feel risky and pushes the team toward never deploying, never refactoring, never trying anything new — which is its own kind of failure because the pipeline ossifies and decays. A budget gives the team explicit permission to use a small fraction of the year's reliability for forward progress, and the conversion of "reliability we didn't spend" into "changes we shipped" is the entire point of the SLO/error-budget framework.

from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Iterable

@dataclass
class RunRecord:
    run_id: str
    interval_end: datetime
    completion: datetime | None        # None = still running or failed
    deadline_offset_hours: float = 9.0 # SLA anchored on interval_end

    @property
    def deadline(self) -> datetime:
        return self.interval_end + timedelta(hours=self.deadline_offset_hours)

    @property
    def is_late(self) -> bool:
        # not-yet-finished past the deadline counts as late
        if self.completion is None:
            return datetime.utcnow() > self.deadline
        return self.completion > self.deadline

def slo_window(runs: Iterable[RunRecord], days: int = 30, target: float = 0.99):
    """Compute SLO attainment over the last `days` of runs and how much
    error budget remains."""
    cutoff = datetime.utcnow() - timedelta(days=days)
    recent = [r for r in runs if r.interval_end >= cutoff]
    if not recent:
        return {"target": target, "attained": None, "budget_used_pct": None}
    on_time = sum(1 for r in recent if not r.is_late)
    attained = on_time / len(recent)
    allowed_misses = (1 - target) * len(recent)
    actual_misses  = len(recent) - on_time
    budget_used_pct = 0.0 if allowed_misses == 0 else (
        100.0 * actual_misses / allowed_misses)
    return {
        "target": target,
        "attained": round(attained, 4),
        "runs": len(recent),
        "misses": actual_misses,
        "budget_used_pct": round(budget_used_pct, 1),
    }

A typical reading from a Swiggy pricing pipeline at month-end against a 30-day window:

$ python slo.py
{'target': 0.99, 'attained': 0.9667, 'runs': 30,
 'misses': 1, 'budget_used_pct': 333.3}

The output reads: target is 99 %, attainment over 30 days is 96.67 %, one run was late, and the team has spent 333 % of the monthly error budget — three times more than the budget allowed. The walkthrough is short but load-bearing. is_late treats a still-running run past deadline as already-late — a pipeline stuck for 3 hours past 09:00 is not "pending judgment", it is missing the SLA right now. completion is None is the trick: pages should fire at the deadline, not after the run eventually completes. budget_used_pct above 100 means the SLO was violated; once it crosses ~70 %, production teams freeze risky changes. Why measure budget-used as a percentage rather than absolute misses: the absolute number depends on window length (30 vs 90 days) and frequency (daily vs hourly runs), which makes cross-team comparison meaningless. Percentage of budget normalises both, which is why Google's SRE handbook leads with budget-burn-rate alerting and not absolute-miss alerting.

How the scheduler turns SLA into pages

A scheduler that knows about SLAs runs a separate watcher that wakes at each pending deadline, asks "is this run done?", and fires a callback if not. This is conceptually a tiny inverted index from deadline_timestamp → run_id, scanned every minute against now().

SLA watcher — how a scheduler fires pages at deadlinesA diagram of the SLA watcher loop. A heartbeat fires every minute. The watcher reads the runs table, finds runs whose deadline is past and whose completion is null, marks them as late in the SLA misses table, and fires the alert callback to PagerDuty. The same loop also closes the miss when a late run eventually completes, recording duration and severity.SLA watcher loop — every minute, find missed deadlines, page the ownerheartbeatevery 60scron / k8s jobscan runs tableWHERE deadline < now()AND completion IS NULLinsert sla_missunique(run_id) — at-most-oncenotify_count = 0page ownerPagerDuty / OpsGeniefirst attempt onlyon completionclose sla_miss rowrecord severityaggregate to SLOattainment over windowbudget burn rate
The watcher's job is simple but load-bearing — heartbeat, scan, miss-row insert, page. Closing the miss when the run eventually completes is what feeds the SLO aggregator and the error-budget dashboard.

The minimum viable watcher fits in 50 lines:

import sqlite3, time, requests
from datetime import datetime

def tick(conn, pager_url: str):
    """One pass of the watcher. Find newly-late runs, insert miss rows,
    page exactly once per miss, then close miss when run completes."""
    now = datetime.utcnow().isoformat() + "Z"

    # 1. find runs that are now past deadline and not yet completed
    cur = conn.execute(
        "SELECT r.run_id, r.dag_id, r.deadline, r.owner "
        "FROM runs r LEFT JOIN sla_miss m ON r.run_id = m.run_id "
        "WHERE r.deadline < ? AND r.completion IS NULL "
        "  AND m.run_id IS NULL",
        (now,),
    )
    new_misses = cur.fetchall()

    # 2. record each one (UNIQUE constraint enforces at-most-once page)
    for run_id, dag_id, deadline, owner in new_misses:
        try:
            conn.execute(
                "INSERT INTO sla_miss(run_id, opened_at, owner) VALUES (?,?,?)",
                (run_id, now, owner),
            )
            conn.commit()
        except sqlite3.IntegrityError:
            continue   # another watcher beat us; skip
        # 3. page outside the transaction so DB lock isn't held during HTTP
        requests.post(pager_url, json={
            "service": dag_id, "run_id": run_id, "deadline": deadline,
            "owner": owner, "severity": "P2",
        }, timeout=5)

    # 4. close miss rows whose run has now completed
    conn.execute(
        "UPDATE sla_miss SET closed_at = ?, "
        "  duration_s = (julianday(?) - julianday(opened_at)) * 86400 "
        "WHERE closed_at IS NULL AND run_id IN ("
        "  SELECT run_id FROM runs WHERE completion IS NOT NULL)",
        (now, now),
    )
    conn.commit()

if __name__ == "__main__":
    conn = sqlite3.connect("scheduler.db")
    while True:
        tick(conn, "https://events.pagerduty.com/v2/enqueue")
        time.sleep(60)

The walkthrough is short. The LEFT JOIN sla_miss ... WHERE m.run_id IS NULL is what makes the page fire exactly once: a run can only become "newly missed" if it isn't already in the miss table. The UNIQUE(run_id) constraint on sla_miss is the safety net for two watchers racing — exactly one INSERT wins. Paging outside the transaction matters because PagerDuty's API can block for 4–5 seconds and you do not want every other reader of sla_miss waiting on an HTTP call. The UPDATE sla_miss SET closed_at is the bookkeeping that lets you compute miss-duration for the postmortem; without it, the SLO dashboard knows about misses but not about how bad each one was.

A real watcher adds three things this minimal version skips: deadline-anchor configurability (interval_end vs scheduled_start vs latest-source-row freshness), severity escalation (page primary on-call at deadline, secondary at deadline+30min, manager at deadline+2h), and a "pre-deadline warning" channel that pings Slack 30 minutes before deadline so the on-call can intervene before the page. Airflow's sla_miss_callback, Dagster's freshness policies, and Prefect's flow run timeout each implement variants of this loop with slightly different deadline semantics — but the core "scan, miss-row, page, close" pattern is identical.

The watcher itself needs a heartbeat metric — a row written to a watcher_heartbeat table on every successful tick. A separate alert ("watcher heartbeat older than 5 minutes") fires on the secondary on-call channel if the watcher itself goes silent. Without this dead-man's-switch, a watcher that crashes silently leaves every SLA miss un-paged and the on-call discovers the problem only when a consumer complains hours later. Production teams treat the watcher's own uptime as the highest-tier SLA, because every other SLA depends on it.

SLAs in the real world: pipelines, dashboards, contracts

The contract gets interesting when the consumer of the data is not the engineer running the pipeline. Three concrete shapes show up over and over in Indian fintech and consumer-internet:

The first shape is the morning-finance dashboard. The pipeline runs overnight; the dashboard is opened at 09:00 IST by a human; the SLA is interval_end + 9h against actual completion. The consequences of a miss are operational — a manager waits, a meeting starts late, a number on a slide is stale. Razorpay's daily settlement-summary pipeline ships under this contract; the SLA is interval_end + 8h30m (so 08:30 IST), with a 30-minute buffer before the dashboard's 09:00 open. The buffer is what the runbook calls the "intervention window" — half an hour for the on-call to notice, manually re-run, and still hit 09:00.

The second shape is the regulatory filing. The pipeline produces a file that has to be uploaded to a regulator's portal by a fixed external deadline — GST returns by the 11th of every month at 23:59 IST, RBI Form FC-GPR within 30 days of inward remittance, FATCA filings annually by 31 May. The SLA here is external and inflexible. The miss costs are concrete and paid in rupees: a GST return filed one day late triggers a ₹100/day late-filing fee per return on top of interest on outstanding tax — for a Flipkart-scale company filing tens of thousands of GSTINs, this scales fast. The pipeline SLA in this case is set days before the regulatory deadline (GST return pipeline at GSTN-filer companies usually targets the 8th, three days of buffer for re-runs) and the on-call rota is staffed to match.

The third shape is the operational stream. A real-time fraud model at a payments company needs the latest transaction features within 60 seconds of the transaction occurring; a Dream11 leaderboard needs to update within 2 seconds of a six being hit; a Swiggy delivery-ETA model needs current courier locations within 30 seconds. The SLA here is on freshness, not completion. The watcher reads the latest event-time the consumer's view-of-the-world includes and compares against wall-clock now. Confluent's "lag" metric on a Kafka consumer group, and Materialize's mz_freshness view, both expose this as a first-class signal.

A fourth shape worth naming is the internal-consumer pipeline — a feature-engineering job that feeds an ML training run, a metrics-rollup that powers the next pipeline downstream, an aggregation that another team depends on without anyone documenting it. These pipelines tend to have no formal SLA at all because the consumer is "us" and the producer is also "us". They are fine right up until the moment the consumer team wakes up and discovers their own pipeline has been silently drifting because the upstream nobody-owned-the-SLA-on has been late three days a week for two months. The fix is structural: every pipeline that any other pipeline depends on gets an SLA, even an internal one, even if the deadline is generous. The discipline of writing it down forces the question "who owns this?" to get answered before the incident, not after.

def freshness_check(conn, view_name: str, max_age_seconds: int):
    """SLA on freshness rather than completion. Used for streaming consumers
    and live dashboards where 'on time' is measured against latest data,
    not against the pipeline's own clock."""
    row = conn.execute(
        f"SELECT max(event_time) AS latest FROM {view_name}").fetchone()
    if not row or not row["latest"]:
        return {"status": "no_data", "lag_s": None}
    lag_s = (datetime.utcnow() - row["latest"]).total_seconds()
    return {
        "status": "ok" if lag_s <= max_age_seconds else "stale",
        "lag_s": round(lag_s, 1),
        "max_age_s": max_age_seconds,
    }

Why freshness and completion deserve separate SLAs: a batch dashboard cares about completion (the run finished by 09:00), a streaming dashboard cares about freshness (the latest tick is no older than 60 seconds), and conflating the two means you either over-page (a freshness alert when the run is finishing on time) or under-page (a completion success while the data is hours stale because the source ingestion is lagging). Real production systems carry both signals on the same pipeline and let the consumer pick which one their use-case is about.

SLA hierarchies and the upstream-blame problem

A pipeline's SLA is rarely about its own code — it is about its slowest upstream. A daily_revenue_summary table that has an SLA of 09:00 depends on orders_fact which depends on orders_raw which depends on the source ingestion from the OLTP database. If the OLTP replication lag spikes at 04:00, every downstream SLA cascades into a miss, and every consumer's pager fires for what is fundamentally one root-cause incident.

Production teams handle this with two complementary patterns. The first is SLA propagation — every table in the lineage graph carries a derived SLA computed from its consumers' SLAs minus the table's typical run duration. If daily_revenue_summary needs to ship by 09:00 and typically takes 45 minutes, its SLA on input arrival is 08:15. If its upstream orders_fact typically takes 90 minutes to build, orders_fact must ship by 06:45. Walk the chain back to the source ingestion and you get a per-stage deadline that the watcher can enforce at every hop, paging the team responsible for the actual delay rather than the unlucky consumer at the end of the chain.

The second pattern is alert deduplication — the on-call paging system collapses cascading misses into one incident. PagerDuty's dedup_key field, set to the lineage-root rather than the immediate run, ensures one page when the OLTP lag spikes, not seventeen. Dipti, an on-call SRE at Razorpay, told the team in a 2024 retrospective that the worst night of her year was 11 cascading SLA pages between 02:00 and 05:00, all from one root-cause: a Postgres autovacuum that had blocked the logical replication slot. Adding lineage-root deduplication cut average pages-per-incident from 4.2 to 1.3 and let the on-call sleep through symptomatic alerts that the root-cause owner was already handling.

What goes wrong

SLA anchored on the wrong clock. A pipeline's SLA written against scheduled_start + 4h instead of interval_end + 9h slides whenever the scheduler queues the run late. A 30-minute scheduler delay that pushes the run from 02:00 to 02:30 still reads "on time" against the 06:30 SLA, even though the dashboard at 09:00 is now 30 minutes from miss territory. The fix is to anchor against the data-interval, not the run's own clock — which is what made sla_miss_callback against data_interval_end the production-correct pattern in modern Airflow.

SLA defined but no owner. A pipeline whose SLA fires a page to "the data team" without naming a specific rota owner is a page nobody answers at 03:00 IST. The fix is one row per SLA in a sla_definitions table with owner_team, escalation_chain, and runbook_url, and a CI check that fails any new pipeline shipping without it. Swiggy's data platform team shipped this check in 2024 after a 3-hour P1 incident where the dashboard was down and three teams each thought another was on call.

Page fatigue from too-tight SLA. An SLA of interval_end + 4h for a pipeline that takes 3h45min on a good day will miss every week the warehouse has a slow night. The team starts ignoring the pages — and then misses the real incident. The fix is to set the SLA based on the p95 of historical run durations plus a buffer (Google's SRE book recommends p95 + 30%), and to revisit the number every quarter. A common anti-pattern is to set SLAs once and never touch them, while the pipeline's actual runtime drifts up by 10 minutes a year as data volume grows.

Watcher lag during scheduler outage. The SLA watcher is a separate process that depends on the scheduler's database. If the scheduler is down, the watcher either crashes (and pages stop firing entirely) or keeps polling stale data (and pages fire 30 minutes late, after the scheduler restarts and writes the actual completion time). The fix is to make the watcher independent — read the runs table from a replica, run on a separate cluster, alert if its own heartbeat misses for more than 5 minutes. This is "who watches the watcher", and the answer is a second-tier dead-man's-switch.

Daylight saving and timezone drift. A pipeline whose SLA is "complete by 09:00 IST" computed in UTC will silently shift if the deadline conversion is wrong, or if a server's TZ database is stale. Indian timezones don't observe DST, but pipelines whose source is a US-based system (Stripe, AWS billing exports, Salesforce) cross DST boundaries twice a year — and the upstream-arrival time shifts by an hour, breaking the downstream SLA twice yearly. The fix is to store all deadlines in UTC and convert at display time only, with a CI check that all SLA arithmetic uses tz-aware datetimes.

SLA as a number without consequences. A 99 % SLO that is missed every month with no follow-up postmortem is not an SLO, it is a wish. The fix is to wire missing the SLO to a concrete consequence — error-budget burn freezes risky deploys, a postmortem doc is owed within 48h, the team's planning cycle reserves capacity for SLA-improvement work. Without consequences, the SLO is a dashboard nobody reads.

Page fires after the run completes. A subtle race: the watcher polls every minute, the run completes 30 seconds before the deadline, the watcher's next tick is 30 seconds after the deadline — the watcher reads "completion is set" and correctly suppresses the page. But if the watcher reads completion IS NULL first and then the run completes before the page is sent, the page fires for a run that is already done. The fix is to re-check completion IS NULL inside the same transaction that inserts the miss row, and to abort the page if the row insert fails the check. Cheap database engineering; saves the on-call from being woken up by a page that was already irrelevant when it arrived.

Common confusions

Going deeper

Burn-rate alerting vs threshold alerting

The naive SLO alert fires when attainment drops below 99 %. The problem is that this fires after the budget is gone — by the time you get paged, the month is already lost. Burn-rate alerting (from Google's SRE workbook chapter 5) fires when the rate of budget consumption indicates the budget will run out before the window ends. A 1-hour burn rate of 14.4× means the budget will be exhausted in 2 days at current pace, even though only a small fraction has been spent so far. The mathematical formulation: alert when (misses_in_window / total_runs_in_window) > burn_threshold * (1 - SLO_target). A two-tier system pages on a fast burn (1h window, 14.4× threshold = imminent budget exhaustion) and tickets on a slow burn (6h window, 6× threshold = trending bad). This avoids both the "too late" failure of threshold alerting and the "too noisy" failure of paging on every individual miss.

How Airflow, Dagster, and Prefect each model "late"

Airflow's sla parameter on a task historically anchored to task_instance.start_date, which is scheduler-relative — a quirk that misled teams for years. Modern Airflow (2.7+) recommends sla_miss_callback against dag_run.data_interval_end, which is data-relative and what most contracts actually want. Dagster takes a different approach: FreshnessPolicy on an asset, declared as "this asset must be no more than 30 minutes stale relative to wall-clock", lets the framework derive freshness deadlines automatically without a hand-written watcher. Prefect's flow run timeout is a per-run thing, not a recurring SLA — for recurring SLAs Prefect users typically wire their own watcher against the runs API, similar to the 50-line watcher above. The lesson is that "SLA" is not a settled abstraction across tools; the underlying mechanism (scan runs, compare deadline to completion, fire callback) is the same, but the syntactic surface is different in every tool.

The coupling between SLA and partition layout

A pipeline's SLA constrains its partition design. If the SLA is interval_end + 1h, the pipeline cannot partition daily — there is no full-day partition to write because interval_end is at midnight and the SLA fires at 01:00. Hourly partitions are required, and the pipeline writes 24 partitions per day rather than 1. This in turn affects compaction (chapter 78) — 24× more files means 24× more small-file pressure on the lakehouse, which means a separate compaction job to merge hourly partitions into daily partitions after the SLA window. The decision to tighten an SLA from "next morning" to "within an hour" is therefore not just an alerting change; it cascades into storage layout, compaction strategy, and warehouse cost. A 2024 internal post-mortem at PhonePe documented exactly this — a fraud-team request to tighten a feature pipeline's SLA from 6h to 30m forced a re-partitioning from daily to 5-minute buckets, which produced 288× more files, which broke the compaction job, which caused queries to slow by 4×, which surfaced two weeks later as an unrelated-looking dashboard regression.

Multi-region SLAs and the meaning of "complete"

A pipeline that writes to S3 in ap-south-1 (Mumbai) is "complete" the moment the S3 PUT returns a success. But a downstream consumer in us-east-1 (N. Virginia) reading the same data via cross-region replication might not see it for another 30 seconds. If the consumer's SLA is interval_end + 9h measured from when it can read the data, the pipeline's effective SLA is 9h - replication_lag. Multi-region applications make this explicit by writing to the local region, then awaiting cross-region replication confirmation, then marking the run complete only after the consumer's region has the data. AWS's S3 cross-region replication metrics (PendingReplicationCount) are the signal; production teams expose them as part of the pipeline's "complete" definition rather than treating them as infra concerns. The cost is latency added to the SLA path; the benefit is that the SLA actually means what consumers think it means.

When "SLA on a pipeline" is the wrong abstraction entirely

Some workloads don't have well-defined runs — a streaming job runs forever, a CDC consumer keeps up with a binlog, a feature serving layer answers queries continuously. For these, "on time" is a misnomer; the right contract is on lag (consumer is no more than N seconds behind producer), throughput (sustained 50k events/sec ingest), or availability (99.9 % of queries succeed within 50ms). The SLA framework still applies — measure the indicator, set an objective, page on burn — but the indicator is no longer "did the run complete by deadline". Confusing batch-style "did it finish?" SLAs with stream-style "is it keeping up?" SLAs is one of the most common architectural mistakes when teams retrofit batch governance onto streaming systems. Build 8 covers the streaming-side mechanism in detail; this chapter is about the batch-side primitive.

The runbook every SLA needs

An SLA without a runbook is half a contract. When the page fires at 03:00 IST, the on-call has perhaps 10 minutes to triage before the consequences start to stack — a missed regulatory window, a stale dashboard, a hungry trading desk. The runbook is the document that turns those 10 minutes from a panicked guess into a sequenced procedure. Aditi, who runs the data-platform on-call rota at a Bengaluru fintech, keeps a one-page runbook per SLA in a shared wiki: the first three commands to run (a last_known_good query, a scheduler-state check, and a source-ingestion-lag check), the decision tree (if upstream is the cause, page upstream owner; if it's a slow run, watch and update incident; if it's a code bug, trigger the rollback), and the comms template for the affected consumer team. The runbook is updated after every incident — not as a chore but as the actual deliverable of the postmortem. A team that ships SLAs without runbooks is a team where every miss is a fresh incident; a team that maintains runbooks turns the third incident of the same kind into a 5-minute resolution.

Where this leads next

The SLA primitive is what turns a scheduler from a "thing that runs jobs" into a "thing that runs jobs correctly enough that humans can rely on it". Build 4 closes here — the executor knows its DAG, retries safely, supports backfills, and now has a contract it must honour. Build 5 begins with lineage and observability, which exist precisely because the moment an SLA is missed, the on-call needs to know which upstream caused it within 60 seconds, not 60 minutes.

The deeper claim of this build is that "production-ready scheduling" is not a feature checklist but a closed loop: a DAG that knows its dependencies, an executor that retries idempotently, a backfill story that replays history correctly, and an SLA contract that turns the loop's behaviour into a measurable promise. Take any one of those four out and the loop opens — pages either fire too late, or never, or for the wrong reason — and the team falls back into the firefighting mode that Build 1's "three walls" was supposed to escape from.

References

  1. Google SRE Workbook — chapter 5: Alerting on SLOs — the canonical treatment of burn-rate alerting and the multi-window/multi-burn-rate pattern.
  2. Airflow SLA documentationsla_miss_callback, anchor semantics, and the historic execution-date confusion.
  3. Dagster Freshness Policies — the asset-freshness model that replaces hand-written watchers.
  4. Prefect: Detecting late and slow flow runs — the flow-timeout primitive and watcher patterns.
  5. Niall Murphy et al., "Site Reliability Engineering" — chapter 4: Service Level Objectives — SLI/SLO/SLA definitions and the error budget concept.
  6. Confluent: Monitoring consumer lag — freshness SLAs in stream-processing terms.
  7. Backfills: re-running history correctly — chapter 24, the operation that runs after most SLA misses.
  8. What idempotent actually means for data and why it's hard — chapter 12, the foundation that any retry-on-SLA-miss assumes.
  9. Charity Majors, "Observability Engineering" — the case for indicator-based contracts and why threshold alerting is not enough.