Freshness SLOs: the data-eng analog of uptime
It is 9:47 a.m. on the first of the month at Razorpay's Bengaluru office. The CFO's dashboard shows yesterday's settlement total — ₹2,847 crore — and that number is what the finance team will reconcile against the bank statements before lunch. The number looks fine. The dashboard is up. The Looker query returns in 800 ms. Nobody knows that the upstream payments_settled table has not received a new row since 4:12 a.m. — the Airflow DAG that loads it failed at 4:08, retried twice, and parked silently in skipped because a dependency hit its 4-hour SLA window. The dashboard is showing yesterday's number because today's number does not exist, and the dashboard has no way to display "this data is six hours stale" because nobody told it what stale means. By 11 a.m., when the operations head pings finance about the day's volume looking off, the table has been frozen for almost seven hours and the reconciliation window — a hard regulatory deadline at 6 p.m. — is now 35% gone. The SRE team would have caught a server outage in 90 seconds; the data team caught a stale dataset in 80 minutes, after the business noticed.
A freshness SLO is a hard, measurable promise about the maximum age of a dataset, defined per table or per pipeline and monitored with the same rigour an SRE applies to uptime. It turns "is the pipeline working?" into "is the most recent row younger than 30 minutes?" — a question a machine can answer and an alert can fire on. Without freshness SLOs, you discover stale data when a human looks at a dashboard and frowns, which is the worst possible monitoring strategy.
What "fresh" means and why it has to be measured
The word "fresh" in production means exactly one thing: the difference between the current wall-clock time and the timestamp of the most recent useful row in the dataset. Everything else — "the pipeline ran", "the DAG succeeded", "the file landed in S3" — is a proxy for freshness, and proxies fail in surprising ways. The DAG can succeed on an empty input. The file can land but be corrupt. The job can run but write yesterday's data to today's partition. The only thing that does not lie is the dataset itself, queried directly: SELECT MAX(event_time) FROM payments_settled returns the actual cutting edge of the data, and NOW() - that is the actual lag.
The first thing that goes wrong when teams adopt freshness monitoring is conflating two different lag types. Pipeline lag — the time between the pipeline's scheduled run and its actual completion — measures the pipeline's punctuality. Data lag — the time between an event happening in the real world and that event being queryable in the warehouse — measures what the consumer experiences. A pipeline can be perfectly punctual (DAG runs every hour, on the hour, in 4 minutes) and still have terrible data lag (the source database itself is delayed by 6 hours because of a CDC backlog). Why the consumer-facing measurement wins: the user of the data — the CFO, the fraud analyst, the recommender system — does not care that the DAG ran on time. They care about the gap between reality and what the table says about reality. Measuring freshness at the source-of-truth table aligns the metric with what the human will be angry about when it breaks.
The second thing that goes wrong is picking the wrong field. event_time (when the event happened in the real world), processing_time (when the warehouse received it), and ingested_at (when the row was written) are three different timestamps and they each tell you about a different kind of lag. The right one to measure for a freshness SLO is almost always event_time, because that is the question business users actually have: "is this dataset showing me what is happening right now in the world?" Measuring ingested_at instead can give you a freshness number of 30 seconds while the underlying data is hours behind reality — the warehouse was punctually loaded with stale data.
The third thing is forgetting that freshness is a distribution, not a number. The dataset's median lag may be 4 minutes; the p95 lag may be 22 minutes; the p99 lag at month-end may be 3 hours. A single-number SLO of "fresh within 30 minutes" hides all three. The serious freshness contracts that production teams ship at scale specify "p95 freshness ≤ 30 min, p99 ≤ 60 min, max ≤ 4 hours" — three numbers that together describe the shape of the distribution rather than the median alone.
Defining the SLO in a contract
A freshness SLO lives inside a data contract (chapter 31), expressed as a structured block the producer commits to and the consumer can rely on. The shape that has stabilised across teams looks like this:
# data_contracts/payments_settled.yaml
dataset: warehouse.fact.payments_settled
owner: payments-platform-team
governance: tier-1-financial # tier-1 = blocks reconciliation; tier-2 = analytics; tier-3 = exploration
freshness:
measurement: "MAX(event_time)" # the column to compute lag from
evaluation_clock: wall_clock_ist # which clock to subtract against
slo:
p50: 5m # 50% of measurements lag <= 5 min
p95: 15m # 95% of measurements lag <= 15 min
p99: 30m # 99% of measurements lag <= 30 min
max: 60m # never more than 60 min, ever
business_hours: "Mon-Fri 06:00-23:00 IST" # SLO only applies in this window
measurement_interval: 60s # how often the monitor takes a sample
alerting:
p95_breach_window: 30m # alert if p95 has been violated for 30 min
max_breach: immediate # alert the moment max is exceeded
page_oncall_when: tier-1 # only tier-1 datasets page; tier-2 ticket; tier-3 email
error_budget:
monthly_minutes_unfresh: 43.2 # 99.9% freshness target = 43.2 min/month allowed unfresh
The shape repays a careful read. measurement is the SQL expression the monitor runs to compute "how fresh is the data" — usually MAX(event_time), sometimes MAX(GREATEST(event_time, updated_at)) for tables that allow late updates. evaluation_clock specifies which wall clock to subtract against — IST for a domestic-only dataset, UTC for a global one, and the choice matters because mixing clocks across timezone boundaries is how teams miss DST and start their year four hours behind. slo is the per-percentile promise; business_hours scopes it (the freshness SLO for an analytics dataset rarely makes sense at 3 a.m. when nobody is querying, and applying it 24×7 wastes the error budget on hours nobody cares about). error_budget is the SRE-style quota: how many minutes the dataset is allowed to spend unfresh in a month before the team has to stop shipping new features and fix the freshness regression. Why the error-budget framing matters: without it, every freshness incident becomes a fire drill of equal weight, and the team works through them in the order they happen rather than the order that depletes the budget fastest. With an error budget, a 90-minute breach in week one of the month is a panic; an identical 90-minute breach with the budget already exhausted in week three triggers the freeze that gets the team off the feature treadmill and onto the reliability work.
The tier field decides escalation. Tier-1 datasets (regulatory reconciliation, fraud scoring, risk pricing) page the on-call engineer at 2 a.m. when they go stale because the cost of waiting is higher than the cost of waking somebody up. Tier-2 (executive dashboards, weekly KPI tracking) opens a ticket that is triaged in business hours. Tier-3 (data exploration, ad-hoc analytics) sends an email summary at 9 a.m. weekdays. Mixing the tiers — paging on every dataset — is how teams burn out on-call rotations in three months; under-tiering everything is how the next reconciliation incident becomes a regulatory filing. The discipline is to pick the tier deliberately and review it quarterly.
The contract's freshness block is enforced in two places: the monitor that takes a sample every measurement_interval and records the lag, and the deploy gate that refuses to merge a producer change which would predictably violate the SLO (e.g. switching from real-time CDC to a 4-hour batch refresh on a dataset with a 30-minute p95 SLO). The first piece catches operational regressions; the second catches architectural ones. Most teams ship the monitor first because it is easier; the deploy gate is what stops the slow drift where every quarter the SLO becomes one notch slacker because nobody negotiated the tradeoff explicitly.
Building a freshness monitor
The monitor is a small piece of code with an outsized importance — it is the thing that turns "I think the pipeline is healthy" into a number on a graph. The version below is small enough to read in one go and realistic enough to ship to staging.
# freshness_monitor.py — sample lag for tier-1 datasets and emit a metric.
# Run as a cronjob every 60 seconds; emits to Prometheus pushgateway.
import os, time, datetime, yaml, psycopg2
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
CONTRACTS_DIR = "/etc/data-contracts"
PG_DSN = os.environ["WAREHOUSE_DSN"]
PUSHGATEWAY = os.environ["PUSHGATEWAY_URL"]
def measure_freshness(conn, contract):
"""Run the contract's measurement query and return lag in seconds."""
sql = f"SELECT {contract['freshness']['measurement']} FROM {contract['dataset']}"
with conn.cursor() as cur:
cur.execute(sql)
max_event_time = cur.fetchone()[0]
if max_event_time is None:
return None # empty table — caller decides what this means
now = datetime.datetime.now(datetime.timezone.utc)
if max_event_time.tzinfo is None:
max_event_time = max_event_time.replace(tzinfo=datetime.timezone.utc)
lag_seconds = (now - max_event_time).total_seconds()
return max(0, lag_seconds) # negative lag = clock skew, clamp to 0
def in_business_hours(window_str):
"""Parse 'Mon-Fri 06:00-23:00 IST' and check if now is inside it."""
# Production code uses pytz; abbreviated here for clarity.
now_ist = datetime.datetime.now(datetime.timezone(datetime.timedelta(hours=5, minutes=30)))
if now_ist.weekday() >= 5: return False # Sat/Sun
if not (6 <= now_ist.hour < 23): return False
return True
def evaluate(contract, lag_seconds):
"""Return (status, breach_severity) given lag and SLO."""
slo = contract["freshness"]["slo"]
parse = lambda s: int(s[:-1]) * (60 if s.endswith("m") else 1)
if lag_seconds > parse(slo["max"]): return ("BREACH", "max")
if lag_seconds > parse(slo["p99"]): return ("WARN", "p99")
if lag_seconds > parse(slo["p95"]): return ("DRIFT", "p95")
return ("OK", None)
def main():
conn = psycopg2.connect(PG_DSN)
registry = CollectorRegistry()
lag_gauge = Gauge("dataset_freshness_lag_seconds",
"Lag from MAX(event_time) to NOW()",
["dataset", "tier"], registry=registry)
for filename in os.listdir(CONTRACTS_DIR):
contract = yaml.safe_load(open(f"{CONTRACTS_DIR}/{filename}"))
if not in_business_hours(contract["freshness"]["business_hours"]):
continue
lag = measure_freshness(conn, contract)
if lag is None:
print(f"[WARN] {contract['dataset']}: empty table")
continue
status, severity = evaluate(contract, lag)
lag_gauge.labels(contract["dataset"], contract["governance"]).set(lag)
print(f"[{status}] {contract['dataset']}: {lag:.1f}s ({severity or 'within SLO'})")
push_to_gateway(PUSHGATEWAY, job="freshness_monitor", registry=registry)
if __name__ == "__main__":
main()
# Sample run on a healthy and a breaching dataset:
$ python freshness_monitor.py
[OK] warehouse.fact.payments_settled: 184.2s (within SLO)
[OK] warehouse.fact.fraud_scores: 47.6s (within SLO)
[DRIFT] warehouse.fact.merchant_dashboard: 1024.8s (p95)
[BREACH] warehouse.fact.recon_ledger: 4621.3s (max)
[OK] warehouse.dim.merchants: 218.7s (within SLO)
$ echo $?
0
The piece worth dwelling on is measure_freshness. It runs SELECT MAX(event_time) FROM <dataset> directly against the warehouse, which sounds trivial but is the entire correctness story — measuring at the consumer's table means the lag includes every upstream hop, and a regression anywhere in the chain shows up here. The implementation handles two edge cases that production teams stub their toes on: a None result (empty table, which is sometimes a freshness violation and sometimes the expected state of a brand-new dataset, so the caller decides) and a missing timezone (Postgres TIMESTAMP WITHOUT TIME ZONE columns return a naive datetime, which subtracted from a UTC-aware now() raises an error). Why the timezone handling is not optional: the most common production bug in freshness monitoring is "the lag jumped by exactly 5h30m on the IST/UTC boundary because somebody assumed the column was in IST and somebody else assumed UTC". The monitor must be explicit about what timezone every timestamp is in and convert on the way in, not when the alert fires.
evaluate maps a lag number to a status. The status tiers — OK, DRIFT (p95 breached), WARN (p99 breached), BREACH (max exceeded) — give the alerting system room to escalate without paging on every percentile blip. Production setups feed DRIFT to a Slack channel for awareness, WARN to a 15-minute aggregation alert, and BREACH to an immediate page. Why three tiers and not one: a single threshold ("p95 = alert") fires on transient blips that resolve in 90 seconds and trains the on-call to ignore the alert. A graduated scheme captures the shape of the breach — a p95 drift that lasts for two hours is genuinely bad even if max is fine, while a max exceedance that lasted 60 seconds and resolved is forgivable. Mapping the response to severity prevents alert fatigue from killing the signal.
The push-to-Prometheus pattern is operationally important. The monitor is a stateless cronjob; it publishes the gauge to a pushgateway and Prometheus scrapes the gateway. This means the alerting rules — "p95 has been > 15m for 30 min" — live in Prometheus, where they share infrastructure with every other SLO the SRE team runs. Production deployments at Razorpay and PhonePe co-locate freshness alerts with API-uptime alerts in the same dashboard, because the on-call who fields the page should not have to context-switch between two monitoring systems. The investment in shared infra pays back the second time the team has to debug a cross-cutting incident.
The monitor itself has an SLO. If it is down, freshness goes silently un-monitored, and the next incident is undetected. The pattern is to run the monitor in two regions, alert on either of them being down for more than 5 minutes, and have a separate watchdog that pings the monitor's heartbeat — a freshness SLO on the freshness monitor, recursively. That sounds excessive until the day the monitor's pod gets evicted by Kubernetes during a node upgrade and nobody notices for six hours.
Where freshness SLOs break in practice
The clean mental model — measure lag, alert on breach, fix the pipeline — collides with three messy realities. The first is batch periodicity. A pipeline that runs every 4 hours has a lag that looks like a sawtooth: it climbs from 0 minutes after the run completes to 240 minutes just before the next run, then drops back. The "average" lag is 120 minutes; the p99 lag is close to 240 minutes; the SLO has to be set against the worst case of the sawtooth, not the average. Most teams pick "p99 < 250m" rather than "average < 120m" precisely because the p99 is what the consumer experiences when they query at the wrong time.
The second mess is late-arriving data. A user's payment event timestamped at 09:11:42 might land in the warehouse at 09:23:18 — eleven minutes late because the source system itself buffered it. The naive MAX(event_time) measurement says the data is fresh because the row exists; the consumer querying for "all events in the 09:00-10:00 hour" gets a different answer at 09:30 (missing the late event) than at 10:00 (with the late event included). Freshness SLOs on streaming systems usually pair with completeness SLOs — "95% of events for hour H are landed by H+15m, 99% by H+30m" — measured from the event time, not the ingestion time. A dataset can be fresh and incomplete simultaneously, which is the operational definition of "the data looked right but the dashboard's number is wrong".
The third mess is producers without a heartbeat. If the source system goes silent — a payment gateway has a literal lull because it's 4 a.m. and nobody is paying — the freshness monitor sees MAX(event_time) advancing slowly and starts to alert because lag is climbing. But the lag is climbing because reality is quiet, not because the pipeline is broken. The fix is a heartbeat row — the producer emits a synthetic row every 60 seconds whose only purpose is to prove the pipeline is alive even when business traffic is zero. The monitor can then distinguish "pipeline broken" (no heartbeats arriving) from "business quiet" (heartbeats arriving, business rows not). Razorpay's payment-events pipeline has a heartbeat row from every region every 30 seconds, tagged with event_type='_heartbeat', filtered out of consumer queries by default and surfaced only in the freshness monitor. The 0.0001% extra storage pays for itself the first time a quiet midnight is correctly identified as healthy rather than broken.
The fourth, more political mess: whose freshness SLO is it? A dashboard's freshness depends on the source's freshness, the warehouse load's freshness, and the BI tool's cache. If the dashboard is stale, finance complains to the BI team, who complains to the data warehouse team, who complains to the source platform team. The discipline that scales is to publish a per-hop SLO and a chained end-to-end SLO: each team owns their hop, and the end-to-end SLO falls out as the sum-of-percentiles. When the end-to-end SLO breaches, the team whose hop's SLO is also breaching is the team that owns the incident. This requires every hop to instrument its own freshness monitor, which is what makes freshness an organisational property rather than a single team's concern.
The fifth mess is the SLO-is-aspirational trap. A team writes "p99 < 30m" into a contract aspirationally, the actual p99 is 90m, and nobody fixes either the SLO or the pipeline. The contract becomes a lie that everyone has politely agreed to ignore. The discipline is to set the SLO at the current p99 + 20% buffer initially, then tighten it by 10% every quarter as the team improves the pipeline. Negotiating an aspirational target down rather than padding a real target up keeps the SLO honest, and the error budget genuinely tracks reliability investment.
Common confusions
- "Freshness SLO is the same as data quality." Adjacent, not equivalent. Freshness is one dimension of data quality (others are completeness, accuracy, consistency, validity). A row that is fresh can still be wrong; a row that is correct can still be stale. The freshness SLO covers exactly the "is this number from this morning or last Wednesday" question, nothing more.
- "If the pipeline succeeds on time, the data is fresh." No. Pipeline punctuality (the DAG ran at 04:00) and data freshness (the rows in the table reflect 04:00 reality) are different things. A successful pipeline run can copy yesterday's data into today's partition, look successful in Airflow, and produce stale output. Measure the data, not the pipeline.
- "A 99.9% freshness SLO means the data is rarely stale." It means the data is allowed to be stale for 0.1% of the time, which is 43.2 minutes per month. That is a long window if it concentrates at the end of every month into one outage — same uptime number, very different operational experience. Pair the percentile SLO with a max-breach-window cap.
- "Streaming pipelines don't need freshness SLOs." They need them more, not less. Streaming pipelines have lower median lag but their lag distribution has a long tail when checkpoints stall, watermarks freeze, or backpressure kicks in. The SLO is the contract that turns "we use Flink" into "the dashboard sees data within 90 seconds, p99".
- "Freshness only matters for dashboards." Freshness matters anywhere a downstream system makes a decision based on data — fraud-scoring models, recommender systems, dynamic pricing, regulatory reports. A fraud model trained on yesterday's data scores today's transactions less accurately; the freshness SLO of the feature store is a model-quality input.
- "Set one freshness SLO per pipeline." Set one per dataset and one per consumer's expectation. The same pipeline produces multiple datasets that downstream consumers care about differently —
payments_settledis tier-1,payments_audit_logis tier-3. Bundling them into one pipeline-level SLO either over-pages on the audit log or under-protects the settled table.
Going deeper
How freshness composes across multi-hop pipelines
When data flows through three hops — source DB → CDC stream → bronze table → silver table — the end-to-end freshness lag is the sum of the per-hop lags, but the distribution of the end-to-end lag is not the sum of the per-hop distributions. Tail events compound: a p99 spike at hop 1 plus a p99 spike at hop 2 produces an end-to-end p99-of-p99 that is much worse than naive addition suggests, because both spikes happen during the same incident more often than independent percentile arithmetic predicts. The right way to measure end-to-end freshness is directly at the consumer's table — MAX(event_time) minus NOW() — rather than summing per-hop measurements. The composite measurement captures the actual user experience without requiring the team to model the dependence structure between hops. The per-hop measurements still earn their keep for attribution (which hop is the long pole?) but not for target-setting (which is always end-to-end).
Streaming watermarks and how they relate to freshness
A watermark in a streaming system (chapter 80) is the system's belief about the slowest event timestamp it has seen — formally, all events with timestamp ≤ watermark have been observed. The watermark is the streaming-native equivalent of a per-row freshness measurement: as the watermark advances, the system is asserting "I am fresh up to here". A freshness SLO on a streaming dataset can be implemented as wall_clock_now - current_watermark < SLO, and the alerting rule is identical to the batch case. The interesting wrinkle is that watermarks can stall — a single slow partition stops the global watermark advancing — so a freshness alert on a streaming pipeline often correctly diagnoses a partition imbalance. The same mechanism that gives you exactly-once correctness gives you a freshness signal for free.
The economics of a tighter SLO
Going from a 60-minute SLO to a 5-minute SLO is rarely a 12× cost increase — it is often a 50× cost increase, because the architecture has to change. A 60-minute SLO can be served by a daily batch with hourly micro-batches; a 5-minute SLO requires a streaming pipeline with always-on infrastructure. The Razorpay merchant dashboard moved from 4-hour batch to 5-minute streaming in 2024 and the underlying compute bill went from ₹3.2 lakh/month to ₹14.8 lakh/month — a 4.6× cost increase that the business signed off on because it changed merchant behaviour from "checking the dashboard tomorrow" to "watching it during a flash sale". The lesson is that freshness has a cost curve with phase transitions; pick the SLO that matches the business value, not the SLO that sounds impressive.
Freshness for batch jobs that don't run frequently
Some datasets — monthly close, quarterly board reporting, annual tax filings — have freshness measured in days, not minutes. The SLO for a monthly close is "the November numbers are queryable by the 5th business day of December". Encoding that in the same freshness block as a streaming dataset is a category error; the right primitive is a deadline SLO that compares the partition's expected close date against wall clock. The monitor for these is structurally different — it asks "should the November partition exist by now? and if so, does it have rows?" rather than "what is the lag on the latest row?". Some teams use the same contract YAML schema with a different field (deadline: instead of freshness:); others split into a separate contract type. Either works as long as the on-call understands which alert means what.
How freshness drives capacity planning
A freshness SLO is also a capacity floor. If the SLO is "p99 < 30m" and the worst-case backfill scenario is "ingest 6 hours of source data after a 4-hour outage", the pipeline needs to process 6 hours of data in 30 minutes — a 12× burst capacity that the steady-state pipeline does not provide. Capacity planning under freshness constraints means provisioning for the burst, not the average; teams that size for the average end up missing the SLO every time there is a recovery scenario. The PhonePe data platform team sizes its CDC infrastructure for "1 hour of outage recovery within p99 SLO", which costs roughly 8× the steady-state requirement and is funded explicitly as a reliability investment rather than a feature investment. Without the freshness SLO making the requirement legible, the business would not have approved the headroom.
Where this leads next
The next chapter (34) covers data quality testing — the row-level checks that catch problems freshness alone cannot detect, like nulls in required fields or referential integrity breaks. Chapter 35 covers anomaly detection, which is the ML-flavoured cousin of SLO monitoring and useful for the patterns that hard thresholds miss. Build 5 closes with chapter 36 on incident response: when a tier-1 SLO breaches at 3 a.m., what does the runbook say?
- Data contracts: the producer/consumer boundary — chapter 31, the document where freshness SLOs live.
- Data catalogs and the "what does this column mean" problem — chapter 30, the discovery layer that surfaces freshness status alongside schema.
- Schema registries and the evolution problem — chapter 32, the structural layer the freshness SLO sits on top of.
- Watermarks and event time — chapter 80, where streaming-pipeline freshness becomes a first-class system property.
Freshness SLOs are the smallest unit of operational discipline that distinguishes a pipeline-shop from a data-platform team. The pipeline-shop ships a DAG and considers the work done when the DAG turns green; the data-platform team ships a DAG plus a freshness SLO and considers the work done when the consumer can rely on the dataset's lag without checking. The cultural shift — from "the pipeline runs" to "the dataset is fresh" — is what lets the data team be on-call the same way the SRE team is, with the same kind of error budget, the same kind of post-incident reviews, and the same kind of credibility when negotiating priorities with product.
A practical bar: pick a random tier-1 dataset in your warehouse and ask anyone "what is its freshness SLO". If the answer is a number with three percentiles, the discipline is in place. If the answer is "we run it every hour" — that is a pipeline schedule, not a freshness SLO, and the next outage will discover the difference.
References
- Google SRE Book, Chapter 4 — Service Level Objectives — the foundational text on SLO design, error budgets, and the discipline of reliability targets. Every freshness SLO inherits its shape from this chapter.
- Barr Moses et al., "What is Data Observability?" (Monte Carlo blog, 2020) — the practitioner essay that named freshness as one of the five pillars of data observability.
- Chad Sanderson, "Data Contracts: A Comprehensive Guide" (2023) — practitioner-level treatment of data contracts including freshness SLO patterns at production scale.
- Apache Airflow SLA documentation — Airflow's native SLA mechanism, which measures pipeline punctuality (not freshness) but is often the first SLO mechanism teams adopt.
- dbt freshness tests — the dbt-native freshness check, the most common starting point for teams adding freshness monitoring to a warehouse stack.
- Niels Claeys, "Pillars of Data Observability" (Datafold, 2022) — practical breakdown of how freshness composes with completeness, schema, lineage, and distribution monitoring.
- Data contracts: the producer/consumer boundary — chapter 31, the document where freshness SLOs are written down and version-controlled.
- Schema registries and the evolution problem — chapter 32, the substrate the freshness SLO depends on for stable structural reads.