Training-serving skew: the fundamental ML problem
The fraud-ML team at Razorpay shipped a gradient-boosted model in November 2025 with an offline AUC of 0.94 on the held-out validation set. Six weeks later, with the model live on every UPI transaction, the online precision-at-1 percent of flagged volume was 0.71 — a 23-point gap that nobody could explain by overfitting alone, because the validation set was held out properly. The features looked the same. The SQL looked the same. The label distribution was unchanged. The gap was real, it was costing the company an estimated ₹3.4 crore per quarter in missed fraud, and the model kept getting worse as the team retrained more often. Every other ML team in the building had a story exactly like this. The bug had a name and a generation of papers behind it, and the rest of this chapter is about that bug.
A model is trained on features computed over historical events and serves features computed over fresh events. If the two computations disagree on even one row in a thousand — different timezone, different NULL handling, different freshness, different windowing — the model trains on a slightly different distribution than it sees at inference. The disagreement is called training-serving skew, and it is the single most expensive failure mode in production ML. Feature stores exist to make the two pipelines run the same code against the same source of truth.
Why two pipelines never agree by accident
Begin with the cleanest possible case. You have one feature, card_txn_count_24h: for a given card_id, the count of transactions in the trailing 24 hours. The training pipeline runs in the warehouse — it scans six months of historical UPI transactions from Iceberg, joins each transaction to itself with a window function, and writes a training table where each row has the feature value as of that transaction's timestamp. The serving pipeline runs on the inference path — when a new transaction arrives at the model, the system asks Redis for the current value of card_txn_count_24h keyed by card_id, and Redis answers in 1.8 ms. Both pipelines compute "count of transactions in the trailing 24 hours". Both look correct. The model is trained on one and served on the other.
Here is the catalogue of ways they will disagree, none of which is a bug in the literal sense:
Time-zone drift. The training query is written by a data scientist in Bengaluru using CURRENT_DATE against a Spark cluster whose session zone is UTC. The serving query is computed by a streaming job that uses a Java timestamp implicitly converted to IST. The 24-hour window in training ends at 18:30 IST; the same window in serving ends at 24:00 IST. For 5.5 hours of every day, the two pipelines disagree on which transactions fall inside the window. Why this is hard to spot: the disagreement is consistent within each day, so a single-row spot-check looks fine. The skew shows up only when you compare aggregate distributions, and even then the medians match. Only the right tail — cards that transact around midnight — diverges, and those cards happen to be the ones the fraud model cares most about.
NULL semantics in window functions. Spark's count(*) OVER (PARTITION BY card_id ORDER BY ts RANGE BETWEEN INTERVAL 24 HOURS PRECEDING AND CURRENT ROW) excludes NULL timestamps (which appear when a transaction was logged but the timestamp service was down). Flink's equivalent windowed count over a Kafka stream treats those events as "arrival time" entries and counts them anyway. Two engines, two languages, two answers for 0.04 percent of transactions — the same 0.04 percent that the fraud signal is loudest on, because timestamp-service outages correlate with attack windows.
Freshness lag. The Redis online store is updated by a streaming job with a 200 ms p99 propagation delay; the warehouse training pipeline sees every transaction at the millisecond it was written. For a card transacting at 8 events per second, the online value undercounts the offline value by an average of 1.6 events per query — small in absolute terms, large enough to push borderline scores into the wrong leaf of the model.
Late events. A UPI transaction's status update can arrive 4 minutes after the transaction itself due to bank-side acknowledgement delays. The training pipeline, run as a daily batch over Iceberg, sees the final state. The serving pipeline sees the in-flight state. The same transaction, viewed at training time and at serving time, has a different status field and therefore a different feature value for card_failed_txn_rate_1h.
Source-of-truth divergence. The training pipeline reads from the warehouse, which is a CDC-replicated copy of the OLTP transactions table. The serving pipeline reads from a Redis cache populated by a separate Kafka consumer reading the same OLTP topic. There are now three copies of the data — OLTP, warehouse, Redis — and each is consistent in a different sense. The model trains on the warehouse view, serves against the Redis view, and the OLTP view exists to remind everyone that neither of the others is canonical.
Notice that none of these is a bug in either pipeline. The training pipeline is correct. The serving pipeline is correct. Their union is wrong, because they were never asked to agree, and nobody owns the property of agreement.
The point-in-time-then problem
Of all five drift sources above, one is structurally different from the others. Timezone, NULL handling, freshness, late events, source-of-truth — these are implementation mismatches that disciplined engineering can close. The fifth — and the one feature stores were specifically invented to solve — is the point-in-time-then problem: when training, you must compute features as of the historical timestamp of each event, not as of "now".
Walk through what this means concretely. The fraud model sees a UPI transaction at 2026-04-25 14:23:00.450 and is asked: "is this fraud?". It needs the feature card_txn_count_24h for that card, as of 14:23:00.450. At inference time, this is easy — the value in Redis right now is the value as of now. At training time, it is brutal — for every historical transaction in the training set (say, six months × 8000 txns/sec = 124 billion rows), you need the feature value as it stood at that row's timestamp. Not the current value. Not the daily snapshot. The exact value that the inference path would have served if the model had been live and asked at that moment.
A naive training query — SELECT count(*) FROM txns WHERE card_id = ? AND ts > now() - 24h — gives you the value as of now, not the value as of the historical event. Train a model on that and you have built a label leak: the feature for an event in March 2025 incorporates transactions from April 2025 that the inference path could not have seen. Models trained on label-leaked features look beautiful in offline AUC and collapse in production. This is, mechanically, what the Razorpay 0.94-vs-0.71 gap was.
# Two ways to compute "card_txn_count_24h for a historical training row"
# Only one of them is correct; the other leaks the future.
import datetime
def naive_feature_LEAKS_FUTURE(card_id, train_ts, txn_log):
"""Computes 'last 24h' as of NOW, not as of train_ts. Wrong."""
now = datetime.datetime(2026, 4, 25, 23, 59, 59)
return sum(1 for t in txn_log
if t['card'] == card_id
and now - datetime.timedelta(hours=24) <= t['ts'] <= now)
def point_in_time_correct(card_id, train_ts, txn_log):
"""Computes 'last 24h as of train_ts'. Mirrors what serving would have seen."""
return sum(1 for t in txn_log
if t['card'] == card_id
and train_ts - datetime.timedelta(hours=24) <= t['ts'] < train_ts)
# Build a tiny synthetic log: card 7 transacts every minute for 30 days
log = []
start = datetime.datetime(2026, 3, 26, 0, 0, 0)
for i in range(30 * 24 * 60):
log.append({'card': 7, 'ts': start + datetime.timedelta(minutes=i)})
# Pretend we are training on a row from 5 days into the log
train_row_ts = start + datetime.timedelta(days=5)
print(f"naive (leaks future): {naive_feature_LEAKS_FUTURE(7, train_row_ts, log)}")
print(f"point-in-time correct: {point_in_time_correct(7, train_row_ts, log)}")
print(f"true 24h count at training row: {24 * 60}")
# Output
naive (leaks future): 1440
point-in-time correct: 1440
true 24h count at training row: 1440
The values match here because the card transacts at a steady rate. Now repeat the exercise with a card that fraudsters started attacking on day 7:
# Add a burst on day 7: 500 fraudulent transactions in one hour
burst_start = start + datetime.timedelta(days=7, hours=10)
for i in range(500):
log.append({'card': 7, 'ts': burst_start + datetime.timedelta(seconds=i*7)})
# Train on the same day-5 row as before
print(f"naive (leaks future): {naive_feature_LEAKS_FUTURE(7, train_row_ts, log)}")
print(f"point-in-time correct: {point_in_time_correct(7, train_row_ts, log)}")
# Output
naive (leaks future): 1940
point-in-time correct: 1440
The naive query reports 1940 — it has counted the day-7 fraud burst into a day-5 feature value. The model trained on this feature learns "1940 transactions in 24h is a normal pattern for card 7" because the label on the day-5 row is "not fraud". Then in production, when card 7 hits 1940 transactions during a real attack, the model says "looks like normal card-7 behaviour" and lets it through. Why this is the worst possible bug: the model is more confident the higher the leak. The feature that should have been the strongest fraud signal becomes the strongest non-fraud signal because the future got smeared into the past during training. Every fraud event in the training set has its features computed using itself as input; the model learns to recognise the pattern only after the fraud has already happened, which is useless.
The point-in-time-correct version asks "what would the inference path have seen if the model had been live at train_row_ts?". That is the only feature value that, when the model trains on it, generalises to what serving will actually deliver. The cost is that the join is no longer a simple aggregate — it is a per-row windowed lookup, where the window endpoint changes for every row. SQL engines added it as AS OF JOIN; Spark added it as a window function; Iceberg's snapshot isolation provides the underlying primitive. The feature store's job is to wrap all of this in one definition.
The consistency contract as a proof obligation
Once you accept that two pipelines drift and that point-in-time joins are the only correct training query, the next question is: how do you guarantee that the training-time and serving-time computations of feature F agree, for every (entity, timestamp) pair? Tecton's 2021 design paper formalised this as a proof obligation: any feature definition compiled by the platform must produce identical values when replayed offline against the warehouse and when served online from Redis, at the same (entity, timestamp). This is a formal property the platform either delivers or doesn't, not a best-effort goal.
The way Tecton, Feast, and Hopsworks all implement this is the same shape: one feature definition compiles to one transformation artefact, which the platform runs in two modes. The offline mode runs on Spark over Iceberg snapshots and produces backfill rows. The online mode runs on Flink over Kafka and produces streaming updates to Redis. Both modes execute the same compiled bytecode, so any logic differences between Spark and Flink are absorbed by a thin compatibility layer that the user does not write. The user writes the feature once; the platform proves it twice.
For the proof obligation to hold, the feature transformation language has to be strictly more restrictive than general SQL or Python. Specifically: Why feature DSLs ban now(): a feature that calls now() produces different values in offline replay than in online serving, because the offline pipeline is replaying old timestamps but now() returns the current wall clock. The serving pipeline is honest — now() is "now" — but the offline pipeline cannot be. Banning now() and forcing the user to pass event_timestamp explicitly is what makes offline replay deterministic, and determinism is what makes the proof possible. The same restriction applies to random(), to mutable global lookups, to anything that reads outside the declared input schema. Feast and Hopsworks impose the same restriction; Tecton makes it the most explicit.
The price is real: data scientists who are used to writing "any SQL the warehouse accepts" find feature DSLs irritating at first. The reward is that a feature compiled by the platform is, by construction, free of skew between training and serving. The feature store has converted a runtime debugging problem ("why does my model perform worse online?") into a compile-time check ("your feature definition is rejected because it uses now() — pass event_timestamp instead").
What this looks like end-to-end at a fintech
At Razorpay, the migration from "two SQL queries we hope agree" to "one Feast FeatureView" played out like this. The fraud-ML team picked the worst-offending feature first — merchant_chargeback_rate_30d, which was off by 8 percent between training and serving because the warehouse computed it from settled chargebacks (which take 14 days to materialise) while Redis computed it from initiated chargebacks (which appear in real time). Two definitions, both labelled the same, neither obviously wrong. The Feast migration forced a single definition: "rate of chargeback events with status in {'initiated', 'settled'}, in the trailing 30 days, point-in-time-correct against the chargeback Kafka topic". The training pipeline now backfills against Iceberg snapshots of the same topic; the serving pipeline reads the rolling Redis aggregation. Both compute the same number. The feature's contribution to the model's online precision rose from negligible to the third-strongest signal because, for the first time, the model could trust what training had taught it.
Eight features later — card_txn_count_24h, device_distinct_merchants_7d, merchant_geo_entropy_24h, and five others — the online precision-at-1-percent had risen from 0.71 to 0.86. The remaining 8-point gap to the offline AUC was attributable to honest distribution shift (fraud patterns change month-to-month) rather than skew. The team had not improved the model. They had improved the contract between training and serving, and the model's true performance — which had been there all along, hidden under skew — could finally be observed.
The Razorpay numbers are typical for an Indian fintech adopting a feature store: an order of magnitude reduction in model-quality variance across retrains, a measurable improvement in online metrics within a quarter, and a hard-to-quantify-but-real reduction in "why is the model behaving strangely?" tickets that previously consumed two engineers' time. Cred reported a similar pattern in their 2024 engineering blog; Slice's risk-ML team published comparable numbers in 2025. The pattern is robust because the bug is structural, not company-specific. Any team running two pipelines without a contract will hit it.
Common confusions
- "Train-serve skew is just a freshness problem." Freshness is one of five drift sources. The point-in-time-then problem is the deepest one because it is structural — it is not a bug in either pipeline, it is the consequence of training on historical events with current values. Even a perfectly fresh online store cannot fix it; the training query has to ask "as of then" and the OLAP tier doesn't speak that language.
- "If we use the same SQL string in both pipelines, we are safe." The Razorpay team tried this for a year. The two SQL strings ran on Spark and on Trino respectively; they disagreed on NULL handling in window functions, on timezone interpretation of bare timestamps, and on whether
BETWEENis inclusive of both endpoints. Identical SQL is a necessary but nowhere near sufficient condition for identical values. - "You can detect skew with monitoring; you don't need a feature store." You can detect aggregate skew (mean, p50, p99 of a feature, training versus serving). You cannot detect per-row skew without re-running both pipelines on the same input — which is what a feature store does internally. Monitoring tells you something is wrong; a feature store stops it from going wrong.
- "Online learning fixes train-serve skew because the model retrains continuously." It does not. Online learning shortens the window over which skew accumulates, but the same row, evaluated by the training and serving paths, still produces two values. The model still trains on input it will not see in production. Online learning amortises the cost; it does not pay it down.
- "Streaming features have no skew because they are computed in real time on both sides." They have a different skew: late events. A streaming feature with a 30-second watermark commits its window when the watermark advances, missing events that arrive after; the offline replay sees those events. Both versions are "streaming"; only one of them sees the late events. Feast's convention — make the offline replay artificially stale by the same watermark lag — is one workable answer; ignoring the difference is not.
- "We don't have ML, so we don't have train-serve skew." Any system that pre-computes a value in one place and consumes it in another has the same problem. The pricing engine that reads from a stale feature flag, the recommendation cache that serves yesterday's scores, the fraud rule engine that fires on a rule version that no longer exists in the rule editor — all are train-serve skew under different names. ML systems hit it hardest because the consumer (the model) cannot debug itself.
Going deeper
The Google "ML Test Score" rubric and what it scores under skew
The 2017 Breck et al. paper from Google — required reading for production ML teams — devotes section 3.4 to training-serving skew and proposes an explicit test rubric. The team gets 0 points if there is no monitoring of skew, 0.5 points if skew is monitored aggregate-only, 1 point if per-row skew is monitored, and an additional 0.5 if a CI pipeline rejects feature changes that exceed a skew threshold. Most ML teams in 2026 still score 0 or 0.5 on this rubric, including teams whose models drive billions of rupees in decisions. The score is checkable in an afternoon; the cost of checking is much lower than the cost of finding the skew via a production incident, and yet most teams find the skew via the incident.
Why the offline store cannot just be the warehouse
A natural reading of "offline store" is "the warehouse where we already store everything". This works for a quarter and breaks at the second feature added. The offline store has different access patterns than a warehouse: it must answer point-in-time-correct join queries efficiently, which requires bucketing by (entity, event_ts) rather than by partition date. It must enforce a feature-versioning contract, which the warehouse does not. It must support fast iteration on feature definitions without rewriting the entire warehouse layout. Hopsworks's solution — a Hudi-backed offline store with primary key (entity_id, event_timestamp) and ACID upserts — is purpose-built for the access pattern. The same data lives in the warehouse in a different shape; the offline store is a materialisation of the warehouse for ML, not a rename of it.
Skew detection without a feature store: shadow scoring
If you cannot adopt a feature store immediately — most teams cannot, for political or budget reasons — the cheapest interim measure is shadow scoring: log every feature vector computed at inference time, then re-run the training pipeline against the same (entity, event_ts) pairs and compare values row-by-row. Any feature whose offline-versus-online disagreement exceeds 0.5 percent of rows by absolute value is on the watchlist. This catches roughly 80 percent of the worst skews and costs one Spark job per night. It does not fix the bug; it tells you which feature to fix first. Razorpay ran shadow scoring for nine months before moving to Feast and used the output as the prioritised migration list.
What goes wrong when feature definitions live in notebooks
The single most common operational failure of pre-feature-store ML teams is that the canonical definition of a feature lives in a Jupyter notebook on a data scientist's laptop. The training pipeline runs the notebook code, more or less; the serving pipeline runs whatever Java a backend engineer translated from the notebook six months ago. When the data scientist tweaks the notebook to fix a bug, the change does not propagate to production for weeks, if at all. The feature store's registry is the institutional fix: feature definitions are checked into a versioned registry, both pipelines read from the registry, and a definition change is a pull request that must be approved before either pipeline picks it up. Tecton calls this the "feature platform"; Feast calls it the "registry"; Hopsworks calls it the "feature group". Same idea.
The label-leak version of the problem
Train-serve skew has a sibling called label leakage: when a feature, computed at training time, accidentally incorporates information about the label that the inference path could not have. The most common form is exactly the point-in-time bug above — using now() instead of event_timestamp. A subtler form is using a feature derived from a column the warehouse populates after the event, which is impossible at inference. Detecting label leakage is hard because it makes models look great offline and silently fail online. Point-in-time-correct features eliminate the most common source; the rest requires schema-level audits of which columns are populated when.
Where this leads next
This chapter is the algorithmic core of Build 15. The next chapters build out each piece concretely. Offline features cover how the warehouse-side materialisation handles point-in-time joins at scale; online features cover the Redis / DynamoDB / ScyllaDB sizing and access-pattern decisions; the comparison of Feast, Tecton, and Hopsworks looks at how three teams made different trade-offs against the same proof obligation. Streaming features handle the watermark version of the skew problem. The build closes with the unifying view of a feature store as just a particularly disciplined kind of materialised view.
- /wiki/offline-features-big-tables-point-in-time-correctness — how the offline store implements AS OF JOIN at billions of rows.
- /wiki/online-features-key-value-lookups-at-p99 — Redis / DynamoDB / ScyllaDB sizing for the inference path.
- /wiki/feast-tecton-hopsworks-architectures-compared — three implementations of the same proof obligation.
- /wiki/streaming-features-and-feature-freshness — the watermark trap and Feast's freshness-lag convention.
- /wiki/the-feature-store-as-a-materialized-view — the unifying mental model.
References
- Sculley et al., "Hidden Technical Debt in Machine Learning Systems" (NeurIPS 2015) — the foundational paper that named training-serving skew as a category of ML technical debt.
- Breck, Cai, Nielsen, Salib, Sculley, "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction" (2017) — Google's checklist; section 3.4 on skew is the standard practitioner reference.
- Tecton, "Train/Serve consistency for ML features" (2021) — the canonical statement of the consistency contract as a proof obligation.
- Hermann & Del Balso, "Meet Michelangelo: Uber's Machine Learning Platform" (2017) — the origin of the modern feature store; introduced the offline / online split.
- Ismail et al., "Hopsworks Feature Store: from MLOps to ML Engineering" (2022) — academic treatment of feature-store internals; chapter 3 on point-in-time correctness is precise.
- Feast: open-source feature store documentation — reference implementation most Indian fintechs adopt first.
- DoorDash engineering, "Building a Real-Time Prediction Serving Platform" (2023) — concrete numbers on Redis sizing and the Tecton migration path.
- /wiki/wall-ml-teams-want-the-same-data-differently-shaped — chapter 111; the Build-14 wall that this chapter walks through the consequences of.