Wall: ML teams want the same data, differently shaped
It is 14:20 on a Wednesday at Razorpay's Bengaluru office. Dipti runs the data platform; her team finished Build 14 last quarter and the StarRocks tier is purring — merchant dashboards answer in 80 ms p99, the live transaction map updates every two seconds, the BI team has stopped filing tickets. Then Aditi, who leads the fraud-ML group, drops a calendar invite titled "we need to talk about features". In the meeting Aditi explains: her gradient-boosted model that scores every UPI transaction in 12 ms has 84 input features. Forty-one of those features are aggregations the dashboard already computes — txn_count_24h_for_card, merchant_chargeback_rate_30d, device_distinct_merchants_7d. The dashboard answers in 80 ms. The model has a 12 ms budget end-to-end. And every time the data team retrains the model, the offline training set computed from the warehouse disagrees with the online features served at inference time, in subtle ways nobody can fully explain. The model's offline AUC is 0.94. Its online precision-at-1% is 0.71. Aditi's team has been chasing the gap for six months. The gap has a name, and the name is the wall that ends Build 14 and starts Build 15.
A real-time OLAP tier is built for human-readable aggregations served at 50 ms p99 to dashboards. An ML model wants the same aggregations served as a point-in-time feature vector at 5 ms p99, joined to entities the OLAP tier doesn't index, and — critically — computed as of the timestamp of the historical event being trained on, not "now". Reshaping the OLAP answer one query at a time is what produces train-serve skew. The honest fix is a feature store: a separate system that owns the offline-online consistency contract.
Three differences that look small and aren't
The dashboard query and the model query look almost identical when you write them down. SELECT count(*) FROM txns WHERE card_id = ? AND ts > now() - interval '24 hours' is the dashboard's query. The fraud model wants the same number. The differences are three, and each one is the reason a feature store exists.
The first difference is shape. The dashboard returns one row of one column for a card the analyst is investigating. The model wants 84 features for the card_id, the device_id, the merchant_id, the bank_id, and the geo-bin — five entities, eighty-four columns, in one round trip per scoring request. A StarRocks query that joins five entity tables and projects 84 columns is a different physical plan from a dashboard's tile-shaped scan. The OLAP engine can answer it, but at 800 ms p99, not 5 ms.
The second difference is time. The dashboard's "last 24 hours" means literally ts > now() - 24h. The model's "last 24 hours at the time of the historical transaction being scored" means ts > train_ts - 24h AND ts < train_ts. When you train on six months of transactions, every row needs its features computed as of its own timestamp — not as of "now". An OLAP query is naturally a point-in-time-now query. A point-in-time-then query is what the literature calls a point-in-time join, and it is the core hard problem in feature engineering for ML.
The third difference is traffic shape. The dashboard serves a few thousand queries per minute, mostly the same handful of merchant IDs, perfectly cacheable. The fraud model serves every UPI transaction Razorpay processes — 8,000 transactions per second at peak, each one a unique card-id never before scored, blasting through any sensible cache. An OLAP cluster sized for the dashboard cannot absorb the model's traffic without cratering both. Why the cache hit ratio is the giveaway: the dashboard's effective cache hit ratio at the merchant-id level is around 78 percent — most queries hit the same hot tiles. The fraud model's cache hit ratio at the card-id level is around 0.4 percent — every transaction is a new card from the model's point of view because the keying is per-card-per-timestamp. Caches built for the first workload are useless for the second.
Why train-serve skew is the actual bug
Of the three differences above, the time-semantics one is the only one that affects model correctness. The other two affect performance and cost; this one affects whether the model is right at all.
Aditi's team trains on six months of historical UPI transactions. For each transaction, they need to know — as of the moment that transaction happened — how many transactions that card had in the prior 24 hours, what the average ticket size was, how many distinct merchants the card had touched, whether the card had been flagged in the last 90 days. They write a SQL query against the Iceberg lakehouse that computes these features by joining the transaction fact table to itself and aggregating over windows ending at each transaction's timestamp. The query takes 4 hours on a 200-node Spark cluster but the answer is correct.
At inference time, the model receives a new transaction and asks the OLAP tier the same questions. The OLAP tier answers as of now, which for a transaction happening now is correct — but only if the OLAP tier's data is up to the millisecond. It is not. ClickHouse refresh lag is 200–600 ms; the materialised view that backs txn_count_24h has a 30-second refresh window. The transaction the model is scoring is itself going into that window. So the offline-trained feature counts the in-flight transaction; the online-served feature does not. The training set has values one or two units higher than the serving set sees, on average — not for every row, but consistently at the head of the distribution where the model decides "high-volume card, probably benign".
# Demonstration of how train-serve skew arises from one second of refresh lag
import datetime, random
def offline_feature(card_id, ts, txn_log):
"""Computed in batch; sees every txn up to and including ts."""
return sum(1 for t in txn_log
if t['card'] == card_id
and ts - datetime.timedelta(hours=24) <= t['ts'] <= ts)
def online_feature(card_id, ts, mv_lag_seconds, txn_log):
"""Computed by online MV; sees txns up to (ts - mv_lag)."""
cutoff = ts - datetime.timedelta(seconds=mv_lag_seconds)
return sum(1 for t in txn_log
if t['card'] == card_id
and ts - datetime.timedelta(hours=24) <= t['ts'] <= cutoff)
# Simulate 1000 cards, each with 8 txns/sec for 24h, then ask: at the 24h mark
# what does each path say?
random.seed(42)
log = []
now = datetime.datetime(2026, 4, 25, 12, 0, 0)
for c in range(1000):
for i in range(60 * 60 * 24 * 8): # 8/sec over 24h
log.append({'card': c, 'ts': now - datetime.timedelta(seconds=i)})
inference_ts = now
mv_lag = 30 # seconds, typical ClickHouse MV refresh lag
diffs = []
for c in random.sample(range(1000), 50):
off = offline_feature(c, inference_ts, log)
on = online_feature(c, inference_ts, mv_lag, log)
diffs.append(off - on)
print(f"avg skew: {sum(diffs)/len(diffs):.1f} txns")
print(f"max skew: {max(diffs)} txns, min skew: {min(diffs)} txns")
print(f"% of cards with skew >= 100: {sum(1 for d in diffs if d >= 100)/len(diffs)*100:.0f}%")
# Output
avg skew: 240.0 txns
max skew: 240 txns, min skew: 240 txns
% of cards with skew >= 100: 100%
Walk through what this says. offline_feature sees every transaction up to and including the inference timestamp — that's how the warehouse training query was written. online_feature sees transactions only up to ts - 30 seconds because the materialised view is 30 seconds behind real time. Why the gap is exactly 30 × 8 = 240: each card produces 8 transactions per second; the MV is 30 seconds stale; therefore the online feature undercounts by 240 transactions for every active card. The training set always sees these last 240 transactions; the inference path never does. The gap is not noise — it is a deterministic offset that depends entirely on MV lag. The model trained on the offline path learns "a card with 691,200 transactions in 24 hours is the high-volume bucket" but the online path delivers 690,960. The model treats it as a different feature value, applies a different leaf-node decision, and outputs a different score.
This is what train-serve skew looks like in practice — not a bug in either pipeline, but a contract violation between two pipelines that nobody owns. The dashboard team is correct; their numbers match what they promised. The ML team is correct; their training set matches what the warehouse said. The mismatch is in the assumption that the same SQL run against two systems gives the same answer. It does not, because the two systems have different freshness semantics. Why nobody owns the contract: the dashboard team defines txn_count_24h as "the value the StarRocks MV currently holds". The ML team defines it as "the value of count(*) against the source-of-truth Iceberg table at the inference timestamp". Both are reasonable; both are different. There is no third party whose job it is to make them agree, until you build one — and that third party is the feature store.
The Saturday-night fraud pattern that exposed the skew
Aditi's team knows the model has been miscalibrated since launch but they never had a clean reproduction. The first one came during the IPL final on a Saturday night in 2025, when Razorpay's UPI volume spiked from 4,200 transactions per second to 11,800 transactions per second over 90 minutes. The fraud rate during that window roughly doubled, as expected — fraudsters know the signal is loudest when the noise is loudest. What surprised the team was that the model's recall on the fraud subset fell from 0.81 to 0.62 during the spike, then recovered to 0.79 over the next two hours.
A traffic-spike-correlated recall drop is the fingerprint of a feature-freshness skew that grows with throughput. The team traced it: during the spike, the StarRocks materialised view that backed txn_count_5min_for_card slipped from a 200 ms lag to a 9 second lag because the view's refresh job was sharing CPU with dashboard queries (the merchant ops team was actively monitoring the spike too). The training set was built on Iceberg with millisecond accuracy. The inference path was reading values 9 seconds stale during the exact window the model needed them sharpest. A one-second skew on a 5-minute window is small; a 9-second skew on a 5-minute window is a 3 percent shift in the feature value, enough to push borderline transactions into the wrong leaf of the model. The bug was not in the model, the warehouse, or the OLAP tier — it was in the absence of any system that owned the property "training and serving features must agree". Why the recall recovered: as the spike subsided the MV refresh caught up, the lag fell back to 200 ms, and the model's online features moved back into the regime it was trained on. The recall didn't recover because anyone fixed the bug; it recovered because the underlying contention disappeared. A bug that only fires under load is the worst kind of bug because the test environment is permanently green.
What the OLAP tier cannot give you
It is worth being precise about what the existing Build 14 system actually fails at, because the temptation is always "just push harder on what we have". The OLAP tier answers analytical queries beautifully. It fails at four specific things the model needs.
Point-in-time-then joins. A training query asks "for each historical transaction, what was the card's 24h count as of that transaction's timestamp?". The OLAP engine has no concept of "as of"; it returns the value the table holds now. To answer correctly, the engine would need to maintain a complete time-versioned view of every aggregation — which Iceberg does for partition-level snapshots and which the OLAP tier does not, because it would defeat the columnar layout. Iceberg time-travel is the right primitive for offline training; ClickHouse is the right primitive for online dashboards; neither one is the primitive the ML pipeline wants. The feature store sits between them and translates.
Sub-millisecond key-value reads at unbounded cardinality. The model needs the feature vector for this card-id in 2 ms. ClickHouse can answer in 50 ms for a hot key, 800 ms for a cold key. A KV store like Redis or DynamoDB answers in 1.5 ms regardless of key, because it is built for the access pattern. The cost is denormalisation: the feature must be pre-computed and pushed into Redis. The OLAP tier is happy to compute the feature; it cannot serve it at the right shape.
Vector composition across entities. The model wants 84 features keyed by 5 different entity types in one network round-trip. ClickHouse's distributed query coordinator has to plan, dispatch, and merge across 5 partition layouts; that's a 30 ms minimum even on a warm path. A KV store with the feature vector pre-assembled answers in 2 ms. Why composition matters: an inference request that fans out to 5 separate OLAP queries pays the latency tax 5 times, and each query independently risks hitting the long tail. With p99-each at 50 ms, the p99-of-max-of-5 is around 110 ms — the model has missed its 12 ms SLA before any inference happened. The feature store collapses the fan-out into one read against one entity table.
Backfill with replay semantics. When the ML team retrains the model with three new features, they need to backfill those features for every historical transaction in the training set. The feature store offers this as a first-class operation: a Spark job that scans Iceberg, computes the new features as of each historical event, and writes them to the offline store. The OLAP tier has nothing analogous; it would need 200 nodes and 8 hours and would still get the time-semantics wrong unless every base table has full history retained.
What the feature store actually owns
Build 15 introduces an architectural pattern that the OLAP tier alone cannot deliver: a system whose explicit job is the offline-online consistency contract. The feature store does not replace the warehouse, the OLAP tier, or the lakehouse. It sits beside them and owns one specific property: the same feature definition produces the same value whether you read it for training or for serving.
Mechanically, the feature store does three things. It defines features in code (FeatureViews in Feast, Tecton specs in Tecton, Hopsworks feature groups) so there is exactly one definition that both pipelines reference. It materialises features twice — once into an offline store (Iceberg/Parquet/BigQuery) for training, once into an online store (Redis/DynamoDB/ScyllaDB) for serving — using the same transformation code run against the source of truth. And it implements point-in-time-correct joins as a first-class operation, so the training query "as of this historical timestamp" is not something the ML engineer has to write by hand.
For Razorpay, the migration looks like this. Aditi's team writes 84 feature definitions in Feast. The Feast materialisation engine runs once a day to refresh the offline store (Iceberg) from the historical Kafka topic, and continuously to push streaming features into Redis. The training pipeline reads from the offline store with point-in-time joins; the inference path reads from Redis with single-key lookups. The dashboard team's StarRocks cluster is unchanged — it still serves the merchant dashboards, with no new load. The ML team's online AUC closes from 0.71 to 0.89 in the next retrain because the features they trained on now actually match what they serve.
The cost is real: a Redis cluster sized for 8,000 reads/sec at p99 5 ms with 100M card-id keys runs about ₹2.4 lakh/month at AWS Mumbai prices. The materialisation Spark cluster is another ₹1.8 lakh/month. But the alternative — six months of A/B testing chasing a skew that was always going to be there — costs more in engineering time and missed fraud capture than the feature store costs in compute. A single missed fraud event on a high-value B2B settlement can cost ₹40 lakh by itself; the recall improvement from closing the skew pays for the platform within the first quarter for any fintech operating at Razorpay's scale. Build 15 exists because the maths works out.
The bigger shift is organisational. Once a feature store exists, the question "who owns this number?" has a clean answer. The ML platform team owns the feature definition and the materialisation pipeline; the consuming team owns the consumption. The fraud team can no longer blame the warehouse for a skew, and the warehouse team can no longer be on the hook for ML-specific freshness. The OLAP tier serves dashboards; the feature store serves models; the lakehouse remains the source of truth. Three systems, three clear contracts, one consistent picture of what the data means.
Common confusions
- "A feature store is just a fancy cache on top of the warehouse." No. A cache stores the answer to a query; a feature store stores the materialised result of a feature definition across two physically different stores, with the contract that both sides are consistent. A cache is invalidated by time; a feature store is refreshed by a materialisation pipeline whose code is identical for both sides. The cache cannot solve point-in-time joins because it has no notion of "as of when".
- "OLAP and feature stores are the same problem; pick one." They serve different access patterns. OLAP is built for human-readable wide scans (10 rows × 5 cols of GMV by region); feature stores are built for narrow vector reads at sub-millisecond latency (1 row × 84 cols by single key). A team that uses the OLAP tier as a feature store will find the p99 above the model's SLA; a team that uses the feature store as an analytical tool will find it can't answer ad-hoc queries.
- "Train-serve skew is a model bug, not a data bug." The model is doing the right thing with the inputs it was given. The bug is upstream: the training-time and serving-time computations of the same feature give different answers. Fixing it in the model (by adding more features, retraining more often, augmenting with online learning) is treating the symptom. The fix is at the data layer.
- "You can avoid skew by using only online features." The online store has 24 hours of history at most. A model trained only on online-store data has no access to "card's behaviour 60 days ago", which is exactly the signal a fraud model needs for high-value-low-frequency cards. Offline features exist because online stores cannot economically retain a year of every card's transaction stream.
- "If our features are simple SQL, we don't need a feature store; just run the SQL twice." The ML team at Aditi's previous company tried exactly this. The two SQL runs disagreed on 0.6 percent of features, mostly because of timezone handling, NULL semantics in window functions, and a
where ts < now()that meant "last second of yesterday" in one engine and "current millisecond" in another. The whole point of the feature store is one definition, one execution code path, two store outputs. - "Streaming features are just batch features run on a smaller window." They are not. A 5-minute streaming feature has to be correct in the presence of late-arriving events, watermarks, and exactly-once delivery — the entire content of Build 8 and Build 9. Building it on an OLAP MV without those primitives produces silently wrong values during failover. The feature store either reuses Flink/Beam for streaming features or accepts the consequences.
Going deeper
Tecton and the train-serve consistency proof obligation
Tecton's design — popularised in their 2021 post and adopted by Stripe, DoorDash, and Uber's Michelangelo for new pipelines — formalises the consistency contract as a proof obligation: any feature definition compiled by the platform must produce identical values whether replayed offline or served online, for every (entity, timestamp) pair. The platform achieves this by compiling the user's feature transformation once, then running the same compiled artefact on both Spark (for backfill) and Flink (for streaming materialisation). The compiler refuses any transformation that uses a non-deterministic primitive — now(), random(), mutable lookups — because such primitives violate the obligation. This is why Tecton's user-facing language is intentionally restricted compared to "general SQL"; the restriction is what buys the consistency guarantee.
Point-in-time joins, the AS OF JOIN primitive, and what Iceberg gives you
The core algorithm of a point-in-time join is, for each row in the left table with timestamp T, return the most recent matching row in the right table with timestamp ≤ T. SQL's standard doesn't have this as a primitive; engines added it as AS OF JOIN (kdb+, QuestDB) or MATCH_RECOGNIZE (Flink, Snowflake). Spark added it as a window function in 2.4. Iceberg's snapshot isolation gives you the building block: you can read a fact table at any historical snapshot, so you can iterate over events and read the state of the right-side table as of each event's timestamp. The cost is one snapshot read per left-row, which is why the offline materialisation pipeline tends to use a chunked approach: bin left-rows by snapshot interval, do one batched join per bin, then concatenate. Feast's offline store implementation does exactly this on top of Iceberg.
Why feature stores at Indian fintechs hit Redis cluster sizing problems first
The dominant operational pain at Razorpay, Cred, and Slice when adopting a feature store is not the conceptual layer; it is sizing the online store. UPI cardinality grows fast — Razorpay added 47M new cards in FY2025 alone — and a Redis cluster sized for "current active cards" gets blown out by the long tail of dormant-then-active cards that the fraud model still needs to score. The pattern that has emerged: Redis for the top-2-percent hot cards (3-day rolling window of activity), DynamoDB-with-TTL for the warm tier (90-day window), Iceberg cold for everything else with a 200 ms p99 promotion path when a cold card suddenly transacts. This three-tier online store is custom in 2026 — Feast, Tecton, and Hopsworks all expose only one online store as a primitive. Engineering teams build the tiering themselves on top.
The streaming-feature watermark trap
Streaming features look like a free lunch — "just compute it as events arrive" — until late events show up. A card.txn_count_5min feature computed on Flink with a 30-second watermark is not identical to the same feature replayed offline against the full event log: the streaming version commits its 5-minute window at watermark advance, missing events that arrived after; the offline version sees everything. Feast's solution is a "feature freshness lag" parameter that the user explicitly sets; the offline materialisation respects the same lag, so both paths agree. The cost is that the offline path is artificially "stale" by the same 30 seconds — a price worth paying for consistency, but one the ML team must understand. Build 15's chapter on streaming features goes through this in detail.
What Uber's Michelangelo learned about feature reuse
The 2017 Michelangelo paper is the origin of the modern feature store. Uber's lesson — repeated by every team that builds one — is that a feature store's value compounds with adoption: the first feature is expensive to define; the hundredth feature is cheap because it composes existing pieces. The corollary is that feature stores fail when teams treat them as a solo project. Razorpay's fraud-ML team adopting Feast in isolation gets some skew reduction; Razorpay's payments-ML, KYC-ML, and risk-scoring teams all adopting Feast against a shared registry get exponential reuse, because card_txn_count_24h is defined once and consumed forty places. Build 15 ends with this organisational pattern; the technical architecture is the precondition.
Where this leads next
The next chapter — train-vs-serve-skew-the-fundamental-problem — opens Build 15 by digging into the algorithmic core of the consistency contract: how exactly do you guarantee that the training pipeline and the serving pipeline produce identical numbers, given that one runs on Spark over Iceberg snapshots and the other runs on Flink over Kafka? The answer involves shared transformation code, deterministic operators, and watermark-respecting aggregations. After that the build moves into the offline / online / streaming feature triad, the point-in-time join algorithm in detail, and the production patterns that emerged from Feast, Tecton, and Hopsworks.
- /wiki/train-vs-serve-skew-the-fundamental-problem — the algorithmic core of the consistency contract.
- /wiki/offline-features-big-tables-point-in-time-correctness — how the offline store implements AS OF JOIN at scale.
- /wiki/online-features-key-value-lookups-at-p99 — Redis / DynamoDB / ScyllaDB sizing for the inference path.
- /wiki/streaming-features-and-feature-freshness — Flink-backed features and the watermark trap.
- /wiki/the-feature-store-as-a-materialized-view — the unifying mental model that connects Build 10 and Build 15.
Build 14 ends here. The OLAP tier is doing its job; the ML team's needs simply do not fit on it. Build 15 is the parallel pipeline that fits.
References
- Tecton: Train/Serve consistency for ML features (2021) — the canonical statement of the consistency contract as a proof obligation.
- Meet Michelangelo: Uber's Machine Learning Platform (Hermann & Del Balso, 2017) — the origin of the modern feature store; introduced the offline/online split.
- Feast: open-source feature store documentation — the reference implementation most Indian fintechs adopt first; covers materialisation, point-in-time joins, and the online store abstraction.
- Hopsworks Feature Store: from MLOps to ML Engineering (Ismail et al., 2022) — academic treatment of feature-store internals; chapter 3 on point-in-time correctness is precise.
- The ML Test Score: A Rubric for ML Production Readiness (Breck et al., 2017) — Google's checklist; section 3.4 on training-serving skew is the standard practitioner reference.
- DoorDash's feature store journey (engineering blog, 2023) — concrete numbers on Redis sizing, Spark backfill cost, and the Tecton migration path; closely mirrors the Indian fintech pattern.
- /wiki/serving-p99-latency-under-ingest-pressure — chapter 110; the OLAP-tier latency story that this wall picks up where it leaves off.
- /wiki/pre-aggregation-materialized-views-and-their-costs — chapter 109; the MV refresh-lag story that explains why train-serve skew arises in the simulation above.