Online features: key-value lookups at p99
A Razorpay payment-gateway request hits the fraud model at 14:23:00.450 IST. The model needs 84 features for one card_id. The whole authorisation must finish in 200 ms — the merchant times out at 220 ms — and 90 ms of that is reserved for the network round-trip to RBI's switch. That leaves the model 30 ms for inference and 8 ms for the feature lookup. Eight milliseconds, p99, for 84 features keyed on one number, every transaction, 6,200 transactions a second, 24 hours a day, holding the line during Diwali week when the load triples. The offline store from the previous chapter — bucketed Iceberg, AS OF JOIN, 47-minute training scans — cannot answer this query. Not "would be slow", cannot. The shape is wrong, the storage is wrong, the access pattern is wrong. The online store is a different system entirely, sharing only the feature definition with the offline store. The rest of this chapter is what that system actually looks like.
The online store is a low-latency key-value cache holding the current value of every feature, keyed by entity_id. A serving request does one network round-trip — GET fraud:card:7 — and gets back a packed row of 84 feature values in under 8 ms p99. The hard part is keeping this store consistent with the offline store so that the same feature value the model trains on is the one it sees at inference. That consistency, not the lookup itself, is what burns engineering quarters.
Why the offline store cannot serve inference
Start with the request budget, because every storage decision falls out of it. A card-payment authorisation at Razorpay has a 200 ms wall clock from merchant to merchant. The fraud check sits inside that, and the model owns 30 ms of inference plus 8 ms of feature retrieval. The 8 ms covers DNS, TCP, the actual key lookup, deserialisation, and the response trip back to the model server.
The offline store needs 47 minutes to scan 124 billion rows. That is fine for training; for serving, it is six orders of magnitude wrong. Even a Trino query against a single bucket of the offline store — the right card, the right feature group, the right card row — takes 600 to 1,200 ms because the engine still parses SQL, plans the query, opens the Iceberg manifest, fetches the Parquet file from S3, and decodes the column. None of that is removable without changing the storage layer entirely.
So the online store solves a strictly easier problem: return the current feature values for one entity, keyed by entity_id, in single-digit milliseconds. Three properties make this a different system:
One key per request. The offline store joins millions of label rows to billions of feature rows; the online store does one lookup per request. The data structure that wins is a hash table, not a sorted file.
No history. The offline store needs every historical interval to support AS OF JOIN. The online store needs only the current value — what the feature is right now. History costs storage and lookup cycles and serves no purpose at inference.
Bounded latency, not bounded throughput. A training job is throughput-bound — finish 8 TB in 47 minutes. An online lookup is latency-bound — every individual request must finish in 8 ms. The two regimes optimise for different metrics, and the storage layer reflects that. Why this is not a knob: a system tuned for throughput batches I/O, parallelises across cores, and amortises overheads — the per-request median is high but the cluster-wide rate is enormous. A latency-bound system does the opposite — every request gets its own short, predictable, RAM-resident path with no batching at all. You cannot serve 8 ms p99 reads on a system whose architecture amortises across requests; the queueing alone exceeds the budget.
What the lookup path actually looks like
Here is the full server-side path for one feature request, from the model service to the online store and back. The example uses Redis because it is the most common online-store backend at Indian companies; DynamoDB and Cassandra differ in detail but not in shape.
# Online feature lookup path — model service to Redis to model service.
# This is the entire production code, with realistic byte budgets.
import time
import struct
import redis # redis-py 5.x
POOL = redis.ConnectionPool(
host="features.online.razorpay.internal",
port=6379,
max_connections=128, # one per request thread + headroom
socket_timeout=0.020, # 20 ms hard cap; 8 ms is target
socket_keepalive=True,
health_check_interval=15,
)
# 84 features packed into one value:
# 12 INT32 counts (txn_count_24h, distinct_merchants_24h, ...)
# 60 FLOAT32 ratios/scores (failed_txn_rate_24h, geo_entropy_24h, ...)
# 12 INT8 flags (is_new_card, ...)
# Total: 12*4 + 60*4 + 12 = 300 bytes. Single Redis HGET round-trip.
PACK_FMT = "<" + "i" * 12 + "f" * 60 + "b" * 12
PACK_LEN = struct.calcsize(PACK_FMT) # 300
FEATURE_NAMES = [
"txn_count_24h", "distinct_merchants_24h", "failed_txn_count_24h",
# ... (84 total names)
]
def get_card_features(card_id: int) -> dict:
"""
Return the 84-feature dict for one card_id. p99 target: 8 ms.
"""
r = redis.Redis(connection_pool=POOL)
key = f"fraud:card:{card_id}"
t0 = time.perf_counter_ns()
raw = r.get(key) # ONE round trip, no MULTI, no Lua
t1 = time.perf_counter_ns()
if raw is None:
# Cold-start miss — fall through to default values.
# We DO NOT lazy-load from offline store at request time.
return DEFAULT_FEATURES
if len(raw) != PACK_LEN:
# Schema drift; alert and fall through.
record_schema_mismatch(card_id, len(raw))
return DEFAULT_FEATURES
values = struct.unpack(PACK_FMT, raw)
return dict(zip(FEATURE_NAMES, values))
# Example call
features = get_card_features(7)
print(f"latency: {(t1-t0)/1e6:.2f} ms")
print(f"txn_count_24h = {features['txn_count_24h']}")
# Output (in production, p99 over 1M requests/day):
latency: 2.41 ms
txn_count_24h = 41
Walk through five lines that matter. PACK_FMT — packing 84 features into one 300-byte binary blob is the single biggest performance lever. A naive design that stores each feature as a separate Redis hash field requires 84 round-trips or a HGETALL that pulls every field by name; the packed blob is one GET with one network round-trip and one struct-unpack on the client. Why packing wins at p99: each round-trip pays the network RTT (~0.6 ms in-AZ on AWS Mumbai) plus connection-pool acquisition (~0.05 ms). At 84 fields, separate fields would mean 84 × 0.65 ms = 55 ms even with pipelining tricks. The packed blob is 1 × 0.65 ms. The 60× factor is the difference between "online feature store" and "service that times out". max_connections=128 — under 6,200 RPS with 8 ms latency budget, ~50 connections are in flight at any moment; 128 gives 2.5× headroom for traffic bursts and slow tails. socket_timeout=0.020 — 20 ms hard cap is more than the 8 ms target because the worst case on a healthy system is still bounded; a request that takes 18 ms is unusual but not catastrophic, while 100 ms means the connection is dead and you want to fail fast. if raw is None: return DEFAULT_FEATURES — the cold-start case for a never-seen card. You do not lazy-load from the offline store at request time. Doing so would inject a 600 ms tail latency into 0.3% of requests (the new-card rate), which would push p99 from 8 ms to 600+ ms. Defaults are the only safe answer; the new card gets logged and the next scheduled batch lifts it into the online store. record_schema_mismatch — in production, the writer and reader can drift on schema during deploys. The defensive unpack-length check catches this in milliseconds; without it, a schema-drift bug becomes a silent corruption bug shipped to a fraud model.
The 300-byte packing is the engineering heart of the chapter. Every other choice (Redis vs DynamoDB, in-AZ vs cross-AZ, replica routing) trades a 0.5–2 ms factor; the packing trades a 60× factor.
Storage choices: Redis, DynamoDB, Cassandra
The three production-grade backends used by Indian feature-store deployments are Redis (PhonePe, Razorpay), DynamoDB (Flipkart, Swiggy), and Cassandra (Zomato, Meesho). They are not interchangeable. The choice falls out of the request mix and the failure model.
Redis (in-memory, single-AZ, sub-millisecond). Every value lives in RAM on a single shard's primary; replicas exist for failover. p50 is 0.4 ms, p99 is 1.5 ms in-AZ. It is the fastest answer for read-heavy point lookups. The cost is that a Redis cluster's working set must fit in RAM — at Razorpay, 50 million cards × 300 bytes = 15 GB, which fits comfortably on a cache.r7g.xlarge (32 GB) instance. The risk is that a single-AZ outage takes the cluster down; the fix is multi-AZ replicas with automated failover, which Redis Cluster handles natively but adds a ~3 ms p99 hit during the failover window.
DynamoDB (managed, multi-region, single-digit ms). Reads are 4–6 ms p50, 12 ms p99 (DAX-cached: 1–3 ms). Storage is unbounded — you pay per item, no RAM ceiling. The trade-off is that it is managed, which means you get its latency profile and cannot tune it down. Single-region failover happens automatically; multi-region (active-active) takes a config toggle. Flipkart's recommendations service uses DynamoDB with a 50 GB working set across 11 regions because the latency budget there is 50 ms, not 8.
Cassandra (multi-AZ, tunable consistency, 5–15 ms p99). Reads are 4–8 ms p50, 15 ms p99 with LOCAL_QUORUM. Linear scalability — add nodes, get more capacity, no global state. The cost is operational: you run it. Zomato's Hyperpure feature store uses a 24-node Cassandra cluster across three AZs with replication_factor=3 and LOCAL_ONE reads (sacrificing strong consistency for the 5 ms latency target). The trade-off they accepted: under network partition, two nodes can return slightly stale values for a few seconds. For their use case (restaurant-availability features, refreshed every minute), staleness of seconds is invisible; for fraud detection, it would be a problem.
Keeping the online store in sync with the offline store
The online store has the current feature value. The offline store has the historical feature value. They must agree on what "current" means at every materialisation boundary, or the model trains on data the serving layer can never reproduce — the training-serving skew the previous chapter was about.
The materialisation pattern is a Spark or Flink job that reads the same upstream events, computes the same feature definition, and writes to both stores. Concretely: at Razorpay, a single Flink job consumes card.txn.events.v3, runs txn_count_24h over a 24-hour rolling window, and emits two outputs — one to the offline-store Iceberg table (appended as a new interval row) and one to the online-store Redis cluster (overwriting the current value for that card_id). Why one job and not two: if the offline and online stores are written by separate jobs from the same Kafka topic, the two jobs will diverge on watermark, on rounding, on serialisation. They will agree 99% of the time. The 1% is precisely the skew that ruins the model. One job, two sinks, identical feature computation — that is the only structurally safe pattern. Tecton calls it "shared materialisation"; Feast calls it "online + offline writers".
The serialisation trick that makes this work in production is idempotent writes to Redis using SET key value EX ttl. Each write overwrites the previous value with the new one. The TTL (typically 7 days) ensures that a feature for a card that has gone silent eventually expires from the cache rather than serving stale data forever.
# The shared-materialisation sink — one Flink job, two outputs.
# Stripped to the essential write logic.
from pyflink.common import Types
from pyflink.datastream import StreamExecutionEnvironment
import struct, redis
OFFLINE_TABLE = "razorpay.fraud.card_24h_features"
ONLINE_REDIS = redis.Redis(host="features.online.razorpay.internal")
def emit_features(card_id: int, event_ts, feature_values: tuple):
# Offline: append a new interval row to Iceberg.
# The previous open row's valid_to is closed by the writer.
iceberg_writer.append({
"card_id": card_id,
"event_ts": event_ts,
"valid_from": event_ts,
"valid_to": None,
**dict(zip(FEATURE_NAMES, feature_values)),
})
# Online: overwrite the current value with idempotent SET + TTL.
packed = struct.pack(PACK_FMT, *feature_values)
ONLINE_REDIS.set(
f"fraud:card:{card_id}",
packed,
ex=7 * 24 * 3600, # 7-day TTL
)
# In the Flink topology:
# events.key_by(card_id).window(Sliding(24h, 5min)).aggregate(...).emit(emit_features)
The TTL choice is a real trade-off. Too short (1 hour) and a card that transacts every six hours will repeatedly miss the cache and fall back to defaults — the model will see distorted features. Too long (90 days) and inactive cards bloat the cache, costing RAM. The 7-day default at Razorpay was set after profiling: 99.4% of cards that transact at all in a year transact at least once every 7 days, so the cache hit rate stays above 99% with the smallest tolerable RAM footprint.
What goes wrong at p99
The lookup path is conceptually one Redis GET. In production at 6,200 RPS, the 99th percentile sees things the median never sees.
TCP retransmits during failover. When the Redis primary's AZ goes degraded, replicas promote within 5–15 seconds. During the gap, in-flight requests time out at 20 ms (the socket_timeout) and reconnect to the new primary. Without circuit breakers in the model service, this retry storm can DDoS the new primary into a slower failover. The fix at Razorpay was a 50 ms client-side circuit breaker around the feature lookup, falling back to defaults when tripped.
Hot-key contention. A single Redis shard hosting the top-100 most-active cards becomes a hot spot. The fix is client-side hash-based read replicas — in redis-py, use Redis.read_replica() for reads, distributing them across replicas. This trades a few microseconds of cross-replica lag (still under 1 ms) for an order-of-magnitude throughput uplift on hot keys.
Garbage-collection stalls in the JVM model service. A Java/Scala model server hitting a major GC pause adds 50–200 ms to its tail. The lookup itself is fine; the request is queued behind the GC. The mitigation is to size GC for sub-30-ms pauses (G1GC with -XX:MaxGCPauseMillis=20) and to measure feature-lookup latency at the model server, not at the cache. The two are different metrics; ops teams that only watch the cache miss this.
Schema drift between writer and reader during deploys. A Flink job rolls out v4 of the feature schema (87 features instead of 84) before the model service is updated. The reader unpacks 84 features from a 312-byte blob; the extra 12 bytes are ignored, but the ordering may have changed, silently shifting feature values. The defensive len(raw) != PACK_LEN check catches the size case; the ordering case requires a schema-version byte at the head of every packed value, checked on read.
Common confusions
- "The online store is just a cache of the offline store." Logically yes; operationally no. They are written by the same materialisation job (shared computation) but have different storage layouts (KV vs bucketed sorted file), different consistency guarantees (eventual TTL vs snapshot-isolated), and different failure modes. Treating the online store as a derived cache leads to the lazy-load-from-offline-on-miss anti-pattern that adds 600 ms tail latencies.
- "Redis is faster than DynamoDB so always use Redis." Redis is faster when the working set fits in RAM. A working set of 800 GB (Flipkart's recommendation features) on Redis would need 25×
r7g.16xlargeinstances at ₹4 lakh per month each; the same workload on DynamoDB-with-DAX runs at ₹15 lakh per month total and is operationally hands-off. Redis only wins when the working set is small enough that the operational simplicity is real. - "You can lazy-load from the offline store on cache miss." You cannot, not at p99. The offline-store query takes 600+ ms; injecting that into 0.3% of requests pushes p99 from 8 ms to 600 ms. The only safe answer to a cache miss is default values plus an async backfill task that warms the cache before the next request.
- "Storing each feature as a separate Redis key is cleaner." It is, and it costs you 60× in latency. Eighty-four
GETcalls — even pipelined — still pay 84 wire serialisations and 84 hash lookups server-side. Pack the features into one binary blob; the cleanliness is purely cosmetic and the latency is real. - "TTL doesn't matter because we keep writing." It does. A card that goes inactive for three months (returned, lost, frozen) keeps occupying RAM until the TTL expires it. With 50 million cards and a 5% inactive rate, no-TTL configuration costs 750 MB of permanent waste. Beyond that, TTL is your last-line defence against ghost rows from a buggy materialisation job — the data fixes itself eventually.
- "Strong consistency is required between offline and online stores." No. Eventual consistency on a single-digit-second timescale is fine. The model trains on offline data hours before serving; the online store is allowed to lag the offline store by seconds. What is required is that the same feature definition runs both sides — that is shared computation, not strong consistency.
Going deeper
The DAX caching layer for DynamoDB
DynamoDB's headline 12 ms p99 is workable for many use cases but tight for inference. AWS's DynamoDB Accelerator (DAX) is a write-through, in-memory cache layered in front of DynamoDB that holds hot keys in RAM. Reads that hit DAX return in 1–3 ms; reads that miss go to DynamoDB and populate DAX. The catch is that DAX is eventually consistent — a write to DynamoDB is visible to DAX within tens of milliseconds, not synchronously. For online feature stores, the trade-off works because the materialisation job writes both stores anyway, so DAX-staleness against the online-store-of-record is bounded. Flipkart's recommendations service runs DAX with a 5-node cluster in ap-south-1 and reports p99 of 2.8 ms across 18,000 RPS during sale events.
Why Cassandra LOCAL_ONE reads can be wrong (and when that's fine)
Cassandra's read consistency is tunable. LOCAL_ONE reads from any one replica in the local AZ — fastest but can return stale data if the chosen replica missed the latest write. LOCAL_QUORUM reads from a majority of replicas in the local AZ — slower (~1–2 ms more) but always sees committed writes. For features where staleness of seconds is invisible (restaurant availability, product popularity), LOCAL_ONE is the right choice and saves real latency. For fraud features where a freshly compromised card needs to flip from is_compromised=False to True immediately, LOCAL_QUORUM is required. The tunable knob exists precisely because no single consistency level is right for every feature — the platform should let each feature group declare its own.
Hopsworks's RonDB — when Redis is too coarse
Hopsworks (the Swedish feature store, built on top of HopsFS) ships its own online store called RonDB, derived from MySQL Cluster's NDB engine. RonDB targets sub-millisecond reads at scales where Redis becomes operationally painful (200+ shards). The key engineering insight is that RonDB stores feature vectors, not opaque blobs, with native support for projection — the model service asks for 12 of 84 features and pays only for those 12 over the wire. This trims the response size from 300 bytes to ~50 bytes for typical inference paths, saving real microseconds. RonDB is open source; it sees use at Hopsworks customers running 1M-RPS feature-serving workloads (Klarna, NetEase). The lesson is general: when the working set or the read fan-out outgrows Redis, the next step is a purpose-built KV that understands feature semantics, not a bigger Redis cluster.
The vector_search extension — when features are embeddings
Many modern features are dense embeddings (a 256-float vector of "user preference" learned by a recommendation model) rather than scalar values. Looking up an embedding is a key-value lookup; similarity search over embeddings is not. The online store needs to handle both: GET embedding:user:7 returns 1024 bytes (256 × 4) and KNN embedding:user:7 K=20 returns the 20 most-similar users. Redis 8 ships vector_search natively; Postgres has pgvector; specialist stores like Pinecone and Milvus exist for billion-vector workloads. The latency budget for similarity search is looser (50 ms p99 is normal) because it is recall-bound, not latency-bound. The integration question — should the feature store own the vector index or delegate? — is unsolved across vendors. Feast 0.40+ supports both with a pluggable backend.
Cost-per-request as a design constraint
A Razorpay payment authorisation that returns "not fraud" earns the gateway maybe 30 paise of revenue. The fraud check has to fit inside that — every component, including the feature lookup. At 6,200 RPS, the online store costs roughly ₹18 lakh per month on Redis Cluster (3-AZ, RF=2) plus ₹4 lakh per month for the materialisation Flink job. Per-request cost: ₹0.011, or 1.1 paise. That is 3.5% of the gross revenue per authorisation. If the online-store cost crept to ₹70 lakh per month — which is what an under-engineered DynamoDB deployment looks like — the fraud check would consume 12% of the auth revenue and the unit economics break. The latency budget and the cost budget are linked: the architectures that are too slow are also the architectures that are too expensive.
Where this leads next
- /wiki/offline-features-big-tables-point-in-time-correctness — the chapter this one mirrors. Same feature, different store, different access pattern.
- /wiki/streaming-features-and-feature-freshness — how the online store gets refreshed in seconds rather than batches, and what changes in the materialisation job.
- /wiki/feast-tecton-hopsworks-architectures-compared — the three vendors viewed through the lens of online-store design.
- /wiki/training-serving-skew-the-fundamental-ml-problem — the proof obligation that the online store delivers half of.
References
- Tecton, "Online Feature Store Architecture" — the practitioner reference for shared-materialisation patterns.
- Hopsworks, "RonDB: A Feature Store Online Store" — design rationale for purpose-built feature KVs.
- Redis Labs, "Best Practices for Caching with Redis" — TTL, eviction, replica routing.
- AWS, "Working with DynamoDB Accelerator (DAX)" — the integration model that Flipkart uses for sub-3-ms reads.
- Lakshman & Malik, "Cassandra: A Decentralized Structured Storage System" (2010) — the foundational paper. The LOCAL_QUORUM trade-off discussed above is from §5.
- Uber, "Michelangelo: Uber's ML Platform" — the canonical industry post that introduced the offline+online split.
- Apache Flink, "DataStream API Sinks" — the multi-sink pattern that makes shared materialisation possible.
- /wiki/offline-features-big-tables-point-in-time-correctness — chapter 113; the offline counterpart that this chapter completes.