Feast, Tecton, Hopsworks: architectures compared
A platform engineer at Meesho is comparing three feature-store products on a Tuesday afternoon, with a deadline of Friday to recommend one to the head of ML. Feast is open-source and free; Tecton is a managed SaaS that costs ₹1.6 crore per year for their tier; Hopsworks ships an on-prem appliance that the security team likes because the data never leaves the VPC. All three claim to solve "feature store" — point-in-time correctness, online + offline parity, training-serving skew. So why are they priced an order of magnitude apart, and why does picking the wrong one cost 18 months of platform-team time? Because the three vendors drew the line between platform and customer code in three different places. That line — what the product owns versus what your team owns — is the entire architectural difference.
Feast is a thin metadata-and-SDK layer on top of storage you bring; the customer owns materialisation, infra, and SLAs. Tecton owns the entire feature lifecycle — definition language, materialisation engine, online + offline stores — as a managed service. Hopsworks ships a vertically-integrated platform you run yourself, with its own filesystem, online store (RonDB), and training pipeline. The right pick is the one where the boundary matches your team's capacity.
What "feature store" actually means as a product
A feature store is not one thing; it is six things stitched together. Feature definitions (the SQL or Python that computes a feature). A registry (which features exist, what they depend on, who owns them). An offline store (point-in-time-correct historical data for training). An online store (low-latency current data for serving). A materialisation pipeline (the job that keeps both stores in sync). An SDK (the library data scientists use to fetch features for training and inference). The architectural question is which of these six the vendor owns and which your team owns.
Feast owns the registry and the SDK. Everything else — the offline store (you bring BigQuery / Snowflake / Iceberg), the online store (you bring Redis / DynamoDB), the materialisation job (you write Spark / Flink) — is your team's problem. The product is essentially a YAML schema and a Python client. You get cheap and flexible; you give up turnkey.
Tecton owns all six. Feature definitions are written in Tecton's Python DSL, registered with the Tecton control plane, materialised by Tecton-operated Spark/Flink clusters, written to a Tecton-managed online store (DynamoDB or their own KV), queried via the Tecton SDK. You write feature definitions and SLA targets; Tecton runs the rest. You get turnkey; you give up control over the storage layer and pay SaaS prices.
Hopsworks also owns all six but is self-hosted. Their control plane, their HopsFS distributed filesystem (the offline store), their RonDB online store, their feature engineering jobs run on their Hopsworks cluster. You install it on your own Kubernetes — typically inside a regulated VPC where data sovereignty matters — and you own the operations. You get turnkey-but-on-prem; you give up the elasticity that managed SaaS provides.
How a feature definition flows in each system
The cleanest way to see the architectural difference is to follow one feature — failed_txn_rate_24h — from definition to inference in each platform. Pick a Razorpay-style fraud feature: for each card_id, the rate of failed transactions in the last 24 hours.
In Feast, you define the feature in a Python file:
# feature_repo/features.py — Feast feature definition
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64
from datetime import timedelta
card = Entity(name="card", join_keys=["card_id"])
card_txn_source = FileSource(
path="s3://razorpay-features/card_txn_24h.parquet",
timestamp_field="event_ts",
created_timestamp_column="materialized_at",
)
failed_rate_view = FeatureView(
name="card_failed_rate_24h",
entities=[card],
ttl=timedelta(days=7),
schema=[
Field(name="failed_txn_rate_24h", dtype=Float32),
Field(name="failed_txn_count_24h", dtype=Int64),
],
source=card_txn_source,
online=True,
)
# What Feast does when you run `feast apply` and `feast materialize`:
$ feast apply
Created entity card
Created feature view card_failed_rate_24h
Registry updated: features.db (SQLite)
$ feast materialize 2026-04-24T00:00:00 2026-04-25T00:00:00
Materializing feature view card_failed_rate_24h
Reading from FileSource: s3://razorpay-features/card_txn_24h.parquet
Writing to online store: redis://features.online.razorpay.internal:6379
Written 4,832,194 rows in 312s
Walk through what Feast actually does. FileSource(path="s3://...") — Feast does not own the offline data; it points at a file you produced with your own pipeline. The Parquet file at that path was written by a Spark job you wrote, scheduled by Airflow you operate. Feast is reading, not writing, the offline data. online=True — this flag toggles whether feast materialize will copy current values from the offline source into Redis. The materialisation step is a Python loop that reads Parquet rows and writes Redis keys — it is your serving-time consistency story, but the job that creates the Parquet in the first place is still entirely your responsibility. feast materialize — runs as a Python process. For 4.8 million rows it took 312 seconds, single-threaded; at PhonePe scale (2 billion cards) you would shard this across workers manually, because Feast has no built-in distributed materialisation engine. Why this matters: the bottleneck for Feast at scale is always the materialisation step, because it is single-process Python by default. Teams either write their own Spark/Flink job that mirrors Feast's logic and writes both Parquet and Redis in one pass — bypassing feast materialize entirely — or they outgrow Feast. The materialisation Python script is the friction point that pushes serious users to either Tecton or Hopsworks.
In Tecton, you define the feature in Tecton's DSL, and Tecton's control plane runs the job:
# fraud/features.py — Tecton feature definition
from tecton import batch_feature_view, Entity, FilteredSource
from tecton.types import Field, Float32, Int64
from datetime import timedelta
card = Entity(name="card", join_keys=["card_id"])
@batch_feature_view(
sources=[FilteredSource(card_txn_source)],
entities=[card],
mode="spark_sql",
online=True,
offline=True,
feature_start_time=datetime(2024, 1, 1),
batch_schedule=timedelta(hours=1),
ttl=timedelta(days=7),
schema=[Field("failed_txn_rate_24h", Float32),
Field("failed_txn_count_24h", Int64)],
)
def card_failed_rate_24h(card_txn_source):
return f"""
SELECT card_id,
SUM(CASE WHEN status = 'FAILED' THEN 1 ELSE 0 END) AS failed_txn_count_24h,
SUM(CASE WHEN status = 'FAILED' THEN 1.0 ELSE 0.0 END)
/ COUNT(*) AS failed_txn_rate_24h,
MAX(event_ts) AS event_ts
FROM {card_txn_source}
WHERE event_ts >= TIMESTAMP'{{ feature_start_time }}'
GROUP BY card_id
"""
You run tecton apply against the Tecton control plane and it does everything else — provisions a Databricks Spark cluster (or its own Spark-on-Kubernetes), runs the SQL on the schedule, writes both an offline store (Iceberg or Delta) and an online store (DynamoDB), keeps the registry, exposes a Python SDK that returns features for training (point-in-time-correct join) or serving (sub-10-ms KV lookup). When the cluster fails, Tecton retries. When the schema drifts, Tecton catches it. When materialisation is slow, Tecton scales the cluster.
In Hopsworks, the shape is similar to Tecton but the runtime lives on your Kubernetes:
# fraud_features.py — Hopsworks feature definition
import hopsworks
import hsfs
project = hopsworks.login()
fs = project.get_feature_store()
card_fg = fs.create_feature_group(
name="card_failed_rate_24h",
version=1,
primary_key=["card_id"],
event_time="event_ts",
online_enabled=True, # writes to RonDB online store
statistics_config=True, # auto-collect statistics for drift detection
)
# Compute the feature in PySpark (running on the Hopsworks cluster)
df = spark.sql("""
SELECT card_id,
SUM(CASE WHEN status='FAILED' THEN 1 ELSE 0 END) AS failed_txn_count_24h,
SUM(CASE WHEN status='FAILED' THEN 1.0 ELSE 0.0 END)
/ COUNT(*) AS failed_txn_rate_24h,
MAX(event_ts) AS event_ts
FROM card_txn_events
WHERE event_ts >= current_timestamp() - INTERVAL 24 HOURS
GROUP BY card_id
""")
card_fg.insert(df, write_options={"start_offline_materialization": True})
The card_fg.insert(df) call writes to both HopsFS (offline) and RonDB (online) atomically, with the offline write committed via Hudi or Iceberg copy-on-write, the online write via RonDB's NDB protocol. Why one call writes to both: Hopsworks's FeatureGroup.insert() opens a single transaction that the platform fans out to two sinks. Failure on either rolls back. This is the platform doing what a hand-rolled Flink "shared materialisation" job (covered in the previous chapter) does, but as a built-in. The cost: you have to be running the Hopsworks cluster, with its own scheduler, executor, and storage. The benefit: you don't write that Flink job yourself. Compare this with Feast's feast materialize (Python loop, single-machine) and Tecton's managed Spark cluster (vendor's infra). The same logical operation has three very different implementations.
What you give up to get each one
The pricing differences (₹0 / ₹70 lakh / ₹1.6 crore per year for a Meesho-scale workload) are not arbitrary. They reflect what the vendor is taking off your plate and how operationally invested they are.
Feast (₹0 SaaS, but ~₹40-80 lakh/year of platform-team effort). You get a feature registry, a clean SDK, point-in-time correctness for training (via get_historical_features). You give up: distributed materialisation, online-store operations, schema drift detection, monitoring. A typical Feast user at 50M-card scale runs a 3-engineer platform team to keep Feast healthy — building the Spark jobs that actually populate the offline store, the Flink jobs that maintain the online store, the alerts when materialisation drifts, the Helm charts for the Feast server. The total cost is real; it is just paid in headcount instead of SaaS bills. Meesho ran Feast for 18 months before migrating off because the platform-team cost was higher than Tecton's licence at their scale.
Tecton (₹1.6 crore SaaS for ~50M-entity scale, 0 platform engineers). You get end-to-end materialisation, monitoring, alerting, a managed online store, point-in-time-correct training joins, transformation freshness SLAs. You give up: control over which storage primitives are used (Tecton picks DynamoDB or its KV; you can't use your existing Redis), a hard dependency on the Tecton control plane (if their AWS account has an outage, your serving stops), and the SaaS bill. Tecton's pitch is straightforward — at scale, ₹1.6 crore/year is cheaper than three platform engineers (₹2-3 crore loaded) plus the operational risk.
Hopsworks (₹70 lakh-1.2 crore/year for the licence, plus your own ops). You get the integrated platform but on your VPC. You give up: the elasticity that managed SaaS provides — you size your HopsFS cluster, your RonDB cluster, your Spark cluster up-front and re-provision when traffic grows. Hopsworks wins when data sovereignty is a hard requirement: regulated lenders (Bajaj Finserv, IDFC First), insurance (HDFC Life), telecoms (Jio) where sending feature data to a US-based SaaS is non-negotiable. The licence covers vendor-supported software; the operational cost is yours.
The training-time API: where they diverge most
The single biggest API difference between the three is how a data scientist asks "give me the feature values that were valid at the moment each label was generated, for 50 million labels." This is the point-in-time-correct training join from the previous chapters.
# Feast — get_historical_features
from feast import FeatureStore
store = FeatureStore(repo_path=".")
training_df = store.get_historical_features(
entity_df=label_df, # has card_id and event_ts columns
features=["card_failed_rate_24h:failed_txn_rate_24h",
"card_failed_rate_24h:failed_txn_count_24h"],
).to_df()
# Under the hood: AS OF JOIN written to BigQuery / Snowflake / DuckDB.
# At 50M labels × 2 features, a 4-node BigQuery slot pool runs this in 14 minutes.
# Tecton — get_features_for_events
from tecton import get_workspace
ws = get_workspace("razorpay-prod")
training_df = ws.get_feature_view("card_failed_rate_24h") \
.get_features_for_events(events=label_df).to_pandas()
# Under the hood: AS OF JOIN executed by Tecton's managed Spark cluster.
# At 50M labels, runs in ~7 minutes — Tecton has pre-bucketed the offline store.
# Hopsworks — query.get_training_data()
fg = fs.get_feature_group("card_failed_rate_24h", version=1)
query = fg.select_all().as_of(label_df["event_ts"])
training_df, _ = query.read_from_offline().with_event_time(label_df).to_pandas()
# Under the hood: AS OF JOIN executed by Hopsworks's PySpark on HopsFS.
# At 50M labels, runs in ~9 minutes on a 12-node cluster.
The latency numbers are real and they matter. At 50M labels, Tecton at 7 minutes vs Feast at 14 minutes is the difference between a data scientist running 8 experiments per day vs 4. Over a year of model development, that is a 2× productivity gap that in practice costs more than the Tecton licence. Why Tecton is faster: Tecton pre-buckets the offline store by entity_id and pre-sorts by event_ts, so the AS OF JOIN can skip 90% of partitions. Feast leaves bucketing to whatever you wrote your Parquet job to do — typically nothing. Hopsworks bucketing is configurable but not on by default. The performance gap is not a vendor mystery; it is a consequence of the offline-store engineering investment Tecton has made and Feast has not.
Common confusions
- "Feast is the open-source version of Tecton." Feast was started by Gojek, then donated to the LF AI & Data Foundation; Tecton was founded by some of the same Uber engineers who built Michelangelo. They share intellectual lineage but have diverged architecturally — Feast is a thin metadata layer; Tecton is a full vertical platform. Treating Feast as "free Tecton" leads teams to expect features (managed materialisation, transformation engine) that Feast does not have.
- "Hopsworks is just a self-hosted Tecton." Hopsworks predates Tecton and was built around HopsFS, a research distributed filesystem out of KTH. Its feature store sits on top of that filesystem, while Tecton sits on top of cloud object stores and managed Spark. The on-prem positioning is a consequence, not a feature flag.
- "Pick the cheapest one because they all do the same thing." The total cost is licence + platform-team headcount + operational risk. Feast's ₹0 licence means a 3-engineer platform team; Tecton's ₹1.6 crore means 0. At small scale (1M entities, 5 features) the math favours Feast strongly; at PhonePe-scale it inverts.
- "You can migrate from one to another easily." Feature definitions are not portable. Feast's YAML, Tecton's Python DSL, Hopsworks's Python SDK look superficially similar but bind to different runtime semantics (windowing rules, late-data handling, schema-evolution rules). Meesho's migration from Feast to Tecton in 2025 took 6 months of platform-team time despite both being "Python-feature-store" products.
- "The online store determines the latency, so just pick one with good latency." All three can hit sub-10 ms p99 because they delegate to (or contain) Redis, DynamoDB, RonDB, or similar. The latency difference between vendors is rarely the deciding factor; the materialisation engine and offline-store performance is what changes the engineering cost most.
- "Open-source means I can self-host without a SaaS bill." Feast is open-source but the deployment still requires a Helm chart, a Kubernetes cluster, an offline store you run, an online store you run. Self-hosting is not the same as SaaS-with-no-vendor; it is "you-host-everything-yourself" and the operational load can exceed Tecton's licence.
Going deeper
Tecton's two-mode materialisation: batch and streaming as one DAG
Tecton's killer architectural feature is that a single feature definition can compile to both a batch Spark job (refresh every hour) and a streaming Flink job (refresh every second), with the platform handling backfill from batch and forward-fill from streaming. The customer writes one SQL definition; Tecton emits two physical pipelines that share state — at the feature-group level, the streaming job writes to the same online store as the batch job, and the batch job's offline-store writes serve as the historical record the streaming job's checkpoints can recover from. This is hard to build in Feast (you would write the Spark job and the Flink job separately and manage their consistency yourself) and Hopsworks supports it but with manual configuration. The lesson: when feature freshness varies across features (some 24-hour, some 30-second), the unified materialisation engine pays for itself.
Hopsworks's as_of semantics and the timeline view
Hopsworks's offline store is built on Apache Hudi, which gives it true time-travel: any feature value at any timestamp in the past can be retrieved with fg.as_of(ts).read(). This matters during training-serving skew incidents — when the model performs differently in production than in training, you can replay the exact feature values the inference path would have seen at any historical timestamp, rather than reconstructing from logs. Tecton has a similar capability via the offline store's snapshot isolation; Feast leaves this to whatever your offline store does (BigQuery time travel is 7 days; Snowflake's is configurable; Iceberg's is whatever you set retention to). The "time-travel for training-serving skew investigation" use case is a strong argument for Hopsworks at regulated institutions where audit replay is a compliance requirement.
The real cost driver: who tunes the online store under load
Online-store latency is "good enough" by default in all three; what differs is who tunes it when traffic patterns change. At Razorpay, when Diwali week brought 4× normal traffic, the team using Tecton phoned a Tecton solutions engineer who scaled the DynamoDB tables and DAX nodes from a control panel; the team using Feast had a 3 a.m. on-call where the platform engineer manually re-sharded their Redis cluster. The on-call cost was real — over the year, the Feast team's pages-per-week ran 6× higher than the Tecton team's. Why this gap is structural: Tecton's SLA is contractual — they have a paid pager-rotation that cares about your store's tail latency. Feast's SLA is whatever your platform team can deliver. The pager difference is the visible part of an invisible difference in operational maturity that you only see under load.
Why neither Feast nor Hopsworks won the embedded-vector market
Vector embeddings as features are growing fast (recommendation models, LLM RAG, fraud anomaly scores). Tecton added vector support in 2025; Feast 0.40+ supports it via pluggable backends. Hopsworks supports it via a separate Vector Database product, not its main feature store. The architectural reason is that vector search is not a key-value lookup — it requires similarity indexes (HNSW, IVF) that don't fit naturally into the row-oriented online stores all three platforms started with. Tecton's response was to integrate Pinecone and Weaviate as backend options; Hopsworks built a separate vector index alongside RonDB. This is the next architectural frontier — feature stores that natively understand both scalar and vector features without bolt-ons.
Where this leads next
- /wiki/training-serving-skew-the-fundamental-ml-problem — the proof obligation that all three vendors claim to solve.
- /wiki/online-features-key-value-lookups-at-p99 — the online-store layer all three platforms abstract over.
- /wiki/offline-features-big-tables-point-in-time-correctness — the offline-store layer, where Tecton's pre-bucketing pays off.
- /wiki/streaming-features-and-feature-freshness — the streaming materialisation that Tecton handles natively and the others bolt on.
References
- Tecton, "Feature Store Architecture" — the canonical reference for the managed-vertical-platform model, including the two-mode materialisation engine.
- Feast, "Feast Architecture" — the registry-and-SDK model in the maintainers' own words.
- Hopsworks, "The Hopsworks Architecture" — design rationale for the integrated-on-prem platform, including HopsFS and RonDB.
- Bergman et al., "Hopsworks: A Feature Store for ML" — the VLDB-style paper covering the offline + online + streaming feature integration.
- Uber, "Michelangelo: Uber's ML Platform" — the original industry post that defined the offline + online split all three platforms now implement.
- Gojek, "Why we built Feast" — the origin story for Feast and the Gojek-scale problems that motivated it.
- Meesho engineering, "Migrating from Feast to a managed feature store" — a customer-side case study showing the cost trade-off in practice.
- /wiki/online-features-key-value-lookups-at-p99 — chapter 114; the online-store internals that all three platforms either own or delegate.