Lakehouse vs Warehouse — The Real Tradeoffs

In short

The data warehouse and the data lakehouse are not rivals waiting for one to win. They are two stable points on a spectrum, and a serious data team picks a point on that spectrum based on six axes, not based on which vendor has the loudest conference talks.

The classical warehouse — Snowflake, BigQuery, Redshift — owns its storage format, owns its compute, owns its catalog, and ships everything as one vertically integrated product. You SSO in, you point Tableau at it, the dashboards work. Governance, audit, masking, role-based access, query history — all sit in one console. The price is twofold: storage costs roughly 5-10x what the same bytes cost on raw S3, and your data lives in a format only one vendor can read without paying an export fee.

The lakehouse — Databricks Lakehouse, or the open stack of Iceberg + Trino + Spark + Athena reading Parquet on S3 — flips both. Storage is cheap open-format Parquet on S3 (₹2 per GB-year, not ₹15). Many engines plug into the same files: Trino for interactive SQL, Spark for ML, Athena for ad-hoc, DuckDB on a laptop, even Snowflake or BigQuery as external readers. The price is operational: more YAML, more orchestration, weaker out-of-the-box governance, and a Tableau experience that requires more glue.

The decision criteria are concrete: data volume (under 10 TB either works, over 100 TB the lakehouse is dramatically cheaper), workload mix (pure SQL BI favours warehouse, ML + streaming + ad-hoc favours lakehouse), team skills (SQL-only analysts → warehouse, mixed data-engineering team → lakehouse), governance maturity (heavy regulatory → warehouse, internal/iterative → lakehouse), and vendor risk tolerance. Most early-stage Indian startups (5-50 person data teams) start on Snowflake or BigQuery for the ergonomics. Past about 100 TB, the cost arithmetic flips and many — Razorpay, PhonePe, Swiggy among them — move workloads onto open lakehouse. The convergence of 2024-2026 (Snowflake's Iceberg support, Apache Polaris, Databricks SQL warehouses) is making the choice less binary every quarter.

You have spent Build 16 designing the warehouse from the inside — star schemas, slowly changing dimensions, separation of storage and compute, the move to data lakes, table formats like Iceberg and Delta, and engines like Trino and Spark that read those open tables. Each chapter ended with the same observation: the lake gives you cheap storage and open formats, the warehouse gives you ergonomics and governance, and the industry is converging from both ends toward a middle called the lakehouse.

This closing chapter pulls back to the architectural decision itself. If you are starting a data platform in 2026, or rebuilding one, you face a real choice with real money attached. This chapter walks the two extremes, the trade-off matrix that sits between them, the concrete decision criteria, and a worked Indian SaaS example that shows how the same company lands in three different places at three different stages of growth.

The two extremes

The first thing to understand is that these are architectural patterns, not products. Snowflake is the most visible classical warehouse, but BigQuery and Redshift sit in roughly the same shape. Databricks is the most visible lakehouse, but a Trino + Iceberg + Spark stack on S3 (with no Databricks involved) implements the same pattern. The question is which shape fits your problem, not which logo goes on the slide.

The classical data warehouse

A classical warehouse — Snowflake, BigQuery, Redshift Provisioned — is a vertically integrated product. The vendor owns every layer: the storage format, the compute fleet, the query optimiser, the metadata catalog, the access control, the audit trail, the web UI, the BI connectors. You upload data, you write SQL, the rest is the vendor's problem.

The storage format is proprietary. Snowflake's micro-partitions are not Parquet (despite being column-oriented and compressed similarly); the on-disk layout, the encryption, the indexing sidecars are all Snowflake-internal. Redshift's columnar files are not Parquet either. BigQuery's Capacitor format is closed and undocumented. Why proprietary formats? Three reasons: the vendor can co-design storage with the engine for maximum throughput (BigQuery's Capacitor + Dremel co-design buys roughly 3x the scan throughput of generic Parquet readers); the vendor can ship features (time travel, zero-copy clones, cluster keys) that depend on storage internals; and — less charitable but real — proprietary formats create switching costs that protect revenue.

Classical warehouse architecture. One vendor owns every layer from disk to dashboard. Excellent ergonomics, expensive storage, real lock-in.

The strengths of this shape are real and worth naming honestly. Ergonomics: a Snowflake account can be productive on day one — load a CSV, run SQL, point Tableau at it, the dashboards render. There is no orchestrator to configure, no Iceberg catalog to deploy, no Spark cluster to size. Governance: role-based access control, row-level security, dynamic data masking, query history, audit trails, data classification — all of this ships in the product, not as a separate tool you bolt on. Performance predictability: the vendor co-designs storage and compute, so query latency is consistent and well-characterised; you do not get the long tail of slow queries that comes from a misconfigured Spark job hitting cold S3 partitions.

The costs are also real. Storage is expensive: Snowflake charges roughly $23/TB/month for storage on AWS (similar to S3 Standard), plus an effective premium because all your data has to be loaded into the proprietary format — you cannot share files with other tools without paying egress + format-conversion costs. Compute is metered in vendor units (Snowflake credits, BigQuery slots) that are difficult to compare across vendors and that include the vendor's margin. Lock-in is structural: migrating off Snowflake to a lakehouse is a months-long project because the data format is closed.

The data lakehouse

A lakehouse is the inverse architecture. Storage is open-format Parquet on a cheap object store (S3, GCS, ABFS), wrapped by an open table format (Apache Iceberg, Delta Lake, or Apache Hudi) that adds ACID transactions, schema evolution, and time travel on top of the raw files. Compute is whatever you want: Trino for interactive SQL, Spark for batch ETL and ML, Athena or DuckDB for ad-hoc, Flink for streaming, Databricks Photon if you are buying that experience, or any combination of these reading the same tables.

The phrase "lakehouse" was coined and popularised by Databricks in their CIDR 2021 paper Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics, but the architecture predates the name — anyone running Hive on S3 + Parquet in 2017 was approximating it. What Iceberg, Delta and Hudi added (chapter 133) was the ACID + metadata layer that turned a data swamp into something a SQL engine could trust.

Lakehouse architecture. Open Parquet on S3, an open table format on top, an open catalog on top of that, and as many independent engines as you want — none of which lock the data in.

The strengths of this shape mirror the warehouse's weaknesses. Storage is cheap: raw S3 Standard is roughly ₹2 per GB-year (about $23/TB/month), and you can drop cold partitions to S3 Infrequent Access (₹0.40/GB-year) or Glacier (₹0.10/GB-year) without any vendor's permission. Open formats: your Parquet files are readable by every analytical engine that has shipped in the last decade. If Trino disappoints you, you swap in Spark. If Spark is too heavy for ad-hoc, you point DuckDB at the same files. Workload diversity: ML training reads Parquet directly; streaming jobs (Flink) write to Iceberg in real time; SQL BI runs on Trino against the same tables. No data movement, no copy, one source of truth.

The weaknesses are equally real and equally worth naming. Operational burden: you run the catalog (Polaris, Unity, Glue), you run the Trino cluster, you manage the Iceberg compactions, you orchestrate the Spark jobs. There is no one console to log into and no one vendor to call when a query is slow. Governance is younger: Unity Catalog (Databricks) and Apache Polaris (open-source, donated by Snowflake in 2024) are good and improving fast, but neither has the audit/RBAC/lineage maturity that Snowflake has spent a decade polishing. Ergonomics for non-technical users: a SQL-only analyst on Snowflake is productive in an hour; the same analyst on a Trino + Iceberg + Polaris stack hits more rough edges (slower autocomplete, less polished UI, occasional schema-evolution surprises).

The trade-off matrix

The two shapes above are not better or worse — they are different points on a spectrum, and where you should land depends on six axes. The diagram below maps them out.

The six-axis trade-off matrix. The warehouse wins ergonomics and governance; the lakehouse wins cost-at-scale, interoperability, workload diversity and vendor risk. Match the axes to your problem.

A few of these axes deserve unpacking, because the popular discourse usually gets them slightly wrong.

Cost at scale is the headline argument for the lakehouse, and the arithmetic is real but only kicks in past a threshold. Below ~10 TB, both architectures cost roughly the same — Snowflake's per-second compute billing means a small workload pays only for what it runs, and the storage premium on a few terabytes is in the hundreds of dollars per month, not the tens of thousands. Above ~100 TB, the picture inverts dramatically: the same data on Iceberg + S3 + Trino costs roughly 30-40% of the equivalent Snowflake bill, mostly because cold partitions can be moved to S3 IA (5x cheaper) and compute clusters can be sized aggressively to the workload.

Interoperability is harder to feel until you need it. The day a new use case arrives that is a poor fit for your current engine — say, an ML team wants to train a model directly on the raw events, and your SQL warehouse charges per scan — you discover whether your data is portable. On a lakehouse, the answer is: point Spark at the same Iceberg table, no movement, no copy. On a warehouse, the answer is: export to Parquet (paying egress), reload into S3, build a parallel pipeline. The cost of not having interoperability is invisible until the moment it is enormous.

Ergonomics is where the warehouse's marketing is genuinely earned. A team of five SQL analysts at a Series-A startup can get a Snowflake account, load their Postgres CDC, and have dashboards in Looker by Friday. The same team on a lakehouse spends the first month deploying Polaris, configuring Trino, learning Iceberg semantics, debugging schema-evolution edge cases — and produces nothing the business can see. Why the ergonomics gap is real and persistent: a vertically integrated product can co-design the UX across layers (the catalog UI knows about the query history, which knows about the warehouse sizing, which knows about the masking policy). An open stack assembled from Polaris + Trino + Tableau + Airflow has six teams of maintainers who do not coordinate, so the seams between products are visible. Convergence is happening — Databricks Workspace and Tabular's UI before its acquisition both shrank the gap — but the warehouse's "everything in one place" experience will likely lead by 18-24 months for some time yet.

Governance is the axis where the conventional wisdom is most out of date. As of 2026, Apache Polaris (the open catalog Snowflake donated in mid-2024) and Unity Catalog (Databricks, with limited open-source bits) both ship table-level RBAC, column masking, audit logs, and integration with major identity providers. They are not yet at parity with Snowflake's native governance, but they are closer than they were two years ago, and the gap is closing each quarter. The "warehouse wins governance" claim that was unambiguously true in 2022 is now more like "warehouse leads governance, lakehouse acceptable for most".

Workload diversity is the axis that decided the lakehouse's existence. A pure SQL BI workload fits a warehouse beautifully. The moment ML training, streaming ingestion, ad-hoc Python notebooks, vector search, or full-text search enter the workload mix, the warehouse model strains. Either you pay the warehouse vendor for ML add-ons (Snowpark, BigQuery ML) that are slower than native frameworks, or you copy data out to a separate ML platform — and now you have two stores and a sync pipeline. The lakehouse's one-storage-many-engines pattern fits a multi-modal data team natively.

Decision criteria

The matrix above maps the trade-offs; the criteria below help you weight them. There is no single right answer; the questions below approximate the right one for most teams.

Volume. Below ~10 TB, either architecture works fine. Pick by team skills and ergonomics. Between 10-100 TB, the cost gap starts to matter; reach for the lakehouse if you have engineering bandwidth, stay on the warehouse if you do not. Above 100 TB, the lakehouse is dramatically cheaper at scale; the operational cost of running the lakehouse is amortised across enough storage that it becomes worth paying.

Workload mix. Pure SQL BI on a single mart? Warehouse. Mixed workload — SQL + ML training + streaming ingestion + ad-hoc data science notebooks + vector search for an LLM feature? Lakehouse. The deciding question is: how many distinct compute personalities does your data need to feed? One → warehouse. Three or more → lakehouse.

Team skills. SQL-only analysts and a small team (under 10 people)? Warehouse. The ergonomics premium is worth the cost premium. A team with at least 2-3 senior data engineers who know Spark, Airflow, container orchestration, and infrastructure-as-code? Lakehouse becomes viable. The lakehouse rewards engineering depth and punishes its absence.

Governance maturity. Heavy regulatory regime (PCI for a payments processor, SEBI rules for a brokerage, RBI rules for a bank) where you need iron-clad audit trails, column-level masking, and quarterly compliance attestations? Warehouse — the maturity gap matters here, and Snowflake's audit story is battle-tested. Internal/iterative work where the governance bar is "don't let interns see SSNs"? Lakehouse with Polaris is fine.

Vendor risk tolerance. A regulated industry that worries about a single-vendor outage taking down its analytics for a week, or a procurement team that requires open formats for compliance, or a CFO who has been bitten by a software-vendor price hike? Lakehouse with open formats — your data is portable by construction. A pre-IPO startup that wants to optimise for speed of execution and accepts the lock-in? Warehouse.

Cost sensitivity. A startup with VC runway burning at 2K/month on Snowflake should not over-engineer a lakehouse. A growth-stage company spending50K/month on Snowflake whose data team has the bandwidth to run open-source infrastructure should seriously model the migration. A late-stage company at $500K/month is almost certainly leaving money on the table by staying pure-warehouse.

The convergence

A worthwhile observation is that the gap between the two architectures is shrinking from both sides. In late 2024, Snowflake announced general availability of Apache Iceberg tables, letting Snowflake's compute query Iceberg tables stored in your own S3 bucket — your storage, their engine, BYOC in reverse. Databricks acquired Tabular (the company behind Iceberg) in mid-2024 for $2 billion, and committed to interoperating Delta Lake and Iceberg via the Delta UniForm spec. BigQuery added BigLake tables that read Iceberg natively. Apache Polaris, donated by Snowflake to the Apache Foundation in 2024, is now a credible open alternative to Unity Catalog.

The line between warehouse and lakehouse is blurring in 2026 in two directions. Warehouses are opening their storage: you can put your data in open Iceberg format on your own S3 bucket and let Snowflake query it. Lakehouses are getting better ergonomics: Databricks SQL warehouses now feel almost identical to Snowflake virtual warehouses for SQL analysts; the Tableau and Power BI connectors are at parity. The decision in 2027 may be less "warehouse or lakehouse?" and more "which catalog (Polaris, Unity, Glue), which engine for which workload, and how much vendor management do I want to outsource?"

This convergence is good news. It means the trade-off matrix above will get less stark over time, the cost of an early decision will get smaller, and the pattern of starting on a warehouse and adding lakehouse capacity as you grow (or vice versa) will become more practical, not less.

An Indian SaaS company at three stages of growth

Consider Anvaya, a Bengaluru-headquartered B2B SaaS company selling workflow software to mid-market manufacturers. Anvaya's data architecture evolves over five years as the business grows. The story below is a composite drawn from several real companies — Razorpay, Zerodha, Postman, Freshworks — and is broadly representative of the path that most successful Indian SaaS companies take.

Year 1 — Seed-stage (1 TB, 5 analysts). Anvaya has 200 customers, 1 TB of data (events, CRM exports, Postgres CDC), and a five-person business team that needs dashboards. The data team is one person — a senior analyst who knows SQL and dbt but has never touched Spark. They sign up for Snowflake, set up Fivetran to load Postgres + HubSpot + Mixpanel into Snowflake, write dbt models, point Tableau at the warehouse. Total spend: ₹1.7 lakh/month (~$2K), of which ~₹1.3 lakh is Snowflake compute and storage, ~₹40K is Fivetran. Time from signup to first production dashboard: 4 days. Going lakehouse here would be over-engineering — the cost savings would be in the hundreds of dollars per month, and the operational burden would consume the one data person.

Year 3 — Series B (50 TB, 30 analysts + 5 data engineers + ML team). Anvaya has grown to 3,000 customers, 50 TB of data, and now serves three distinct workloads. BI is still on Snowflake (130 dashboards, 30 analysts, no plans to move). A new ML team (5 people) needs to train churn-prediction and anomaly-detection models on raw event streams — running this on Snowflake via Snowpark is technically possible but slow and expensive (roughly ₹15 lakh/month just for the ML compute). A new clickstream pipeline ingests 200 GB/day of mobile-app events that the BI team does not need but the ML team does. The decision: keep Snowflake for BI (it works, the analysts are productive, switching costs are real), add Databricks for ML on the same data — write the bronze and silver Parquet tiers to S3, let Databricks read them with Spark for training, and continue to push gold tables into Snowflake for BI. Total spend: ₹17 lakh/month (~$20K), split roughly 60/40 between Snowflake and Databricks. Hybrid architecture, two engines, one source of truth at the silver layer. The cost is up 10x from year 1 because data and team are 10x larger; the per-row cost is roughly flat.

Year 5 — Scale-up (500 TB, 100+ users, multiple teams). Anvaya is now a 1,200-person company with 30 customers in the Fortune 500, 500 TB of data, and four major data consumers: BI (Snowflake), ML (Databricks), product analytics (a Trino cluster the product team set up to run cohort queries), and a new agentic-AI feature that needs to run vector search over claim text (for an insurance vertical). Continuing on pure Snowflake for the BI portion would now cost roughly ₹2.5 crore/month (~$300K) at the published list price — roughly half of which is storage premium and Snowpark ML cost. The decision: migrate to a full open lakehouse on Iceberg. All raw data goes to S3 in Parquet wrapped by Iceberg. Trino serves interactive SQL and BI (Tableau migrated to use Trino as its source). Spark on Databricks runs ML and heavy ETL. Athena handles ad-hoc queries from the product team. Snowflake stays for the BI use cases that the analysts have not had bandwidth to migrate yet, but reads Iceberg tables externally rather than holding its own copy.

The breakdown after the migration: ₹68 lakh/month (~$80K) total — ₹15 lakh S3 storage, ₹25 lakh Databricks compute, ₹12 lakh Trino on EC2, ₹8 lakh Snowflake (much-reduced footprint), ₹8 lakh in glue (Polaris catalog, Airflow, monitoring, on-call). That is a 10x increase in storage volume against a 4x increase in cost — the lakehouse's economic advantage made concrete. The price was roughly four months of two senior data engineers building the migration, and an ongoing operational burden of one full-time engineer keeping the platform healthy.

The lesson is not "lakehouse always wins" — it is that the right architecture changed three times as the business grew. Starting on Snowflake was the right call at year 1 (the operational discipline a lakehouse requires would have killed the data function). Hybridising at year 3 was the right call (one architecture could not serve both BI and ML cost-effectively). Migrating to a full lakehouse at year 5 was the right call (the cost gap had grown to where the ROI of migration was overwhelming). A team that picked the year-5 architecture in year 1 would have shipped half the dashboards. A team that stayed on the year-1 architecture at year 5 would be paying ₹2 crore/month more than necessary.

Closing the build

You have walked Build 16 from the inside. Star schemas and slowly changing dimensions taught you the warehouse's modelling discipline. Separation of storage and compute showed you why Snowflake's bet reset cloud-warehouse economics. Data lakes on object storage showed you what happens when you take that bet to the limit and let any engine read any file. Iceberg and Delta closed the gap by adding ACID and schema evolution to the lake. Trino, Spark and Athena demonstrated the one-storage-many-engines pattern in action.

This chapter has been about the meta-decision sitting on top of all of that. The warehouse and the lakehouse are not enemies; they are two stable shapes a data platform can take, and the right one for your team depends on the volume of your data, the diversity of your workloads, the depth of your engineering team, the maturity of your governance requirements, and your tolerance for vendor risk. The economics tilt toward the lakehouse as you scale, the ergonomics tilt toward the warehouse if your team is small and SQL-first, and the convergence happening in 2024-2026 means the cost of choosing wrong is shrinking each year.

The next build (Build 17) leaves the analytical world entirely and starts on document databases — MongoDB, Couchbase, the operational store for an application that needs flexible schemas, JSON-native queries, and millisecond writes. Different problem, different shape, and a fresh set of trade-offs. Bring the matrix-thinking with you.

References

Armbrust et al., Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics, CIDR 2021 — the Databricks paper that named and defined the lakehouse pattern.
Dageville et al., The Snowflake Elastic Data Warehouse, SIGMOD 2016 — the canonical Snowflake architecture paper.
Apache Software Foundation, Apache Iceberg specification — the open table format that underpins most modern lakehouses.
Snowflake Engineering Blog, Iceberg Tables in Snowflake — the 2024 announcement of Snowflake reading external Iceberg tables, a key convergence milestone.
Onehouse, Apache Hudi vs Delta Lake vs Apache Iceberg: A Comparison Guide — practical comparison of the three open table formats.
Fivetran + dbt Labs, The Modern Data Stack — the warehouse-centric architecture that made Snowflake and BigQuery the default for early-stage companies.