Star and Snowflake Schemas

In short

A data warehouse stores history and answers analytical questions — "revenue by month by region by product category for the last three years" — and its schema design follows different rules from the OLTP database that runs the cash registers. At the centre sits a fact table: one row per business event (one sale line, one page view, one ad impression), narrow, append-only, billions of rows, with foreign keys to dimensions and a handful of numeric measures (quantity, revenue, cost). Around it sit dimension tables: one row per entity (one product, one customer, one store, one date), wide with many descriptive columns, small (thousands to millions of rows), and denormalised — the product dimension carries the category name, the department name, the brand, the supplier name, all in one row. Drawn as a diagram this looks like a star, hence the name. The snowflake schema is the same idea except the dimensions are themselves normalised — the product dimension links to a category dimension, which links to a department dimension — saving storage but adding joins. Star almost always wins in modern warehouses because storage is cheap, BI tools generate cleaner SQL against star, and query optimisers (Snowflake, BigQuery, Redshift, Databricks) have specific star-join optimisations. The most consequential design decision is the grain of the fact table — what does one row represent? "One row per order line" vs "one row per order header" determines what questions you can answer and how big the table will be. This chapter opens Build 16 by deriving the star schema for an Indian fashion retailer (fact_sales, dim_product, dim_store, dim_customer, dim_date), shows the snowflake variant, walks the trade-off matrix, and previews slowly changing dimensions (full treatment in chapter 130).

You have spent the last fifteen builds learning how OLTP databases work — B-trees, MVCC, two-phase locking, write-ahead logging, replication, sharding. Every one of those mechanisms exists to serve a workload that looks the same: thousands of small transactions per second, each touching a few rows, each finishing in single-digit milliseconds. Insert an order. Update an inventory count. Read a user's profile. The queries are short, the rows are narrow, the schema is normalised to third normal form so that an update touches one place and one place only.

Now consider a different workload. The CFO of an Indian fashion retailer wants to know: "What was our revenue by month, by product category, by tier-1 city, for women's apparel, for the last three financial years?" That query touches every row of a multi-billion-row sales history, joins across five or six tables, aggregates the result, and runs maybe twice a week — once when the CFO checks the dashboard, once when the analyst rebuilds the slide.

Running this query against the OLTP database that powers the checkout flow is a disaster on multiple axes. It takes hours, it locks tables and slows down the live site, and the schema — designed to make INSERT INTO order_lines cheap — makes the analytical join slow and unreadable. The answer is to build a separate database, with a separate schema, optimised for the opposite workload. That database is the data warehouse, and its dominant schema is the star.

This chapter — opening Build 16 — derives the star schema from first principles, walks its snowflake variant, explains the all-important concept of grain, and previews slowly changing dimensions. By the end you should be able to look at a business question and sketch the warehouse schema that answers it on a napkin.

Two workloads, two schemas

Start by naming the two workloads precisely. The OLTP workload — Online Transaction Processing — is what runs the business in real time. Customer places an order; row goes into orders, rows go into order_lines, inventory counts decrement, payment row inserts. Reads are also transactional: fetch this customer's order history, fetch this product's price. Each query is short, each row is narrow, throughput is measured in transactions per second.

The OLAP workload — Online Analytical Processing — is what runs the analysis of the business. Aggregate, group, filter, slice, dice. Revenue by region by month. Cohort retention by signup month. Conversion funnel by traffic source. Each query is long-running (seconds to minutes), each query scans millions or billions of rows, throughput is measured in queries per hour and the metric that matters is latency to insight.

The schema rules that make OLTP fast actively hurt OLAP. Third normal form — every fact stored once, no redundancy — means that to answer "revenue by product category" you have to join order_lines → products → product_categories → product_departments, four tables, several million rows each. Foreign-key indexes don't help because the query is a scan, not a lookup. Worse, the column you want to aggregate (revenue) is computed from quantity × unit_price, and the unit_price lives in a third table because it changes over time and your OLTP schema correctly tracks that history.

Why normalisation hurts analytics: 3NF optimises for write correctness — every fact lives in exactly one place, so an update is unambiguous. But analytical queries don't update; they aggregate. Aggregation needs the columns adjacent in one wide row so the engine can scan and reduce in a single pass. The work that 3NF saves on writes (no redundancy) is irrelevant when there are no transactional writes, only bulk loads.

The warehouse schema flips the priority. Storage is cheap; analyst time is expensive; query latency drives engagement. Denormalise aggressively. Pre-join. Pre-compute. Make the rows wide and the joins few.

The star: one fact, many dimensions

The star schema, as Ralph Kimball formalised it in The Data Warehouse Toolkit, has exactly two kinds of table.

A fact table stores one row per business event. For a retailer, that event might be "one line on one sales receipt" — a customer bought 2 units of SKU KURTA-RED-M at STORE-MUM-005 on 2026-04-12 for ₹1,798. The fact row has:

A handful of foreign keys, one per dimension (date_id, product_id, store_id, customer_id)
A handful of measures — numeric columns you will aggregate (quantity, revenue, cost, discount)
Nothing else. No descriptive text, no human-readable names, no derived attributes.

Fact tables are huge (often the only table in the warehouse with billions of rows), narrow (a dozen columns), and append-only (you never update a sales fact; you write a correction row).

A dimension table stores one row per entity — one product, one customer, one store, one calendar date. The dimension row has many wide descriptive columns: product name, brand, category, sub-category, department, supplier, colour, size, fabric, country of origin, price-tier, first-stocked-date. Dimension tables are small (thousands to a few million rows), wide (sometimes 50+ columns), and denormalised — the category name appears in every product row, the city name appears in every store row, and that is fine.

Drawn as an entity-relationship diagram, the fact table sits at the centre and the dimension tables surround it. Each dimension has exactly one foreign-key relationship to the fact. The result looks like a star.

Star schema. The fact_sales table holds measurements and foreign keys; each dimension is a single denormalised table with all descriptive attributes flattened into one row.

The geometry matters. Every analytical query is now a join between the fact table and one or more dimensions — never between two dimensions. The query planner sees a small number of always-the-same-shape joins and can apply a star join optimisation: it filters the dimensions first (small tables, selective predicates), takes the resulting product_id set, and uses that to scan only the matching slices of the fact table. Snowflake, BigQuery, Redshift, Databricks, and DuckDB all have specific code paths for this shape.

Why the shape matters to the planner: when the planner knows that one big table is being joined to several small ones via foreign keys, it can flip the join order from "fact ⋈ dim1 ⋈ dim2 ⋈ dim3" (which materialises huge intermediates) to "filter dim1, filter dim2, filter dim3, semi-join all three filters into fact" (which prunes the fact scan to maybe 1% of rows before any join). This is the star schema optimisation, and it doesn't fire on arbitrary normalised joins.

The grain decision: what does one row mean?

Before you write a single CREATE TABLE, you have to decide the grain of the fact table. Grain is Kimball's word for "what does one row in the fact table represent in the business?" Get this wrong and the warehouse will answer some questions perfectly and lie about others.

For a retailer there are at least three plausible grains:

Order-line grain — one row per line on a sales receipt. If a customer buys 3 items in one transaction, that produces 3 fact rows. This is the most detailed grain. You can answer "revenue by product category" because each row has a product_id, and you can answer "average basket size" because rows from the same receipt share an order_id.
Order-header grain — one row per receipt regardless of how many items. A 3-item purchase is one row with total_quantity = 3 and total_revenue = ₹1798. You can answer "average basket value" trivially but you cannot answer "revenue by product category" because the row has no product.
Daily-snapshot grain — one row per (store, date) with totals for that day. Tiny table, fast queries about store performance, but you cannot answer anything per-product or per-customer.

The rule: choose the most detailed grain you can afford to store. Coarser grains can always be derived by aggregation; finer grains can never be recovered after the fact. For a retailer, that almost always means order-line grain. For an ad-tech firm it means one row per impression. For a streaming service it means one row per "play event".

The most common modelling mistake is mixing grains in one fact table — some rows representing line items, others representing daily totals — usually because someone added a "summary" row to make a particular dashboard fast. Now every query has to filter on a row_type column to avoid double-counting, and analysts spend half their lives debugging it. One fact table, one grain. If you need a daily summary, build a separate fact table at daily-snapshot grain, derived from the line-grain table by an aggregation job.

Why mixing grains is so bad: aggregation is associative when every row contributes once. The moment some rows are sums of others, SUM(revenue) no longer means "total revenue" — it means "total revenue plus the daily sums plus the monthly sums plus the annual sums". Every query needs a grain filter, and forgetting the filter silently inflates every number on the dashboard. There is no warning, no error, just wrong answers presented confidently.

Snowflake: normalise the dimensions

The snowflake schema starts with the same fact table at the centre but normalises the dimensions. The product dimension no longer carries the category name and the department name in every row; instead it carries a category_id foreign key into a dim_product_category table, which carries a department_id foreign key into dim_product_department. Drawn as a diagram this looks like a snowflake — the central fact still has its star points, but each point fans out into smaller crystals.

Snowflake schema. The product dimension is split into product → category → department, three normalised tables. Storage drops because category and department names are no longer repeated in every product row, but every analytical query that filters on department now needs a 4-way join.

Snowflake saves storage. If you have 50,000 products in 200 categories in 10 departments, the star schema's dim_product repeats each category name 250 times on average and each department name 5,000 times. The snowflake stores each category name once and each department name once. For a wide hierarchy (HR organisations, geographic hierarchies with country → state → district → city → ward), the savings are real.

It also makes some kinds of update easier. If "Women's Apparel" gets renamed to "Women's Fashion", the snowflake updates one row in dim_product_department; the star schema updates 5,000 rows in dim_product.

But the cost is paid on every analytical query. The query "revenue by department for tier-1 cities" was a 3-way join in the star (fact ⋈ dim_product ⋈ dim_store); in the snowflake it is a 5-way join (fact ⋈ dim_product ⋈ dim_product_category ⋈ dim_product_department ⋈ dim_store). Modern engines plan the extra joins efficiently, but the SQL is harder to read, the BI tool's auto-generated joins get more brittle, and the star-join optimisation may or may not fire across the chain.

The trade-off matrix

The decision between star and snowflake comes down to a small set of trade-offs. The matrix below makes them concrete.

The trade-off matrix. The cells where star wins are highlighted; the cells where snowflake wins are highlighted. The score in modern warehouses is roughly 5-2 in favour of star.

Why does star win in modern warehouses? Three reasons.

First, storage is cheap. Snowflake (the company) charges for storage at roughly 23 per TB per month; an extra terabyte from denormalised dimensions might add300 a year. The same terabyte saved by snowflaking saves $300 a year and costs an analyst an hour a week debugging join chains — analyst hours are worth orders of magnitude more.

Second, query optimisers handle star joins specifically well. The star-join optimisation — push selective filters into the dimensions, semi-join the resulting key sets back into the fact, scan only the matching fact slices — is a deeply optimised path in every commercial warehouse. The same optimisation applied across a snowflake chain has more degrees of freedom and may pick a worse plan.

Third, BI tools assume star. Tableau, Power BI, Looker, Metabase — they all generate SQL by joining a fact to a dimension on a single foreign key. Add a snowflake hierarchy and the BI tool either generates the wrong join or asks the modeller to set up the hierarchy explicitly, which is often skipped. Star matches the BI tool's mental model.

Snowflake still has its place. Use it when a dimension is genuinely huge with deep hierarchies (geography), when an attribute changes often and you want one-place updates (organisational hierarchy), or when storage cost dominates because the dimension is an order of magnitude larger than usual.

Slowly changing dimensions: a preview

A dimension is "slowly changing" when its attributes change over time but you care about the history. A customer moves from Bengaluru to Mumbai. A product gets reclassified from "Casual Wear" to "Athleisure". A store gets reformatted from "small format" to "flagship". The fact rows from before the change should still join to the old attributes; the fact rows from after should join to the new.

There are three classical handling patterns, named SCD Type 1 / 2 / 3 (the next chapter, Slowly Changing Dimensions, covers them in detail).

SCD Type 1: overwrite. The customer's city becomes Mumbai everywhere. History is lost. Use for corrections, not for genuine changes.
SCD Type 2: add a new row. The customer gets a new customer_id (or a new surrogate key with the same business_id), with valid_from and valid_to columns. Old fact rows continue to point to the old surrogate key; new fact rows point to the new one. History is preserved exactly.
SCD Type 3: add a column. The customer table gains a previous_city column. You can answer "current city" and "one previous city", nothing further back. Rare but useful for known two-version splits.

The default in modern warehouses is Type 2. Storage is cheap, audit-ability matters, and analyst questions like "what was this customer's segment at the time of the order" require Type 2 behaviour.

Designing the schema for a fashion retailer

A mid-sized Indian fashion retailer with 200 stores across 15 cities, 50,000 SKUs, and 5 million customers wants a warehouse that answers questions like "monthly revenue by product category, by city tier, with festival vs non-festival breakdown".

Step 1 — fix the grain. One row per line item on a sales receipt. A 3-item purchase produces 3 fact rows. This is the finest grain that's reasonably tractable; everything else can be aggregated up.

Step 2 — fact table.

CREATE TABLE fact_sales (
  sale_line_id    BIGINT  PRIMARY KEY,   -- surrogate
  date_id         INT     NOT NULL,      -- FK -> dim_date
  product_id      INT     NOT NULL,      -- FK -> dim_product (SCD Type 2)
  store_id        INT     NOT NULL,      -- FK -> dim_store
  customer_id     BIGINT,                -- FK -> dim_customer (nullable: walk-ins)
  -- measures
  quantity        INT     NOT NULL,
  unit_price      NUMERIC(10,2) NOT NULL,
  revenue         NUMERIC(12,2) NOT NULL,  -- quantity * unit_price - discount
  cost            NUMERIC(12,2) NOT NULL,
  discount        NUMERIC(10,2) NOT NULL DEFAULT 0
)
PARTITION BY RANGE (date_id);  -- monthly partitions

Note: unit_price is denormalised into the fact even though it's "really" a property of (product, date) because storing it once per fact row makes price-history queries trivial and the column is only 8 bytes.

Step 3 — dimensions.

CREATE TABLE dim_date (
  date_id        INT PRIMARY KEY,            -- e.g. 20260412
  date           DATE NOT NULL,
  day_of_week    SMALLINT,
  month          SMALLINT,
  month_name     VARCHAR(12),
  quarter        SMALLINT,
  year           SMALLINT,
  fiscal_year    SMALLINT,                   -- April-March in India
  fiscal_quarter SMALLINT,
  is_festival    BOOLEAN,
  festival_name  VARCHAR(40),                -- 'Diwali', 'Holi', 'Onam', NULL
  is_weekend     BOOLEAN
);

CREATE TABLE dim_product (
  product_id          INT PRIMARY KEY,       -- surrogate (Type 2)
  business_sku        VARCHAR(40) NOT NULL,  -- stable across versions
  name                VARCHAR(200),
  brand               VARCHAR(80),
  category            VARCHAR(40),           -- 'Kurtas', 'T-Shirts', 'Sarees'
  sub_category        VARCHAR(40),
  department          VARCHAR(40),           -- 'Womens', 'Mens', 'Kids'
  fabric              VARCHAR(40),
  colour              VARCHAR(40),
  size                VARCHAR(10),
  supplier_name       VARCHAR(120),
  valid_from          DATE NOT NULL,
  valid_to            DATE NOT NULL DEFAULT '9999-12-31',
  is_current          BOOLEAN
);

CREATE TABLE dim_store (
  store_id     INT PRIMARY KEY,
  name         VARCHAR(120),
  city         VARCHAR(60),
  state        VARCHAR(60),
  tier         SMALLINT,                     -- 1 = Mumbai/Delhi/Bengaluru, 2 = Pune/Jaipur, 3 = rest
  region       VARCHAR(20),                  -- 'North', 'South', 'East', 'West'
  format       VARCHAR(20),                  -- 'flagship', 'standard', 'kiosk'
  opened_date  DATE
);

CREATE TABLE dim_customer (
  customer_id   BIGINT PRIMARY KEY,
  name          VARCHAR(200),
  gender        CHAR(1),
  age_band      VARCHAR(10),                 -- '18-24', '25-34', etc
  city          VARCHAR(60),
  state         VARCHAR(60),
  pin_code      CHAR(6),
  segment       VARCHAR(20),                 -- 'new', 'regular', 'vip'
  signup_date   DATE
);

Note dim_product has valid_from / valid_to / is_current for SCD Type 2. When a kurta gets reclassified from 'Casual' to 'Ethnic', a new product_id is minted with the new category and valid_from = today; the old row gets valid_to = yesterday and is_current = false. Old fact rows continue to point to the old product_id and report the old category; new fact rows point to the new product_id and report the new category. History is preserved exactly.

Step 4 — the analytical query. "Revenue by month for women's apparel in tier-1 cities, financial year 2025-26."

SELECT  d.fiscal_year,
        d.month_name,
        SUM(f.revenue)                  AS revenue,
        SUM(f.quantity)                 AS units,
        SUM(f.revenue - f.cost)         AS gross_margin
FROM    fact_sales f
JOIN    dim_date    d  ON f.date_id    = d.date_id
JOIN    dim_product p  ON f.product_id = p.product_id
JOIN    dim_store   s  ON f.store_id   = s.store_id
WHERE   d.fiscal_year   = 2025
  AND   p.department    = 'Womens'
  AND   s.tier          = 1
GROUP BY d.fiscal_year, d.month, d.month_name
ORDER BY d.month;

This is the query that justifies the entire schema. Three joins, all fact-to-dim, all on integer foreign keys; three predicates, all on small dimension columns; one GROUP BY and one SUM. The query optimiser filters the dimensions first (dim_product has maybe 5,000 women's-apparel rows; dim_store has maybe 30 tier-1 rows; dim_date has 12 month rows for FY 2025), takes the cross product of the three filter sets (a few hundred thousand candidate fact-row coordinates), and uses that to scan only the relevant slices of fact_sales. On a partitioned columnar warehouse this returns in seconds even if fact_sales has 5 billion rows.

The same query against the snowflake variant (with dim_product → dim_product_category → dim_product_department) becomes a 5-way join. It still works, the planner still figures it out, but the SQL is one line longer and the BI tool's auto-generated version is one bug closer.

What you carry into the rest of Build 16

Build 16 covers the rest of the warehouse stack: how massively-parallel-processing engines (Snowflake, Redshift, BigQuery, Databricks) execute queries against star schemas at petabyte scale, how separation of storage and compute changed the cost model, what columnar storage formats (Parquet, ORC, Delta) look like under the hood, and how the lakehouse pattern unifies warehouse and data lake.

All of that machinery exists to make star-schema queries fast. The schema you draw on the napkin in the first conversation with the analytics team — the fact table at the centre, the dimensions around it, the grain pinned down explicitly — is what every layer of the warehouse stack is engineered to serve. Get the schema right and the rest of the build is engineering. Get the schema wrong — wrong grain, mixed grains, missing dimension, accidentally normalised — and no amount of compute will save you.

The next chapter, on MPP architecture and the shuffle, shows what actually happens when the warehouse engine receives that 3-way star join. The chapter after that, on slowly changing dimensions, fills in the SCD Type 1/2/3 patterns sketched here.

Going deeper

A few topics that go beyond the basics — useful when you graduate from "I can sketch a star" to "I am running the warehouse for a billion-row business".

Conformed dimensions and the bus matrix

When a business has multiple fact tables — fact_sales, fact_inventory, fact_returns, fact_marketing_spend — the dimensions that are shared across them must use the same surrogate keys, the same attribute definitions, and the same SCD policy. Kimball calls these conformed dimensions and the master plan that lays out which dimensions appear in which fact tables the bus matrix. Without conformed dimensions you cannot join fact_sales to fact_returns on product_id because the two tables mean different things by product_id. The bus matrix is usually a literal grid on a wall in the analytics team's room: rows are fact tables, columns are dimensions, an X marks "this dimension applies to this fact". It's the cheapest most powerful planning tool in the discipline.

Factless fact tables

Sometimes you want to record an event that has no measure — a student attended class, a customer browsed a product, a coupon was issued (without yet being redeemed). The fact table has only foreign keys, no measures. These are called factless fact tables. Counting rows answers "how many events" and joining to dimensions answers "for which combinations". A factless fact_browse(date_id, customer_id, product_id) lets you compute conversion rate as count(distinct fact_sales) / count(distinct fact_browse) for any product slice.

Degenerate dimensions

Some attributes belong on the fact row but don't have a corresponding dimension table — order numbers, invoice numbers, transaction IDs. They identify the source transaction but have no descriptive attributes worth their own table. Kimball calls these degenerate dimensions: store the value directly on the fact row, no foreign key, no dim table. The order number lets you reconstruct the receipt without a join.

When the Inmon view wins

Bill Inmon, the other founding figure of warehousing, advocated a different architecture: load source data into a normalised enterprise data warehouse (3NF, single source of truth) and build data marts (star schemas) on top for specific analytical use cases. The Inmon approach is heavier upfront but easier to govern in regulated industries (banking, insurance) where data lineage and definitional consistency across departments are auditable requirements. Modern lakehouses (Databricks, Snowflake) blur the distinction by letting you have both a normalised silver layer and denormalised gold-layer marts in the same engine — the Inmon vs Kimball debate is now more about modelling discipline than tooling choice.

Wide tables and the "one big table" pattern

Some modern teams skip dimension joins entirely and pre-compute one wide denormalised table — fact_sales_enriched with every dimension attribute already joined in, hundreds of columns, billions of rows. Columnar storage makes this cheap (you only read the columns the query touches), and the result is even faster than a star because there are zero joins. The trade-off is that any dimension change requires rebuilding the wide table, and the table is duplicated work across queries that don't need every column. dbt teams in particular sometimes ship "one big table" marts as the final layer; they're a star schema collapsed to its limit.

References

Kimball, R. & Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition. Wiley. The canonical reference for star schema, conformed dimensions, slowly changing dimensions, and the bus matrix.
Microsoft Learn. Star schema and how it improves Power BI performance. Practical guidance from the Power BI team on why BI tools assume star.
Microsoft Learn. Understand star schema and the importance for Power BI. Companion piece covering the snowflake variant and when to use it.
Holistics. Kimball vs Inmon vs Data Vault. A modern survey of warehouse modelling philosophies.
dbt Labs. How we structure our dbt projects. Modern data-modelling practice including the staging → intermediate → marts pattern that produces star schemas at the gold layer.
Snowflake. Star schema benchmark and join optimization. Evidence that modern warehouse engines have specific star-join code paths.