Cost-Based Optimisation and Cardinality Estimation

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Rule-based rewrites prune the plan space using pure relational-algebra equivalences: predicate pushdown, projection pushdown, join associativity and commutativity. They are safe and statistics-free. What they leave behind is a family of equivalent plans — same rows, same columns, different shapes — and the optimiser still has to pick one. Cost-based optimisation does that by assigning each candidate plan a numerical cost (usually a weighted sum of page I/O, CPU comparisons, and network bytes), enumerating plans by dynamic programming over subsets of the input tables (Selinger's 1979 System R algorithm, still the default in PostgreSQL), and keeping the cheapest. The enumeration is tractable; the scoring is hard, because cost depends on how many rows each operator produces, and nobody knows those numbers ahead of time.

That is the job of cardinality estimation, and it is the fragile heart of every relational database ever built. The optimiser relies on three pillars: (a) table and column statistics — row counts, number of distinct values (NDV), min, max, null fraction, histograms, most-common-value lists; (b) selectivity formulas — equality 1/NDV, range via histogram buckets, conjuncts under an independence assumption; (c) join-size estimation — |R ⋈ S| ≈ |R|·|S| / max(NDV_R(k), NDV_S(k)). Every one of these is a rough approximation, and a 10× error at one level compounds to a 100× error at the next. By level five of a join tree, your estimate can be off by 10⁵, and the optimiser picks a plan that runs for six hours instead of six seconds.

Modern engines cope with three techniques: multi-column and extended statistics to capture correlated columns; runtime sampling during planning or execution; adaptive query execution that re-plans mid-query when the actual row counts diverge from the estimates. Learned cardinality estimators using deep neural networks exist in research (Naru, MSCN, NeuroCard) but are not yet mainstream. The Leis et al. VLDB 2015 benchmark showed that all major open-source optimisers pick plans with >100× cost in roughly 10% of real-world queries — not because the algorithms are bad, but because cardinality is genuinely hard.

The query that exposes the stakes

You have three tables in a retail warehouse — users with 1 million rows, orders with 100 million rows, and line_items with 1 billion rows. The query:

SELECT u.name, l.product_id, l.qty
FROM users u
JOIN orders o ON o.user_id = u.id
JOIN line_items l ON l.order_id = o.id
WHERE u.country = 'IN';

After the rule-based layer has pushed the filter and exposed join commutativity and associativity, at least three physical plans remain. Each joins the three tables in a different order.

Plan A — bad order: users ⋈ line_items ⋈ orders. The optimiser joins users to line_items first. There is no direct join predicate between them (the schema's only link is via orders), so this is effectively a Cartesian product filtered at the next level. Intermediate size: |users| × |line_items| = 10⁶ × 10⁹ = 10¹⁵ rows. A thousand trillion. Your cluster doesn't finish.

Plan B — mediocre order: users ⋈ orders ⋈ line_items. After filtering, σ(users) has ~250,000 Indian users. σ(users) ⋈ orders is ~25 million rows (25 orders per user on average). Joining that to line_items (each order has ~10 line items) gives ~250 million rows — the final answer.

Plan C — good order: line_items ⋈ orders ⋈ users. Build hash tables on the small sides at each level; stream the big side through. Every intermediate stays at output size or smaller; no blow-up.

Plans B and C are both "correct" in the rule-based sense; both answer the same SQL. Plan A is also correct. The cost difference between A and B is about 10⁶. Pick wrong and your query runs for a million times longer. This is not a contrived example — it is the default shape of most analytical queries, and getting the join order right is almost entirely a function of cardinality estimation.

What a cost model actually computes

Every physical operator has a cost formula. You already derived them in the previous chapters:

operator	I/O cost	CPU cost
sequential scan	`B_R`	`\|R\|`
index scan	`h + \|leaf pages touched\| + matches`	`log\|R\| + matches`
nested-loop join	`B_R + \|R\| · B_S`	`\|R\| · \|S\|`
block NLJ	`B_R + ⌈B_R/(B-2)⌉ · B_S`	`\|R\| · \|S\|`
index NLJ	`B_R + \|R\| · (h + matches_per_probe)`	`\|R\| · log\|S\|`
grace hash join	`3(B_R + B_S)`	`\|R\| + \|S\|`
sort-merge join	`4(B_R + B_S)` (both unsorted)	`\|R\|·log\|R\| + \|S\|·log\|S\|`

The cost-based layer's job is to combine these into a cost for the whole tree. A plan is a tree of physical operators; a cost is computed bottom-up. Each operator's cost depends on the cost of its children and the number of rows its children produce — which is where cardinality estimation enters.

# query/cost/model.py
from dataclasses import dataclass

@dataclass(frozen=True)
class Cost:
    io: float           # page reads + writes
    cpu: float          # tuple-level comparisons
    net: float = 0.0    # bytes shipped (for distributed plans)

    def total(self, w_io=1.0, w_cpu=0.001, w_net=10.0) -> float:
        return w_io * self.io + w_cpu * self.cpu + w_net * self.net

    def __add__(self, other: "Cost") -> "Cost":
        return Cost(self.io + other.io, self.cpu + other.cpu, self.net + other.net)

Why CPU is weighted 0.001 relative to I/O: a single 8 KB page read from an SSD takes roughly 100 microseconds; a single tuple comparison takes roughly 100 nanoseconds. The ratio is 1000:1. PostgreSQL uses a slightly different baseline — seq_page_cost = 1.0, random_page_cost = 4.0, cpu_tuple_cost = 0.01, cpu_index_tuple_cost = 0.005, cpu_operator_cost = 0.0025 — but the flavour is the same: I/O dominates, CPU is a tie-breaker. Weights are tunable per installation; on a system with NVMe storage where random reads are almost as cheap as sequential, operators raise seq_page_cost so that in-memory comparisons count for more.

The recursive cost function walks the plan tree bottom-up:

# query/cost/estimate.py
def cost_of(plan) -> tuple[Cost, int]:
    """Return (cost, estimated_rows_out) for a plan subtree."""
    if isinstance(plan, SeqScan):
        return Cost(io=plan.pages, cpu=plan.rows), plan.rows
    if isinstance(plan, IndexScan):
        return Cost(io=plan.height + plan.matched_pages,
                    cpu=plan.matched_rows), plan.matched_rows
    if isinstance(plan, Filter):
        child_cost, child_rows = cost_of(plan.child)
        sel = estimate_selectivity(plan.predicate, plan.child.schema)
        out_rows = int(child_rows * sel)
        return child_cost + Cost(io=0, cpu=child_rows), out_rows
    if isinstance(plan, HashJoin):
        lc, lr = cost_of(plan.left)
        rc, rr = cost_of(plan.right)
        build, probe = (lr, rr) if lr < rr else (rr, lr)
        io = 3 * (pages(lr) + pages(rr)) if build > mem_pages else (pages(lr) + pages(rr))
        out_rows = estimate_join_size(plan.left, plan.right, plan.predicate)
        return lc + rc + Cost(io=io, cpu=lr + rr), out_rows
    # ...NestedLoopJoin, SortMergeJoin, Aggregate, Sort handled similarly

Notice estimate_selectivity and estimate_join_size. These are the cardinality estimation calls. Every cost decision eventually bottoms out in them, and every bad plan eventually traces back to one of them being wrong.

PostgreSQL's real implementation ships constants that you can inspect:

seq_page_cost            = 1.0
random_page_cost         = 4.0
cpu_tuple_cost           = 0.01
cpu_index_tuple_cost     = 0.005
cpu_operator_cost        = 0.0025
parallel_tuple_cost      = 0.1
effective_cache_size     = 4GB    # influences index-vs-seqscan crossover

These are knobs. Changing random_page_cost from 4.0 to 1.1 on an SSD-backed Postgres instance is a classic DBA optimisation — it tells the planner that random I/O is nearly as cheap as sequential, which flips many plans from sequential scan to index scan.

Statistics — what the optimiser knows about the data

Cardinality estimation needs a summary of each table. You don't want the optimiser to scan the table every time it plans a query (which would defeat the whole point). You also don't want the summary to be exact — that would be the table itself. The compromise every engine makes is to maintain a small, approximate summary that is refreshed periodically via ANALYZE.

Table-level statistics. Row count, total pages, average row width. PostgreSQL stores these in pg_class.reltuples and pg_class.relpages, updated during VACUUM and ANALYZE. They lag between runs — a freshly inserted billion rows will confuse the optimiser until you re-analyse.

Column-level statistics. Per column: number of distinct values (NDV / n_distinct), min, max, null fraction, average width. NDV is the single most important column statistic — it drives sel(x = c) = 1/NDV for equality and |R ⋈ S| ≈ |R|·|S|/NDV for joins.

NDV is also hard to estimate cheaply. Exact NDV requires a full scan with a hash set of size NDV. Production engines use HyperLogLog or distinct-count sketches on a sample — PostgreSQL the Haas-Stokes estimator, Snowflake HyperLogLog++. Errors of ±2× are typical, worse on skewed columns.

Histograms. A histogram partitions the column's value range into buckets and stores the row count per bucket. Two common forms:

Equi-width: buckets cover equal-size value ranges. For age with range [0, 100], ten buckets cover [0,10), [10,20), ..., [90,100]. Simple but awful on skewed data — a bucket covering ages 20 to 30 might hold 40% of the rows while a bucket covering 90 to 100 holds 0.1%.
Equi-height (equi-depth): buckets contain the same row count; bucket boundaries adapt to the distribution. For the same age column, boundaries might be at 22, 28, 33, 41, 52, 67, 85 — denser where the data is denser. Every bucket contributes the same 1/k of the table, which makes the per-bucket selectivity uniform and the estimator much more robust.

Equi-height is the dominant choice because it gives uniform-per-bucket selectivity estimates. PostgreSQL's pg_statistic.stavalues stores equi-height histogram boundaries for each indexed column.

An equi-height histogram. Row count per bucket is constant (10 M here); bucket widths adapt to where the data is — narrow where dense, wide where sparse. Selectivity of a range predicate is estimated as (covered-bucket fraction), which is much more accurate on skewed data than an equi-width histogram would be.

Most-common-value (MCV) list. Histograms treat each bucket uniformly, which is wrong when a single value inside a bucket is hugely over-represented. If country = 'IN' is 25% of all users but the bucket containing it covers 20 distinct countries, the histogram would estimate country = 'IN' as ~1.25% — a 20× underestimate. To fix this, engines store the top-N most-common values and their observed frequencies separately from the histogram. The histogram then describes only the non-MCV portion of the column.

PostgreSQL's default is 100 MCVs per column, adjustable per column via ALTER TABLE ... SET STATISTICS. Oracle calls this a "frequency histogram" when NDV is small enough to fit fully, and "top-frequency histogram" when it isn't.

Functional dependencies and multi-column statistics. The independence assumption (sel(P1 ∧ P2) ≈ sel(P1) · sel(P2)) breaks on correlated columns. Classic example: city = 'Mumbai' and pincode = '400001' are perfectly correlated (every Mumbai pincode starts with 400), yet the optimiser multiplies sel(city='Mumbai') = 0.01 with sel(pincode='400001') = 0.0001 to get 10⁻⁶ instead of the actual 10⁻⁴. A hundred-fold underestimate is exactly the kind of error that picks nested-loop on what should have been a hash join.

PostgreSQL 10+ offers extended statistics via CREATE STATISTICS addr_stats (dependencies, ndistinct, mcv) ON city, pincode FROM addresses;. This maintains functional dependencies (detected automatically), multi-column NDV, and multi-column MCVs. SQL Server's "column group statistics" and Oracle's "extended statistics" serve the same purpose.

How statistics get collected. ANALYZE samples the table — 300 · default_statistics_target rows in PostgreSQL, which defaults to 30,000 rows regardless of table size. Sampling is uniform at the page level (sampled rows across random pages would cost random I/O).

Why a fixed sample size is correct asymptotically: the standard error of a sampled proportion is O(1/sqrt(n)), independent of population size. 30,000 rows give ~0.6% error on a proportion — enough for histogram boundaries and null fractions. NDV is the exception: its sample-based estimate degrades with population size, which is why Postgres uses specialised NDV estimators rather than raw counts.

Cardinality estimation — the fragile heart

Now the formulas. Each one is a selectivity estimate; multiplying by the input row count gives the output row count.

Equality selectivity

For x = c:

If c is in the MCV list: sel = freq(c), looked up directly.
Otherwise: sel = (1 - sum_of_MCV_frequencies) / (NDV - |MCV|).
If no MCV list exists: sel = 1 / NDV.

Why 1/NDV assumes uniformity: each of the NDV distinct values appears with equal probability. If the column is actually skewed (one value dominates), this underestimates the selectivity for the dominant value and overestimates it for the rare ones. MCV lists fix this for the top-N values; the tail is where 1/NDV still applies and is still wrong in proportion to how skewed the tail is.

def sel_equality(col_stats, value) -> float:
    if value in col_stats.mcv:
        return col_stats.mcv[value]                 # direct frequency
    non_mcv_frac = 1 - sum(col_stats.mcv.values())
    tail_ndv = col_stats.ndv - len(col_stats.mcv)
    return non_mcv_frac / max(tail_ndv, 1)

Range selectivity via histogram

For x BETWEEN a AND b:

Find the buckets covered by [a, b].
Sum: full buckets contribute 1/k each (where k is the number of buckets). Partial buckets contribute (overlap_width / bucket_width) · (1/k) — the linear interpolation assumption inside each bucket.
Add MCV frequencies for MCVs that fall inside [a, b].

def sel_range(col_stats, lo, hi) -> float:
    k = len(col_stats.buckets)
    sel = 0.0
    for i, (b_lo, b_hi) in enumerate(col_stats.buckets):
        if b_hi <= lo or b_lo >= hi:
            continue
        overlap = min(b_hi, hi) - max(b_lo, lo)
        width = b_hi - b_lo
        sel += (overlap / width) * (1 / k)
    for v, f in col_stats.mcv.items():
        if lo <= v <= hi:
            sel += f
    return sel

Why linear interpolation inside a bucket is optimistic: an equi-height bucket contains the same number of rows as its neighbours, but the distribution within the bucket is unknown. If rows cluster at the low end, a range that touches only the high end yields zero rows while the estimator predicts overlap/width · (1/k). Tighter bucket resolution (more buckets) or within-bucket statistics (PostgreSQL's stanumbers stores within-bucket correlation) help, but you cannot eliminate the approximation without storing every row.

Conjunct selectivity (and why independence is a lie)

For P1 AND P2 referencing different columns, the optimiser assumes independence:

\text{sel}(P_1 \wedge P_2) \approx \text{sel}(P_1) \cdot \text{sel}(P_2)

This is the single most consequential approximation in query optimisation. It is almost always wrong on real data. Correlated columns (city and pincode, model_year and model_name, order_status and ship_date) violate it in both directions: correlated columns make conjuncts more selective than the product (over-estimates output); anti-correlated columns (rare in practice but possible — age and mortgage_remaining on post-retirement loans) make conjuncts less selective than the product (under-estimates output).

For disjuncts:

\text{sel}(P_1 \vee P_2) = \text{sel}(P_1) + \text{sel}(P_2) - \text{sel}(P_1 \wedge P_2)

which reduces to sel(P1) + sel(P2) - sel(P1)·sel(P2) under independence — also wrong in the same way.

Multi-column statistics (discussed above) fix this for columns the DBA has thought to register. Columns the DBA hasn't registered remain subject to the independence assumption, which is why most production optimisers are still quietly wrong on most queries.

Join size estimation

This is where cardinality estimation pays for its own bad assumptions. For an equi-join R ⋈_{R.k = S.k} S where neither side has been filtered:

|R \bowtie S| \approx \frac{|R| \cdot |S|}{\max(\text{NDV}_R(k), \text{NDV}_S(k))}

The derivation: each distinct value of k appears |R|/NDV_R(k) times in R and |S|/NDV_S(k) times in S. The Cartesian product over matching keys is |R|/NDV_R · |S|/NDV_S per distinct value. Summing over max(NDV_R, NDV_S) distinct values (the keys that actually appear in both sides) gives |R|·|S| / max(NDV_R, NDV_S).

Why max and not min: the join key's domain on either side is whatever values appear in that column. If NDV_R(k) = 10⁶ and NDV_S(k) = 100, it means R has a million distinct keys and S has a hundred. A join on equality can only match on keys that are in both sides. Assuming S's keys are a subset of R's (the foreign-key case), the join sees S's 100 distinct keys, each with |R|/10⁶ · |S|/100 = |R|·|S|/10⁸ rows contributing to the output. That equals |R|·|S| / max(NDV_R, NDV_S) = |R|·|S|/10⁶. The bigger the max, the more the output is diluted. Using min would assume every key in the smaller side matches in the larger side, which is true for foreign-key joins but overestimates for arbitrary joins.

For the foreign-key special case — users.id is primary key (NDV = |users|) and orders.user_id is a foreign key pointing into it — the formula simplifies beautifully:

|users \bowtie orders| \approx \frac{|users| \cdot |orders|}{|users|} = |orders|

Every order has exactly one user, so the join size equals the orders table size. This is why FK joins don't blow up: they preserve the size of the many side.

def estimate_join_size(left_rows, right_rows, left_ndv, right_ndv) -> int:
    return int(left_rows * right_rows / max(left_ndv, right_ndv, 1))

One line. And yet this one line carries the weight of every join-order decision the optimiser will ever make.

Join order enumeration — Selinger's dynamic programming

The System R algorithm (Selinger 1979) enumerates plans by dynamic programming over subsets of the input tables. For each non-empty subset S ⊆ {T_1, ..., T_n}, compute and memoise the cheapest plan joining exactly those tables. The plan for S is built by combining the plan for some proper subset with the scan of the remaining table. Outer loop over subsets by size; inner loop over ways to peel off one table.

# query/opt/selinger.py
from itertools import combinations

def selinger_dp(tables, join_preds, base_costs, estimate_cost):
    """Left-deep DP enumeration. Returns best plan for joining all tables."""
    best = {}
    # Base cases: single-table scans
    for t in tables:
        best[frozenset([t])] = (base_costs[t], [t])
    # Build up subsets by size
    n = len(tables)
    for size in range(2, n + 1):
        for subset in combinations(tables, size):
            s = frozenset(subset)
            best_cost, best_order = float('inf'), None
            for t in subset:                        # left-deep: pull one table out
                rest = s - {t}
                if rest not in best: continue
                if not has_edge(rest, t, join_preds):
                    continue                         # no predicate connecting them
                sub_cost, sub_order = best[rest]
                join_cost = estimate_cost(sub_order, t, join_preds)
                total = sub_cost + join_cost
                if total < best_cost:
                    best_cost, best_order = total, sub_order + [t]
            if best_order is not None:
                best[s] = (best_cost, best_order)
    return best[frozenset(tables)]

The has_edge check restricts the search to connected subplans — subsets whose tables are linked by join predicates. Without this restriction, the DP would consider Cartesian products (users × line_items from the opening example), which the optimiser rightly avoids except as a last resort. PostgreSQL calls this the "connected-subgraphs" optimisation.

Complexity. The DP visits 2^n subsets; each does O(n) work. Total: O(2^n · n) states. At n = 10 that's ~10,000 — fast. At n = 20, ten million — slow. At n = 30, intractable.

PostgreSQL handles this with the GEQO (Genetic Query Optimiser) fallback. Above geqo_threshold = 12 tables it abandons DP for a genetic algorithm (random initial plans, crossover and mutation) that runs in bounded time but loses optimality. CockroachDB uses a similar greedy "cheapest join first" heuristic.

Left-deep vs bushy. The DP above enumerates left-deep only: each step adds one new table to an existing subtree. Bushy plans (both sides of a join can be multi-table subjoins) have Catalan-many more shapes. Oracle, SQL Server, Snowflake enumerate bushy; PostgreSQL restricts to left-deep. Bushy wins on star-schema queries where multiple small dimensions hash-join into a fact table independently.

Interesting orders. A sorted output is worth more than its cost suggests — downstream sort-merge joins or ORDER BY can reuse it. Selinger's DP tracks each subset's cheapest plan under each interesting order (ORDER BY columns, join keys, GROUP BY keys) separately. The memo is indexed by (subset, sort_order) instead of just subset. This inflates state by a constant factor but often finds plans pure cost-sum DP would miss.

When estimates go wrong — and how engines cope

The Leis et al. 2015 VLDB paper "How Good Are Query Optimizers, Really?" benchmarked the cardinality estimators in PostgreSQL, MySQL, MonetDB, HyPer, DB2, and SQL Server on the Join Order Benchmark. Headline finding: all optimisers produce plans with >100× actual cost in roughly 10% of queries, and the dominant error source is cardinality estimation — not cost modelling, not enumeration. Errors compound exponentially with join depth: a 10× error at level 1 becomes 10⁵ by level 5. Three classes of defence:

Sampling at plan time. Oracle's "dynamic sampling" and DB2's "runtime statistics" issue small exploratory queries at optimisation time — scan a few thousand rows, compute actual selectivity, plan with that. Tens of milliseconds overhead, but robust on predicates the histogram didn't capture.

Adaptive execution. Spark's Adaptive Query Execution, DuckDB's re-optimisation, SQL Server's Adaptive Join — all measure actual cardinalities during execution and reshape the plan mid-query. SQL Server's Adaptive Join is the cleanest version: it starts as a hash join but switches mid-operator to a nested-loop join if the observed build-side row count is below a threshold. Both plans are compiled up front; the runtime chooses.

Hints. When the optimiser is consistently wrong and you cannot fix the statistics, every production engine has an escape hatch: Oracle's /*+ ORDERED */, Postgres's pg_hint_plan, SQL Server's OPTION (FORCE ORDER). Last-resort workarounds when a specific query must run today.

Learned cardinality estimation

Since 2018, a research line has proposed replacing the histogram-plus-independence stack with neural networks trained on query workloads. MSCN (Kipf et al., CIDR 2019) uses a multi-set CNN over a query representation; Naru (Yang et al., VLDB 2019) learns the full joint distribution of a table's columns via a deep autoregressive model; NeuroCard extends Naru to multi-table joins. On benchmarks, these models beat histogram estimators by an order of magnitude on correlated columns. The catches: training takes hours per table; models are brittle across workload shifts; updates force retraining; inference is still milliseconds, too slow for OLTP compilation. No major production database uses learned estimation by default as of 2026. For now, histograms and ANALYZE are what you ship.

Estimating a two-table join with a filter.

The query:

SELECT u.name, o.total
FROM users u JOIN orders o ON o.user_id = u.id
WHERE u.country = 'IN';

Statistics you have:

|users| = 10^7, NDV_users(id) = 10^7 (PK, unique).
|users.country|: NDV = 200. MCV list includes country = 'IN' with frequency 0.25.
|orders| = 5 × 10^8, NDV_orders(user_id) = 10^7 (FK into users).
Buffer pool: 100 MB (= 10^4 pages at 8 KB).

Step 1 — selectivity of u.country = 'IN'. 'IN' is in the MCV list with frequency 0.25, so:

\text{sel}(u.country = \text{'IN'}) = 0.25

Step 2 — filtered user count.

|\sigma_{country=\text{'IN'}}(users)| = 10^7 \cdot 0.25 = 2.5 \cdot 10^6

2.5 million Indian users.

Step 3 — join size. Apply the formula with the filtered left side:

|\sigma(users) \bowtie orders| \approx \frac{|\sigma(users)| \cdot |orders|}{\max(NDV_L, NDV_R)}

NDV of user_id on the filtered left side is still ≤ |σ(users)| = 2.5 · 10^6 (the filter picked 2.5M unique user ids). NDV_R is 10^7. So max = 10^7:

|\sigma(users) \bowtie orders| \approx \frac{2.5 \cdot 10^6 \cdot 5 \cdot 10^8}{10^7} = 1.25 \cdot 10^8

125 million output rows — matches the intuition that 25% of the orders table corresponds to Indian users.

Step 4 — cost each physical plan.

Assume rows are 200 bytes each. Filtered users is 2.5·10^6 × 200 B = 500 MB = B_L = 6.25 · 10^4 pages. Orders is 5·10^8 × 200 B = 100 GB = B_R = 1.25 · 10^7 pages.

Nested-loop join (even with block NLJ at M = 10^4 pages): B_L + ⌈B_L / (M-2)⌉ · B_R = 6.25·10^4 + 7 · 1.25·10^7 ≈ 8.75·10^7 pages. At 0.5 ms per SSD page read, that's ~44,000 seconds = 12 hours. Not useful.
Grace hash join: 3(B_L + B_R) = 3 · 1.256·10^7 ≈ 3.77·10^7 pages. Wall-clock ~19,000 seconds = 5 hours. The build side (500 MB filtered users) doesn't fit in 100 MB buffer, so grace partitioning kicks in with ~8 partitions.
Sort-merge join: neither side is sorted on the join key in this scenario. Cost 4(B_L + B_R) ≈ 5·10^7 pages, ~7 hours. Slower than grace.

Grace hash wins. Cost estimate handed to Selinger DP: 3.77·10^7 pages. The 125M-row output is passed up to any parent operator as the expected row count.

Where reality could diverge from the estimate. Suppose country = 'IN' were not in the MCV list — the optimiser would fall back to 1/NDV = 1/200 = 0.005, giving sel(country='IN') = 0.005 and estimated filtered users = 50,000 instead of 2.5 million. The estimated join size would be 50,000 · 5·10^8 / 10^7 = 2.5 · 10^6 rows — a 50× underestimate. Based on that, the optimiser might pick nested-loop join (small outer driving into big indexed inner) thinking 50,000 lookups are fine — but it would actually do 2.5 million lookups, and the real runtime would blow past the hash-join plan.

This is the classic "bad estimate, wrong plan" scenario. The fix: ensure country has enough MCV slots (default 100 is usually plenty for a 200-value column); verify with EXPLAIN ANALYZE that estimated and actual row counts match.

Common confusions

"The optimiser picks the fastest plan." It picks the plan with the lowest estimated cost. A plan estimated at 1 second can run for an hour if the cardinality model was off.
"ANALYZE is free." ANALYZE reads a sample of the table (30,000 pages by default in PostgreSQL) and computes NDV, histograms, and MCVs. On a 100 GB table it takes minutes and competes with production I/O. Schedule off-peak.
"More statistics is always better." Each extra statistic adds planning-time overhead and slows ANALYZE. Extended statistics should be added for specific correlated-column pairs you actually query, not sprayed on every combination.
"Histograms handle skew." Only within a single column. Correlated multi-column skew (the city-pincode case) needs multi-column statistics; histograms alone treat the columns independently.
"Indexes are always better than sequential scans." For high-selectivity predicates (<1–5% of rows) index scan wins. Past ~10% selectivity, sequential scan wins because it reads pages sequentially while index scan does random I/O per matched row. The crossover depends on random_page_cost.
"EXPLAIN and EXPLAIN ANALYZE are the same." EXPLAIN shows the estimated plan. EXPLAIN ANALYZE runs the query and compares estimated vs actual rows per operator. A 10× mismatch is a warning; 100× is a bug in your statistics target.
"Join order doesn't matter once you use hash join." Join order determines what flows into each hash join. The opening example's 10^15-row intermediate still has to be hash-joined with the third table — bad join order is unrecoverable by physical operator choice.

Going deeper

Two topics that matter for anyone building on top of a real optimiser.

The Cascades framework — rule-based and cost-based unified

Selinger's DP works top-down over left-deep plans. The Cascades framework (Graefe 1995) generalises this to bushy plans and unifies rule-based and cost-based optimisation under one enumeration mechanism. The core data structure is the memo — a DAG where each node is a logical equivalence class and stores a set of physical implementations. Transformation rules convert one logical expression to another (associativity, commutativity); implementation rules attach a physical operator (hash join, sort-merge) to a logical one. Each new expression is added to the memo; duplicates detected by expression-signature hashing.

Cost-based search is a top-down traversal of the memo with branch-and-bound pruning. For each logical class the optimiser computes a lower bound on the cost of any plan rooted there; subtrees exceeding the current best cost are pruned. This is why Cascades handles larger join graphs than pure Selinger DP. SQL Server, CockroachDB, Apache Calcite's VolcanoPlanner, and Postgres's PathKey system are all Cascades-derived.

Bernoulli sampling for runtime correction

When the optimiser suspects a bad estimate (filter references many columns, or the column has no MCV), some engines run a Bernoulli sample at plan time — read each page with probability p, evaluate the filter, estimate selectivity directly. Standard error scales as O(1/sqrt(p · N)); for p = 0.01 on a 100 M-row table, 0.1% — tighter than the 30,000-row ANALYZE sample. Cost is 10–50 ms for a 1% sample on a gigabyte table. Oracle and DB2 enable it adaptively; PostgreSQL exposes TABLESAMPLE BERNOULLI for application-level use on problematic queries.

Where this leads next

Cost-based optimisation has the most surface area and the weakest formal guarantees in the database. Cardinality estimation is the component most likely to be the reason your query is slow — every production database ships an optimiser that gets 90% of queries right and flails on the 10% where statistics fail it.

The next chapter — why the optimizer is the hardest part of a DB — is the philosophical coda. Every piece of the optimiser is well-understood in isolation; the integrated system is still where every engine regresses on every release, because the optimiser is where the storage engine, execution engine, query language, statistics, and workload meet in a single decision procedure. After that, adaptive query processing picks up where "when estimates go wrong" left off — the direction of travel is optimisers that make worse initial estimates but recover better, because perfect estimates are unreachable and reactive planning is cheaper than being wrong.

References

Selinger, Astrahan, Chamberlin, Lorie, Price, Access Path Selection in a Relational Database Management System, SIGMOD 1979 — the System R paper. Introduces cost-based optimisation and the dynamic-programming join enumeration used almost verbatim in PostgreSQL today. Sections 4–5 are the cost model; section 6 is the DP algorithm.
Leis, Gubichev, Mirchev, Boncz, Kemper, Neumann, How Good Are Query Optimizers, Really?, VLDB 2015 — the benchmark paper that measures cardinality-estimation quality in every major DB. Shows that estimation errors dominate optimiser quality and that join-depth amplification makes them worse. The reference paper on why cardinality estimation matters.
Ioannidis, Query Optimization, ACM Computing Surveys 28(1), 1996 — the standard survey of cost-based optimisation. Covers histogram types, selectivity estimation, join ordering, and Cascades at an accessible level.
Graefe, The Cascades Framework for Query Optimization, IEEE Data Engineering Bulletin 18(3), 1995 — the Cascades paper. Unifies rule-based and cost-based optimisation under memo-based branch-and-bound search. Used in SQL Server, CockroachDB, Calcite's VolcanoPlanner.
PostgreSQL documentation, Planner / Optimizer and Row Estimation Examples — the production reference for how a real optimiser does cardinality estimation. Walks through histogram, MCV, and null-fraction arithmetic on example queries.
Kipf, Kipf, Radke, Leis, Boncz, Kemper, Learned Cardinalities: Estimating Correlated Joins with Deep Learning, CIDR 2019 (MSCN) — the representative paper for learned cardinality estimation. Shows that a multi-set CNN outperforms PostgreSQL's estimator on correlated joins by 10–100× on the JOB benchmark, with the caveats about training cost and workload shift discussed in this chapter.