In short
The other subsystems of a database are, in a real sense, solved problems. Pick up a textbook and you can build a buffer pool, a write-ahead log, ARIES recovery, an iterator-model executor, two-phase locking, or a hash join, and the code will be correct for all time. The algorithms have proofs. Cost models are closed-form. Papers from 1990 still describe the state of the art.
The query optimiser is different. Forty-five years after Selinger's System R paper, the state of the art is still "compile a plan from statistics that are usually wrong, execute it, and hope." Five reasons: (1) the plan space grows exponentially in the number of tables — at twenty tables, exhaustive search is infeasible on the fastest hardware that will ever exist; (2) cardinality estimates have unbounded multiplicative error that compounds with plan depth — 10× off at level one becomes 10⁵× at level five; (3) physical operator choices interact non-locally — a hash join here changes which sort order is "interesting" there, which changes the right operator somewhere else; (4) the workload is adversarial — every real query is a corner case the optimiser's model does not cover; (5) the right plan depends on facts you do not have until you run the query.
The takeaway is practical, not academic. When a production query is slow, roughly seventy percent of the time it is a plan problem — not storage, not execution. And it is often unfixable without hand-tuned hints, materialised views, or denormalisation. You are going to spend a lot of your career reading EXPLAIN ANALYZE. This chapter is about why, and what to do about it.
The easy parts of a database
Look at everything else the database does. Each subsystem has a "known right answer" — textbook algorithm, correctness proof, cost model that closes.
- Storage. Slotted pages, WAL, ARIES recovery. ARIES is three passes — analysis, redo, undo — provably correct against a crash. The 1992 paper is still what every relational engine implements.
- Execution. The iterator model runs any plan. Each operator has a closed-form cost: nested-loop
B_R + |R|·B_S, hash join3(B_R + B_S)when it spills, merge sort2 B_R · ⌈log_M B_R⌉. Vectorised and compiled execution change the constant factor; the catalogue has not changed in thirty years. - Concurrency control. 2PL for serialisability; MVCC for snapshot isolation. Both have formal proofs — Bernstein, Hadzilacos, Goodman 1987 nailed it down. Modern implementations vary in constants, not correctness.
- Buffer management, indexing, compression. B-trees from 1972. Bloom filters from 1970. Clock-sweep from the 1960s. Snappy, LZ4, dictionary encoding — all textbook.
None of this is trivial — a production storage engine takes a team years. But the goal is clear, the algorithms exist, and when you finish, the thing is done. Nobody is publishing papers arguing the B-tree is wrong. The optimiser is the one subsystem where that is still an open question.
Why the optimiser is different
Four structural reasons. Each alone would be hard. Together they produce a problem nobody has fully cracked.
1. Plan space explodes combinatorially
The optimiser's input is a logical plan: join these tables, with these predicates, produce these columns. Its output fixes the join order, the algorithm for each join, the indexes each scan uses, whether aggregates are hash or sort based, where exchanges sit in a parallel plan. The count is the product of all of these.
Left-deep plan orders for n tables is n!. At n = 10 that is 3.6 million. At n = 15, 1.3 trillion. At n = 20, 2.4 quintillion — longer than any supercomputer will ever enumerate. Allow bushy trees and multiply by a Catalan factor; multiply again by per-node operator choices, and a medium analytical query has 10^{20} candidate plans.
No exhaustive search survives this. Production optimisers use either Selinger's DP (O(2^n · n) subsets) or top-down memoised search (Cascades) with aggressive pruning. Both give up completeness. Both produce locally-optimal plans that are sometimes globally catastrophic.
No other subsystem has this shape. A B-tree insert chooses between split-or-not. A lock manager chooses between grant, queue, deadlock-abort. The optimiser picks one plan from 10^{20}, and the wrong choice costs orders of magnitude.
2. Cardinality estimates are unbounded-error
Every cost model starts with cardinality: how many rows does each subtree produce? Feed wrong numbers in and every plan comparison becomes meaningless. Cardinality estimation is not "sometimes inaccurate" — it has unbounded multiplicative error that compounds with plan depth.
Consider A ⋈ B ⋈ C with selectivities assumed independent. The optimiser estimates |A ⋈ B| ≈ |A| · |B| · s_{AB}, then |(A ⋈ B) ⋈ C| ≈ |A ⋈ B| · |C| · s_{(AB)C}. If each selectivity is 10× off (trivially easy — correlated columns, skewed histograms), two multiplications compound to 100×. For a five-way join, 10⁵.
Nothing else has this property. Buffer pool miss rate is bounded between 0 and 1. Lock queue lengths have computable upper bounds. Cardinality error has no bound — a correlation the histograms miss can produce arbitrary error. The Leis et al. 2015 VLDB paper measured >1000× errors on 10% of real queries across every major engine. That 10% is where all your pager-calls come from.
3. Physical operator choices interact non-locally
You cannot pick the physical operator for each join in isolation. What is right at level 5 depends on what was picked at level 3.
The mechanism is interesting orders (Selinger 1979). Pick a sort-merge join at level 3 and its output comes out sorted on the join key. That ordering is "interesting" at level 5 if level 5 is a join on the same key — it can use a cheap merge join instead of hashing. Pick a hash join at level 3 and level 5 needs a sort.
So the question "hash or sort-merge at level 3?" cannot be answered locally. Selinger's DP carries a set of interesting-order states through each sub-plan, multiplying the DP table size. This is why the System R algorithm is O(3^n) rather than O(2^n), and why Cascades optimisers search a memo (DAG of logical equivalences with physical alternatives) rather than a tree.
Every other subsystem composes. A good hash function is still a good hash function inside a join. The optimiser does not compose — operators influence each other through data properties (orderings, partitionings, distinct-ness) that are not local to one operator's cost.
4. The workload is adversarial
Benchmarks are synthetic. TPC-H, TPC-DS, the Join Order Benchmark — uniform distributions, independent columns, no NULLs, no UDFs. Real workloads do not look like that. Real workloads have:
- Skew. One hot user with ten million orders; millions of users with ten each. A plan assuming uniform distribution runs orders of magnitude slower on the hot key.
- Correlations.
country = 'IN'andcurrency = 'INR'are not independent. The optimiser multiplies selectivities as if they were and estimates 1000× too low. - Constant-folding-hostile expressions.
WHERE date_trunc('month', created_at) = ...cannot use an index oncreated_atunless rewritten. Most planners do not rewrite. - NULL bombs. A column 80% NULL breaks histograms — the MCV bucket is just NULL and carries no information about the rest.
- UDFs. No cost model, no selectivity estimate. Postgres assumes one-third of rows pass any UDF predicate.
- Bind-variable peeking. First parse binds
country = 'US'(80% of data). Cached plan uses a scan. Later calls withcountry = 'NZ'(0.1%) run the wrong plan.
No other subsystem is antagonised this hard. The optimiser facing adversarial statistics produces plans that are arbitrarily bad — there is no floor.
Four decades of research, still the same problem
A compressed timeline:
- 1979 — Selinger's System R. Patricia Selinger and four IBM co-authors invent cost-based optimisation: DP over left-deep joins, I/O cost model, histogram selectivities. Every commercial RDBMS — DB2, Oracle, SQL Server, Postgres, MySQL — descends from this.
- 1993–1995 — Graefe's Volcano and Cascades. Top-down memoised enumeration with cost-based pruning. Separates logical from physical plans. SQL Server, CockroachDB, and Apache Calcite ship Cascades-family optimisers. Handles bushy plans natively. The cardinality problem is unchanged.
- 2015 — Leis et al. Measured >100× plan-cost error on ~10% of real queries across every major engine. Verdict: "estimation is the culprit."
- 2018–present — Learned optimisers. Naru (2019) and MSCN learn joint distributions. Bao (2021) learns which hints to apply. Neo (2019) learns whole plans. Better in-distribution, worse out-of-distribution. None is the default anywhere in production.
- 2020–present — Adaptive execution. Spark AQE, SQL Server Adaptive Joins, DuckDB re-optimise-on-spill, Oracle statistics feedback. Re-plan mid-query. Helps, but cannot undo work already done.
Forty-five years, thousands of PhD theses, and the practical state is still "generate a plan, run it, hope." Every new idea improves things 20–50% on one workload and regresses the same on another. The problem is structural, not a lack of cleverness.
Practical symptoms you will see in production
Every one of these is a page you will get as a backend engineer.
"The query was 10 ms yesterday, 10 seconds today." ANALYZE ran overnight. A histogram bucket boundary shifted. The optimiser's cardinality estimate changed by a hair and flipped the join order. The data did not change; the estimate of the data did.
"Adding an index made the query slower." The optimiser sees a new access path and its cost model likes it. But statistics on the column are stale or the predicate is less selective than assumed, and the index scan returns 90% of rows — at two I/Os each (index + heap lookup) instead of one. A bad index is a negative.
"Reordering the WHERE clauses didn't change the plan — until it did." Reordering predicates in a conjunction should be a no-op; the planner sees an unordered set. But it sometimes changes the canonical form, which changes which rewrites fire, which cascades into a different join order. You cannot predict when.
"The cached plan from the first execution is wrong for every call after." Bind-variable peeking. First bind: country = 'US' (80% of data), plan uses a scan. Every later call reuses the plan; bound with 'NZ' (0.1%), an index would be 1000× better. Oracle's adaptive cursor sharing and SQL Server's parameter sniffing patches help, with mixed results.
The production fix is never a planner fix. You cannot wait for the optimiser to get smarter. The workaround is a hint (/*+ USE_HASH(u o) */), a plan guide, a materialised view, or a denormalisation. None of these fixes the optimiser. They sidestep it.
A three-table join the optimiser gets wrong
Three tables in an e-commerce warehouse. users: 10⁷ rows with a country column. products: 10⁵ rows with a category column. orders: 10⁸ rows with foreign keys to both.
SELECT u.name, p.title
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON o.product_id = p.id
WHERE u.country = 'IN' AND p.category = 'pooja-samagri';
Reality. The filters are tightly correlated — Indian users order pooja-samagri products overwhelmingly. Roughly 2 × 10⁶ orders match both.
What the optimiser sees. Independence assumption: s_{IN} = 0.1, s_{pooja} = 0.005. Combined: 5 × 10⁻⁴, expecting 50,000 matching orders.
Join order it picks. Smallest first. users ⋈ orders is estimated at 10⁷ rows; products ⋈ orders at 5 × 10⁵. The optimiser builds products ⋈ orders first, then joins against users.
What happens. Because of the correlation, products ⋈ orders returns 2 × 10⁶ rows — 4× the estimate. The hash probe against users sees four times the rows expected and spills to disk. Wall-clock: 40 seconds instead of 5.
Right join order. users ⋈ orders first, filtered to Indian users (2 × 10⁷). Then join against products filtered to pooja-samagri (500 rows) — that hash table fits in L2 cache. Runs in 3 seconds.
| quantity | estimate | actual |
|---|---|---|
| Indian users | 10⁶ | 10⁶ |
| pooja-samagri products | 500 | 500 |
| orders matching both | 50,000 | 2,000,000 |
| selectivity error | — | 40× |
The fix: multi-column extended statistics on (orders.user_id, orders.product_id), which Postgres has supported since v10. Almost nobody creates them; there is no automatic mechanism to know which column pairs are correlated. No one debugged storage. No one changed an index. The optimiser simply multiplied two numbers that should not have been multiplied.
What makes a plan choice "right"
There is no plan that is "right" in an absolute sense. The optimal plan depends on facts the optimiser cannot know.
- Current data distribution. Not yesterday's. Unless
ANALYZEhas run since the last change, the optimiser works off stale statistics. - Current buffer state. A sequential scan of a fully-cached
userscosts nothing; the same scan cold costs minutes. The optimiser's I/O cost is uniform; it has no idea what is resident. - Concurrent workload. A hash join that builds 2 GB is fine alone but evicts everyone else's working set under load. The optimiser plans in isolation.
- Hardware. SSD random read is 500 µs; spinning disk is 10 ms. Cost constants are calibrated at setup and rarely updated. Move to faster hardware and the relative costs change but the plans do not.
It runs on stale statistics, a crude I/O model, and no awareness of concurrent load. It will never be "right" — the best it can hope for is "not disastrously wrong most of the time," a lower bar than every other subsystem clears.
The three places you fix it in practice
If the optimiser cannot be fixed, your toolkit is a short list of workarounds.
Statistics. The cheapest fix. Run ANALYZE more often on hot tables. Create multi-column stats (CREATE STATISTICS in Postgres) for correlated column pairs. Raise default_statistics_target. Collect MCVs for skewed columns. Helps at the margin — does not fix unbounded error, just reduces average error. You cannot store joint stats on every column subset; the cost is combinatorial.
Plan forcing. Hints, plan guides, stored plans. The heavy hand. Oracle has /*+ USE_HASH(a b) */, /*+ ORDERED */. SQL Server has OPTION (HASH JOIN) and plan guides. Postgres refuses, but pg_hint_plan fills the gap. Hints rot: the right plan at 10 GB is wrong at 100 GB, and nobody reviews hints when data grows.
Schema. The heaviest hand, most durable. Denormalise. Add materialised views. Add covering indexes. Pre-compute aggregates with triggers or a CDC pipeline. If the optimiser keeps picking the wrong plan, rewrite the schema so there is only one plan and it is the right one. A materialised view that pre-joins users ⋈ orders ⋈ products turns the 40-second query into a 10-ms lookup. The optimiser no longer has a choice to get wrong. You pay in write amplification and consistency complexity, but latency becomes a property of the schema rather than of the optimiser's current mood.
Adaptive query processing — the most promising frontier
The insight: instead of fixing the optimiser, give up on getting the plan right up front. Start executing. Monitor actual cardinalities. Re-plan the remaining subtree when an estimate is badly wrong.
Spark AQE (since version 3.0, 2020) checks row counts after each shuffle stage and can switch a shuffled hash join to a broadcast join, coalesce small partitions, or skew-handle a hot key. SQL Server's Adaptive Joins (2017) start as nested-loop and switch to hash if input exceeds a compile-time threshold. Oracle's statistics feedback records observed cardinalities and re-compiles on the next call. DuckDB re-optimises when a hash build spills.
Limitations. Re-planning itself costs CPU — you cannot re-plan every row. Mid-query replanning cannot undo work already done; if 5 GB of hash-partition files are on disk, switching to merge join means re-sorting and the partitions are wasted. Adaptive systems re-plan only at barriers: end of a shuffle stage, end of a pipeline. Adaptive execution is where the field is heading, but it is not solved, and production systems that use it still ship hints, materialised views, and schema workarounds as their primary tools.
What this means for you as a developer
You will not fix the optimiser. You will work with it.
- Do not write clever SQL the optimiser cannot see through. Correlated subqueries in forms the engine does not unnest;
WHERE func(col) = 'x'that hides predicates from index selection;ORclauses the engine cannot split into index unions. If the SQL is hostile, the plan will be too. - Keep statistics fresh on hot tables. Nightly
ANALYZEif your engine does not do it automatically. After a bulk load, the first query is terrible untilANALYZEruns. - Keep the join count manageable. Each added table roughly doubles plan space and quadruples the chance of a bad choice. A twenty-table query is not going to run reliably anywhere. Break it up.
- Read the plan for every slow query.
EXPLAIN ANALYZEshows you the chosen plan and the actual row counts. Eighty percent of the time the bug is visible on first read: an estimate 1000× off, a nested-loop where a hash join should be, a seq scan where an index was available.
A small diagnostic helper — given a plan tree from Postgres EXPLAIN (FORMAT JSON), classify its shape and surface the worst estimate error in the tree:
# diagnose.py — classify a query plan at a glance
def shape(node: dict) -> str:
"""Return 'leaf', 'left-deep', 'right-deep', or 'bushy'."""
kids = node.get("Plans", [])
if not kids:
return "leaf"
if len(kids) == 1:
return shape(kids[0])
left, right = shape(kids[0]), shape(kids[1])
left_has_join = left not in ("leaf",)
right_has_join = right not in ("leaf",)
if left_has_join and right_has_join: return "bushy"
if left_has_join: return "left-deep"
if right_has_join: return "right-deep"
return "leaf" # two-leaf join
def max_row_error(node: dict) -> float:
"""Largest estimate-vs-actual row-count ratio in the tree."""
est, act = node.get("Plan Rows", 1), node.get("Actual Rows", 1) or 1
here = max(est, act) / max(min(est, act), 1)
return max([here] + [max_row_error(k) for k in node.get("Plans", [])])
If max_row_error exceeds 100, you have a cardinality-estimation problem and should chase the plan, not the execution.
Going deeper
Two directions in active research.
The robustness-versus-optimality trade-off
The classical formulation: given cost estimates, pick the cheapest plan. A newer formulation: given that estimates are wrong, pick a plan whose worst-case performance is bounded, even if its best case is not the absolute minimum.
If plan A costs 10 when the estimate is right and 10,000 when it is 100× off, and plan B costs 20 always, B is "more robust." Slightly suboptimal on average, dramatically better in the worst case. A production system that runs slightly slower on average but never catastrophically is easier to operate than one that runs faster on average but pages you at 3am. Yin et al. (2022) formalise this as picking plans inside a "robustness region" of the parameter space. Oracle's bind-sensitive cursors and SQL Server's plan-stability features gesture toward this, but the fully-robust optimiser has not shipped anywhere.
Was Selinger right about how hard this is?
Selinger is said to have called the System R optimiser "the hardest software project I ever worked on." Is that still true with compilers, kernels, and distributed systems to compare against?
Probably yes, for a narrow reason. Compilers deal with large search spaces too; operating systems face adversarial workloads too. What makes the optimiser worse is that it must produce a new decision for every query, and the decision depends on runtime data that shifts faster than offline calibration tracks. A compiler picks a register allocation once per function; the optimiser picks a plan every time the query is parsed. A compiler's input (the program) is stable; the optimiser's input (the data distribution) drifts daily. The closest analogy is real-time ad bidding — exponential action space, incomplete information, shifting distribution. Not a reassuring comparison.
Where this leads next
Build 6 ends here. From the iterator model, through hash join and sort-merge join, through rule-based rewrites and cost-based enumeration, you have a working query engine. You know how plans are chosen, why the choices are often wrong, and what to do when they are.
Build 7 turns to transactions and concurrency control: two-phase locking, MVCC, snapshot isolation, serialisable snapshot isolation. These are firmly in the "solved problem" category. The algorithms have proofs; implementations vary in constants, not in correctness. You will build an MVCC system from scratch and end up with the semantic equivalent of what Postgres ships. After that comes recovery (ARIES) in Build 8 and indexing structures in Build 9. All solved. All textbook. All satisfying. Enjoy them after this chapter.
References
- Selinger, Astrahan, Chamberlin, Lorie, Price, Access Path Selection in a Relational Database Management System, SIGMOD 1979 — the founding paper. DP over left-deep join orders, histogram-based selectivity, interesting orders. Every commercial optimiser still descends from this.
- Leis, Gubichev, Mirchev, Boncz, Kemper, Neumann, How Good Are Query Optimizers, Really?, VLDB 2015 — measured 100×-plus estimation error on ~10% of real queries across five engines. The most-cited result in modern optimiser research.
- Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys 25(2), 1993 — the canonical survey of execution and optimisation. Still the first thing a new optimiser engineer reads.
- Graefe and McKenna, The Volcano Optimizer Generator: Extensibility and Efficient Search, ICDE 1993 — the Volcano/Cascades framework behind SQL Server, CockroachDB, and Calcite.
- Marcus, Negi, Mao, Zhang, Alizadeh, Kraska, Papaemmanouil, Tatbul, Neo: A Learned Query Optimizer, VLDB 2019 — the RL-based learned optimiser. Cautiously optimistic in the paper; still not the default anywhere in production five years on, which is itself informative.
- Stonebraker and Hellerstein, What Goes Around Comes Around... And Around..., 2024 — honest retrospective on how little progress has been made on the fundamental estimation problem despite massive investment, and why learned optimisers have not displaced cost-based ones.