In short

The AST the parser and binder produce is a tree of SQL syntax — it still has SELECT, FROM, WHERE, and implicit joins lurking inside comma-separated table lists. No optimiser can reason cleanly about a tree shaped like that, because two queries with identical meaning can have very different syntactic shapes. So the next pass throws the AST away and builds a new tree in a tiny, regular language: relational algebra. Six core operators — select σ (filter rows), project π (pick columns), join ⋈, union ∪, difference −, Cartesian product × — plus two practical extras, aggregate γ and distinct δ, are enough to express every SQL query you are likely to write. Each operator takes one or two relations and returns a relation, so operators compose into trees. The shape of that tree is the logical plan. On top of the logical plan, the optimiser applies rewrite rules — push predicates down past joins, reorder joins, merge consecutive projections — each of which is a pattern match on tree shape. Eventually the logical tree is replaced by a physical plan of the same shape, where each logical operator has been assigned a concrete algorithm (hash join vs nested-loop join, index scan vs full scan). This chapter builds the Python dataclasses for the algebra, shows how to translate a SELECT-FROM-WHERE AST into an algebra tree, and sets up the IR that the next three chapters' optimiser and executor will operate on.

The parser and the binder have done their work. Your database is holding a bound AST: a Select node whose columns are resolved ColumnRef(table_id=users, column_id=name) objects, whose table is a resolved TableRef(table_id=users), whose WHERE is a type-checked expression. Every name has been turned into an id. Every type has been verified. The tree is correct.

And you still cannot optimise it.

The reason is that the AST is a tree of SQL syntax, not a tree of operations on relations. Consider these two queries:

-- Query A
SELECT u.name FROM users u, orders o WHERE u.id = o.user_id AND o.total > 1000

-- Query B
SELECT u.name FROM users u JOIN orders o ON u.id = o.user_id WHERE o.total > 1000

They mean the same thing. They return the same rows. A smart optimiser should pick the same plan for both. But their ASTs are different shapes — Query A has a comma-list of tables and a two-part AND in the WHERE; Query B has a JOIN node with its own ON clause and a simpler WHERE. If your optimiser's rewrite rules match on AST shape, you are writing one rule for the comma form and a parallel rule for the JOIN form, and you are still missing the dozen other syntactic shapes that mean the same thing. This does not scale.

The fix is a change of representation. Throw the AST away. Build a new tree in a language where both queries produce the identical structure — where the only choices are which operators are used and in what order. That language is relational algebra. Codd introduced it in 1970 as the mathematical basis of the relational model; every real database uses some extended version of it as its intermediate representation.

The six core operators

A relation is a set (or bag) of tuples over a named schema. Relational algebra is a set of operators that each take one or two relations and return a relation. The closure property — operator output is itself a relation — is what makes operators composable into trees.

Six operators form the core. Every other operation you need can be built from them.

The six core relational algebra operatorsSix boxes in a row showing the core operators. Sigma (select) keeps rows matching a predicate. Pi (project) keeps chosen columns. Join combines two relations matching a condition. Union takes union of two same-shape relations. Minus takes set difference. Cross product takes Cartesian product.σselectfilter rowsσₖ₂ₐ(R)keep rowswhere p holdsπprojectpick columnsπₐₕ(R)drop all colsnot in {a,b}joincombineR ⋈ₖ Spaired rowswhere p holdsunioncombine setsR ∪ Srows in Ror in SminusdifferenceR − Srows in Rnot in S×productCartesianR × Severy pairof rowsThree accent-coloured boxes — select, project, join — are the ones you see in almost every query.Union, difference, product appear in set operations, NOT EXISTS, and the theoretical derivation of join.Join itself is defined as σ(R × S) — a filter on top of a Cartesian product.
The six core operators of relational algebra. Every SQL query you will write lowers into a tree built out of these six shapes, plus two practical extras (aggregate and distinct) that extended algebra adds on top.

Two practical additions that every real system bolts on top:

Eight operators, each with a precise definition, each composable with every other. That is the whole language.

Translating a SQL AST to an algebra tree

Here is the translation for SELECT u.name FROM users u, orders o WHERE u.id = o.user_id AND o.total > 1000. The AST the binder hands over has a list of two tables, a conjunction of two predicates, and a single projected column. The algebra tree is built bottom-up.

SQL to relational algebra treeA SQL statement at the top translates into a tree below it. The tree has pi at the root with child sigma, sigma's child is a join, the join's two children are scan(users) on the left and scan(orders) on the right. Arrows show the correspondence: SELECT becomes pi, WHERE becomes sigma plus join predicate, FROM becomes scan nodes.SELECT u.name FROM users u, orders o WHERE u.id = o.user_id AND o.total > 1000loweringπ[u.name]σ[o.total > 1000]⋈[u.id = o.user_id]scan(users)scan(orders)Two tables in FROM become two leaf scans joined by the equality predicate from WHERE. The non-join predicate lifts into σ. The SELECT list becomes π at the root.
Lowering a SELECT-FROM-WHERE statement to an algebra tree. Each SQL clause corresponds to a named node; the shape of the tree is what the optimiser rewrites in the next chapter.

The rules for lowering a SELECT-FROM-WHERE AST are mechanical:

  1. FROM clause → one Scan(table) leaf per referenced table.
  2. Multiple FROM tables → a Join with no predicate (equivalent to ×) combining the scans, left-deep by default: Join(Join(Scan(a), Scan(b)), Scan(c)).
  3. WHERE clause → split the predicate on AND. Every equality that mentions columns from two different scans becomes the join predicate and lifts into the nearest Join node. Everything else becomes a Filter (σ) above the joins.
  4. SELECT list → a single Project (π) node at the root, over whatever the joins-and-filters produced.
  5. GROUP BY → a Group (γ) node between the filter and the project.
  6. DISTINCT → a Distinct (δ) node above the project.
  7. ORDER BY, LIMITSort and Limit nodes at the very top.

Note step 3: the optimiser's job begins here. Even the initial lowering pulls join-equality predicates out of the WHERE and attaches them to the Join. That is already a small rewrite — and it is why comma-FROM and JOIN-FROM land on identical trees. Both syntaxes produce the same Join(Scan, Scan, predicate) node. The comma-vs-JOIN distinction is a property of the AST; it does not survive into the algebra.

The Python class hierarchy

Here is the algebra, expressed as Python dataclasses. Each operator is a class with its inputs (the child plans) and its parameters (the predicate, the column list, the aggregate specs).

# query/algebra.py
from dataclasses import dataclass, field
from typing import Union

# Expressions that appear inside operators — predicates, projected columns,
# aggregate arguments. Reuses the bound-AST expression tree from the binder.
from .bound import BoundExpr, ColumnRef, AggregateCall

@dataclass
class Scan:
    """A leaf: read all rows of a base table."""
    table_id: int
    alias: str
    # schema is derivable from the catalog; cached here for convenience.
    columns: list[ColumnRef]

@dataclass
class Filter:
    """σ — keep rows where predicate evaluates to true."""
    input: "Plan"
    predicate: BoundExpr

@dataclass
class Project:
    """π — keep only listed columns (or computed expressions)."""
    input: "Plan"
    exprs: list[BoundExpr]
    output_names: list[str]

@dataclass
class Join:
    """⋈ — combine two relations. predicate=None means Cartesian product."""
    left: "Plan"
    right: "Plan"
    predicate: BoundExpr | None = None
    kind: str = "inner"            # inner | left | right | full | semi | anti

@dataclass
class Union:
    """R ∪ S — schemas must match."""
    left: "Plan"
    right: "Plan"
    all: bool = False              # True for UNION ALL (no duplicate removal)

@dataclass
class Difference:
    """R − S — schemas must match."""
    left: "Plan"
    right: "Plan"

@dataclass
class Product:
    """R × S — Cartesian product, used only internally; the optimiser
    converts this to Join(predicate=...) whenever possible."""
    left: "Plan"
    right: "Plan"

@dataclass
class Group:
    """γ — GROUP BY with aggregates."""
    input: "Plan"
    group_by: list[BoundExpr]
    aggregates: list[AggregateCall]

@dataclass
class Distinct:
    """δ — remove duplicate rows."""
    input: "Plan"

@dataclass
class Sort:
    """ORDER BY."""
    input: "Plan"
    keys: list[tuple[BoundExpr, str]]   # (expr, "asc" | "desc")

@dataclass
class Limit:
    """LIMIT n [OFFSET m]."""
    input: "Plan"
    count: int
    offset: int = 0

Plan = Union[Scan, Filter, Project, Join, Union, Difference,
             Product, Group, Distinct, Sort, Limit]

Why every operator has an input (or left/right) field pointing to another Plan: this is the closure property in code. The type Plan is a sum of all the operator classes, and every operator's child is also a Plan. So trees can be arbitrarily nested — Project over Filter over Join over Scan — and the type system allows it.

Why Product exists at all when Join with predicate=None would suffice: it is a hint to the optimiser that a subsequent rule should attempt to find a predicate and convert this to a real join. Some systems collapse the two, some keep them distinct. Postgres's planner has both; SQLite has only one. For teaching, keeping them separate makes the lowering rules and the optimiser rules easier to write down, because you can say "never leave a Product in the final plan".

Why Scan carries the alias and a precomputed columns list: later passes (the optimiser, the executor) need to know the output schema of every operator in the tree without having to go back to the catalog. By the time an algebra tree is built, every node can answer "what columns do I produce, and in what order?" in O(1).

Example: lowering a real query

Take the running example:

SELECT u.name, COUNT(*) AS order_count
FROM users u, orders o
WHERE u.id = o.user_id AND o.total > 1000
GROUP BY u.name

After the parser and the binder, the AST has:

The lowering function walks the AST and emits the algebra tree in reverse: start at the leaves, build up.

# query/lower.py
from .algebra import Scan, Filter, Project, Join, Group
from .bound import split_conjuncts, mentions, AggregateCall

def lower(bound_select) -> Plan:
    # 1. Leaves: one Scan per FROM table.
    plans = [Scan(t.table_id, t.alias, t.columns) for t in bound_select.from_tables]

    # 2. Reduce the list of scans to a single Join (left-deep), pulling
    #    join predicates out of WHERE as we go.
    conjuncts = split_conjuncts(bound_select.where) if bound_select.where else []
    join_preds, other_preds = [], []
    for c in conjuncts:
        refs = mentions(c)           # set of table aliases the predicate touches
        if len(refs) == 2:
            join_preds.append(c)
        else:
            other_preds.append(c)    # single-table or constant predicate

    tree = plans[0]
    for right in plans[1:]:
        # pick predicates that now connect both sides; the rest stay in the pool
        applicable = [p for p in join_preds if mentions(p) <= _aliases(tree) | _aliases(right)]
        for p in applicable: join_preds.remove(p)
        tree = Join(left=tree, right=right,
                    predicate=_and_all(applicable) if applicable else None)

    # 3. Remaining single-table predicates become one Filter above the join.
    if other_preds:
        tree = Filter(input=tree, predicate=_and_all(other_preds))

    # 4. GROUP BY becomes a Group node.
    if bound_select.group_by:
        aggs = [a for a in bound_select.select_list if isinstance(a, AggregateCall)]
        tree = Group(input=tree, group_by=bound_select.group_by, aggregates=aggs)

    # 5. The SELECT list becomes the outermost Project.
    tree = Project(input=tree,
                   exprs=[e.expr for e in bound_select.select_list],
                   output_names=[e.alias for e in bound_select.select_list])

    return tree

Run it on the example and you get this tree (written in Python's repr form):

Project(
  exprs=[u.name, COUNT(*)],
  output_names=['name', 'order_count'],
  input=Group(
    group_by=[u.name],
    aggregates=[COUNT(*)],
    input=Filter(
      predicate=(o.total > 1000),
      input=Join(
        predicate=(u.id = o.user_id),
        left=Scan('users', alias='u', ...),
        right=Scan('orders', alias='o', ...),
      )
    )
  )
)

Six nodes, exactly one for each SQL clause. The shape is regular. Every SELECT-FROM-WHERE-GROUP BY query will produce a tree of this shape, modulo the number of joins, the specifics of the predicates, and whether Filter/Group are present or absent.

The lowering code is around 40 lines. It is mechanical in a way the parser was not — there are no precedence questions, no syntactic ambiguity, just a systematic walk from the AST fields to the algebra constructors.

Why algebra as the IR — rewrite rules on tree shape

Here is the payoff. Consider the rule predicate push-down: a filter on columns from a single table can be moved below a join that does not touch those columns, because filtering before joining produces the same answer but reads fewer rows. In algebra, the rule is:

\sigma_p(R \bowtie_q S) \;\equiv\; \sigma_p(R) \bowtie_q S \quad \text{when } p \text{ only mentions columns of } R

As Python code against the algebra dataclasses, that rule is five lines:

def push_down_filter(node: Plan) -> Plan:
    if isinstance(node, Filter) and isinstance(node.input, Join):
        j = node.input
        if _only_mentions(node.predicate, _schema_of(j.left)):
            return Join(left=Filter(j.left, node.predicate),
                        right=j.right, predicate=j.predicate, kind=j.kind)
    return node

That is the whole rule. The optimiser applies it repeatedly, walking the tree, until no more pushes are possible. Now imagine trying to write the equivalent rule against the AST:

This is why everyone uses algebra as the IR. The operators are uniform, the tree is regular, and the rewrite rules become pattern matches on tree shape. Codd's 1970 algebra and Graefe's Cascades architecture of the 1990s are both built on this observation: a good IR makes transformations short.

A few other rules that are one-liners on the algebra and nightmares on the AST:

Each of these is the basis of a standard optimiser rewrite, and each is about ten lines of Python against the algebra dataclasses.

Logical vs physical operators

So far every operator has been a logical operator: it says what relation is computed, not how. Join(left, right, predicate=...) does not say whether to use a hash join, a merge join, or a nested-loop join. Scan(table) does not say whether to use the primary index, a secondary index, or a full table scan. Those choices are physical.

The planner runs in two phases. First, logical planning: lower the AST into an algebra tree, apply rewrite rules to produce an equivalent algebra tree with better structure (predicates pushed down, joins reordered, projections merged). Then, physical planning: walk the logical tree and, at each node, pick a concrete algorithm — replacing the logical operator with a physical one of the same shape.

Logical vs physical operatorsTwo trees side by side. Left: a logical plan with pi, sigma, join, and two scan nodes. Right: the corresponding physical plan where pi becomes ProjectExec, sigma becomes FilterExec, join becomes HashJoinExec, and scans become IndexScanExec or SeqScanExec. An arrow between them is labelled 'cost-based physical planning'.Logical planπ[u.name]σ[o.total > 1000]⋈[id=user_id]usersorderspickalgorithmsPhysical planProjectExecFilterExecHashJoinExecIndexScanExecSeqScanExecSame tree shape. Each logical node becomes one of several possible physical nodes.Choice of physical operator is driven by table size, indexes, and memory estimates.
Logical and physical plans have the same tree shape. Physical planning is the act of assigning a concrete algorithm to each logical operator. The shape is fixed by logical optimisation; the algorithms are chosen by cost.

The two layers let you separate concerns:

Physical operators have the same tree shape as logical operators — there is a HashJoinExec for every Join, a FilterExec for every Filter, an IndexScanExec or SeqScanExec for every Scan. They take the same children. What changes is the algorithm and, sometimes, the schema of what flows through (a HashJoinExec might build a hash table on the smaller side and probe with the larger; the order is a physical choice).

The next chapter in this build (Iterator-model execution) is about physical operators: each one implements open(), next(), close() and pulls rows from its children. The chapter after that (Rule-based rewrites) runs at the logical level: rules that match on Plan tree shape and produce new Plan trees. Both chapters stand on the algebra you just built.

Common confusions

Going deeper

If you wanted to know what IR sits between parser and executor, you have it. What follows is the bridge from the toy to real production optimisers — the Cascades framework, NULL semantics, and the distinction between set and bag algebra that trips up every first serious SQL implementer.

Graefe's Cascades framework

Goetz Graefe's Cascades (Graefe 1995) is the optimiser architecture used, in some form, by SQL Server, Greenplum, CockroachDB, Apache Calcite, Orca (Pivotal), and a dozen others. Its core insight is that optimisation should be a search over equivalent trees, with memoisation.

The architecture has three parts:

  1. A memo structure: every algebraically-equivalent subtree is represented by a single group. A group contains all the alternative expressions that compute the same relation. When the optimiser rewrites R ⋈ S to S ⋈ R, it does not replace the tree — it adds the new expression to the same group.

  2. Logical and physical rules: logical rules preserve equivalence and create new logical expressions in the same group (associativity, commutativity, predicate pushdown). Physical rules turn a logical expression into a physical implementation (Join → HashJoin, NestedLoopJoin, or MergeJoin).

  3. A top-down, cost-based search: the optimiser walks the memo, applying rules, and asks "what is the cheapest physical expression for this group?". Costs propagate from children to parents. Pruning cuts branches whose bound exceeds the best known.

Cascades is much more sophisticated than what this chapter builds — the toy optimiser in the next chapter uses rule-based, bottom-up rewrites with no memoisation and no cost model. That is a deliberate choice: rule-based optimisers are the baseline every undergraduate learns before moving to Cascades. See the references section for Graefe's original paper.

Bag semantics vs set semantics

Codd's 1970 algebra is over sets: no duplicates, R ∪ R = R, σ never produces duplicates that weren't already there. SQL is over bags (multisets): duplicates are kept unless the user says DISTINCT, UNION ALL is the default cheap union, COUNT(*) explicitly counts duplicates.

The difference matters for rewrite rules. Under set semantics:

\pi_a(R \cup S) \;\equiv\; \pi_a(R) \cup \pi_a(S)

Under bag semantics, this is still true if UNION is UNION ALL — but it is not true if the union is a set union (duplicate-eliminating), because projecting first might turn distinct rows into duplicates and change the count.

Every real optimiser has a set of rules that is valid under bag semantics and a (usually larger) set valid only under set semantics, and it has to know which it is applying. Papers like Garcia-Molina, Ullman, Widom's Database Systems: The Complete Book, Chapter 5 work through this in detail.

NULL and three-valued logic

SQL's NULL is infamous because it turns boolean predicates into three-valued logic: true, false, and unknown. NULL = NULL is unknown, not true. NULL = 5 is unknown. NULL AND false is false, but NULL AND true is unknown. This interacts with algebra rewrites in subtle ways.

Consider the rewrite σ_{p AND q}(R) ≡ σ_p(σ_q(R)). Under two-valued logic this is trivially true — a row satisfies p AND q iff it satisfies both. Under three-valued logic, σ discards rows where the predicate is not true (i.e., false OR unknown). The rewrite still holds, but the proof uses the fact that AND over three-valued logic is monotonic — a worked exercise in any serious database course. The optimiser needs to be aware that some rewrites change the NULL-handling behaviour and cannot be blindly applied.

Postgres's src/backend/optimizer/ contains dozens of safety checks that amount to "do not apply this rewrite if a column in the predicate could be NULL".

Extended algebra: what real systems track

Production IRs add annotations to every operator that pure algebra does not have:

These live as fields on each Plan node in mature systems. The next chapter's iterator model will add cardinality estimates to the physical plan.

Apache Calcite: an algebra IR you can read

For a modern, production-quality, open-source algebra IR, read Apache Calcite's RelNode class hierarchy. Calcite is the SQL planning layer used by Apache Flink, Apache Drill, Apache Hive (via Calcite integration), Dremio, and many others. Its algebra is exactly the extended form described above — Scan, Project, Filter, Join, Aggregate, Union, Minus, Sort, Values, Window, Correlate — with cost and cardinality annotations and a Cascades-style optimiser on top.

Reading Calcite's rule set (RelOptRule subclasses) is the fastest way to see what a hundred-plus rewrite rules look like in one place.

Where this leads next

You now have an IR — a small, regular tree of operators that represents any bound SQL query. The parser and binder feed into it; the optimiser and executor read out of it.

Relational algebra is the hinge between the meaning of a query (what rows to return) and its execution (how to compute them efficiently). By putting a formally-defined IR between the two halves of the engine, you give every later pass a simple job: take a tree, return a tree. That is the entire reason this chapter exists.

References

  1. Codd, A Relational Model of Data for Large Shared Data Banks, CACM 1970 — the founding paper. Introduces the relational model and the algebra (and calculus) that defines it.
  2. Graefe, The Cascades Framework for Query Optimization, IEEE Data Eng. Bulletin 1995 — the optimiser architecture used by SQL Server, CockroachDB, Calcite, and others. Explains memo, groups, logical/physical rules, and top-down search.
  3. Graefe, Volcano — An Extensible and Parallel Query Evaluation System, IEEE TKDE 1994 — introduces the iterator execution model that pairs with this chapter's algebra.
  4. Selinger, Astrahan, Chamberlin, Lorie, Price, Access Path Selection in a Relational Database Management System, SIGMOD 1979 — the System R paper that defined cost-based optimisation and join-order enumeration.
  5. Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book (2nd ed., 2008), Ch. 2 and 16 — textbook treatment of the algebra, including bag semantics and NULL handling.
  6. Apache Calcite, RelNode class hierarchy — a modern, production-quality open-source algebra IR with a full Cascades-style optimiser on top. The canonical example of what a real extended algebra looks like in code.