Relational Algebra as Your Intermediate Representation

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The AST the parser and binder produce is a tree of SQL syntax — it still has SELECT, FROM, WHERE, and implicit joins lurking inside comma-separated table lists. No optimiser can reason cleanly about a tree shaped like that, because two queries with identical meaning can have very different syntactic shapes. So the next pass throws the AST away and builds a new tree in a tiny, regular language: relational algebra. Six core operators — select σ (filter rows), project π (pick columns), join ⋈, union ∪, difference −, Cartesian product × — plus two practical extras, aggregate γ and distinct δ, are enough to express every SQL query you are likely to write. Each operator takes one or two relations and returns a relation, so operators compose into trees. The shape of that tree is the logical plan. On top of the logical plan, the optimiser applies rewrite rules — push predicates down past joins, reorder joins, merge consecutive projections — each of which is a pattern match on tree shape. Eventually the logical tree is replaced by a physical plan of the same shape, where each logical operator has been assigned a concrete algorithm (hash join vs nested-loop join, index scan vs full scan). This chapter builds the Python dataclasses for the algebra, shows how to translate a SELECT-FROM-WHERE AST into an algebra tree, and sets up the IR that the next three chapters' optimiser and executor will operate on.

The parser and the binder have done their work. Your database is holding a bound AST: a Select node whose columns are resolved ColumnRef(table_id=users, column_id=name) objects, whose table is a resolved TableRef(table_id=users), whose WHERE is a type-checked expression. Every name has been turned into an id. Every type has been verified. The tree is correct.

And you still cannot optimise it.

The reason is that the AST is a tree of SQL syntax, not a tree of operations on relations. Consider these two queries:

-- Query A
SELECT u.name FROM users u, orders o WHERE u.id = o.user_id AND o.total > 1000

-- Query B
SELECT u.name FROM users u JOIN orders o ON u.id = o.user_id WHERE o.total > 1000

They mean the same thing. They return the same rows. A smart optimiser should pick the same plan for both. But their ASTs are different shapes — Query A has a comma-list of tables and a two-part AND in the WHERE; Query B has a JOIN node with its own ON clause and a simpler WHERE. If your optimiser's rewrite rules match on AST shape, you are writing one rule for the comma form and a parallel rule for the JOIN form, and you are still missing the dozen other syntactic shapes that mean the same thing. This does not scale.

The fix is a change of representation. Throw the AST away. Build a new tree in a language where both queries produce the identical structure — where the only choices are which operators are used and in what order. That language is relational algebra. Codd introduced it in 1970 as the mathematical basis of the relational model; every real database uses some extended version of it as its intermediate representation.

The six core operators

A relation is a set (or bag) of tuples over a named schema. Relational algebra is a set of operators that each take one or two relations and return a relation. The closure property — operator output is itself a relation — is what makes operators composable into trees.

Six operators form the core. Every other operation you need can be built from them.

The six core operators of relational algebra. Every SQL query you will write lowers into a tree built out of these six shapes, plus two practical extras (aggregate and distinct) that extended algebra adds on top.

σ (sigma, select) — σₚ(R) returns the rows of R for which predicate p is true. This is the SQL WHERE. Do not confuse with the SQL keyword SELECT; they are unrelated. Codd picked σ first, then the SQL designers picked SELECT for what algebra calls π.
π (pi, project) — πₐ,ᵦ(R) returns R with only columns a and b kept. This is the column list of a SQL SELECT.
⋈ (bowtie, join) — R ⋈ₚ S returns pairs of rows (one from R, one from S) for which p holds. With no predicate, it is the Cartesian product. The most common form is the natural join or equijoin, where p is an equality on a shared key.
∪ (union) — R ∪ S returns rows that appear in R or in S. Both must have the same schema.
− (difference) — R − S returns rows in R that are not in S. Same schema requirement.
× (product) — R × S returns every pair of rows. |R × S| = |R| · |S|, which is usually enormous — product shows up in the theory a lot and in the plans of real queries almost never, because the optimiser always turns it into a join as quickly as possible.

Two practical additions that every real system bolts on top:

γ (gamma, aggregate) — γ_group_by; agg(R) groups R by the listed columns and computes aggregate expressions (COUNT, SUM, AVG, MIN, MAX) per group. This is SQL's GROUP BY.
δ (delta, distinct) — δ(R) removes duplicates. Pure algebra is over sets, so δ is implicit; SQL is over bags (duplicates allowed), so δ has to be explicit — that is what SELECT DISTINCT becomes.

Eight operators, each with a precise definition, each composable with every other. That is the whole language.

Translating a SQL AST to an algebra tree

Here is the translation for SELECT u.name FROM users u, orders o WHERE u.id = o.user_id AND o.total > 1000. The AST the binder hands over has a list of two tables, a conjunction of two predicates, and a single projected column. The algebra tree is built bottom-up.

Lowering a SELECT-FROM-WHERE statement to an algebra tree. Each SQL clause corresponds to a named node; the shape of the tree is what the optimiser rewrites in the next chapter.

The rules for lowering a SELECT-FROM-WHERE AST are mechanical:

FROM clause → one Scan(table) leaf per referenced table.
Multiple FROM tables → a Join with no predicate (equivalent to ×) combining the scans, left-deep by default: Join(Join(Scan(a), Scan(b)), Scan(c)).
WHERE clause → split the predicate on AND. Every equality that mentions columns from two different scans becomes the join predicate and lifts into the nearest Join node. Everything else becomes a Filter (σ) above the joins.
SELECT list → a single Project (π) node at the root, over whatever the joins-and-filters produced.
GROUP BY → a Group (γ) node between the filter and the project.
DISTINCT → a Distinct (δ) node above the project.
ORDER BY, LIMIT → Sort and Limit nodes at the very top.

Note step 3: the optimiser's job begins here. Even the initial lowering pulls join-equality predicates out of the WHERE and attaches them to the Join. That is already a small rewrite — and it is why comma-FROM and JOIN-FROM land on identical trees. Both syntaxes produce the same Join(Scan, Scan, predicate) node. The comma-vs-JOIN distinction is a property of the AST; it does not survive into the algebra.

The Python class hierarchy

Here is the algebra, expressed as Python dataclasses. Each operator is a class with its inputs (the child plans) and its parameters (the predicate, the column list, the aggregate specs).

# query/algebra.py
from dataclasses import dataclass, field
from typing import Union

# Expressions that appear inside operators — predicates, projected columns,
# aggregate arguments. Reuses the bound-AST expression tree from the binder.
from .bound import BoundExpr, ColumnRef, AggregateCall

@dataclass
class Scan:
    """A leaf: read all rows of a base table."""
    table_id: int
    alias: str
    # schema is derivable from the catalog; cached here for convenience.
    columns: list[ColumnRef]

@dataclass
class Filter:
    """&#963; — keep rows where predicate evaluates to true."""
    input: "Plan"
    predicate: BoundExpr

@dataclass
class Project:
    """&#960; — keep only listed columns (or computed expressions)."""
    input: "Plan"
    exprs: list[BoundExpr]
    output_names: list[str]

@dataclass
class Join:
    """&#8904; — combine two relations. predicate=None means Cartesian product."""
    left: "Plan"
    right: "Plan"
    predicate: BoundExpr | None = None
    kind: str = "inner"            # inner | left | right | full | semi | anti

@dataclass
class Union:
    """R &#8746; S — schemas must match."""
    left: "Plan"
    right: "Plan"
    all: bool = False              # True for UNION ALL (no duplicate removal)

@dataclass
class Difference:
    """R &#8722; S — schemas must match."""
    left: "Plan"
    right: "Plan"

@dataclass
class Product:
    """R &#215; S — Cartesian product, used only internally; the optimiser
    converts this to Join(predicate=...) whenever possible."""
    left: "Plan"
    right: "Plan"

@dataclass
class Group:
    """&#947; — GROUP BY with aggregates."""
    input: "Plan"
    group_by: list[BoundExpr]
    aggregates: list[AggregateCall]

@dataclass
class Distinct:
    """&#948; — remove duplicate rows."""
    input: "Plan"

@dataclass
class Sort:
    """ORDER BY."""
    input: "Plan"
    keys: list[tuple[BoundExpr, str]]   # (expr, "asc" | "desc")

@dataclass
class Limit:
    """LIMIT n [OFFSET m]."""
    input: "Plan"
    count: int
    offset: int = 0

Plan = Union[Scan, Filter, Project, Join, Union, Difference,
             Product, Group, Distinct, Sort, Limit]

Why every operator has an input (or left/right) field pointing to another Plan: this is the closure property in code. The type Plan is a sum of all the operator classes, and every operator's child is also a Plan. So trees can be arbitrarily nested — Project over Filter over Join over Scan — and the type system allows it.

Why Product exists at all when Join with predicate=None would suffice: it is a hint to the optimiser that a subsequent rule should attempt to find a predicate and convert this to a real join. Some systems collapse the two, some keep them distinct. Postgres's planner has both; SQLite has only one. For teaching, keeping them separate makes the lowering rules and the optimiser rules easier to write down, because you can say "never leave a Product in the final plan".

Why Scan carries the alias and a precomputed columns list: later passes (the optimiser, the executor) need to know the output schema of every operator in the tree without having to go back to the catalog. By the time an algebra tree is built, every node can answer "what columns do I produce, and in what order?" in O(1).

Example: lowering a real query

Take the running example:

SELECT u.name, COUNT(*) AS order_count
FROM users u, orders o
WHERE u.id = o.user_id AND o.total > 1000
GROUP BY u.name

After the parser and the binder, the AST has:

Two tables in FROM: users u, orders o.
A WHERE predicate: AND(u.id = o.user_id, o.total > 1000).
A GROUP BY: [u.name].
A SELECT list: [u.name, COUNT(*) AS order_count].

The lowering function walks the AST and emits the algebra tree in reverse: start at the leaves, build up.

# query/lower.py
from .algebra import Scan, Filter, Project, Join, Group
from .bound import split_conjuncts, mentions, AggregateCall

def lower(bound_select) -> Plan:
    # 1. Leaves: one Scan per FROM table.
    plans = [Scan(t.table_id, t.alias, t.columns) for t in bound_select.from_tables]

    # 2. Reduce the list of scans to a single Join (left-deep), pulling
    #    join predicates out of WHERE as we go.
    conjuncts = split_conjuncts(bound_select.where) if bound_select.where else []
    join_preds, other_preds = [], []
    for c in conjuncts:
        refs = mentions(c)           # set of table aliases the predicate touches
        if len(refs) == 2:
            join_preds.append(c)
        else:
            other_preds.append(c)    # single-table or constant predicate

    tree = plans[0]
    for right in plans[1:]:
        # pick predicates that now connect both sides; the rest stay in the pool
        applicable = [p for p in join_preds if mentions(p) <= _aliases(tree) | _aliases(right)]
        for p in applicable: join_preds.remove(p)
        tree = Join(left=tree, right=right,
                    predicate=_and_all(applicable) if applicable else None)

    # 3. Remaining single-table predicates become one Filter above the join.
    if other_preds:
        tree = Filter(input=tree, predicate=_and_all(other_preds))

    # 4. GROUP BY becomes a Group node.
    if bound_select.group_by:
        aggs = [a for a in bound_select.select_list if isinstance(a, AggregateCall)]
        tree = Group(input=tree, group_by=bound_select.group_by, aggregates=aggs)

    # 5. The SELECT list becomes the outermost Project.
    tree = Project(input=tree,
                   exprs=[e.expr for e in bound_select.select_list],
                   output_names=[e.alias for e in bound_select.select_list])

    return tree

Run it on the example and you get this tree (written in Python's repr form):

Project(
  exprs=[u.name, COUNT(*)],
  output_names=['name', 'order_count'],
  input=Group(
    group_by=[u.name],
    aggregates=[COUNT(*)],
    input=Filter(
      predicate=(o.total > 1000),
      input=Join(
        predicate=(u.id = o.user_id),
        left=Scan('users', alias='u', ...),
        right=Scan('orders', alias='o', ...),
      )
    )
  )
)

Six nodes, exactly one for each SQL clause. The shape is regular. Every SELECT-FROM-WHERE-GROUP BY query will produce a tree of this shape, modulo the number of joins, the specifics of the predicates, and whether Filter/Group are present or absent.

The lowering code is around 40 lines. It is mechanical in a way the parser was not — there are no precedence questions, no syntactic ambiguity, just a systematic walk from the AST fields to the algebra constructors.

Why algebra as the IR — rewrite rules on tree shape

Here is the payoff. Consider the rule predicate push-down: a filter on columns from a single table can be moved below a join that does not touch those columns, because filtering before joining produces the same answer but reads fewer rows. In algebra, the rule is:

\sigma_p(R \bowtie_q S) \;\equiv\; \sigma_p(R) \bowtie_q S \quad \text{when } p \text{ only mentions columns of } R

As Python code against the algebra dataclasses, that rule is five lines:

def push_down_filter(node: Plan) -> Plan:
    if isinstance(node, Filter) and isinstance(node.input, Join):
        j = node.input
        if _only_mentions(node.predicate, _schema_of(j.left)):
            return Join(left=Filter(j.left, node.predicate),
                        right=j.right, predicate=j.predicate, kind=j.kind)
    return node

That is the whole rule. The optimiser applies it repeatedly, walking the tree, until no more pushes are possible. Now imagine trying to write the equivalent rule against the AST:

Find the WHERE clause.
Split it on AND into conjuncts.
For each conjunct, check if its columns all come from a specific table in the FROM list.
If the FROM list uses comma syntax, rewrite it into explicit JOIN syntax first (or handle both cases).
If the conjunct can be pushed, remove it from the top WHERE and attach it to the table reference somehow — except that SQL AST has no syntactic place to put a per-table WHERE. You would have to invent a virtual table or a subquery and reshape the FROM.

This is why everyone uses algebra as the IR. The operators are uniform, the tree is regular, and the rewrite rules become pattern matches on tree shape. Codd's 1970 algebra and Graefe's Cascades architecture of the 1990s are both built on this observation: a good IR makes transformations short.

A few other rules that are one-liners on the algebra and nightmares on the AST:

Join commutativity: R ⋈ S ≡ S ⋈ R. Swap left and right.
Join associativity: (R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T). Restructure three nested nodes.
Projection merge: π_a(π_b(R)) ≡ π_a(R) when a ⊆ b. Collapse two Project nodes into one.
Filter merge: σ_p(σ_q(R)) ≡ σ_{p∧q}(R). Collapse two Filters.

Each of these is the basis of a standard optimiser rewrite, and each is about ten lines of Python against the algebra dataclasses.

Logical vs physical operators

So far every operator has been a logical operator: it says what relation is computed, not how. Join(left, right, predicate=...) does not say whether to use a hash join, a merge join, or a nested-loop join. Scan(table) does not say whether to use the primary index, a secondary index, or a full table scan. Those choices are physical.

The planner runs in two phases. First, logical planning: lower the AST into an algebra tree, apply rewrite rules to produce an equivalent algebra tree with better structure (predicates pushed down, joins reordered, projections merged). Then, physical planning: walk the logical tree and, at each node, pick a concrete algorithm — replacing the logical operator with a physical one of the same shape.

Logical and physical plans have the same tree shape. Physical planning is the act of assigning a concrete algorithm to each logical operator. The shape is fixed by logical optimisation; the algorithms are chosen by cost.

The two layers let you separate concerns:

Logical optimisation is about equivalence: any rewrite must preserve the answer. Predicate push-down, join reordering, projection pruning — all of these are correctness-preserving regardless of table size.
Physical optimisation is about cost: given a logical tree, pick the algorithms that run fastest for the data at hand. A hash join is cheap when both sides fit in memory; a merge join is cheap when inputs are already sorted; a nested-loop join is cheap when one side is tiny. The planner uses statistics (row counts, histograms) to estimate cost and pick.

Physical operators have the same tree shape as logical operators — there is a HashJoinExec for every Join, a FilterExec for every Filter, an IndexScanExec or SeqScanExec for every Scan. They take the same children. What changes is the algorithm and, sometimes, the schema of what flows through (a HashJoinExec might build a hash table on the smaller side and probe with the larger; the order is a physical choice).

The next chapter in this build (Iterator-model execution) is about physical operators: each one implements open(), next(), close() and pulls rows from its children. The chapter after that (Rule-based rewrites) runs at the logical level: rules that match on Plan tree shape and produce new Plan trees. Both chapters stand on the algebra you just built.

Common confusions

"SQL has SELECT, algebra has σ called 'select' and π called 'project'. Which is which?" SQL's SELECT keyword is project in algebra (it picks columns). SQL's WHERE clause is select in algebra (it filters rows). The naming collision is an accident of history: Codd published in 1970, SQL's predecessor SEQUEL was named in 1974, and by the time anyone noticed the clash the keywords had shipped. Memorise: σ filters rows, π picks columns, no matter what any keyword is called.
"Why not optimise on the AST directly?" Because the AST is not closed under the optimisations you want to apply. The AST has SELECT ... FROM ... WHERE as a single node shape; after push-down, a filter might be "inside" a join that used to be inside FROM, and there is no syntactic way to express that in a SELECT statement without wrapping each table in a subquery. The algebra has no such constraint — filter can nest anywhere. Every serious query engine has an algebra IR, even if it calls it something else.
"I've heard algebra is 'set-based' but SQL is 'bag-based' (duplicates allowed). Which does my IR model?" Pure Codd algebra is set-based: R ∪ R = R. SQL is bag-based: UNION ALL keeps duplicates, SELECT without DISTINCT keeps duplicates, GROUP BY COUNT(*) sees duplicates. Real database IRs are extended algebras that track duplicates explicitly. In the dataclasses above, the Union node has an all: bool flag, and the semantics of Scan/Filter/Project are all bag semantics by default (they do not deduplicate). If you want set semantics anywhere, you insert an explicit Distinct. The rewrite rules change in a few places — for example, projection push-down is trickier under bag semantics because dropping a column can change the multiplicity of duplicates — but the machinery is the same.
"Is Product ever kept in a final plan?" Almost never. Cartesian product is O(n·m) and useless except as a conceptual step. The logical optimiser's first job is to find any Product in the tree and, if any WHERE predicate connects its two sides, convert it to a Join with that predicate. If it genuinely cannot — if the user wrote SELECT * FROM a, b with no correlation — most systems raise a warning and execute it anyway. This is where "cross-join without condition" red-flag warnings come from.
"What about outer joins, LATERAL, CTEs, window functions, set operations with column alignment?" Each is an extension of the core algebra. Left/right/full outer joins add a kind field to Join (already in the dataclass above). LATERAL subqueries introduce correlation and become a special join variant. CTEs become named plan fragments you can reference multiple times, possibly with recursion (a fixpoint operator). Window functions add a Window operator between Group and Project. Set operations (INTERSECT, EXCEPT) are essentially Join and Difference over deduplicated inputs. All are mechanical additions to the dataclass list; none change the architecture.
"My rewrite rule seems to fire infinitely." Standard issue — the rule matches a pattern, produces a new tree that also matches the pattern, and loops. The fix is either to make the rule strictly reduce some tree measure (depth, number of nodes), or to structure the optimiser in phases where each phase runs a rule set to fixed point, but rules in later phases never fire on trees that earlier phases would have rewritten. Graefe's Cascades uses a memoisation table to detect re-visits; see the Going deeper section.

Going deeper

If you wanted to know what IR sits between parser and executor, you have it. What follows is the bridge from the toy to real production optimisers — the Cascades framework, NULL semantics, and the distinction between set and bag algebra that trips up every first serious SQL implementer.

Graefe's Cascades framework

Goetz Graefe's Cascades (Graefe 1995) is the optimiser architecture used, in some form, by SQL Server, Greenplum, CockroachDB, Apache Calcite, Orca (Pivotal), and a dozen others. Its core insight is that optimisation should be a search over equivalent trees, with memoisation.

The architecture has three parts:

A memo structure: every algebraically-equivalent subtree is represented by a single group. A group contains all the alternative expressions that compute the same relation. When the optimiser rewrites R ⋈ S to S ⋈ R, it does not replace the tree — it adds the new expression to the same group.
Logical and physical rules: logical rules preserve equivalence and create new logical expressions in the same group (associativity, commutativity, predicate pushdown). Physical rules turn a logical expression into a physical implementation (Join → HashJoin, NestedLoopJoin, or MergeJoin).
A top-down, cost-based search: the optimiser walks the memo, applying rules, and asks "what is the cheapest physical expression for this group?". Costs propagate from children to parents. Pruning cuts branches whose bound exceeds the best known.

Cascades is much more sophisticated than what this chapter builds — the toy optimiser in the next chapter uses rule-based, bottom-up rewrites with no memoisation and no cost model. That is a deliberate choice: rule-based optimisers are the baseline every undergraduate learns before moving to Cascades. See the references section for Graefe's original paper.

Bag semantics vs set semantics

Codd's 1970 algebra is over sets: no duplicates, R ∪ R = R, σ never produces duplicates that weren't already there. SQL is over bags (multisets): duplicates are kept unless the user says DISTINCT, UNION ALL is the default cheap union, COUNT(*) explicitly counts duplicates.

The difference matters for rewrite rules. Under set semantics:

\pi_a(R \cup S) \;\equiv\; \pi_a(R) \cup \pi_a(S)

Under bag semantics, this is still true if UNION is UNION ALL — but it is not true if the union is a set union (duplicate-eliminating), because projecting first might turn distinct rows into duplicates and change the count.

Every real optimiser has a set of rules that is valid under bag semantics and a (usually larger) set valid only under set semantics, and it has to know which it is applying. Papers like Garcia-Molina, Ullman, Widom's Database Systems: The Complete Book, Chapter 5 work through this in detail.

NULL and three-valued logic

SQL's NULL is infamous because it turns boolean predicates into three-valued logic: true, false, and unknown. NULL = NULL is unknown, not true. NULL = 5 is unknown. NULL AND false is false, but NULL AND true is unknown. This interacts with algebra rewrites in subtle ways.

Consider the rewrite σ_{p AND q}(R) ≡ σ_p(σ_q(R)). Under two-valued logic this is trivially true — a row satisfies p AND q iff it satisfies both. Under three-valued logic, σ discards rows where the predicate is not true (i.e., false OR unknown). The rewrite still holds, but the proof uses the fact that AND over three-valued logic is monotonic — a worked exercise in any serious database course. The optimiser needs to be aware that some rewrites change the NULL-handling behaviour and cannot be blindly applied.

Postgres's src/backend/optimizer/ contains dozens of safety checks that amount to "do not apply this rewrite if a column in the predicate could be NULL".

Extended algebra: what real systems track

Production IRs add annotations to every operator that pure algebra does not have:

Estimated cardinality: how many rows each node outputs.
Estimated cost: CPU + I/O estimate.
Output ordering: whether rows are sorted by some key (matters for merge joins, ORDER BY).
Output distribution: for distributed systems, on which node(s) each row lives.
Uniqueness constraints: which columns are known to be unique after this operator, used to skip Distinct operations.
Functional dependencies: preserved to enable more rewrites.

These live as fields on each Plan node in mature systems. The next chapter's iterator model will add cardinality estimates to the physical plan.

Apache Calcite: an algebra IR you can read

For a modern, production-quality, open-source algebra IR, read Apache Calcite's RelNode class hierarchy. Calcite is the SQL planning layer used by Apache Flink, Apache Drill, Apache Hive (via Calcite integration), Dremio, and many others. Its algebra is exactly the extended form described above — Scan, Project, Filter, Join, Aggregate, Union, Minus, Sort, Values, Window, Correlate — with cost and cardinality annotations and a Cascades-style optimiser on top.

Reading Calcite's rule set (RelOptRule subclasses) is the fastest way to see what a hundred-plus rewrite rules look like in one place.

Where this leads next

You now have an IR — a small, regular tree of operators that represents any bound SQL query. The parser and binder feed into it; the optimiser and executor read out of it.

Iterator-model (Volcano) execution — the next chapter turns each physical operator into an iterator with open(), next(), close(). A query executes by calling next() on the root; rows pull through the tree one at a time. This is the execution model introduced by Graefe's Volcano and still used by Postgres, SQLite, and almost every other row-based engine.
Rule-based rewrites: predicate push-down, join pushdown — the logical optimiser pass. It walks the algebra tree and applies rewrite rules like the push-down example above, until no more rules fire. The result is an equivalent logical tree with better structure.
Cost-based join ordering — beyond rule-based rewrites, picking the order of a 5-way join is an enumeration problem. Selinger's 1979 System R paper introduced dynamic-programming join ordering; modern planners extend it with heuristics for larger joins. All of it runs on the algebra.
Executing aggregates and hash joins — the algorithmic meat of query execution, where physical operators become real code that moves bytes.

Relational algebra is the hinge between the meaning of a query (what rows to return) and its execution (how to compute them efficiently). By putting a formally-defined IR between the two halves of the engine, you give every later pass a simple job: take a tree, return a tree. That is the entire reason this chapter exists.

References

Codd, A Relational Model of Data for Large Shared Data Banks, CACM 1970 — the founding paper. Introduces the relational model and the algebra (and calculus) that defines it.
Graefe, The Cascades Framework for Query Optimization, IEEE Data Eng. Bulletin 1995 — the optimiser architecture used by SQL Server, CockroachDB, Calcite, and others. Explains memo, groups, logical/physical rules, and top-down search.
Graefe, Volcano — An Extensible and Parallel Query Evaluation System, IEEE TKDE 1994 — introduces the iterator execution model that pairs with this chapter's algebra.
Selinger, Astrahan, Chamberlin, Lorie, Price, Access Path Selection in a Relational Database Management System, SIGMOD 1979 — the System R paper that defined cost-based optimisation and join-order enumeration.
Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book (2nd ed., 2008), Ch. 2 and 16 — textbook treatment of the algebra, including bag semantics and NULL handling.
Apache Calcite, RelNode class hierarchy — a modern, production-quality open-source algebra IR with a full Cascades-style optimiser on top. The canonical example of what a real extended algebra looks like in code.