The Binder — Resolving Names Against the Catalog

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

The parser accepts any string that matches the SQL grammar — it happily produced a tree for SELECT nmae FROM users WHERE age > 18, with nmae sitting there as a perfectly valid identifier. The parser never checked whether users is a real table, or whether nmae is a real column, or whether comparing age to the integer 18 makes type sense. Syntax is not semantics. The binder is the pass that does the semantic work. It walks the AST the parser produced, and for every name it encounters — every table reference, every column reference — it consults the catalog (the database's own tables of tables) and either attaches a resolved schema identity or raises a helpful error. After binding, the tree is still the same shape, but every Ident node now carries a table-id, a column-id, and a declared type. The binder also runs semantic checks the grammar cannot: column-count match in UNION, type compatibility in comparisons, ambiguous references when two joined tables both have a column called id, and the scope rules that govern correlated subqueries. Its output is a bound AST — the first representation of the query where everything in it is meaningful, not just well-shaped. The next pass lowers this bound AST into relational algebra.

Last chapter ended with a parser that accepted this string:

SELECT nmae FROM users WHERE age > 18

and produced this tree, without complaint:

Select(
    columns=[Ident(name='nmae')],
    table=Ident(name='users'),
    where=BinaryOp(op='>', left=Ident(name='age'), right=Number(value=18)),
)

Notice the typo: nmae. The parser did not notice. It was not supposed to notice. Its job ended at is this string grammatical SQL, and SELECT <identifier> FROM <identifier> WHERE ... is grammatical whether or not the identifiers refer to anything real. If you ran this tree through a planner now, the executor would eventually crash with something like KeyError: 'nmae' at a call site three layers away from where the mistake was made. Not a helpful error. A crash.

The binder exists to turn that crash into a sentence. It runs after the parser and before the planner, and it is the first pass in the pipeline that knows the database has schemas — that users is a real table, that nmae is not one of its columns, that age is an INTEGER and therefore comparable to 18. Its input is the raw AST. Its output is a bound AST: the same tree shape, but where every Ident that was just a string now carries an explicit reference to the catalog entry it resolved to — a table-id, a column-id, a declared type. Every check the grammar is too coarse to express — does this name exist? is this comparison type-safe? do these two arms of UNION have the same column count? — is enforced right here. The golden rule of a query engine is: catch semantic errors as close to the user as you can, and the binder is the last pass that is close to the user.

The catalog — metadata as tables

Before the binder can resolve anything, there has to be somewhere for it to look. That somewhere is the catalog: a set of tables whose rows describe the rest of the database's schema. A table about tables. A table about columns. A table about indexes. A table about types.

This is not a metaphor. Postgres literally stores its schema in regular SQL tables, under the schema pg_catalog:

SELECT relname FROM pg_class WHERE relkind = 'r';      -- all user tables
SELECT attname FROM pg_attribute WHERE attrelid = 16384 AND attnum > 0;

Because the catalog is itself tables — pg_class has a row for every table in the database, including a row for pg_class — every tool you reach for to inspect data (SELECT, JOIN, WHERE) works on the schema too. This is called the catalog being self-describing, and it is how every mature SQL database does it. For the toy binder below, two tables are enough: one listing tables (id + name), one listing columns (carrying table-id, column name, column-id, type). In Postgres these are pg_class and pg_attribute.

Postgres's catalog is two tables doing most of the work: pg_class lists tables, pg_attribute lists columns and cross-references back to a table by oid. Name resolution is exactly two lookups against these tables, in that order.

The Python version, stripped to the parts the binder needs:

# binder/catalog.py
from dataclasses import dataclass, field

@dataclass(frozen=True)
class ColumnDef:
    table_id: int           # which table this column belongs to
    column_id: int          # 1-based position inside that table
    name: str               # "age"
    type: str               # "INT", "TEXT", "BOOL", ...

@dataclass
class TableDef:
    table_id: int
    name: str
    columns: list[ColumnDef] = field(default_factory=list)

class Catalog:
    """In-memory catalog. A real database stores this on disk as rows
    in pg_class and pg_attribute; the API is the same."""

    def __init__(self):
        self._tables_by_name: dict[str, TableDef] = {}
        self._next_table_id = 16384

    def create_table(self, name: str, columns: list[tuple[str, str]]) -> TableDef:
        if name in self._tables_by_name:
            raise ValueError(f"table {name!r} already exists")
        tid = self._next_table_id; self._next_table_id += 1
        cols = [ColumnDef(tid, i + 1, cn, ct) for i, (cn, ct) in enumerate(columns)]
        t = TableDef(tid, name, cols)
        self._tables_by_name[name] = t
        return t

    def lookup_table(self, name: str) -> TableDef | None:
        return self._tables_by_name.get(name)

    def lookup_column(self, table: TableDef, col_name: str) -> ColumnDef | None:
        for c in table.columns:
            if c.name == col_name:
                return c
        return None

Why the table-id and column-id are integers and not strings: downstream passes compare and hash these constantly. Integer equality is a single CPU instruction, string equality is a loop. Postgres uses Oid (a 32-bit unsigned int) for the same reason. The name is kept around for diagnostics, but the identity is the integer. Column-id is 1-based to match Postgres's attnum convention (system columns get negative ids; user columns start at 1).

The binder — an AST walk with a catalog in hand

The binder is a tree walker. It receives a raw AST and a catalog, visits each node in a well-defined order, and produces a new AST where every Ident is replaced with a BoundColumn or BoundTable that carries the resolved catalog identity. The nodes that are not names (literals, operators) pass through untouched, except that each expression node gets annotated with its resolved type — a small but critical piece of metadata the optimiser and executor both need.

Two new dataclasses carry the bound results:

# binder/bound_ast.py
from dataclasses import dataclass
from binder.catalog import TableDef, ColumnDef

@dataclass
class BoundTable:
    table: TableDef         # the catalog entry for this table reference

@dataclass
class BoundColumn:
    column: ColumnDef       # the catalog entry for this column reference
    type: str               # denormalised from column.type for easy access

@dataclass
class BoundBinaryOp:
    op: str
    left:  "BoundExpr"
    right: "BoundExpr"
    type: str               # the type of the expression as a whole

@dataclass
class BoundLiteral:
    value: object
    type: str               # "INT", "TEXT", "BOOL"

@dataclass
class BoundSelect:
    output_columns: list[BoundColumn]
    source: BoundTable
    where: "BoundExpr | None"

The binder itself is a single class with one method per AST shape, plus a scope that keeps track of which tables and column-aliases are currently in scope. For a simple SELECT ... FROM one_table, the scope is trivially just that one table. When joins and subqueries come into the grammar, the scope grows — more on that below.

# binder/binder.py
from parser.ast import Select, Ident, Number, String, BinaryOp
from binder.bound_ast import (
    BoundSelect, BoundTable, BoundColumn, BoundBinaryOp, BoundLiteral,
)
from binder.catalog import Catalog, TableDef, ColumnDef

class Scope:
    """Maps unqualified names to the table(s) that could resolve them.
    For a single-table FROM, the scope has one entry."""
    def __init__(self, parent: "Scope | None" = None):
        self.parent = parent
        self.tables: list[TableDef] = []

    def add_table(self, t: TableDef):
        self.tables.append(t)

    def resolve_column(self, name: str) -> ColumnDef:
        matches: list[ColumnDef] = []
        for t in self.tables:
            for c in t.columns:
                if c.name == name:
                    matches.append(c)
        if len(matches) == 0:
            # try the enclosing scope (for correlated subqueries)
            if self.parent is not None:
                return self.parent.resolve_column(name)
            raise BinderError(f"column {name!r} does not exist", suggest_for=name, scope=self)
        if len(matches) > 1:
            qualified = ", ".join(f"{self._table_of(c).name}.{c.name}" for c in matches)
            raise BinderError(
                f"column reference {name!r} is ambiguous; could be {qualified}"
            )
        return matches[0]

    def _table_of(self, c: ColumnDef) -> TableDef:
        for t in self.tables:
            if c in t.columns: return t
        raise AssertionError("column escaped scope")


class BinderError(Exception):
    def __init__(self, message: str, suggest_for: str | None = None, scope: Scope | None = None):
        if suggest_for and scope:
            hint = _nearest(suggest_for, [c.name for t in scope.tables for c in t.columns])
            if hint: message += f"; did you mean {hint!r}?"
        super().__init__(message)


class Binder:
    def __init__(self, catalog: Catalog):
        self.catalog = catalog

    def bind_select(self, sel: Select, scope: Scope | None = None) -> BoundSelect:
        scope = Scope(parent=scope)
        # 1. resolve FROM first — columns in SELECT and WHERE need the scope to exist
        tbl = self.catalog.lookup_table(sel.table.name)
        if tbl is None:
            raise BinderError(f"table {sel.table.name!r} does not exist")
        scope.add_table(tbl)
        # 2. resolve each output column
        out = [BoundColumn(c, c.type) for c in
               (scope.resolve_column(i.name) for i in sel.columns)]
        # 3. resolve WHERE, if present
        where = self.bind_expr(sel.where, scope) if sel.where else None
        if where is not None and where.type != "BOOL":
            raise BinderError(f"WHERE clause must be BOOL, got {where.type}")
        return BoundSelect(out, BoundTable(tbl), where)

    def bind_expr(self, e, scope: Scope):
        if isinstance(e, Ident):
            c = scope.resolve_column(e.name)
            return BoundColumn(c, c.type)
        if isinstance(e, Number):
            return BoundLiteral(e.value, "INT")
        if isinstance(e, String):
            return BoundLiteral(e.value, "TEXT")
        if isinstance(e, BinaryOp):
            l = self.bind_expr(e.left,  scope)
            r = self.bind_expr(e.right, scope)
            t = _check_binop(e.op, l.type, r.type)  # raises on mismatch
            return BoundBinaryOp(e.op, l, r, t)
        raise AssertionError(f"unknown expr {type(e).__name__}")

The binder is about 60 lines. Three helpers do the type-checking and near-match suggestion:

_NUMERIC = {"INT", "FLOAT"}

def _check_binop(op: str, a: str, b: str) -> str:
    if op in ("AND", "OR"):
        if a == b == "BOOL": return "BOOL"
        raise BinderError(f"{op} requires BOOL operands, got {a} and {b}")
    if op in ("=", "!=", "<", "<=", ">", ">="):
        if a == b: return "BOOL"
        if a in _NUMERIC and b in _NUMERIC: return "BOOL"
        raise BinderError(f"cannot compare {a} with {b}")
    raise AssertionError(f"unknown op {op}")

def _nearest(w: str, cands: list[str]) -> str | None:
    best, bestd = None, 10**9
    for c in cands:
        d = _edit(w.lower(), c.lower())
        if d < bestd: best, bestd = c, d
    return best if bestd <= 2 else None

def _edit(a: str, b: str) -> int:
    # Levenshtein; fine for column-name-length strings
    if len(a) < len(b): a, b = b, a
    prev = list(range(len(b) + 1))
    for i, ca in enumerate(a, 1):
        cur = [i]
        for j, cb in enumerate(b, 1):
            cur.append(min(cur[-1] + 1, prev[j] + 1, prev[j - 1] + (ca != cb)))
        prev = cur
    return prev[-1]

Why bind FROM before SELECT and WHERE: the scope has to exist before any column can be resolved. SELECT a FROM t WHERE a > 0 only makes sense once the binder knows t is a real table whose columns are in scope. This is why SQL is conceptually evaluated FROM first, WHERE next, SELECT last — the binder enforces that ordering on the AST walk, even though the surface syntax reads the opposite way.

The bound AST — what it looks like

Run the binder on the original good query. The input is:

Select(
    columns=[Ident('name')],
    table=Ident('users'),
    where=BinaryOp('>', Ident('age'), Number(18)),
)

The output is:

BoundSelect(
    output_columns=[
        BoundColumn(ColumnDef(table_id=16384, column_id=2, name='name', type='TEXT'), 'TEXT')
    ],
    source=BoundTable(TableDef(16384, 'users', [...])),
    where=BoundBinaryOp('>',
        left =BoundColumn(ColumnDef(16384, 3, 'age', 'INT'), 'INT'),
        right=BoundLiteral(18, 'INT'),
        type='BOOL'
    ),
)

Every name is gone. In its place, a catalog entry with a stable integer id and a declared type. The WHERE clause has a type (BOOL, because > of two INTs is a BOOL), and that type is what the binder's final check uses to reject WHERE 'Dipti' queries whose where-clause is a string instead of a boolean.

The bound AST. Same shape as the parser's output, but every leaf is a catalog reference (or a typed literal) and every internal node carries the type of the sub-expression. This is the first representation of the query where everything is meaningful, not just well-shaped.

Error messages the binder produces

The whole point of this pass is to catch mistakes early with readable messages. Here are the common ones, and the queries that trigger them.

Unknown table.

>>> bind("SELECT name FROM uusers WHERE age > 18")
BinderError: table 'uusers' does not exist

A near-match on table names is the natural extension; for a small catalog it is the same Levenshtein you use for columns.

Unknown column, with suggestion.

>>> bind("SELECT nmae FROM users WHERE age > 18")
BinderError: column 'nmae' does not exist; did you mean 'name'?

The suggestion logic lives in BinderError.__init__: when the column is not found and the scope is known, iterate every in-scope column name, compute edit distance, and if one is within 2 edits, print it as a hint. Postgres does a richer version with a confidence threshold, but this is the shape of the machinery.

Type mismatch.

>>> bind("SELECT name FROM users WHERE age > 'eighteen'")
BinderError: cannot compare INT with TEXT

>>> bind("SELECT name FROM users WHERE name")
BinderError: WHERE clause must be BOOL, got TEXT

The grammar was perfectly happy with both of these strings. Only the binder, which knows age is INT and name is TEXT, can reject them.

Ambiguous column (relevant once joins arrive — the binder is written to handle it now, even though the grammar in the previous chapter does not produce joins yet):

>>> bind("SELECT id FROM users JOIN orders ON ...")
BinderError: column reference 'id' is ambiguous; could be users.id, orders.id

Because both users and orders have an id column, the unqualified name is ambiguous. The fix is for the user to write users.id or orders.id. The binder's resolve_column counts matches across all tables in scope and rejects anything with more than one.

Scope rules — why there is a tree of scopes

So far the binder has had a single flat list of in-scope tables. That works for the single-table queries the parser in chapter 41 produces. It does not work for correlated subqueries, which is where scope actually earns its name.

Consider this query:

SELECT name FROM users
WHERE age > (SELECT AVG(age) FROM users WHERE city = users.city)

The inner SELECT mentions users.city. That reference is not to the inner FROM's users — it is to the outer query's users. The inner query, read on its own, would reject users.city as out of scope. But because the inner query is syntactically nested inside the outer query, the outer's scope encloses the inner's, and names not found in the inner scope fall through to the outer.

This is exactly the same rule that governs variable scope in any programming language: inner scopes see outer scopes, outer scopes do not see inner scopes. You have seen it in Python (functions see module globals, modules do not see function locals), in JavaScript (lexical scope with closures), in a hundred other places. SQL is no different. The implementation is: when you enter a subquery, push a new scope with the current one as its parent; when resolve_column misses, walk up the parent chain before failing.

The Scope class above already has this structure. Scope(parent=...) creates a child scope, and resolve_column falls through to self.parent.resolve_column(name) when the local scope does not have the name. A column that is found only in an outer scope is called a correlation — and the planner needs to know about it later, because correlated subqueries are executed once per outer row, unless the optimiser can decorrelate them.

A correlated subquery introduces a child scope. Name resolution looks locally first; on a miss, it climbs the parent chain. This is the same lexical-scope rule every programming language with nested functions uses — SQL borrowed it, and every correlated query in the world relies on it.

Table aliases fit into the same scope machinery. When you write SELECT u.name FROM users AS u, the binder puts users into the scope under the key u, not users, so the qualified reference u.name resolves and users.name does not (in that scope). A qualified u.name is two lookups: first look up u in the scope's alias map, then look up name in the resulting table.

UNION and the column-count check

One last semantic check that only the binder can do: when the query contains UNION, the two arms must have the same number of columns and compatible types for each position. The parser does not check this — it will happily parse SELECT name, age FROM users UNION SELECT city FROM users, which is a one-column SELECT union-ed with a two-column SELECT. Only after binding each arm separately does the binder know how many output columns each produces and what their types are.

def bind_union(u: Union) -> BoundUnion:
    left  = bind_select(u.left)
    right = bind_select(u.right)
    if len(left.output_columns) != len(right.output_columns):
        raise BinderError(
            f"UNION arms have different column counts: "
            f"{len(left.output_columns)} vs {len(right.output_columns)}"
        )
    out_types = []
    for i, (l, r) in enumerate(zip(left.output_columns, right.output_columns)):
        if l.type != r.type and not (l.type in _NUMERIC and r.type in _NUMERIC):
            raise BinderError(
                f"UNION column {i + 1} has incompatible types: {l.type} vs {r.type}"
            )
        out_types.append(l.type if l.type == r.type else "FLOAT")
    return BoundUnion(left, right, out_types)

Standard SQL additionally requires that UNION deduplicates its result — implicit, unless the user wrote UNION ALL. That is a planner concern, not a binder one; the binder's job is just to verify the shapes are compatible.

Binding three queries, two of which fail

Set up a small catalog and run the binder on three queries — one well-formed, one with a typo, one with a type error. Watch what comes out.

>>> from binder.catalog import Catalog
>>> from binder.binder  import Binder
>>> from parser.parser  import parse

>>> cat = Catalog()
>>> cat.create_table("users", [("id", "INT"), ("name", "TEXT"),
...                            ("age", "INT"), ("city", "TEXT")])

>>> b = Binder(cat)

>>> # 1. A clean query — binder succeeds, every name resolved.
>>> tree = b.bind_select(parse("SELECT name FROM users WHERE age > 18"))
>>> tree.where.type
'BOOL'
>>> tree.where.left.column.name, tree.where.left.type
('age', 'INT')

>>> # 2. The typo from the top of the chapter.
>>> b.bind_select(parse("SELECT nmae FROM users WHERE age > 18"))
BinderError: column 'nmae' does not exist; did you mean 'name'?

>>> # 3. A type error the grammar cannot catch.
>>> b.bind_select(parse("SELECT name FROM users WHERE age > 'eighteen'"))
BinderError: cannot compare INT with TEXT

All three of the input strings are grammatical SQL. The parser accepted all three. The binder accepted the first and rejected the other two — each with a sentence that points the user at what was wrong and, for the typo, suggests what was probably meant.

Before the binder, each of these errors would have shown up as a crash deep inside the executor, at a stack frame with no connection to the user's query text. After the binder, they show up as a single line at the top of the stack: column does not exist, cannot compare. That is the difference between a system you can debug and a system you cannot.

Common confusions

"Why is the binder a separate pass and not folded into the parser?" The parser's job is local — tokens and grammar rules, nothing else. The binder's job is global — it needs the catalog, which for a real database is on disk, possibly behind locks. Mixing them would mean the parser cannot be tested without a catalog, and the binder cannot be written as a clean tree walk.
"Does the binder need the catalog's data, or just the schema?" Just the schema. The binder never touches a row. Postgres caches its catalog heavily — a cold-cache query can take 10x longer than a warm one — because every connection does thousands of catalog lookups per query.
"If the parser already produced a tree, why does the binder produce another instead of mutating?" Immutability downstream — the planner can hold references to bound nodes safely — and the raw AST stays useful for error reporting, which often quotes the user's original query text.
"What if the catalog changes while the query is being bound?" Real databases take an MVCC snapshot of the catalog at the start of query planning. If another transaction concurrently drops a table, the current query still succeeds using the old schema. Postgres's catalog tables participate in MVCC like any other table for exactly this reason.
"Why does ambiguous column resolution require a scope-walk, not a table-by-table check?" Because SQL's rule is that an unqualified name is ambiguous if it exists in more than one table in the same scope. A subquery with the same column name as its outer query is not ambiguous — the subquery's scope wins. The parent pointer in Scope is what tracks local-vs-enclosing.

Going deeper

If you understand that the binder turns names into catalog ids and checks types, you have the core. What follows is how two real systems differ, and the one extra pass Postgres wedges between binder and planner.

Postgres: parse → analyze → rewrite → plan

Postgres does not call its binder a binder — it calls it parse analysis, and the code lives in src/backend/parser/analyze.c. The output is a Query node, defined in parsenodes.h. The transformation from RawStmt (parser output) to Query (analyzer output) is exactly this chapter's work, at a much larger scale — the analyzer handles views, CTEs, inheritance, LATERAL joins, window functions, and more.

Between analysis and planning, Postgres runs a rewrite phase in rewriteHandler.c: expanding views (replacing a view reference with the query tree it was defined from), applying row-level-security policies, and processing CREATE RULE rules. Rewrite runs after binding because it needs resolved names; before planning because it changes tree shape. The pipeline is raw_parser → parse_analyze → pg_rewrite_query → pg_plan_queries — four passes, each with one clear job. This chapter compresses the middle two.

InnoDB/MySQL: resolution inside the optimizer

MySQL does not have a sharp binder/planner split. Its resolver runs in tight interleaving with the optimizer, inside sql/sql_resolver.cc: name resolution, view expansion, and early optimiser rewrites all happen together on the SELECT_LEX tree. InnoDB is then invoked with fully-resolved table and column identifiers. The practical effect was historically worse diagnostics on typos, because the user's original names were harder to recover from a half-optimised tree; MySQL 8.0+ has improved this considerably, but the Postgres architectural split is still cleaner. For a teaching project, copy Postgres.

Type coercion — the binder's other job

Real SQL does not require both sides of a comparison to have identical types. WHERE age > 18.0 compares INT to FLOAT, and the standard says one side is coerced. The binder inserts the coercion explicitly, rewriting BinaryOp('>', IntCol, FloatLit) into BinaryOp('>', Cast(IntCol, FLOAT), FloatLit) — implicit cast insertion. Postgres's policy is intermediate: numeric types coerce freely up a precision hierarchy (INT2 → INT4 → INT8 → NUMERIC → FLOAT4 → FLOAT8), but cross-family coercions (number to text, text to date) require explicit casts. A table of allowed coercions lives in the binder; anything not in the table is a type error.

Where this leads next

The bound AST is a query where every name means something and every expression has a type. What it is not yet is a plan. It still says SELECT these columns FROM this table WHERE this predicate — a declarative description of the answer. Before the executor can run anything, this declarative description has to be turned into an explicit tree of operators: scan this table, filter these rows, project these columns. That lowered form is called relational algebra, and it is the next chapter's subject.

Relational Algebra as Your Intermediate Representation — takes the bound AST this chapter produces and lowers it into a tree of algebraic operators. This is the representation the optimiser rewrites and the executor walks.
A simple logical optimiser — applies rewrite rules like predicate push-down and projection pruning to the algebra tree.
Iterator-model (Volcano) execution — finally, the executor that runs the algebra tree against the buffer pool you built in Build 5.

The binder is the first of three passes that make SQL actually run. It is the one that knows names, the planner is the one that knows algebra, and the executor is the one that knows bytes. Three layers, each trusting the previous, each narrower than what came before. By the end of this build, you will have all three, and a SELECT will go from string to answer with nothing hand-coded.

References

PostgreSQL source tree, analyze.c — the parse-analysis (binder) phase of Postgres. Reading the entry point parse_analyze_fixedparams and following it down is the canonical tour of how a production SQL binder is structured.
PostgreSQL documentation, System Catalogs — the reference for pg_class, pg_attribute, and the other catalog tables. Short, concrete, and essential if you ever want to inspect a live Postgres schema.
Codd, A Relational Model of Data for Large Shared Data Banks, CACM 13(6), 1970 — the paper. Readable in an afternoon; the origin of the "logical schema is separate from physical storage" idea that makes binders a distinct pass in the first place.
Date, An Introduction to Database Systems (8th ed., Addison-Wesley, 2003), Ch. 3, 8 — the classical textbook treatment of the relational model and integrity constraints, with careful attention to scope and name-resolution rules in nested queries.
Graefe, The Cascades Framework for Query Optimization, IEEE Data Eng. Bull. 18(3), 1995 — the canonical reference for what happens to the bound AST after binding, and why a clean binder output matters so much for the optimiser that consumes it.
MySQL source tree, sql_resolver.cc — MySQL's equivalent of Postgres's analyze.c, useful as a contrast in architectural style: resolution interleaved with early optimisation rather than run as a clean separate pass.