Writing a SQL Parser (PEG / Recursive Descent)

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A SQL query arrives at your database as a string — a flat sequence of characters that mean nothing until something gives them structure. Two passes turn the string into something the rest of the engine can use. The lexer walks the characters and groups them into tokens: the letters S-E-L-E-C-T become one SELECT token, the digits 1-8 become one number, the word name becomes an identifier. The parser walks the tokens and builds a tree — an abstract syntax tree, or AST — whose shape matches the grammar of SQL. This chapter writes both for a small SELECT/FROM/WHERE subset: a 40-line Python lexer, a PEG grammar you can read out loud, an 80-line recursive-descent parser, and a set of dataclasses that name every node in the tree. By the end, SELECT name FROM users WHERE age > 18 is a Python object with a .columns, a .table, and a .where field, and the next chapter — the binder — can start resolving those names against a catalog.

The query your database just received is this:

SELECT name FROM users WHERE age > 18

On the wire it is 37 bytes. In your process it is a Python str. None of those 37 bytes mean anything yet. The letter S next to the letter E is just two characters sitting next to each other. You, reading this, know that those six letters together are the SQL keyword SELECT — but the database does not know that, because the database is a program and the string is inert. Somebody has to write the code that looks at the string and says the first six characters are a keyword, then a space, then an identifier, then another keyword, then another identifier, then a keyword, then an identifier, then an operator, then a number. That code is the lexer. Then somebody else has to write the code that looks at that list of labelled tokens and says this is a SELECT statement whose column list is [name], whose source table is users, and whose WHERE clause is the expression age > 18. That code is the parser.

Parsing is the least glamorous part of a database. It is also the part that every query has to get through before anything interesting can happen. A missed closing parenthesis in the parser shows up to the user as a cryptic error message two layers later and a weekend of debugging. A grammar that is slightly ambiguous shows up as queries that parse differently in different versions. Postgres's parser is six thousand lines of generated C and another ten thousand of hand-written support code. SQLite's is a 50,000-line bison grammar. Neither of those is where you should start. You should start with a toy that fits on one screen, so that every design decision is visible.

This chapter builds that toy. The SQL it accepts is a small subset — SELECT col1, col2, ... FROM tbl WHERE expr with integer literals, string literals, identifiers, and the six comparison operators. No joins, no GROUP BY, no subqueries. That is deliberate. The point is to see the machinery end-to-end; the later chapters of this build add the features, one at a time, on top of the same skeleton.

The lexer — characters in, tokens out

Before the parser can look at a SELECT statement, the string has to be chopped into tokens. A token is a small tagged record with three pieces: what kind of thing it is (a keyword, an identifier, a number, an operator), the actual text that matched, and a position (so error messages can point at the right column). Tokens are the alphabet the parser reads; the lexer is the function that produces them.

The lexer's job is purely local. It does not care whether SELECT FROM SELECT makes sense — that is the parser's problem. It only cares whether each chunk of characters is shaped like a legal token. This separation is not a law of nature, but it is a convenience: you can debug the tokeniser by feeding it strings and printing the list, without building any more of the system.

The lexer is a pure function: a string goes in, a list of tokens comes out. Every parser in the rest of this book runs on top of a list that looks like this.

Here is the lexer. It is a single function with a loop and a switch on the current character. Nothing fancy — no regex engine, no table-driven state machine. Just read one character, decide what kind of token starts here, and consume until that token ends.

# sqlparser/lexer.py
from dataclasses import dataclass

KEYWORDS = {"SELECT", "FROM", "WHERE", "AND", "OR", "NOT"}

@dataclass
class Token:
    kind: str       # KEYWORD, IDENT, NUMBER, STRING, OP, COMMA, EOF
    value: str      # the actual text, e.g. "SELECT" or "users" or "18"
    col: int        # 1-based column where this token starts

def tokenize(src: str) -> list[Token]:
    tokens: list[Token] = []
    i, n = 0, len(src)
    while i < n:
        c = src[i]
        if c.isspace():
            i += 1
            continue
        start = i
        if c.isalpha() or c == "_":                         # identifier or keyword
            while i < n and (src[i].isalnum() or src[i] == "_"):
                i += 1
            word = src[start:i]
            kind = "KEYWORD" if word.upper() in KEYWORDS else "IDENT"
            tokens.append(Token(kind, word.upper() if kind == "KEYWORD" else word, start + 1))
        elif c.isdigit():                                   # integer literal
            while i < n and src[i].isdigit():
                i += 1
            tokens.append(Token("NUMBER", src[start:i], start + 1))
        elif c == "'":                                      # string literal
            i += 1
            while i < n and src[i] != "'":
                i += 1
            if i >= n:
                raise SyntaxError(f"unterminated string starting at column {start + 1}")
            i += 1                                          # consume closing quote
            tokens.append(Token("STRING", src[start + 1:i - 1], start + 1))
        elif c == ",":
            tokens.append(Token("COMMA", ",", start + 1)); i += 1
        elif c in "<>!=":                                   # two-char ops: <=, >=, !=, ==
            if i + 1 < n and src[i + 1] == "=":
                tokens.append(Token("OP", src[i:i + 2], start + 1)); i += 2
            else:
                tokens.append(Token("OP", c, start + 1)); i += 1
        elif c in "+-*/":
            tokens.append(Token("OP", c, start + 1)); i += 1
        else:
            raise SyntaxError(f"unexpected character {c!r} at column {start + 1}")
    tokens.append(Token("EOF", "", n + 1))
    return tokens

Why keywords are detected here and not later: SQL is case-insensitive for keywords (select, SELECT, Select all mean the same thing) but usually case-sensitive for identifiers. Doing the case-folding at lex time, on the exact set of reserved words, means the parser can do a simple equality check — tok.value == "SELECT" — and does not have to know which words are reserved. The reserved set is data the lexer owns.

Why column numbers and not line numbers: SQL queries are usually single-line, at least in the kind of REPL or API calls a learner first meets. A single column number is enough to produce a caret pointing at the offending character. For a multi-line query you would add a line field the same way; the extra state is two integers instead of one.

Run the lexer on the example from the top of the chapter:

>>> for t in tokenize("SELECT name FROM users WHERE age > 18"):
...     print(t)
Token(kind='KEYWORD', value='SELECT', col=1)
Token(kind='IDENT',   value='name',   col=8)
Token(kind='KEYWORD', value='FROM',   col=13)
Token(kind='IDENT',   value='users',  col=18)
Token(kind='KEYWORD', value='WHERE',  col=24)
Token(kind='IDENT',   value='age',    col=30)
Token(kind='OP',      value='>',      col=34)
Token(kind='NUMBER',  value='18',     col=36)
Token(kind='EOF',     value='',       col=38)

That list of nine tokens is what the parser will read. Everything downstream — the parser, the binder, the planner, the executor — can ignore raw characters from now on.

The grammar — PEG form

Before writing the parser, write down what it should accept. A grammar is the rule-set for how tokens combine into legal statements. There are several ways to write grammars; the one used here is PEG, short for Parsing Expression Grammar. A PEG rule looks like an equation — the left-hand side is a name, the right-hand side is a pattern built from other rules, token kinds, and a few combinators (/ for ordered choice, * for zero-or-more, ? for optional). PEG rules map one-to-one onto recursive-descent functions, which is why they are so useful for hand-written parsers.

Here is the grammar for the subset this chapter implements:

# A SELECT statement.
select_stmt   = "SELECT" column_list "FROM" ident where_clause? EOF

column_list   = ident ("," ident)*

where_clause  = "WHERE" expression

# Expressions with three precedence levels:
#   OR binds loosest, AND next, then comparisons.
expression    = or_expr
or_expr       = and_expr ("OR"  and_expr)*
and_expr      = cmp_expr ("AND" cmp_expr)*
cmp_expr      = primary (cmp_op primary)?
cmp_op        = "=" / "!=" / "<" / "<=" / ">" / ">="

primary       = NUMBER / STRING / ident
ident         = IDENT

Three things to notice about this grammar.

First, the / in cmp_op is ordered choice. PEGs try alternatives left to right and commit to the first one that matches. This removes a whole class of ambiguity that plagues textbook context-free grammars — there is never a question of which parse "wins". (The cost: some legal languages cannot be expressed as PEGs at all, and ordered choice can hide grammar bugs because later alternatives never fire. For a toy SQL this is fine.)

Second, the precedence of boolean operators is baked into the shape of the rules, not into a separate precedence table. or_expr calls and_expr, which calls cmp_expr, which calls primary. That means a AND b OR c parses as (a AND b) OR c without any extra machinery: by the time the or_expr level is looking for an OR, the AND has already been greedily consumed by the inner and_expr. This is the classic way recursive-descent parsers encode precedence, and it is why the grammar has to be written bottom-up even though the language reads top-down.

Third, the grammar is strictly smaller than real SQL. It has no * for "all columns", no table aliases, no JOIN, no aggregate functions, no parentheses in expressions, no AS. Each of those is a rule you will add in later chapters. The goal right now is to see a complete parser front-to-back; once the skeleton is clear, extensions are cheap.

The grammar as a tree. Each box is a PEG rule and a function in the parser. Reading top to bottom is how a query decomposes; the narrow chain at the bottom right is the precedence ladder for expressions.

The AST — dataclasses, one per node kind

The parser does not hand back a list of tokens. It hands back an abstract syntax tree — a nested Python object whose shape records the structure the parser recognised. Every rule in the grammar gets a dataclass, with fields for exactly the pieces that rule produces.

# sqlparser/ast.py
from dataclasses import dataclass, field
from typing import Union

@dataclass
class Ident:
    name: str

@dataclass
class Number:
    value: int

@dataclass
class String:
    value: str

@dataclass
class BinaryOp:
    op: str                 # "AND", "OR", "=", "!=", "<", "<=", ">", ">="
    left:  "Expr"
    right: "Expr"

Expr = Union[Ident, Number, String, BinaryOp]

@dataclass
class Select:
    columns: list[Ident]
    table:   Ident
    where:   Expr | None

Why dataclasses and not a generic "Node" with a dict of children: each rule has a fixed number of meaningful children with names. Select has a columns list, a table, and a where. Giving them named typed fields means a typo like sel.cloumns is caught immediately instead of returning None silently, and your IDE can autocomplete. It also means the AST prints readably — repr(Select(...)) shows every field — which is the single biggest debugging aid when a parser misbehaves.

Why one BinaryOp node for both logic (AND/OR) and comparisons (=/</> etc): the binder and the evaluator will treat these uniformly — both are "look up the operator, evaluate left, evaluate right, combine". Collapsing them now means the evaluator, written in a later chapter, has one switch statement instead of two. If you later need to distinguish them for type-checking, you can add a BoolOp subclass; for now, one shape is enough.

The whole AST for SELECT name FROM users WHERE age > 18 will be:

Select(
    columns=[Ident(name='name')],
    table=Ident(name='users'),
    where=BinaryOp(op='>', left=Ident(name='age'), right=Number(value=18)),
)

Three dataclass instances, a tiny list, and a nested BinaryOp. That is it. Everything else the database does — name resolution, logical planning, physical planning, execution — works on this tree.

The parser — recursive descent, one function per rule

Now the parser. The trick of recursive-descent is that each grammar rule becomes a method on a parser class, and the methods call each other exactly the way the rules refer to each other. A rule A = B C D becomes a method parse_A that calls parse_B, parse_C, parse_D in sequence. A rule A = B / C becomes a method that tries parse_B, and if that fails, tries parse_C. A rule with * becomes a while loop. A rule with ? becomes an if. That is the entire playbook.

# sqlparser/parser.py
from .lexer import Token, tokenize
from .ast   import Select, Ident, Number, String, BinaryOp, Expr

class Parser:
    def __init__(self, tokens: list[Token]):
        self.toks = tokens
        self.pos = 0

    # --- low-level helpers ----------------------------------------------
    def peek(self) -> Token:
        return self.toks[self.pos]

    def eat(self, kind: str, value: str | None = None) -> Token:
        t = self.peek()
        if t.kind != kind or (value is not None and t.value != value):
            want = f"{kind} {value!r}" if value else kind
            raise SyntaxError(
                f"expected {want} at column {t.col}, got {t.kind} {t.value!r}"
            )
        self.pos += 1
        return t

    def match(self, kind: str, value: str | None = None) -> bool:
        t = self.peek()
        return t.kind == kind and (value is None or t.value == value)

    # --- grammar rules, one per PEG rule --------------------------------
    def parse_select(self) -> Select:
        self.eat("KEYWORD", "SELECT")
        cols = self.parse_column_list()
        self.eat("KEYWORD", "FROM")
        tbl = Ident(self.eat("IDENT").value)
        where = None
        if self.match("KEYWORD", "WHERE"):
            self.eat("KEYWORD", "WHERE")
            where = self.parse_expression()
        self.eat("EOF")
        return Select(columns=cols, table=tbl, where=where)

    def parse_column_list(self) -> list[Ident]:
        cols = [Ident(self.eat("IDENT").value)]
        while self.match("COMMA"):
            self.eat("COMMA")
            cols.append(Ident(self.eat("IDENT").value))
        return cols

    def parse_expression(self) -> Expr:
        return self.parse_or()

    def parse_or(self) -> Expr:
        left = self.parse_and()
        while self.match("KEYWORD", "OR"):
            self.eat("KEYWORD", "OR")
            right = self.parse_and()
            left = BinaryOp("OR", left, right)
        return left

    def parse_and(self) -> Expr:
        left = self.parse_cmp()
        while self.match("KEYWORD", "AND"):
            self.eat("KEYWORD", "AND")
            right = self.parse_cmp()
            left = BinaryOp("AND", left, right)
        return left

    def parse_cmp(self) -> Expr:
        left = self.parse_primary()
        t = self.peek()
        if t.kind == "OP" and t.value in ("=", "!=", "<", "<=", ">", ">="):
            self.pos += 1
            right = self.parse_primary()
            return BinaryOp(t.value, left, right)
        return left

    def parse_primary(self) -> Expr:
        t = self.peek()
        if t.kind == "NUMBER":
            self.pos += 1; return Number(int(t.value))
        if t.kind == "STRING":
            self.pos += 1; return String(t.value)
        if t.kind == "IDENT":
            self.pos += 1; return Ident(t.value)
        raise SyntaxError(
            f"expected a column, number, or string at column {t.col}, "
            f"got {t.kind} {t.value!r}"
        )

def parse(src: str) -> Select:
    return Parser(tokenize(src)).parse_select()

Ten methods. Three helpers (peek, eat, match) that every other method sits on top of. Seven rule methods, each a one-for-one translation of a line of the PEG grammar. Under a hundred lines including imports, comments, and blanks.

The left-recursive build-up in parse_or and parse_and deserves a second look. The naive translation of or_expr = and_expr ("OR" and_expr)* into code is the while loop shown — but the way the AST is constructed matters. Each iteration wraps the accumulated left in a new BinaryOp, so that a OR b OR c becomes BinaryOp("OR", BinaryOp("OR", a, b), c) — left-associative, which is what you want. If you instead assigned left = BinaryOp("OR", left, parse_and()) and then recursed, you would get right-associative trees, which disagree with how SQL semantics group.

Why eat is what carries the error messages: it is the single place where the parser actually consumes a token. If the expected token is not there, the parser cannot proceed, and the column information in the token is exactly what the user needs to see. Centralising the error text in eat means every grammar rule gets column-accurate errors for free — you don't sprinkle error strings throughout the parser.

Error messages — point at the column, suggest the fix

A parser that fails with syntax error and nothing else is useless. A parser that points at the column where it got stuck and says what it was expecting is kind. The difference is almost entirely in the error-constructing code, not in the parsing logic. The eat method above already does most of this work: when it finds a token that is not what it wanted, it says exactly which kind it wanted, which token it got, and where that token started.

Here is what the parser does for three common mistakes:

>>> parse("SELECT name users")
SyntaxError: expected KEYWORD 'FROM' at column 13, got IDENT 'users'

>>> parse("SELECT FROM users")
SyntaxError: expected IDENT at column 8, got KEYWORD 'FROM'

>>> parse("SELECT name FROM users WHERE age >")
SyntaxError: expected a column, number, or string at column 35, got EOF ''

Three things that make these messages readable. First, the column number is accurate because the lexer stamped each token at the moment it was produced. Second, the message says what the parser expected, not just what it rejected — "expected FROM" is actionable, "unexpected users" is not. Third, the EOF token carries a position just past the end of the input, so "unexpected EOF at column 35" correctly blames the missing value after the > instead of blaming something earlier.

A fancier parser adds a suggestion. If eat wanted a FROM and the user typed FORM, the Levenshtein distance between the two is 1 — a single character swap — so the parser can guess the typo and print did you mean FROM?. The hook is at the same place:

def eat(self, kind, value=None):
    t = self.peek()
    if t.kind != kind or (value is not None and t.value != value):
        want = f"{kind} {value!r}" if value else kind
        hint = ""
        if value and t.kind == "IDENT" and _similar(t.value, value):
            hint = f" (did you mean {value!r}?)"
        raise SyntaxError(f"expected {want} at column {t.col}, got {t.kind} {t.value!r}{hint}")
    self.pos += 1
    return t

def _similar(a: str, b: str) -> bool:
    # cheap Levenshtein-distance-1 check for same-length strings
    if abs(len(a) - len(b)) > 1: return False
    a, b = a.upper(), b.upper()
    diff = sum(1 for x, y in zip(a, b) if x != y) + abs(len(a) - len(b))
    return diff <= 1

Postgres, for comparison, does a much more elaborate version of this, searching its keyword table for near-matches and ranking them by edit distance; it can produce messages like ERROR: syntax error at or near "selct". HINT: did you mean "SELECT"?. The idea is the same — the mechanism is a bigger dictionary and a smarter scorer.

Parsing `SELECT name FROM users WHERE age > 18`

Combine the lexer, the grammar, and the parser. Feed the chapter's opening query through the whole pipeline.

>>> from sqlparser.parser import parse
>>> tree = parse("SELECT name FROM users WHERE age > 18")
>>> tree
Select(columns=[Ident(name='name')],
       table=Ident(name='users'),
       where=BinaryOp(op='>',
                      left=Ident(name='age'),
                      right=Number(value=18)))

Walk it step by step. The lexer produces nine tokens (the eight visible ones plus EOF). parse_select eats SELECT, then calls parse_column_list, which eats the identifier name and sees no comma, so returns [Ident('name')]. parse_select eats FROM, then the identifier users. It peeks: the next token is the keyword WHERE, so it eats that and calls parse_expression.

parse_expression calls parse_or, which calls parse_and, which calls parse_cmp. parse_cmp calls parse_primary, which sees the identifier age and returns Ident('age'). Back in parse_cmp, it peeks: the next token is the operator >. That matches the comparison op list, so parse_cmp consumes it, calls parse_primary again, which sees the number 18 and returns Number(18). parse_cmp builds BinaryOp('>', Ident('age'), Number(18)).

That result bubbles back up: parse_and sees no AND, returns as-is. parse_or sees no OR, returns as-is. parse_expression returns the BinaryOp. parse_select eats the EOF and builds the Select node shown above.

The AST for the example query. The indentation in repr() and the boxes above are two ways to look at the same Python object graph.

Feed it a bad query:

>>> parse("SELECT name FRMO users")
SyntaxError: expected KEYWORD 'FROM' at column 13, got IDENT 'FRMO' (did you mean 'FROM'?)

One mistake, one line, one pointer to the bad column, one guess at the fix. That is the parser done.

Common confusions

"What about JOIN, GROUP BY, ORDER BY, HAVING, subqueries, window functions?" All of those are grammar extensions on top of this skeleton. JOIN adds a new rule between FROM and WHERE; GROUP BY and ORDER BY add optional trailing rules after WHERE; subqueries make primary include "(" select_stmt ")". Each is a 5-to-30 line addition to parser.py and one more dataclass in ast.py. The later chapters of this build add them one at a time. Start with the subset that fits on one screen, or you will spend the first week chasing grammar conflicts instead of seeing the pipeline.
"Isn't full SQL ambiguous?" Yes, notoriously. The string SELECT a FROM b AS c WHERE c.x = 1 has to decide whether AS binds to the table or to some expression; nested subqueries interact with correlation; the grammar of EXPLAIN is different from the grammar of what it explains. Every real SQL parser has a stack of hand-written disambiguation rules layered on top of its generator. PEG hides some of this by committing to the first alternative; LALR parsers expose it as shift-reduce conflicts you have to resolve. Neither is pretty. For a learning parser, the thing to internalise is that the ambiguity is in the language, not in your tools.
"Why not use a regex?" You can lex SQL with regexes — a single re.finditer with a big alternation produces a token stream — and for the lexer that is perfectly fine. You cannot parse SQL with a regex, because SQL is not a regular language: SELECT (SELECT (SELECT ...)) can nest to any depth, and no finite-state machine can count open parens. The moment you want to validate matching parentheses, let alone build a tree, you need at least a context-free grammar and a parser to match. Regexes stop at tokens; parsers take over from there.
"Why write this by hand instead of using a generator like ANTLR or lark?" For a real production database you probably would use a generator, or a hand-tuned LALR. But the generators emit code that is much harder to read than the code above, and their error messages are harder to customise. Writing a small recursive-descent parser yourself is the cheapest way to understand what the generator is doing under the hood. Once the subset works, porting to a generator is mechanical.
"My parser works on the happy path but dies weirdly on bad input." Almost always, the fix is to route the failure through eat instead of letting it propagate as an IndexError or AttributeError. If you index into self.toks[self.pos] and there is no token there, you get an IndexError with no context; if you go through peek() and eat(), which always end with an explicit EOF token, you get a clean SyntaxError with a column. The discipline is: never read self.toks directly, always through the two helpers.
"Recursive descent can blow the stack on deep expressions." True. a AND a AND a AND ... five thousand times deep will overflow Python's 1000-frame default stack. For real SQL, you either raise the stack limit, convert the recursion into iteration (the while loop in parse_or already does this for chains of the same operator), or use a Pratt parser for expressions, which handles precedence iteratively. The parser above is iterative at each level, so the stack depth is bounded by the depth of nesting of different levels, not the length of the chain. That is usually fine.

Going deeper

If you wanted to know how a SQL parser turns a string into a tree, you have it. What follows is the map of the larger parsing landscape — tools, historical techniques, and the exact shape of Postgres's parser — for readers who want to connect this toy to the real world.

Parser generators: ANTLR, lark, bison

Most real databases do not write their parsers by hand. They write a grammar file and feed it to a parser generator, which emits the parser code. The tool matters more than the grammar style in practice.

ANTLR (ANother Tool for Language Recognition) generates parsers in Java, Python, Go, C#, and a dozen other targets. It uses a grammar style close to PEG and produces a parse tree you convert into your own AST. Presto, Trino, and Spark SQL all use ANTLR grammars. The grammars are publicly readable — Trino's SqlBase.g4 is a thousand lines of rules that cover all of Trino SQL, and reading it is the fastest way to see what "real" SQL parsing looks like.
lark is a pure-Python parser generator that supports EARLEY and LALR(1) grammars. It is fast enough for toy databases and documentation tools, and the grammar syntax is close to PEG's. If you wanted to grow the parser in this chapter into something that handles 80% of SQL without leaving Python, lark is the natural next step.
bison / yacc generate LALR(1) parsers in C. This is what Postgres, MySQL, and SQLite all use for their production parsers. LALR parsers are fast and give good error messages, but they have shift-reduce conflicts — grammar ambiguities the generator forces you to resolve by adding precedence declarations or rewriting rules. The gram.y file in Postgres's source tree is about 19,000 lines; it is worth skimming even if you never write one, because it is the definitive answer to "what does SQL's grammar actually look like?"

Pratt parsing and operator-precedence parsing

The recursive-descent approach to precedence (one rule per level: or_expr → and_expr → cmp_expr → primary) works but does not scale well when a language has a dozen operator levels with subtleties (left-associative vs right-associative, unary vs binary, prefix vs postfix). A Pratt parser, also called a top-down operator-precedence parser, puts the precedence in a table keyed by operator, and uses a single recursive function that consults the table.

The structure is:

def parse_expression(min_prec: int = 0) -> Expr:
    left = parse_primary()
    while True:
        op = peek_operator()
        if op is None or precedence_of(op) < min_prec: break
        consume(op)
        right = parse_expression(precedence_of(op) + (0 if is_left_assoc(op) else 1))
        left = BinaryOp(op, left, right)
    return left

This is the technique Postgres uses for SQL expressions, the one TypeScript uses in its tsc parser, and the one every serious hand-written expression parser converges on. For the tiny grammar in this chapter the recursive-descent version is clearer; for anything bigger, Pratt wins.

Full SQL: the SQL:1999 standard and why everyone deviates

The SQL standard as of SQL:1999 runs to several thousand pages. Parts of it (subquery semantics, CTE definitions, OLAP functions) are intricate enough that no database implements them all exactly the same way. That is why SQL queries are famously non-portable: the parser for Postgres accepts things the parser for MySQL rejects, and vice versa, for reasons rooted in how each team interpreted the grammar twenty years ago. When you write a new SQL database, you pick a dialect to conform to — usually Postgres's, as the de facto reference — and accept that some of your users will file bugs when they hit the differences.

The takeaway: grammar is data, and which data is a political choice. The parser in this chapter is small enough to avoid the dialect problem entirely, but as you grow it, keep a note of which syntactic corners you are deciding.

How Postgres's parser is actually structured

If you ever want to read a mature SQL parser, the path through Postgres's source is:

src/backend/parser/scan.l — the flex specification for the lexer. About 1,500 lines.
src/backend/parser/gram.y — the bison grammar. About 19,000 lines. Every SQL statement type is a rule here.
src/include/nodes/parsenodes.h — the Raw Parse Tree node types. Equivalent to ast.py above but in C and about a hundred times larger.
src/backend/parser/analyze.c — the parse analysis phase, which is Postgres's name for what this chapter's next door neighbour, the binder, does.

Reading gram.y top-to-bottom is one of the great advanced-student exercises in databases. Every feature SQL has ever added — CTEs, LATERAL joins, MERGE, JSONB operators — is a diff against that file.

PEG vs LALR vs Earley — which to pick

PEG: hand-writable, unambiguous by construction (ordered choice), good for small languages with clear structure. Cost: infinite lookahead is easy to write and expensive to run, no good tooling for detecting grammar bugs.
LALR(1): fast, efficient, decades of tooling (yacc, bison). Cost: conflicts are notoriously hard to debug, the error messages of generated code are worse than hand-written ones without work.
Earley: can parse any context-free grammar, including ambiguous ones, and reports all valid parses. Cost: O(n^3) in the worst case, which is fine for a 200-character SQL query but painful for a 10,000-line file.

A fair rule: small toy parser, PEG or hand-recursive-descent. Production database, LALR with hand tuning. Natural-language-ish input, Earley.

Where this leads next

The parser now turns a string into a Select dataclass. Every name in that dataclass — users, name, age — is just a string; nothing in the parser ever checked whether users is a real table, or whether age is a column that actually exists, or what the type of 18 is relative to the column age. That is the next chapter.

The Binder: Resolving Names Against the Catalog — takes the AST this chapter produces and walks it alongside the catalog (the database's table of tables), attaching a resolved column id to every Ident, rejecting queries that refer to things that do not exist, and type-checking comparisons like age > 18 against the declared type of age.
Relational algebra as your IR — once names are bound, the AST is lowered into a small algebra of operators (scan, filter, project, join) that the optimiser can reason about.
A simple logical optimiser — rewrites the algebra tree using equivalence rules like predicate push-down.
Executing a plan against a heap table — finally, running the plan against real rows on disk.

Parsing is the most bounded piece of the query engine: the grammar is finite, the rules fit on a page, and every test case can be a single string. The chapters after this one add the first layer of the database's semantic knowledge on top — the catalog, the types, and the shapes that make a query not just syntactically legal but actually executable.

References

Ford, Parsing Expression Grammars: A Recognition-Based Syntactic Foundation, POPL 2004 — the original paper defining PEG, ordered choice, and why they are a good fit for hand-written parsers.
Pratt, Top Down Operator Precedence, POPL 1973 — Vaughan Pratt's original paper, still the clearest explanation of how to handle precedence without a rule per level.
PostgreSQL source tree, gram.y — the bison grammar for Postgres SQL, canonical reference for what "real" SQL grammar looks like.
Parr, The Definitive ANTLR 4 Reference (2nd ed., Pragmatic Bookshelf, 2013) — the standard book on ANTLR, covering grammar design, tree construction, and error recovery.
Aho, Lam, Sethi, Ullman, Compilers: Principles, Techniques, and Tools (2nd ed., 2006), Ch. 4 — the textbook treatment of LL, LR, and LALR parsing, for readers who want the formal theory.
Trino project, SqlBase.g4 — a readable, modern ANTLR grammar for a production SQL dialect, useful as a comparison with the toy grammar in this chapter.