In short
For five builds the interface to your database has been two functions: put(k, v) and get(k). The caller decides the key, the caller decides the access path, and the database obeys. Build 6 flips the contract. The caller writes a sentence — SELECT name FROM users WHERE age > 18 — and the database is responsible for deciding which index to use, what order to scan, what to join, what to skip, when to stop. That flip is only possible because of a specific data model underneath: Codd's relational model, where data lives in relations (unordered sets of tuples with named attributes), the model is purely logical (no row numbers, no file offsets, no access paths baked in), and the separation between the logical schema and the physical storage is enforced by construction. The user talks to the relational layer; the storage engine does whatever it wants underneath, and is free to change its mind — add an index, switch from heap scan to B-tree seek, reorder a join — without the user's query being invalidated. The query language (SQL) is the surface. The data model is the thing that makes the surface trustworthy. This chapter is about the model; the parser starts next chapter.
You type three words into a terminal.
SELECT name FROM users WHERE age > 18;
Hit enter. A few milliseconds later, 14,203 rows stream back. Think carefully about what happened. You did not tell the database which file to open. You did not tell it which index to use, or whether to use an index at all. You did not tell it what order to scan the rows in, or where to stop early, or whether to parallelise across cores. You did not even tell it what the word users physically means — on disk, users might be a B-tree, a heap table, an LSM-tree, a compressed columnar file, a federated view over three remote machines, or a materialised join of four underlying tables. You did not care, and the database did not ask you to care.
For forty years, computing generally has run on the opposite contract. When you open("users.csv"), you are telling the kernel the exact bytes. When you call list.append(x), you are telling Python the exact memory operation. When you redis-cli GET user:42, you are telling Redis the exact key. You describe how. The machine executes how. This is the natural way computers work — it is the way your hardware, your syscalls, your networking stack, and your Build 1-through-5 database have all worked. It is called imperative.
The three words above are not imperative. They are declarative: a description of the answer you want, with no commitment to the procedure that produces it. And for a database to honour that contract — for a database to accept "the answer I want is every user over eighteen" and be free to figure out how — something quite specific has to be true about how the data is modelled underneath.
That something was named in 1970 by Edgar F. Codd, an IBM researcher who spent the decade arguing, against enormous institutional resistance, that every production database of his era was making a category error: it was exposing its physical layout as the programming interface. Codd proposed instead a purely logical data model — one where tables are mathematical sets, rows are mathematical tuples, and the user's query talks about the sets and tuples and has no ability to name a pointer, a page, a file, or a scan order. That model — the relational model — is the foundation under every SQL database in the world, under Postgres and SQLite and Oracle and BigQuery and DuckDB, under the Django ORM and the Rails ActiveRecord and the SQLAlchemy you have probably met by now. It is the reason the three-word query above is even a thing you are allowed to type.
This is chapter 40 of the track and chapter 1 of Build 6. Build 6 is where your storage engine becomes a database in the sense a user would recognise the word — where a human types SQL at a prompt, a parser turns it into a tree, a planner turns the tree into relational algebra, an optimiser rewrites the algebra, and an execution engine runs a pipeline of operators across the buffer pool you built in Build 5. Every one of those layers depends on the data model beneath them being relational. So before the parser, before the planner, before a single line of SQL, the data model comes first.
Two interfaces, side by side
Here is the Build 5 KV store you already have, and the Build 6 SQL interface you are about to build, answering the same logical question.
Look at the left side carefully. Every line names a how. db.scan() is a total scan; you chose that. k.startswith(b"user:") is a prefix filter; you chose the key convention. json.loads(v) is a per-row decode; you chose that encoding. row["age"] > 18 is your filter; you chose to apply it after decode rather than before. If tomorrow the storage layer acquires a B-tree index on age, your loop does not benefit — it still full-scans, because your code said so.
Now look at the right side. There is no scan in it. There is no for loop in it. There is no key prefix, no JSON decoding step, no order. The SQL statement commits to one thing and exactly one thing: what the caller wants to see. If tomorrow the storage layer grows a B-tree index on age, this same SQL statement gets faster — the query planner notices the index, chooses to seek it, and the user's query text does not change by one character. If tomorrow the storage layer acquires parallel scan, this same SQL statement gets wider. If tomorrow the storage layer replaces its heap table with a columnar format, this same SQL statement keeps working. The query survives the storage layer; the storage layer does not dictate the query.
That asymmetry is not a matter of taste. It is structural, and it comes out of a specific data model.
Codd's relational model — the four essentials
In June 1970 Codd published a twelve-page paper in Communications of the ACM titled A Relational Model of Data for Large Shared Data Banks. It is readable in an afternoon. It defines the relational model in four pieces that every SQL database you have ever used is built on.
Relations are sets of tuples. A relation is what you would call a table, but with one critical refinement: it is a set, not a list. The rows are unordered. There is no concept of "the first row" or "the row after that one" at the logical level. If the database happens to return rows in a particular order, that is an accident of the physical storage, not a property of the relation. A query that wants a specific order must say so explicitly (ORDER BY), and the database is then free to produce that order however it likes.
Set semantics is stronger than "unordered." It also means no duplicates at the logical level. If two rows have identical values in every column, they are the same tuple, and appear once. (Real SQL tables often allow duplicates — this is a carefully chosen deviation from the pure model and a long-running source of its own debates; we will come back to it.)
Tuples have named attributes. A tuple is a mapping from attribute names to values, not a positional list. (name: "Dipti", age: 17, city: "Chennai") is a tuple; ("Dipti", 17, "Chennai") is a positional list that has lost the information of which slot means what. In the relational model you address data by name, never by position. tuple.name is well-defined. tuple[0] is not a relational concept at all — it does not appear in the algebra, it does not appear in the calculus, and it does not appear in SQL's row model.
This is not a pedantic distinction. Naming is what makes queries compose: when you join two relations, you are lining up attributes with the same name and same domain (or explicitly renaming them). When you project, you are picking a subset of attributes by name. When you filter, you are predicating on named attributes. Positional addressing would mean any schema change — inserting a column, reordering for alignment — would break every query in the system. Named addressing means you can add columns at will, and queries that do not mention them keep working.
Attributes have domains, and relations have constraints. Every attribute is drawn from a domain — a type, in the modern vocabulary. age draws from integers; name draws from strings; signup_date draws from dates. A value that is not in the attribute's domain is not a legal value, and the relation is required to reject it.
Beyond domains, relations carry constraints: rules the database enforces on every insert, update, and delete. Primary keys say "this attribute or set of attributes uniquely identifies a tuple" — no two rows with the same primary key exist. Foreign keys say "this attribute's value must match a primary key in some other relation" — you cannot have a row in orders with user_id = 42 unless there is a row in users with id = 42. NOT NULL, UNIQUE, CHECK constraints all encode application invariants directly in the schema, where the database enforces them uniformly rather than relying on every piece of application code to remember.
Constraints are what turn the relational model from a data-structure proposal into a data-integrity proposal. The reason you cannot accidentally ship a user without a name, or an order without a valid user, or a negative quantity, is that the relational model lets you state those rules as part of the data's definition, and the database refuses any update that would break them. This is part of what C in ACID — Consistency — means.
The model is purely logical. Nothing in the four bullets above says anything about how the relation is stored. No file layout, no sort order, no index, no page format, no B-tree or LSM-tree or hash bucket, no in-memory versus on-disk distinction. A relation is a mathematical object: a set of tuples drawn from a product of domains. What the database does with it — how it lays out the bytes, how it finds a tuple quickly, what order it reads them in — is the storage engine's problem, invisible to the relational model and unconstrained by it.
This is the critical one, because this is what makes the declarative query on the right side of the figure above possible.
Why this beats arbitrary data structures for a declarative interface
You could imagine a world where the "user data model" of a database is whatever the user wants: linked lists, graphs, nested documents, hash tables, arbitrary objects. Indeed, several generations of databases have tried this. The network databases of the 1960s (CODASYL) exposed records-with-pointers directly, so the user had to walk explicit pointer chains to navigate data. Object databases of the 1990s exposed programming-language objects. Document databases of the 2000s-2010s (MongoDB, CouchDB) expose nested JSON.
All of these are legitimate data structures. Why does Codd's relational model dominate as the target of declarative querying? Three reasons, each tied to something you have already built or are about to build.
Reason 1 — the optimiser needs a closed algebra. A query optimiser's job is to take the user's declarative query and rewrite it into an equivalent cheaper form. "Filter after scan" becomes "filter pushed down into the index." "Join then project" becomes "project then join." "Sort then group" becomes "hash aggregate, no sort." For these rewrites to be safe — that is, guaranteed to produce the same answer — the operations the user can express must form a closed algebra with known equivalences. Relational algebra is exactly such a closed algebra: the six core operators (select, project, join, union, difference, rename) take relations in and produce relations out, and the equivalences between compositions of them are provable once and for all.
No such algebra exists for arbitrary pointer-chasing or object graphs. If your data model lets a query say "follow the manager pointer, then the children list, then the third element," there is no general way to rewrite that into a cheaper equivalent form — the operation is imperative, and the optimiser cannot reason about it. The relational model's restriction to set-based, name-addressed data is what makes optimisation possible.
Reason 2 — the data is addressable from the outside. In the relational model, a tuple is identified by its contents — specifically, by the values in its primary-key attributes. No row carries a pointer to another row; references are by value (a foreign key points to a primary key), not by pointer or offset. This means the logical model has no dependency on physical addresses. The database can reorganise the physical storage arbitrarily — rewrite pages, change indexes, split tables across files, replicate across machines — without breaking any application query, because the queries reference data by its values, not by where it lives.
Contrast a linked-list or graph model, where a "pointer" has to be a physical identifier the application remembers. Move a node, and the pointer breaks. The relational model eliminates this category of dependency by never letting application code hold a pointer in the first place.
Reason 3 — it composes with transactions. Everything you built in Build 5 — atomicity, durability, isolation — was couched in terms of "transactions modify rows." The rows can be arbitrary, but the composition rule for transactions (a transaction commits or aborts, atomically, and the state after is the state before plus the committed modifications) is cleanest when rows are self-contained, identified by value, and unconstrained by pointer relationships. The relational model matches this composition rule exactly. Transactions on a network database have to worry about whether a pointer still points to a valid target; transactions on a relational database need only worry about the set of tuples they touched.
A single-line way to say all three reasons: the relational model is the data model where the database has the most freedom to optimise, precisely because the user committed to the fewest implementation details. Every constraint on the user's vocabulary is a degree of freedom for the optimiser.
Relations are not files — the logical/physical split
This is the deepest idea in the chapter, because it is the idea that makes everything in Build 6 possible without breaking everything in Builds 1-5.
In a file-based system, the interface is the storage. open("users.csv") means: this file, these bytes, this encoding. If you want to use the data in a different order, you copy the file. If you want to add an index, you add a separate index file and the application code has to know to consult it. If you want to change the encoding, every reader and writer of the file changes.
In the relational model, the interface is the relation, and the storage is a completely separate concern. The relation users is a set of tuples with a schema. That relation might be stored as a heap table, as a B-tree on the primary key, as an LSM sorted-string-table, as a columnar segment, as a materialised view over three underlying tables, or as a distributed shard across fifty machines. Any of those storage choices yield the same relation — the same answers to the same queries — because the relational model specifies the behaviour (what tuples, what constraints, what operations), not the implementation.
This split is called physical data independence, and Codd named it explicitly in the 1970 paper as a design goal. In practice, it is the reason you can:
- Add a B-tree index on
users(email)without changing any query. Queries that would benefit are automatically replanned to use the new index. - Partition a huge table across many files without changing any query. The planner inserts a partition pruning step.
- Change a table from row-major heap storage to columnar storage (some systems support this as a single command) without changing any query. Only the plans change.
- Migrate from SQLite to Postgres to CockroachDB without changing the SQL — the queries are the same at the logical level even though the storage engines are wildly different underneath.
The split also explains something about the code you have already written. In Build 5 your storage engine is a buffer pool plus a WAL plus an ARIES-style recovery pipeline. Your query engine does not exist yet — it will be the thing you build in Build 6. A mature database cleanly separates these: the storage engine answers questions of the form "give me the bytes of this row" or "commit these changes durably"; the query engine answers questions of the form "give me every user over eighteen" by translating them into many storage-engine calls. Postgres's postgres/src/backend/storage directory versus postgres/src/backend/optimizer directory is this split in the filesystem. SQLite's pager.c versus select.c is the same split. Your Build 5 code is one side of the wall; Build 6 builds the other side.
The query language comes next — but the model comes first
There is a temptation, when learning SQL, to start with the syntax: SELECT, FROM, WHERE, JOIN, GROUP BY, one keyword at a time. That is how most tutorials go, and for someone who only wants to use a database it is fine. But if you want to build a database — which you have been doing for thirty-nine chapters — the order has to be the other way round.
The query language is a surface. The surface is designed against a contract. The contract is the data model. You cannot design a query language until you know what it is a language over. SQL is a query language over relations — over sets of named tuples with domains and constraints. Every piece of SQL syntax has a relational-algebra meaning underneath: SELECT a, b FROM r is projection (π); WHERE p is selection (σ); JOIN is ⋈; UNION is ∪; EXCEPT is −. The parser's job in the next chapter is not to understand English — it is to map text to relational-algebra trees. It can only do that because the trees are expressions over a well-defined algebra of a well-defined data model.
So: the relational model is not a preamble to SQL. It is the thing SQL is for. If you understand relations — sets, tuples, attributes, domains, constraints, the logical/physical split — you can reconstruct SQL from first principles (and people have; the original SQL was called SEQUEL, designed at IBM in the 1970s specifically as a human-friendly syntax for Codd's relational calculus). If you do not understand the model, SQL looks like an arbitrary collection of keywords, and every deviation from the model (NULLs, duplicates, ordering within a query) looks like a typo instead of a deliberate compromise.
The next chapter is going to introduce a SQL parser — a PEG-style recursive-descent parser that turns text like SELECT name FROM users WHERE age > 18 into an abstract syntax tree. The chapter after that is going to introduce relational algebra as the intermediate representation. By the end of Build 6, you will have a full pipeline: text → AST → bound AST → logical plan → optimised logical plan → physical plan → executor → answer. Every stage of that pipeline is a transformation of a relational-algebra expression. Before you can build any of it, the model has to be clear.
The same relation, three queries, three plans
Take a tiny catalogue:
relation orders(order_id PK, user_id FK→users.id, amount, placed_at)
relation users(id PK, name, age, city)
Three declarative questions, each using nothing but the relational model.
Q1. Total amount per city for users over 18.
SELECT u.city, SUM(o.amount) AS total
FROM users u JOIN orders o ON o.user_id = u.id
WHERE u.age > 18
GROUP BY u.city;
A possible physical plan: scan users, filter age > 18, hash-join against orders on id = user_id, hash-aggregate by city, sum amount. An alternative plan for the same query: index-scan on users(age) for rows above 18, nested-loop-join into an index on orders(user_id), streaming-aggregate by city. Either plan produces the same answer. The user's SQL is identical. The optimiser chose.
Q2. Users in Chennai who have never placed an order.
SELECT u.name
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE u.city = 'Chennai' AND o.order_id IS NULL;
A possible physical plan: index-scan users(city='Chennai'), anti-join against orders on user_id, project name. The relational shape is the set of users minus those with matching orders — an anti-join, one of the standard operators. An anti-join is expressible in the relational algebra. The optimiser can pick an anti-join implementation (hash anti-join, sort-merge anti-join, nested-loop anti-join) without the user caring.
Q3. Top 10 spenders by total amount, newest first on ties.
SELECT u.name, SUM(o.amount) AS total
FROM users u JOIN orders o ON o.user_id = u.id
GROUP BY u.id, u.name
ORDER BY total DESC, MAX(o.placed_at) DESC
LIMIT 10;
Here the user has asked for ordering (ORDER BY) and a bound (LIMIT 10). The relational result is a set, but the presentation is a sorted list truncated to 10. The optimiser is free to use a top-N heap to avoid materialising the full sort — the user did not ask how, only what. At 100 rows of input it might full-sort; at 100,000,000 rows it will use a heap. Same query text, different plan, same answer.
Common confusions
-
"Isn't NoSQL not relational? And yet people still query it." NoSQL started as a movement against SQL's perceived rigidity, but over time most NoSQL systems have drifted towards relational-ish models. MongoDB has document schemas, indexes, joins (via
$lookup), aggregation pipelines that look a lot like relational algebra. Cassandra's CQL is SQL-shaped. DynamoDB has tables, primary keys, secondary indexes. True "schema-less" NoSQL is the exception, not the rule, and even in the exceptions the query language ends up reinventing selection, projection, and aggregation because those are what queries actually need. The relational model is less a specific syntax than a specific theory of what data-querying is, and that theory keeps re-emerging. You cannot escape it by running from SQL; you just end up building a less-principled version of it. -
"Schemas are restrictive. Schema-less means freedom." Schema-less means the schema lives in every piece of application code instead of in one place. A document store with "no schema" still has an implicit schema — the application code has to know that
user.ageis an integer, thatuser.nameexists, thatuser.emailis a string. When one service writesageas an integer and another writes it as a string, nobody catches it until production. A declared schema is the database's way of enforcing, once, in one place, the invariants that every query depends on. "Schema-less" really means "schema-on-read" — you pay the check on every read instead of once on insert. Sometimes that is the right trade; often it is a way of putting off the problem. -
"Primary keys are just
idcolumns, right?" In practice, many primary keys are surrogate integers (id) auto-generated by the database, but primary keys can be any combination of columns whose values uniquely identify a tuple.(flight_number, date)is a natural primary key for aflightstable.(user_id, product_id)is a natural primary key for acart_itemstable. Composite keys are fully first-class in the relational model, and the algebra does not treat them specially. The choice between surrogate and natural keys is a design decision with real tradeoffs; it is not a rule of the model. -
"The relational model disallows duplicates; but I see duplicates in my SQL results all the time." Pure relational algebra is set-based: no duplicates. SQL, from SEQUEL onward, chose to allow duplicates (bag semantics — a multiset rather than a set) because real queries often want duplicates (
SELECT price FROM orderswhere many orders have the same price should return one row per order, not one row per distinct price). This was a conscious deviation from Codd's original model, and it generated heat for decades — Codd, in later papers and in the Third Manifesto, argued it was a mistake. In practice,SELECT DISTINCTlets you restore set semantics when you want them. The lesson: SQL is not pure relational theory; it is relational theory with pragmatic compromises baked in for real-world usability. -
"If the logical model is separate from the physical, why does SQL have hints like
USE INDEX?" Because the optimiser is not always right, and experienced database engineers need an escape hatch. Index hints, join-order hints, and planner pragmas are deliberate violations of the pure logical/physical separation — they let the user reach through the abstraction and instruct the storage engine. They exist because optimiser cost models are imperfect, especially on unusual workloads. Pure physical data independence is the ideal; hints are the acknowledged reality. -
"NULLs are part of the relational model, right?" This one is genuinely contentious. Codd included NULL in his later revisions (the 1979 Extending the Database Relational Model paper), but Date and Darwen — in the Third Manifesto — argue vehemently that NULLs break the algebra because three-valued logic (true/false/unknown) makes predicate evaluation ambiguous. Real SQL has NULLs; they cause endless subtle bugs. You will meet them in the parser and planner chapters, and you will feel Codd's and Date's ghost arguing over your shoulder every time you write
WHERE x != 5and forget it does not match rows wherex IS NULL.
Going deeper
The four bullets above cover the working vocabulary a Build 6 engineer needs. Below is the history and the formal theory — useful to anyone who wants to read the source code of a real optimiser or the original papers.
Codd's 1970 paper, read today
Edgar F. Codd's A Relational Model of Data for Large Shared Data Banks (CACM, June 1970) is one of those rare papers where the twelve-page version is still the clearest statement of the idea. It defines relations, attributes, domains, primary keys, foreign keys, and the separation of the logical and physical levels. It introduces relational algebra (eight operators at the time — later reduced to six minimum) and relational calculus (a first-order logic notation equivalent in expressive power), and proves they are equivalent. The equivalence is important: it is the theoretical justification for the split that runs through every SQL engine — the user writes calculus-like predicates, the engine rewrites them into algebraic trees and optimises the trees.
Codd also names the three kinds of "data dependencies" he wants to eliminate: ordering dependence (queries should not depend on the order of rows), indexing dependence (queries should not depend on which indexes exist), and access-path dependence (queries should not commit to a particular way of reaching the data). All three are captured in the logical/physical split, and all three were not true of the network databases that dominated in 1970. Codd's paper was a direct argument against CODASYL's proposed DBTG specification. He lost the political battle in the short term — IBM spent the 1970s funding his research only reluctantly — and won the long one comprehensively. Every database you have ever used is built on his argument.
Date and Darwen's Third Manifesto
Chris Date and Hugh Darwen, two of Codd's most careful readers, published The Third Manifesto in 1995 (with updates through 2013) as an attempt to define a pure relational language they called D, free of the compromises SQL had accumulated. The manifesto is worth reading for its clear-headed insistence on what the relational model actually is — sets, not bags; tuples, not arrays; no NULLs, use default values or missing-value markers through the type system — and for its careful separation of what is mathematics from what is convenience.
The manifesto is not implemented in any mainstream database, but its analysis of SQL's deviations from pure relational theory is the standard reference. If you ever wonder why a SQL query is doing something surprising — duplicate rows, strange NULL handling, order-sensitive aggregations — Date and Darwen's answer is usually that SQL deviated from the model, and the deviation is now part of the standard. The Third Manifesto maps the deviations precisely. Read their Database Explorations blog archive if you want more.
Relational calculus and why planners use algebra
The user, in their head, thinks in relational calculus — first-order predicate logic over relations. "Every user whose age is over 18 and whose city is Chennai" is a predicate: { u | u ∈ users ∧ u.age > 18 ∧ u.city = 'Chennai' }. SQL's WHERE clause is a thin syntactic layer over this calculus. But calculus is declarative in a way that is useful to humans and useless to optimisers — it describes what the answer contains, not how to compute it.
Relational algebra is operational: a tree of operators (σ, π, ⋈, ∪, −, ρ) each of which has a clear execution method. The algebra is what the planner uses internally, because the planner's job is to rewrite the tree — push selections down, push projections up, reorder joins, pick join algorithms. The rewriting obeys well-known algebraic equivalences (σ commutes with ⋈ under certain conditions, associativity of ⋈, distributivity of σ over ∪, etc.), and these equivalences are the building blocks of every cost-based optimiser.
The deep result, due to Codd, is: calculus and algebra are equivalent in expressive power. Every calculus expression can be converted to an algebraic tree, and vice versa. This is sometimes called Codd's theorem, and it is why SQL (a calculus-shaped language) can always be compiled to a relational-algebra plan. Chapter 42 of this track — relational algebra as your intermediate representation — is the practical side of this theorem: the data structure your parser's output gets lowered into.
The extended algebra — what real databases actually use
The "six-operator" core algebra is enough in the mathematical sense — every relational query can be expressed with it — but real databases implement extensions to handle features that are practically important even if theoretically reducible. Aggregation (GROUP BY, SUM, COUNT) adds a γ operator that groups tuples by attributes and computes functions over groups. Outer joins (left, right, full) add operators that preserve unmatched tuples with NULLs in place. Window functions (OVER (...)) add operators that compute functions over frames of rows. Set operations with bag semantics (UNION ALL, EXCEPT ALL) are distinct from their pure-set counterparts.
Each extension corresponds to a physical operator in your future executor: a hash-group-by, a hash-anti-join, a window frame processor. The extended algebra is what the physical plans in Build 6 will be trees over. Everything the user can say in SQL maps to a tree of these operators; everything the planner does is rewrite the tree. The chapter on relational algebra as your intermediate representation is the full list.
The IMS / CODASYL prehistory
It is worth reading a little CODASYL to feel what Codd was arguing against. In a network database, an application program navigates data by walking pointer chains: FIND OWNER OF member follows one pointer; FIND NEXT member WITHIN set follows another. Every query is a short imperative program. Every schema change — adding a new index, reorganising a set — requires rewriting application code to use the new access paths. CODASYL's DBTG report (1971) codified this as a standard; at the time it was considered the future of data management.
IMS (IBM's Information Management System, 1968) was even more constraining: a pure hierarchical model where data is a tree, and every query is a root-to-leaf traversal. IMS is still in production at many banks and airlines; it is not dead, it just does not scale beyond one specific workload shape. Codd's argument, which seemed wildly abstract at the time, was that both IMS and CODASYL had confused the data with the access path, and that a proper data model would separate them. Fifty-five years later the argument is textbook — but re-reading the 1970 paper against its historical opposition is how you appreciate what an intellectual leap it was.
Where this leads next
You have a data model. The next move is a language over that data model, and a parser that turns strings of that language into trees.
- Writing a SQL parser (PEG / recursive descent) — chapter 41. You will build a hand-written recursive-descent parser for a subset of SQL big enough to run real queries against your Build 5 store. The parser produces an abstract syntax tree that mirrors SQL's surface syntax directly.
- The binder: resolving names against the catalog — chapter 42. The parser does not know what
usersmeans; the binder looks upusersin the database's catalog, confirms the columns, and attaches types to every attribute reference in the tree. - Relational algebra as your intermediate representation — the chapter after that. The bound AST gets lowered into a relational-algebra tree, which is the canonical form the optimiser operates on.
Then the optimiser, the physical planner, the execution engine, the operator pipeline. Every one of those sits on top of the data model this chapter defined. If the model ever becomes unclear in a later chapter — if you find yourself unsure why a planner rewrite is legal, or why a query shape is expressible, or why a storage change does not invalidate a query — come back to this chapter. The answer will be somewhere in the four essentials, or in the logical/physical split.
References
- Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks, CACM 13(6), June 1970 — the paper that started it. Twelve pages, readable in an afternoon, still the clearest short introduction to the model.
- C. J. Date and Hugh Darwen, The Third Manifesto (1995, updated 2013) — the definitive attempt to clean up what SQL broke. Not implemented in production, but the clearest guide to what pure relational theory requires versus what SQL actually does.
- Joseph M. Hellerstein, Michael Stonebraker, James Hamilton, Architecture of a Database System, Foundations and Trends in Databases (2007) — the modern reference for how a database's query engine and storage engine fit together. Reading chapters 4 (query processor) and 5 (storage) alongside this track is the shortest path to a working mental model.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017), Ch. 2 Data Models and Query Languages — the plain-English tour of relational, document, and graph models, with the tradeoffs laid out clearly. dataintensive.net.
- Edgar F. Codd, Extending the Database Relational Model to Capture More Meaning, ACM TODS 4(4), December 1979 — Codd's own revisions a decade on, including the (contentious) introduction of NULLs. Useful for seeing the model's author argue with himself.
- Michael Stonebraker and Joseph M. Hellerstein (eds.), Readings in Database Systems, 5th ed. ("The Red Book") — the canonical graduate-course anthology. The editorial introductions frame each paper in its historical moment, and several early chapters cover the relational-model debates directly.