The 30-year arc and where databases go next
In 1996 a Stanford grad student named Andy Pavlo would have shown you Oracle 7, Sybase, and a brand-new project called Postgres95. In 2026, you have just spent 185 chapters building the things that replaced them, and the things that will replace those. This last chapter is the ridge: a place to turn around, look at the road behind, and squint at the road ahead.
Database history rhymes. Every decade declares relational obsolete; every decade ends with the relational ideas pulled back in. Stonebraker called it "what goes around comes around." The single durable trend underneath the noise is that the log is the truth and every database is a materialized view over it — and that idea, more than any product name, is what the next thirty years will keep rediscovering.
The road you just walked
Look at the table of contents of this encyclopedia and you can see the whole field in one glance. Build 1 handed you a file you only add to. Build 2 gave that file an index. Builds 3 and 4 introduced LSM-trees and B-trees — the two index families that almost every storage engine you will meet is one of. Build 5 brought transactions, Build 6 SQL, Build 7 the optimizer. By Build 12 you were sharding. By Build 16 you were running a Raft cluster. By Build 23 you had pulled the log out into Kafka and rebuilt every database in your stack as a fold of that log.
You did not learn 186 disconnected things. You learned one thing in 186 increments: how a single principle — write changes to a durable ordered log, derive everything else lazily — scales from a thirty-line Python file to a planet-sized streaming system. Postgres, RocksDB, Cassandra, Spanner, Snowflake, DuckDB, Kafka, Flink — they are all the same idea wearing different uniforms.
This chapter is not a recap of those uniforms. It is a map of the thirty years that produced them, and a forecast of the thirty years that will produce the next ones. The point is to leave you able to read database news the way an old engineer does: not as a stream of product launches, but as the same handful of ideas re-arguing with each other.
Thirty years in one picture
Mike Stonebraker and Joseph Hellerstein wrote a paper in 2005 called What Goes Around Comes Around. The title is the thesis. They walked through forty years of data-model history — hierarchical, network, relational, object, XML — and showed that every "post-relational" wave eventually re-discovered, and re-implemented, the relational core. Two decades later the pattern is even cleaner. Here is the arc, drawn at the level of waves rather than products.
Read the diagram top to bottom and you get the plot; read the bar at the bottom and you get the theme. The plot is fashion. The theme is engineering.
Wave 1 — relational consolidation (mid 1990s)
The mid-90s ended a long argument. Codd's relational model from 1970, Stonebraker's Ingres and System R from the late 70s, Sybase and Oracle in the 80s — by 1995 the field had agreed: tables, SQL, ACID, B-trees, write-ahead logging. Postgres95 was released. MySQL appeared. Oracle 7 was the system every Indian bank, telco, and ticket-booking site ran on. The intellectual fight was over; the only fights left were performance and licence cost.
This is the world Build 4 (B-trees), Build 5 (transactions, WAL), Build 7 (optimizer), and Build 8 (SQL) describe. None of those ideas were new in 1995. They were consolidated in 1995.
Wave 2 — internet scale and the NoSQL break (mid 2000s)
Then Google happened. The web crawl was a petabyte. Sharded MySQL ran out of running room. In 2006 Google published the Bigtable paper — a wide-column store on top of GFS, no SQL, no joins, no secondary indexes. In 2007 Amazon published Dynamo — an eventually consistent, leaderless KV store designed to keep the shopping cart available through a datacentre fire. Within three years, open-source clones of both — HBase, Cassandra, Riak, MongoDB — were running production workloads at Facebook, Netflix, Twitter.
The slogan was NoSQL, and what it meant in practice was: drop joins, drop strong consistency, drop SQL, drop schemas — buy back horizontal scale and availability. Builds 13, 14, 15, and 16 of this track are the engineering that wave produced: LSM-trees, consistent hashing, gossip, tunable consistency, Paxos and Raft.
The mistake the NoSQL wave is now famous for is that it conflated scale with abandoning relational. Once horizontal scaling was solved, the second half of the wave — the NewSQL wave — re-attached SQL and ACID to the same scaled-out substrate.
Wave 3 — NewSQL and the relational return (early 2010s)
In 2012 Google published Spanner, and the field changed direction overnight. Spanner is a globally distributed database with SQL, secondary indexes, snapshot-isolated transactions, and atomic clocks. It said, plainly, that "scale" and "ACID SQL" were not in conflict. They had only seemed to be because nobody had been willing to spend the engineering.
CockroachDB (2014), TiDB (2015), YugabyteDB (2016) followed. Postgres itself grew logical replication, JSONB, parallel query, and partitioning. By 2020 the boundary between "OLTP relational" and "scale-out" had quietly dissolved. The answer to "do I need SQL or do I need scale?" became "yes." This is what Stonebraker meant by what goes around comes around: the NoSQL wave's permanent contribution was not throwing relational away, but forcing relational to scale.
Wave 4 — cloud and the separation of compute and storage (late 2010s)
The next move was architectural rather than data-model-level. Snowflake (2014), Aurora (2014), BigQuery (2010 onward), Databricks Delta Lake (2017) — each in different ways made the same bet: store data in cheap, durable, shared object storage (S3, GCS); spin compute up and down on demand. The data lives forever in S3. The query engine is ephemeral. You pay for storage in cents per GB-month and for compute by the second.
This unlocked things classical databases could not do: instant warehouse cloning, time travel, zero-copy snapshots, separate scaling of readers and writers, multi-tenant isolation, and elastic batch jobs. It also broke an old assumption — that the database was a single program with its own disks. After Snowflake, "the database" is a protocol spoken between a query layer, a metadata layer, and an object-storage layer that never has to be the same vendor.
Builds 19, 20, and 21 of this track describe the resulting architecture: columnar formats, lakehouses, time-travel and zero-copy clones, metadata services, and the table-format wars (Iceberg, Delta, Hudi).
This wave also produced the open table format — Apache Iceberg in particular — which decoupled storage from query engine in a way the field had never seen before. An Iceberg table on S3 can be queried by Snowflake, Databricks, Trino, DuckDB, Spark, and Flink, simultaneously, with consistent reads and atomic appends. That single sentence would have been science fiction in 2010 and is unremarkable in 2026. Once the storage and the metadata are open, the engine becomes the commodity — and the engineer's job tilts away from "pick a database vendor" toward "pick the right engine for each query class against shared storage." This is the architectural backdrop against which every 2026 vendor pitch should be evaluated.
Wave 5 — streaming, embedded, and AI convergence (2020s)
The current wave is the one you finished a chapter ago. Three threads at once:
- Streaming as a first-class database. Kafka (2011) was the substrate; Materialize (2019), Flink SQL, Pinot, ksqlDB made it queryable. The result is that the database and the stream processor are increasingly the same thing — Kreps's "turning the database inside out."
- Embedded analytics. DuckDB (2019) put a vectorized columnar OLAP engine inside a Python process. SQLite already lived inside every phone. Suddenly "the database" did not need to be a server at all.
- AI and vector search. Pinecone (2019), Weaviate, pgvector, Milvus made nearest-neighbour search a first-class index alongside B-trees. Retrieval-augmented generation pulled vector queries onto the hot path of every chatbot. Postgres grew
pgvector; Elasticsearch grew HNSW; SQLite grew vector extensions. The vector index joined B-tree, hash, inverted, and LSM as the fifth standard index family.
None of these waves replaced the previous one. Postgres did not disappear when Cassandra appeared. Cassandra did not disappear when Spanner appeared. Snowflake did not replace Postgres for OLTP. DuckDB did not replace Snowflake for warehouses. The field accumulated — which is the next theme.
Why no wave fully replaces the previous one: each wave was an answer to a specific cost ratio in the underlying hardware and a specific workload mix in the world above it. NoSQL answered "the dataset is bigger than one machine"; NewSQL answered "but we still need joins"; Snowflake answered "but compute and storage scale at different rates"; streaming answered "but we want the answer to update continuously." Every workload that produced a wave is still in production somewhere, so the system that solved it is still in production too. The field accumulates because production never shrinks.
What did not change
Walk back over those five waves and ask: which ideas from 1995 are still load-bearing in 2026?
Why these six ideas survived: each of them is a consequence of physics, not a fashion choice. Disks are sequentially fast and randomly slow → append-only logs are unbeatably cheap to write. Trees and sorted runs are the only known ways to look up a key in O(\log n) on those disks. Declarative queries let the optimizer choose the plan, which means the plan can change as hardware changes — without you rewriting your code. Crashes are inevitable on any machine left running for a year, so replication is a survival requirement, not an optimization. None of these will be obsoleted by the next wave; they will be re-implemented in it.
The log in particular is the single most stable structure in computer engineering after the file. It was in System R's recovery manager in 1976. It is in Postgres's WAL, RocksDB's commit log, Cassandra's commitlog, Kafka's segment files, Spanner's Paxos groups, Snowflake's transaction log, and the operation log inside every consensus algorithm. When you watch a new database announce itself in 2030 with a new buzzword on the marquee, look behind the marquee for the log. It will be there.
The runnable form of "the log is everything" is one paragraph of Python. Build 1 had it on day one; thirty years of database engineering have not made it obsolete.
# the kernel of every database, still — 2026
from collections import defaultdict
class LogDatabase:
"""A log, plus a derived view, plus replay-on-startup. That is it."""
def __init__(self, path):
self.path = path
self.view = defaultdict(lambda: None) # the materialized view
self._replay() # rebuild it from the log
def _replay(self):
try:
with open(self.path, "r", encoding="utf-8") as f:
for line in f:
op, k, v = line.rstrip("\n").split("\t", 2)
if op == "PUT": self.view[k] = v
elif op == "DEL": self.view.pop(k, None)
except FileNotFoundError:
pass
def put(self, k, v):
with open(self.path, "a", encoding="utf-8") as f:
f.write(f"PUT\t{k}\t{v}\n"); f.flush()
self.view[k] = v
def delete(self, k):
with open(self.path, "a", encoding="utf-8") as f:
f.write(f"DEL\t{k}\t\n"); f.flush()
self.view.pop(k, None)
def get(self, k):
return self.view[k]
Run it. Restart the process. The dictionary rebuilds itself from the log. That is Postgres, Snowflake, RocksDB, and Spanner in twenty-five lines — the "view" gets replaced with B-trees, columnar files, distributed Paxos groups, and Iceberg manifests, but the loop is the same. The log persists; the view is rebuildable. Every database is a longer version of this program.
The themes underneath the thirty-year arc
Step back from individual systems and the picture becomes a small number of forces that keep producing new databases. Six of them, in the order they showed up.
1. Hardware moves; databases follow
The 1990s database was tuned for spinning disks: random reads cost 10ms, sequential reads cost 0.1ms, and the entire query engine was built to convert one into the other. SSDs (mid 2000s) collapsed that ratio to roughly 10:1 — and so LSM-trees, which trade random reads for sequential writes, suddenly stopped looking expensive. NVMe (2015) collapsed it further. RAM grew from megabytes to terabytes; in-memory databases (HANA, MemSQL, VoltDB) moved from absurd to plausible. Network bandwidth grew from 100Mb to 100Gb in the same era; separating compute from storage over the network became feasible only because the network had become as fast as the SAS bus had been.
Every database architecture is shaped by the cost ratios of the hardware it was designed for. When the ratios change, the architectures change. This is why "the best database design" has no fixed answer — it tracks a moving target.
Why hardware drives architecture: a B-tree was the right answer in 1985 because random reads on a spinning disk cost 1000× more than sequential reads, and B-trees minimize seeks at the price of writes-in-place. An LSM-tree was the right answer in 2010 because flash collapsed that ratio, making the LSM's "convert random writes into sequential ones" trade newly cheap. The same query workload can prefer different physical structures on different hardware. Pick a database without thinking about the hardware ratios it was tuned for, and you inherit a design optimized for somebody else's machine.
2. Declarative beats imperative, every time
In every wave, somebody tries to expose imperative, low-level APIs (the "you know your access patterns better than the optimizer" school). MapReduce in 2004 was imperative. Many NoSQL stores in 2008 were imperative. Each time, within a few years, a SQL layer is bolted on — Hive on Hadoop, CQL on Cassandra, KSQL on Kafka, SparkSQL on Spark.
The reason is one of economics. The optimizer changes when hardware changes; the query does not. A SQL query written for spinning disks runs on NVMe without rewriting. A hand-coded scan plan does not. Declarative queries are forward-compatible with hardware you have not bought yet. This is the lesson Stonebraker hammered home in What Goes Around Comes Around and it has been correct every time.
Why declarative wins economically: a query is a description of what you want; a plan is a recipe for how to get it. The plan space changes every five years (new join algorithms, new index types, new hardware primitives), and the optimizer is the only place in the stack where that knowledge lives. If you write the what and let the database choose the how, you get every plan-space improvement for free. If you wrote the how yourself, you have committed to upgrading every query manually each time the world moves. Multiplied by ten thousand queries in a real codebase, that is a labour cost that no organisation can afford forever.
3. Replication is non-negotiable
Single-machine databases died as a category in roughly 2010. Not because they could not scale — Postgres on a single box can handle astonishing workloads — but because uptime requirements caught up with durability requirements. A 99.99% SLA does not tolerate a single point of failure. Streaming replication, Raft, Paxos, leaderless quorums — every system that wants to be production-grade in 2026 ships replication built in. Build 16 was unavoidable.
4. Compute and storage want to separate
This is the deepest architectural shift of the last decade. Once object storage (S3) became as durable and as fast (per GB-month) as a local SAN, the equation changed. The 1995 database had its data on disks attached to its compute; if the compute died you lost the disks. The 2025 database keeps data in S3 and treats compute as a fungible resource — start, scale, stop. Snowflake. Aurora. BigQuery. Databricks. Iceberg-on-X. Once compute can be torn down without touching data, every workload that wants a different compute profile can have its own engine pointed at the same data.
This is also why "polyglot persistence" became "polyglot compute over a single data lake" — Build 24's recurring point. The polyglot is not five databases anymore; it is one Iceberg table read by five engines.
5. Polyglot is the production reality
Nobody runs one database in 2026. The smallest serious Indian startup — say a Series A fintech in Bengaluru — runs at minimum: Postgres for transactional state, Redis for caches and rate limits, Elasticsearch for search, Kafka for events, S3 for raw data, ClickHouse or Snowflake for analytics. The largest — Flipkart, PhonePe, Razorpay — run dozens. Every individual database is specialized; the system is the integration. Builds 22 and 23 are the chapters that taught you to think this way.
The corollary: a working data engineer in 2026 is not someone who knows Postgres deeply. It is someone who knows how four databases talk to each other through a Kafka log. The log is the lingua franca; the databases are the dialects.
6. The log is increasingly visible
In 1995 the log was a private implementation detail. In 2026 the log is a product — Kafka, Pulsar, Kinesis, Redpanda. CDC tools (Debezium, Maxwell, Materialize Source) expose the log of databases that did not originally consider their log a product. The trend is toward making the log explicit, durable, and consumable by many systems. This is what Kreps's "turning the database inside out" was a forecast of, and the forecast was correct.
How to read the next thirty years
Forecasts about specific systems are mostly wrong. Forecasts about pressures are mostly right. Here is a small set of pressures that will shape the next decade — not predictions of which products will win, but of which fights will keep happening.
AI is changing the workload, not the substrate
Vector search is now a first-class index family alongside B-tree, hash, inverted, and LSM. RAG pipelines are now a first-class workload alongside OLTP and OLAP. Embedding pipelines, feature stores, and prompt-history logs are now first-class data. But — and this is the important part — none of them have produced a new substrate. They run on B-trees, LSM-trees, and Kafka logs. Pinecone is, underneath, an LSM with HNSW indexes. pgvector is a Postgres index extension. The workload is new; the substrate is the same one Build 1 of this track introduced. Expect new index types, not new databases.
The interesting consequence of this is operational rather than architectural. A 2026 fintech in Pune that adds an LLM-powered support assistant does not buy a new database. It adds a vector(1536) column to an existing Postgres table, builds an HNSW index on it, runs the embedding pipeline as a Kafka consumer, and serves the resulting nearest-neighbour queries from the same Postgres instance that already serves transactional data. The "AI database" is the regular database with one more index family. That is exactly what Stonebraker's framing predicted: the field absorbs the new workload into the established substrate, rather than spinning off a new substrate for it.
Cheap durable storage will keep eating expensive specialized storage
S3 already replaced HDFS for analytics. Iceberg and Delta are turning S3 into the warehouse storage layer. The next move is OLTP: shared-storage primaries (Aurora, Neon, Tigris), then full S3-backed transactional engines. Every layer of the stack that is currently "data attached to compute" will, over the next decade, fight a battle with "data on object storage, compute is ephemeral." Some layers will resist (low-latency transactional state probably stays close to its compute for a while), but the pressure will be one-directional.
Embedded engines will keep eating servers from below
DuckDB took analytics out of the warehouse for a class of workloads where "ten million rows on your laptop" replaced "Snowflake at $3 per TB scanned." SQLite was already inside every phone. The pressure is real: a lot of work that gets sent to a database server is small enough that an in-process engine can serve it without the network. Expect more of this — embedded vector stores, embedded streaming engines, embedded graph processors. Build 24 chapter 4 (DuckDB) is the case study.
Kafka-shaped systems will absorb more of the database
If Kreps was right, then over time the durable shared log moves from being a tool for moving data between databases to being the canonical place where data lives, with traditional databases reduced to materialized views. The progress in 2026 is partial — Materialize is real, Flink SQL is real, but Postgres is still the place most application code points at. The pressure is the same one that took analytics to S3: the log is durable and shared; everything else is rebuildable. Watch this slowly invert.
Query interfaces will diversify
SQL stays. JSON-document APIs stay. But natural-language interfaces (LLMs that compile questions to SQL or to direct query plans) are now a serious third channel. They will not replace SQL — SQL is what they compile to — but they will change who writes queries. The interesting engineering problem is not "let the LLM write SQL," it is "let the database expose a schema and a cost model the LLM can reason about reliably." Expect every major database to ship an LLM bridge that is more than a thin wrapper.
Serverless and per-query pricing will keep spreading
Aurora Serverless v2, Neon, PlanetScale, BigQuery, Snowflake — the model where you pay for the queries you run, not the servers you provision, will spread to more workloads. This pushes back on the "always-on database server" assumption that Build 24 (running a database in production) describes. The work the production engineer does in 2026 is necessary; in 2036 a fraction of it will be the cloud's problem instead.
Common confusions
-
"NoSQL replaced relational." It did not. NoSQL forced relational to scale horizontally — that was its lasting contribution. The 2026 OLTP database is both: a Spanner or CockroachDB has SQL, ACID, secondary indexes, joins (relational), and runs across hundreds of machines (NoSQL's original promise). The wave was a course-correction, not a replacement.
-
"AI databases are a new category." They are not, as of 2026. Vector search is a new index, not a new substrate. Pinecone, Weaviate, and pgvector all sit on top of LSM-trees or B-trees. The "AI database" marketing label means a database that ships with an HNSW index, an embedding pipeline, and good defaults for RAG. That is real and useful — but it is a workload accommodation, not a Stonebraker-level paradigm shift.
-
"The cloud killed the database." It did the opposite. There are more databases in 2026, more diverse, more specialized, more deeply integrated than ever. What the cloud killed was the assumption that one database vendor sells you a product that runs on your hardware. In 2026 the database is a service, the storage is somebody else's bucket, and the integration is the engineer's job.
-
"Streaming systems and databases are different things." They are converging. Materialize, Flink SQL, ksqlDB, Pinot, RisingWave — each one says, in different words, the database is a folded log, the streaming engine is the folder, and the line between them was always artificial. By 2030 expect this to be conventional wisdom rather than thesis-paper material.
-
"After 30 years there will be one winning database." There will not. Workloads are too varied — OLTP, OLAP, search, vector, graph, time-series, KV, document — and the cost ratios that favour each are too different. What you will see is a small number of substrates (object storage, durable logs, columnar files, B-tree files) shared across many engines. Polyglot is the long-run answer, not a transitional state.
-
"Stonebraker said relational always wins, so just use Postgres." Stonebraker said relational always comes back — which is not the same. The waves still happen; they just do not abandon relational permanently. Picking a database in 2026 is still a tradeoff over polyglot persistence, and "Postgres for everything" is sometimes the right answer and sometimes very, very wrong.
Going deeper
Stonebraker's "What Goes Around Comes Around" — and the 2024 sequel
The 2005 Stonebraker–Hellerstein paper is the canonical historical survey. In 2024 Andy Pavlo and Stonebraker published a sequel — What Goes Around Comes Around… And Around — covering the post-2005 era, NoSQL, NewSQL, cloud, and the early streaming wave. Read both back-to-back. The 2005 paper teaches you to read pre-2005 history; the 2024 paper extends the exact same analytical lens to the era this encyclopedia just walked you through. Together they are the closest thing the field has to a textbook on its own evolution.
The thesis of both papers is unkind to vendors and useful to engineers: most "revolutionary" database announcements are reinventions of ideas the field already had, dressed in new vocabulary. Stonebraker's catalogue of false revolutions — object-oriented databases, XML databases, the early MapReduce wave — is in the 2005 paper. The 2024 sequel adds the document-database wave, the graph-database wave, and the early "AI-native database" announcements to the same catalogue. None of those waves were worthless; each contributed something. But none of them was the paradigm shift their press releases claimed. The discipline of reading the next announcement skeptically is something you build by reading these two papers and noticing the pattern.
Jim Gray's transaction-processing legacy
Almost every transactional concept in this encyclopedia — ACID, two-phase commit, write-ahead logging, snapshot isolation, recovery — descends from Jim Gray's work at IBM (System R) and Tandem (NonStop) in the 1970s and 80s. Gray's Transaction Processing: Concepts and Techniques (1992, with Andreas Reuter) is the bible; most production database manuals are a footnote to it. If you go on to engineer transactional systems, the Gray-Reuter book is the reading after this one.
The CAP debate, settled
In 2000 Eric Brewer conjectured that a distributed system cannot simultaneously offer consistency, availability, and partition tolerance. Lynch and Gilbert proved a formal version in 2002. Half a generation of database design was justified by hand-waving at "CAP." By the late 2010s the consensus shifted to Daniel Abadi's PACELC framing, which adds the in-no-partition latency-consistency tradeoff that CAP ignores. The 2026 reading: CAP is a useful aphorism, not a design tool; PACELC and the more recent work on linearizability vs serializability vs strict-serializability is what serious systems engineers actually reason with. Build 18 covers this in depth.
The papers worth re-reading every five years
If you build databases for a living, the following papers reward repeated reading: System R (Astrahan et al., 1976) for the relational substrate; ARIES (Mohan et al., 1992) for recovery; Bigtable (Chang et al., 2006) and Dynamo (DeCandia et al., 2007) for scale-out KV; Spanner (Corbett et al., 2012) for globally consistent SQL; The Log (Kreps, 2013) for the inside-out worldview; The Snowflake Elastic Data Warehouse (Dageville et al., 2016) for compute/storage separation; Differential Dataflow (McSherry et al., 2013) for streaming materialized views. Every one of those is referenced in this encyclopedia. Re-read them periodically; you will see different things at different points in your career.
Where research is moving in 2026
The most active research areas, as of this chapter: learned indexes (using ML models to replace B-tree internal nodes), disaggregated memory (databases that span CXL fabrics), deterministic transactions (Calvin and its descendants), zero-knowledge databases (cryptographic proofs of query correctness for outsourced storage), hardware/software co-design with NVMe-CSDs and SmartNICs, and vector-first index structures beyond HNSW. None of these are mainstream yet; some will be in five years. The way to track them is the SIGMOD, VLDB, and CIDR conference proceedings — which is what How to read a database paper was about.
What you can build now that you have walked the road
The point of an encyclopedia is not the encyclopedia. It is what you can do after reading it. Concretely: you can read a new database's architecture white paper and place it on the diagram in this chapter within ten minutes. You can pick a database for a workload using the polyglot decision tree of chapter 182. You can read a Postgres or RocksDB pull request and follow the change. You can sketch a sharded Raft cluster on a whiteboard during an interview. You can write a thirty-line append-only log, an in-memory hash index, an LSM-tree, a B-tree page splitter, a query parser, a Raft leader election, a CDC consumer, a vectorized scan operator — and you can do them in any language. That is what 186 chapters were for.
Where this leads next
There is no chapter 187. From here the road forks into the rest of your career.
- For the engineering track: keep Gray-Reuter, Kleppmann's Designing Data-Intensive Applications, and the Stonebraker-Pavlo surveys on a shelf you can reach. Read SIGMOD/VLDB/CIDR proceedings once a year.
- For the systems track: pick a database, read its source code end-to-end. SQLite is the gentlest. Postgres is the deepest. RocksDB is the cleanest LSM-tree codebase in the open. Materialize is the cleanest streaming-database codebase. Pick one; spend a year inside it.
- For the practitioner track: re-read running a database in production, polyglot persistence, and where this is all going. Those three are the everyday operating manual.
- For the curious reader: the next time a "revolutionary new database" announces itself on a Tuesday, open its architecture page and look for the log. It will be there. It always is.
You started in chapter 1 with a file you only added to. You finish in chapter 186 with the same file. Everything in between was layers. That is not a disappointment — that is the field. Welcome to it.
References
- Michael Stonebraker and Joseph M. Hellerstein, What Goes Around Comes Around (Readings in Database Systems, 4th ed., 2005) — the canonical survey of pre-2005 data-model history. The single most important essay for putting database fashion in perspective. redbook.io.
- Andy Pavlo and Michael Stonebraker, What Goes Around Comes Around… And Around (SIGMOD Record, 2024) — the sequel covering NoSQL, NewSQL, cloud, and streaming. Mandatory follow-on to the 2005 paper.
- Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (LinkedIn Engineering, 2013) — the essay that re-centred the log as the universal database primitive. engineering.linkedin.com.
- Jim Gray and Andreas Reuter, Transaction Processing: Concepts and Techniques (Morgan Kaufmann, 1992) — the bible of transactional databases. Still the reference work.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017) — the working engineer's modern compendium; the most-cited reference in this encyclopedia. dataintensive.net.
- James C. Corbett et al., Spanner: Google's Globally-Distributed Database (OSDI, 2012) — the paper that started the NewSQL wave. research.google.
- Benoit Dageville et al., The Snowflake Elastic Data Warehouse (SIGMOD, 2016) — the canonical compute/storage-separation paper.
- The append-only log — chapter 2 of this track — where the road began.