The 30-year arc: where data engineering is going
In 1996 the closest thing to a data engineer at an Indian bank was a DBA who wrote a Pro*C program to dump Oracle tables to tape every Sunday. In 2026 you have just spent 133 chapters building the things that replaced that program, and the things that will replace those. This last chapter is the ridge: a place to turn around, look at the road behind, and squint at the road ahead.
The history of data engineering rhymes. Every wave declares the previous one obsolete; every wave ends with the previous wave's ideas pulled back in. The substrate that has not moved in thirty years is the durable ordered log; the field above it keeps re-arranging itself around hardware ratios, cost ratios, and freshness ratios. Read every announcement on those three axes and you can predict where the next decade goes.
The road you just walked
Look at the table of contents of this curriculum and you can see the entire field in one glance. Build 1 handed you a Python script that reads CSV and writes CSV. Build 2 made the script survive a crash mid-run. Build 3 made it process only what was new. By Build 4 you had written your own DAG scheduler and could compare it to Airflow. By Build 6 you were reading and writing Parquet, partitioning at scale, and explaining why small files are the silent killer of every lakehouse. By Build 7 you had built a single-broker Kafka clone in 300 lines. By Build 9 you knew exactly what "exactly-once" can and cannot mean. By Build 12 you could place an Iceberg table next to a Delta table and explain which workload favours which. By Build 17 you had run a backfill, written a runbook, sized a multi-region disaster-recovery plan, and read a vendor benchmark with a forensic eye.
You did not learn 133 disconnected things. You learned one thing in 133 increments: how a single principle — write changes to a durable ordered log, derive everything else lazily, and treat freshness as a tunable knob — scales from a fifty-line Python script to a planet-sized real-time analytics system. Spark, Flink, Kafka, Snowflake, Trino, Iceberg, dbt, ClickHouse, Pinot, Feast, Materialize, Beam — they are all the same idea wearing different uniforms.
This chapter is not a recap of the uniforms. It is a map of the thirty years that produced them, and a forecast of the thirty years that will produce the next ones. The point is to leave you able to read data-engineering news the way an old practitioner does: not as a stream of product launches, but as the same handful of pressures arguing with each other.
Thirty years in one picture
The thing that is now called data engineering did not exist as a job title in 1995. The work existed — ETL developers, ODS architects, Informatica consultants — but the discipline was a sub-genre of "the DBA" or "the BI guy". It became its own field in roughly 2010, when Hadoop forced enough companies to staff a team that lived between the OLTP database and the BI tool. Here is the arc, drawn at the level of waves rather than products.
Read the diagram top to bottom and you get the plot. Read the bar at the bottom and you get the theme. The plot is fashion. The theme is engineering.
Wave 1 — ETL and the data warehouse (mid 1990s)
The 1990s ended a long argument about how analytics should work. Bill Inmon and Ralph Kimball settled the warehouse-modelling fight (third-normal-form vs star schema), Teradata and Oracle settled the storage fight, and Informatica turned the loading job into a GUI. The shape of the work was: extract from operational databases overnight, transform on a dedicated ETL server, load into a separate analytical warehouse, point Cognos or Business Objects at it. SBI's first warehouse, ICICI's MIS reporting stack, Wipro's customer-360 — every Indian enterprise data project of the late-90s and early-2000s ran this pattern.
This is the world Build 1 (a script is a pipeline) and Build 5 (lineage and contracts) describe. None of those ideas were new in 1995. They were consolidated in 1995 — and they ran more than half of corporate data infrastructure for the next fifteen years.
Wave 2 — Hadoop and big data (mid 2000s)
Then Google happened. The web crawl was a petabyte. Yahoo's ad logs were a petabyte. By 2008 every internet company in the United States had data that no Teradata budget could swallow. Doug Cutting wrote Hadoop. Facebook wrote Hive on top of it. Yahoo wrote Pig. Suddenly the thing called "data engineering" — write Java MapReduce jobs, then SQL on Hive, then Oozie DAGs, then Sqoop imports from Postgres into HDFS — was a recognisably different job from "BI developer" or "DBA". Flipkart, Snapdeal, and Myntra all built first-generation Hadoop stacks between 2012 and 2015; the Indian "big data engineer" job title appeared in those years.
The slogan was big data, and what it meant in practice was: drop joins, drop indexes, drop SQL semantics, and buy back horizontal scale. Builds 6 (columnar storage), 7 (the message log), and 8 (windowing) are the engineering that survived from this wave; the MapReduce part mostly did not. Why MapReduce died and the storage layer survived: MapReduce was an imperative programming model — you wrote map and reduce functions by hand, in Java. The optimizer's job was your job. The moment Hive added a SQL layer, the optimizer's job became the optimizer's job again, and the imperative MapReduce code became a compilation target rather than a thing humans wrote. The storage layer (HDFS, Parquet, ORC) was orthogonal to the execution model and outlived it; the execution model lost to Spark, then to Trino, then to native columnar engines.
The mistake the Hadoop wave is now famous for is that it conflated scale with abandoning SQL and the warehouse. Once horizontal scale was solved, the second half of the wave — the modern data stack — re-attached SQL, ACID-on-tables, and warehouse-shaped queries to the same scaled-out substrate.
Wave 3 — cloud warehouses and the modern data stack (mid 2010s)
In 2014 Snowflake launched. Within five years it had rewritten the rules. Snowflake's bet — separate compute from storage; store data in S3; spin up SQL warehouses on demand — solved the two things that had killed the Hadoop wave for non-Google-scale companies: it removed the operational burden, and it made warehouse capacity elastic. BigQuery (Google), Redshift (Amazon), and later Databricks SQL adopted variants of the same architecture. dbt (2016) gave SQL transformations the kind of dependency-graph-and-version-control discipline that software engineering had had for decades. Fivetran (2012) and Stitch (2014) commoditised the extract step. The "modern data stack" was the Lego kit: Fivetran/Airbyte for ingestion, Snowflake/BigQuery/Redshift for storage and compute, dbt for transformation, Looker/Mode for BI.
For the average Indian SaaS company in 2018–2022, this stack was the default. A four-person data team at a Series B fintech could ship a customer-360 dashboard in eight weeks, where the same project would have taken six months on Hadoop. Razorpay, Cred, Postman, Freshworks, Zepto — all of them ran some variant of this stack. The discipline this curriculum's Build 5 and Build 13 describe is the operational maturity that grew on top of it.
This is the wave that defined the modern data engineer's job description. Before Snowflake, "data engineer" meant "Hadoop person" or "ETL developer" with a foot in BI. After Snowflake it meant "owns the pipeline from source system to dashboard, in dbt or Spark, against an elastic cloud warehouse." That definition stuck.
Wave 4 — the lakehouse and open table formats (late 2010s)
Then came the realisation that Snowflake's elasticity came at a price: vendor lock-in, expensive scan-heavy workloads, and a forced choice between "data lake on S3 with no transactional semantics" and "data warehouse with transactional semantics but proprietary storage." Databricks (Delta Lake, 2019), Netflix (Iceberg, 2018), and Uber (Hudi, 2016) each built the same thing in three different ways: a transactional, schema-evolution-aware, time-travelling, ACID metadata layer that sits on top of Parquet files in S3. By 2022, Iceberg had won the open-format race; by 2025, Snowflake itself supported Iceberg as a first-class storage format and competed with Databricks and Trino for compute over the same data.
This wave is the architectural shift this curriculum's Build 12 describes in detail. The unbundling is the punchline: a single Iceberg table on S3 in 2026 can be queried by Snowflake, Databricks, Trino, DuckDB, Spark, and Flink simultaneously, with consistent reads and atomic appends. That sentence would have been science fiction in 2014 and is unremarkable in 2026.
The cultural consequence is bigger than the technical one. Once storage became open and the warehouse became one engine of many over shared data, the data engineer's question stopped being "which warehouse vendor do we buy?" and became "which engine fits which query class against the same Iceberg table?" That is a categorically different conversation, and it is the one Indian fintech and consumer-tech teams are actively having in 2026.
Wave 5 — streaming-first and real-time analytics (2020s)
The current wave inverts the pipeline. The 2020 stack assumed batch was the default and streaming was an exception; the 2026 stack assumes the durable log is the source of truth and batch is a projection over it. Kafka (2011) was the substrate; Flink (2015) made stateful streaming queryable; ClickHouse (2016 open-source), Pinot (2018), and Druid (2012) made real-time OLAP a first-class workload; Materialize (2019) and RisingWave (2022) made streaming SQL incrementally maintained.
The Indian use cases are concrete. Dream11's leaderboard for an India-Australia match has 100 million users refreshing every five seconds during the last over; Pinot updates the materialized leaderboard from Kafka in sub-second latency. Zerodha's order-book analytics — used by traders to spot momentum during the 9:15 a.m. opening rush — runs on ClickHouse. PhonePe's fraud-detection layer ingests UPI events, scores them in Flink, and flags suspicious transactions before they settle. None of these workloads can wait for a nightly batch. The pipeline is no longer "extract → transform → load → query"; it is "log → continuously materialize → query."
Builds 8, 9, 10, and 14 describe this wave. The deepest theme is what Jay Kreps called turning the database inside out — the database is no longer the source of truth, the log is, and every database in the system is a materialized view over that log. That is the destination this curriculum has been pointing at since chapter one.
Wave 6 — unified compute and AI-shaped data (2025+)
The wave forming as this chapter is written has three threads.
- Embedded engines. DuckDB (2019) put a vectorized columnar engine inside a Python process. Suddenly "the warehouse" did not need to be a server at all for a class of workloads where ten million rows on a laptop replaces ₹40,000-a-month Snowflake spend.
- Vector data and feature stores merging. Feast, Tecton, and Hopsworks built the offline/online feature-store split for ML. Pinecone, Weaviate, pgvector, and Milvus made nearest-neighbour search a first-class index. By 2026 these two product categories are visibly converging — a feature store is a vector store with point-in-time correctness, and a vector store is a feature store without it.
- AI-shaped pipelines. Embedding pipelines, prompt-history logs, and RAG retrieval indices are first-class data assets in 2026. They run on the same Kafka, Iceberg, and DuckDB substrate as everything else — but they have different freshness, retention, and cost profiles, and a 2026 data engineer at a Bengaluru AI startup spends a meaningful slice of their week on them.
None of these three threads has produced a new substrate. They run on Kafka, Parquet, S3, and SQL. The workload is new; the substrate is the same one Build 1 of this track introduced.
What did not change
Walk back over the six waves and ask: which ideas from 1995 are still load-bearing in 2026?
Why these seven ideas survived: each of them is a consequence of physics or economics, not a fashion choice. Disks are sequentially fast and randomly slow → append-only logs are unbeatably cheap to write. Networks fail and processes crash → re-runs must be safe → idempotency is mandatory. Producers and consumers evolve at different rates → the only stable contract is a schema → schemas survive. Optimizers change as hardware changes; queries do not change as fast → declarative survives. Analytical queries scan a few columns of many rows → columnar storage is the right physical layout for that access pattern → columnar survives. None of these will be obsoleted by the next wave; they will be re-implemented in it.
The log in particular is the single most stable structure in this entire field. It was in Inmon's warehouse refresh script in 1995. It is in every Kafka topic, every Iceberg manifest log, every CDC stream, every DAG-execution audit log, every dbt run history, every Materialize source. When you watch a new data platform announce itself in 2030 with a new buzzword on the marquee, look behind the marquee for the log. It will be there.
The runnable form of "the log is everything, the rest is a view" is one paragraph of Python. Build 1 had the seed of it; thirty years of data engineering have not made it obsolete.
# the kernel of every data pipeline, still — 2026
import json, time
from collections import defaultdict
from pathlib import Path
class LogPipeline:
"""A durable log, plus a derived view, plus replay-on-startup. That's it."""
def __init__(self, log_path: str):
self.log_path = Path(log_path)
self.view: dict[str, dict] = {}
self.last_offset = 0
self._replay()
def _replay(self):
"""Rebuild the materialized view from the log on startup."""
if not self.log_path.exists():
return
with self.log_path.open() as f:
for offset, line in enumerate(f, start=1):
event = json.loads(line)
self._apply(event)
self.last_offset = offset
def _apply(self, event: dict):
op, key, payload = event["op"], event["key"], event.get("payload")
if op == "UPSERT": self.view[key] = payload
elif op == "DELETE": self.view.pop(key, None)
def append(self, op: str, key: str, payload: dict | None = None):
event = {"op": op, "key": key, "payload": payload, "ts": time.time()}
with self.log_path.open("a") as f:
f.write(json.dumps(event) + "\n"); f.flush()
self._apply(event)
self.last_offset += 1
def query(self, key: str) -> dict | None:
return self.view.get(key)
# --- usage at a Razorpay-style merchant ledger -----------------------
p = LogPipeline("/tmp/ledger.log")
p.append("UPSERT", "razorpay-test/m_001", {"name": "Asha Tea Stall", "kyc": "verified"})
p.append("UPSERT", "razorpay-test/m_002", {"name": "Kiran Cycles", "kyc": "pending"})
p.append("UPSERT", "razorpay-test/m_001", {"name": "Asha Tea Stall", "kyc": "verified", "limit": 50000})
p.append("DELETE", "razorpay-test/m_002")
print("offset:", p.last_offset)
print("m_001:", p.query("razorpay-test/m_001"))
print("m_002:", p.query("razorpay-test/m_002"))
# Sample run
offset: 4
m_001: {'name': 'Asha Tea Stall', 'kyc': 'verified', 'limit': 50000}
m_002: None
Run it. Kill the process. Restart it. The view rebuilds itself from the log. _replay is what every database, warehouse, and stream processor does on startup — Postgres replays its WAL, Kafka replays its segments, Iceberg replays its manifest log, Materialize replays its source. append is the only durable operation; everything in view is a derived projection. _apply is the fold that Kreps's "log as the unifying abstraction" essay pointed at — every database is a longer version of this fold.
This is Snowflake, Iceberg, RocksDB, Flink, and Materialize compressed into thirty lines. The view gets replaced with B-trees, columnar files, distributed Paxos groups, RocksDB state stores, and Iceberg manifests; the loop is the same. The log persists; the view is rebuildable. Every data system is a longer version of this program.
The themes underneath the thirty-year arc
Step back from individual systems and the picture becomes a small number of forces that keep producing new data platforms. Six of them, in roughly the order they showed up.
1. Hardware moves; data engineering follows
The 1995 warehouse was tuned for spinning disks: random reads cost 10ms, sequential reads cost 0.1ms, and the entire query engine was built to convert one into the other. SSDs (mid 2000s) collapsed that ratio to roughly 10:1 — and so columnar formats, which trade row-locality for sequential column scans, became cheap. NVMe (2015) made the gap smaller still. RAM grew from megabytes to terabytes; in-memory analytics (HANA, MemSQL, vectorized DuckDB) moved from absurd to plausible. Network bandwidth grew from 100 Mb to 100 Gb in the same era; separating compute from storage over the network — the entire Snowflake bet — became feasible only because the network had become as fast as the SAS bus had been twenty years earlier.
Object-storage cost is the most consequential ratio of the last decade. S3 at ₹1.7/GB-month vs SSD at ₹15/GB-month is a 10× cost gap, and it is what makes the lakehouse architecture economically inevitable. Why the S3-vs-SSD ratio drives architecture: when storage costs 10× less to keep cold than to keep hot, the natural design separates the cold tier from the hot tier and only pays for hot capacity proportional to actual query traffic. The lakehouse is that design taken to its limit — all data lives on cheap storage, and compute is rented when needed. Two years ago the gap was 8×; in two years it will be 12×. The architecture follows the ratio.
2. Freshness is a knob, not a binary
The single biggest cultural shift in data engineering between 2015 and 2026 is this: "batch vs streaming" is no longer a category. It is a latency budget. A daily dbt run is the 24-hour budget. A 15-minute incremental refresh is the 15-minute budget. A Materialize incremental view is the second budget. A Pinot real-time ingestion is the 100-millisecond budget. Each budget has a cost; the engineer's job is to match the budget to the business need. A finance reconciliation does not benefit from second-level freshness; a fraud-detection feature does.
The capstone insight is that the same query can run at any of those budgets depending on which engine evaluates it. A SQL aggregation in dbt at 24 hours is the same SQL aggregation in Materialize at 1 second; the optimizer just rebinds it to incremental view maintenance instead of full table scan. This is what Build 10 was pointing at.
3. Idempotency is mandatory, not optional
Every wave that ignored idempotency lost. Hadoop's MapReduce killed Job Tracker more than once a week in production at most companies; the only reason any of those pipelines worked was that re-running them produced the same result. Cron jobs that overwrote yesterday's data when re-run blew up enterprise BI for fifteen years. The modern data stack made idempotency the default — dbt's incremental materialization, Iceberg's MERGE INTO, Spark's checkpointing, Flink's exactly-once sinks. By 2026, a pipeline that is not safely re-runnable is a bug, not a design choice.
4. The log is increasingly visible
In 1995 the change log was a private implementation detail of the source database. In 2026 the log is a product — Kafka, Pulsar, Kinesis, Redpanda. CDC tools (Debezium, Maxwell, Materialize Source) expose the log of databases that did not originally consider their log a product. The trend is toward making the log explicit, durable, retained for weeks rather than seconds, and consumable by many downstream systems. This is what Kreps's "turning the database inside out" was a forecast of, and the forecast was correct.
5. Polyglot is the production reality
Nobody runs one data platform in 2026. The smallest serious Indian data team — say a Series A consumer-tech startup in Bengaluru — runs at minimum: Postgres for source-of-truth state, Kafka for events, S3 with Iceberg or Delta for raw and curated layers, dbt or Spark for transformations, Snowflake or Trino or DuckDB for ad-hoc analytics, ClickHouse or Pinot for real-time dashboards, and a feature store on top for ML. Razorpay, Flipkart, Swiggy, and PhonePe each run dozens of these. Every individual system is specialised; the system of systems is the integration. Builds 16 (governance) and 17 (running production) are the chapters that taught you to think this way.
6. Ownership keeps shifting outward
In 1998, the DBA owned the warehouse. By 2012, the data engineering team owned the pipeline and a separate analytics team owned the dashboards. By 2020, the data mesh idea pushed ownership out further: each business domain owns its own data products, and a central data-platform team provides the substrate (catalog, lineage, contracts, observability) on which those domains build. The 2026 reality is somewhere between the centralised and meshed extremes; most Indian companies of any scale have a central platform team and federated domain ownership. Build 16 covered this.
How to read the next thirty years
Forecasts about specific systems are mostly wrong. Forecasts about pressures are mostly right. Here is a small set of pressures that will shape the next decade — not predictions of which products will win, but of which fights will keep happening.
AI changes the workload, not the substrate
Vector search is now a first-class index alongside B-tree, hash, inverted, and LSM. RAG pipelines are now a first-class workload alongside OLAP and OLTP. Embedding pipelines, feature stores, and prompt-history logs are first-class data. But — and this is the important part — none of them have produced a new substrate. They run on Kafka, Parquet, S3, and SQL. Pinecone is, underneath, an LSM with HNSW indexes. pgvector is a Postgres extension. The "AI database" is a regular database with one more index family. Expect new index types, not new substrates.
The interesting consequence is operational. A 2026 fintech in Pune adding an LLM-powered support assistant does not buy a new database. It adds a vector(1536) column to an Iceberg table, builds an HNSW index on it, runs the embedding pipeline as a Kafka consumer, and serves nearest-neighbour queries from the same warehouse that already serves transactional analytics. The data engineer's job grows by one workload class; the architecture barely changes.
Cheap durable storage will keep eating expensive specialised storage
S3 already replaced HDFS for analytics. Iceberg and Delta are turning S3 into the warehouse storage layer. The next move is OLTP-adjacent: shared-storage primaries (Aurora, Neon), then full S3-backed transactional engines, then S3-backed feature stores. Every layer of the stack that is currently "data attached to compute" will, over the next decade, fight a battle with "data on object storage, compute is ephemeral." Some layers will resist for latency reasons; the pressure will be one-directional.
Embedded engines will keep eating servers from below
DuckDB took analytics out of the warehouse for a class of workloads where ten million rows on a laptop replaced ₹40,000 a month of cloud spend. SQLite was already inside every phone. The pressure is real: a lot of work that gets sent to a warehouse server is small enough that an in-process engine can serve it without the network. Expect more — embedded vector stores, embedded streaming engines, embedded graph processors. The warehouse does not disappear; its addressable workload shrinks at the bottom.
Streaming materialized views absorb more of the pipeline
If Kreps was right, then over time the durable shared log moves from being a tool for moving data between systems to being the canonical place where data lives, with traditional warehouses reduced to materialized views over it. The progress in 2026 is partial — Materialize is real, Flink SQL is real, but most production analytics still sit in Snowflake or BigQuery batch tables. The pressure is the same one that took analytics to S3: the log is durable and shared; everything else is rebuildable. Watch this slowly invert through 2030.
Query interfaces will diversify; SQL stays underneath
SQL stays. dbt-shaped transformation DAGs stay. But natural-language interfaces (LLMs that compile questions to SQL or to direct Iceberg query plans) are now a serious third channel. They will not replace SQL — SQL is what they compile to — but they will change who writes queries. The interesting engineering problem is not "let the LLM write SQL"; it is "let the warehouse expose a schema, lineage graph, and cost model that the LLM can reason about reliably without hallucinating column names." The semantic layer (Build 13) is the moat that gets monetised here.
Cost attribution becomes the central operational discipline
Build 16 and Build 17 (cost on the cloud) have already flagged this. As the warehouse becomes a per-query billing surface, the question "who pays for that query?" becomes the question that drives platform design. Multi-tenant query attribution, per-team budgets, predicate-level chargeback, and "scan-budget" governance are the engineering problems of the next five years. The data engineer's job extends to FinOps in a way that did not exist in 2018.
Data contracts harden; the source of truth becomes legal
Schema-on-write has been gaining ground over schema-on-read since the lakehouse wave; in 2026, data contracts — formally versioned schemas with consumer-side enforcement, owned by the producer team — are the new discipline. By 2030, expect contracts to be near-universal at companies above 200 engineers, with breakage producing CI failures rather than 3 a.m. pages.
Common confusions
-
"Hadoop replaced the warehouse." It did not. Hadoop forced the warehouse to scale horizontally and to absorb non-relational data, but the queryable, ACID, SQL-shaped warehouse came back stronger after Snowflake. The 2026 lakehouse is both: it has SQL, ACID, schema evolution, and joins (warehouse) and runs across hundreds of S3-backed compute nodes (Hadoop's original promise). The wave was a course-correction, not a replacement.
-
"Streaming will replace batch." It will not — streaming has a higher cost floor and a smaller tolerance for backfill. Most reconciliation, audit, and regulatory workloads will run as batch in 2030 because daily freshness is sufficient and batch is cheaper. What changes is that the same SQL transformation can run as either, depending on the latency budget; the engineer's job is to match the budget, not to pick a side.
-
"AI databases are a new category." As of 2026, they are not. Vector search is a new index, not a new substrate. Pinecone, Weaviate, and pgvector all sit on top of LSM-trees or B-trees. The "AI database" label means a database that ships with HNSW, an embedding pipeline, and good RAG defaults. That is real and useful — but it is a workload accommodation, not a paradigm shift.
-
"Data mesh replaces the central data team." It does not, in any large Indian company observed so far. Mesh shifts ownership of data products to domains; it does not eliminate the central platform team that provides the substrate (catalog, lineage, contracts, observability). Treating mesh as "the central team disappears" is the failure mode every consulting deck warned against.
-
"Iceberg vs Delta vs Hudi is settled." It is partially settled — Iceberg has the broadest engine support in 2026 — but the table-format war is still active where features differ (merge-on-read latency, copy-on-write cost, schema-evolution semantics). Treat the choice as workload-dependent rather than vendor-dependent.
-
"After 30 years there will be one winning data platform." There will not. Workloads are too varied — OLTP-CDC sources, OLAP scans, real-time dashboards, vector retrieval, feature serving, ML training, regulatory archives — and the cost-and-freshness ratios that favour each are too different. What you will see is a small number of substrates (object storage, durable logs, columnar files, vector indexes) shared across many engines. Polyglot is the long-run answer, not a transitional state.
Going deeper
Stonebraker's "What Goes Around Comes Around" — and the data-engineering rhyme
Mike Stonebraker and Joseph Hellerstein's 2005 paper What Goes Around Comes Around is canonical for database history; the 2024 sequel by Pavlo and Stonebraker extends the lens through NoSQL and the cloud era. Both are required reading for the data engineer. The thesis is unkind to vendors and useful to engineers: most "revolutionary" announcements are reinventions of ideas the field already had, dressed in new vocabulary. The Hadoop wave repeated mistakes the warehouse wave had already corrected. The first generation of "AI-native" databases is repeating mistakes the document-database wave already corrected. Reading these two papers carefully is the discipline that lets you see the next reinvention coming.
Jay Kreps's "The Log" — the most-quoted essay in the field
Jay Kreps's 2013 LinkedIn essay The Log: What every software engineer should know about real-time data's unifying abstraction is the closest thing to a manifesto this field has. Read it twice — once for the architectural argument, once for the historical framing. The architectural argument is what built Kafka and what underwrites Materialize. The historical framing — that databases, search indexes, caches, and analytics systems are all materialized views of a shared log — is the destination this curriculum has been pointing at since chapter one. Re-read it once a year; you see different things at different points in your career.
Maxime Beauchemin's "The Rise of the Data Engineer"
Maxime Beauchemin (creator of Airflow and Superset) wrote The Rise of the Data Engineer in 2017 and The Downfall of the Data Engineer in 2018. Both are short, both are correct, and both define the cultural shape of the field in a way no academic paper does. The "rise" essay introduced the role; the "downfall" essay diagnosed the burnout pattern that came with it (on-call for the entire warehouse, accountability without authority over upstream systems, "the data is wrong" pages at 2 a.m.). The cure is most of Build 17 — contracts, observability, blast-radius isolation — but Beauchemin named the disease before the cure existed.
Where research is moving in 2026
The most active research areas, as of this chapter: incremental view maintenance at scale (differential dataflow, RisingWave's design, Materialize's recent work on operator graphs); learned indexes for analytics (replacing zone maps with ML models for data skipping); zero-ETL architectures (Aurora-Redshift integration, Snowflake-Iceberg interop, BigQuery-BigLake) that try to remove the explicit pipeline; vector-native query planning (treating embeddings as a first-class type the optimizer reasons about); cost-based scheduling for streaming (giving Flink and Spark Streaming the same plan-aware scheduling that batch already has); and AI-aided pipeline authoring (LLMs that read schemas and lineage to generate transformations). None of these is fully mainstream yet; most will be by 2030. Track them through the SIGMOD, VLDB, and CIDR conferences.
What you can build now that you have walked the road
The point of a curriculum is not the curriculum. It is what you can do after reading it. Concretely: you can read a new data platform's architecture white paper and place it on the diagram in this chapter within ten minutes. You can pick the right engine for a query class against an Iceberg table using the polyglot decision tree of Build 12. You can write a fifty-line pipeline, harden it against crashes, schedule it on a DAG, attach lineage, attach a contract, deploy it on a lakehouse, watch it on a dashboard, page yourself when it breaks, and run a backfill when it does. You can read a Spark, Flink, or Materialize pull request and follow the change. You can sketch a Razorpay-shaped streaming reconciliation pipeline on a whiteboard during an interview. That is what 133 chapters were for.
Where this leads next
There is no chapter 135 in this curriculum. From here the road forks into the rest of your career.
- For the engineering track: keep Kleppmann's Designing Data-Intensive Applications and the Stonebraker/Pavlo surveys on a shelf you can reach. Read SIGMOD/VLDB/CIDR proceedings once a year. Build a small data platform from scratch — pipeline, lakehouse, streaming view, dashboard — once a year, in whatever the new tooling is.
- For the systems track: pick one of the open-source engines and read it end-to-end. DuckDB is the gentlest; its source code is unusually readable. Materialize is the cleanest streaming-database codebase. Flink is the deepest stateful-streaming codebase. Spark is the largest. Pick one; spend a year inside it.
- For the practitioner track: re-read running data engineering in production, cost on the cloud, migrations, and disaster recovery. Those four are the everyday operating manual.
- For the curious reader: the next time a "revolutionary new data platform" announces itself on a Tuesday, open its architecture page and look for the log. It will be there. It always is.
You started in chapter 1 with a fifty-line CSV-to-CSV script. You finish in chapter 134 with the same script — wrapped in a DAG, serialised over a Kafka topic, materialized into an Iceberg table, exposed through a semantic layer, observed by lineage, paid for by a chargeback model, and retrieved by an LLM. Everything in between was layers. That is not a disappointment — that is the field. Welcome to it.
References
- Michael Stonebraker and Joseph M. Hellerstein, What Goes Around Comes Around (Readings in Database Systems, 4th ed., 2005) — the canonical historical survey. Most important essay for putting data fashion in perspective. redbook.io.
- Andy Pavlo and Michael Stonebraker, What Goes Around Comes Around… And Around (SIGMOD Record, 2024) — the sequel covering NoSQL, NewSQL, the cloud warehouse, and the streaming wave.
- Jay Kreps, The Log: What every software engineer should know about real-time data's unifying abstraction (LinkedIn Engineering, 2013) — the essay that re-centred the log as the universal substrate. engineering.linkedin.com.
- Maxime Beauchemin, The Rise of the Data Engineer (Medium, 2017) and The Downfall of the Data Engineer (Medium, 2018) — the cultural definition and the cultural diagnosis of the role.
- Martin Kleppmann, Designing Data-Intensive Applications (O'Reilly, 2017) — the working engineer's modern compendium; the most-cited reference in this curriculum. dataintensive.net.
- Tyler Akidau, Slava Chernyak, and Reuven Lax, Streaming Systems (O'Reilly, 2018) — the canonical book on event-time, watermarks, and stateful streaming.
- Bill Inmon and Ralph Kimball, the foundational warehouse-modelling literature (multiple titles, 1990s) — the third-normal-form vs star-schema debate that still shapes every dbt project.
- Your first pipeline — chapter 1 of this track — where the road began.