DuckDB and the Embedded-Analytics Renaissance

In short

For two decades, "doing analytics" meant standing up a server. You had Postgres for transactional work, and when the analytical queries got slow you added a Redshift cluster, a Spark job, a Snowflake account, an EMR pipeline. The mental model was: analytics is heavy, therefore analytics needs a cluster. This was a true statement in 2010, when a single machine had 8 GB of RAM and four cores, and the data already weighed terabytes.

It is not a true statement in 2026. A laptop now ships with 32 to 128 GB of RAM and 8 to 16 fast cores; a desktop workstation has 256 GB and 32 cores for the price of three months of a small Snowflake bill; a single NVMe SSD reads sequential bytes at 5 to 7 GB/s, which is faster than most clusters can shuffle data between nodes. The data, meanwhile, hasn't grown nearly as fast as the laptops. The median analytical query in industry — the one a product manager runs from a notebook on a Tuesday — touches a Parquet file of a few hundred million rows, well under a hundred gigabytes. That workload doesn't need a cluster. It needs a fast columnar engine that runs in the same process as the analyst's Python.

DuckDB, born in 2018-19 at the Centrum Wiskunde & Informatica (CWI) in Amsterdam — the same Dutch lab that birthed MonetDB and X100 vectorised execution — is the engine that finally said this out loud. It is a single shared library, ~10 MB, that you pip install duckdb and import into Python (or load as a .so/.dll/.dylib from any other language). It runs full Postgres-compatible SQL using the vectorised executor of the previous chapter. It reads Parquet, CSV and Arrow files directly off disk, with no ingestion step — SELECT * FROM 'orders.parquet' WHERE region='south' just works. It is, deliberately, "the SQLite of analytics": zero servers, zero deployment, zero coordination, just a library.

The result has been a quiet revolution. Pandas — for years the default tabular tool in Python — turns out to be 50× slower than DuckDB on group-bys and OOMs at 30 GB on a 32 GB laptop. Spark, when run locally for the workloads it is overkill for, has 30 seconds of JVM startup before the query even begins. DuckDB does the same work in 3 seconds with no setup. An entire generation of "embedded analytics" tools — chDB (ClickHouse-as-library), Polars (Rust columnar dataframes), DuckDB-WASM (DuckDB compiled to WebAssembly running in your browser) — has followed. This chapter walks through the design choices that made DuckDB the centre of that wave.

You spent the previous chapter building a vectorised executor that processes 1024 rows at a time, hammering AVX-512 SIMD lanes and amortising dispatch overhead to picoseconds per row. The chapter before that established columnar storage as the layout that makes such an executor effective in the first place. Together those two ideas — columns on disk, vectors in memory — are the engine of every modern analytical database, from BigQuery to Snowflake to Databricks Photon.

But until 2019 those engines all had one thing in common: they were servers. You stood them up. You configured them. You opened ports. You created users. You loaded data into them through some ingestion pipeline. The minimum useful unit of "running an analytical query" was: provision a cluster, copy the data in, write the query, copy the result out. That is fine when the query is going to be re-run a thousand times by a thousand dashboards. It is comically wrong when the query is going to be run once, by one analyst, against a Parquet file that already exists on their laptop.

DuckDB is the answer to "what does an analytical engine look like if you remove the server entirely?" — and it turns out the answer reshapes a startling fraction of how data work happens.

The thesis: most analytical workloads are small

Pull up the mental image of "big data" you absorbed from the 2012-2018 era — Hadoop, MapReduce, "we process petabytes". Then go look at the actual data sizes companies work with. The Indian e-commerce startup with ten million orders a year has a 5 GB orders table. The fintech with two crore transactions has 15 GB. The mid-sized SaaS with five years of event data has maybe 200 GB. The median analytical query at almost every company that isn't Google or Meta touches less than 100 GB of data.

A laptop today has 32 GB of RAM. A desktop workstation has 128 GB. A single NVMe SSD reads at 5 GB/s sequentially. Most analytical work fits on one machine. The cluster was a workaround for the constraint that one machine was too small; the constraint mostly went away while you weren't looking.

What did not go away is the ergonomic appeal of "just run a SQL query on the file in front of me". For that, you need three things: a SQL engine that's fast enough to be interesting on hundreds of millions of rows; a way to point it at the data that's already on your disk without shuffling it through an ETL; and zero deployment friction. Why these three: speed is what makes you reach for SQL instead of a for loop, file-native I/O is what removes the ETL step, and zero deployment is what makes the whole thing actually used instead of avoided. DuckDB nailed all three.

The diagram makes the category distinction concrete. Postgres and Snowflake are servers: they run as separate processes (often on separate machines), and your Python script talks to them over TCP. Spark and BigQuery are clusters: even when "small", they coordinate work across multiple nodes that shuffle data over the network. DuckDB is a library: it loads as a .so file into your Python process, and duckdb.query(sql) is an in-memory function call into the same address space your DataFrame already lives in. Why this matters: every cross-process boundary costs you serialisation, network round-trips and deployment overhead. A library has none of those. The query and the data live in the same RAM.

What DuckDB inherited (and what it added)

DuckDB did not invent a new query engine. It inherited the entire vectorised-columnar tradition from its parent lab. The CWI database group in Amsterdam had been the central node of analytical-database research for thirty years: MonetDB in 2002, the X100 vectorised executor in 2005, the MonetDB/X100 paper at CIDR. Hannes Mühleisen and Mark Raasveldt, the two CWI researchers who would go on to start DuckDB, were steeped in this work.

What DuckDB added on top was a deliberate set of product choices that the academic predecessors had not made.

Layer 1 — Postgres-compatible SQL — was the choice that made adoption frictionless. There are millions of engineers and analysts who know Postgres syntax. Inventing a new dialect would have meant a five-year ramp; reusing Postgres meant the first SQL query worked on day one. Window functions, CTEs, lateral joins, array types — all the features that make modern SQL pleasant — were available immediately, because the parser was modelled on Postgres's. Why this isn't trivial: SQLite, the obvious comparison, has its own dialect that omits things like window functions until very recently and uses different type rules. DuckDB's deliberate Postgres-compatibility is part of why analysts could migrate dashboards in an afternoon.

Layer 2 — vectorised execution — is the inheritance from MonetDB/X100 that the previous chapter described. SCAN produces a vector of 1024 column values, FILTER produces a selection mask, JOIN probes a hash table 1024 keys at a time, AGG accumulates 1024 sums. The C++ inner loops autovectorise into AVX-512 on modern x86 and NEON on Apple Silicon. Why this matters here: it is what makes DuckDB faster than pandas on the same laptop. Pandas is a Python interpreter calling NumPy per column; DuckDB is a query planner that fuses operations and runs the whole pipeline in one tight C++ loop with SIMD.

Layer 3 — read files in place — is the choice that killed ETL for a huge class of work. Other analytical databases require you to ingest data into their internal format first: COPY orders FROM 'orders.parquet' INTO TABLE orders and then wait for the load. DuckDB lets you write SELECT region, AVG(price) FROM 'orders.parquet' GROUP BY region and runs the query directly against the Parquet file on disk, applying predicate pushdown and column pruning so it only reads the bytes it needs. The same trick works for CSV (with auto schema detection), Arrow (zero-copy because Arrow is already columnar), pandas DataFrames (via Arrow), and Polars DataFrames. Why this matters: the Parquet file is already there on your laptop. Loading it into Postgres just to query it would take longer than the query itself.

The four design choices that made it a hit

If you compress what made DuckDB break out of academia and into industry, it comes down to four product decisions, each of which removed one specific piece of friction.

Single binary, zero dependencies. pip install duckdb and you have the entire engine. No JVM, no driver, no daemon, no port. The package is around 10 MB. This is the SQLite playbook applied to analytics: be so trivially deployable that "should we use it?" stops being a question. Why this matters at scale: every dependency you add is a thing that can break in CI, a security review someone has to do, a version conflict in production. Zero deps means DuckDB shows up in places — Lambda functions, mobile apps, browsers via WASM — that a server database simply cannot reach.

Read Parquet/CSV/Arrow directly. The Parquet file your data pipeline already produced, the CSV an analyst dumped from a dashboard, the Arrow buffer your Polars DataFrame is sitting in — all of these are queryable as if they were native tables. There is no proprietary on-disk format you have to convert to. This collapses an entire category of work — "ingest the data, then query it" — into one step.

Postgres-compatible SQL dialect. Window functions, CTEs, lateral joins, EXCEPT, INTERSECT, regex, JSON, array types. The full modern SQL surface that analysts expect. If you already write Postgres queries, your DuckDB queries are the same queries.

Vectorised execution out of the box. No tuning, no SET enable_columnar = true, no plan hints. The default execution path is the fast one. A first-time user runs their first query and it is already 50× faster than what they are used to from pandas — no skill or configuration required.

Why it took off when it did

Three pre-existing pains converged in 2019-2021, and DuckDB happened to fit the shape of all three.

Pandas was choking. The default Python tabular tool was — and still is — wonderful for exploration but quietly catastrophic above 5 GB. It loads everything into memory, copies aggressively (df = df[df.region == 'south'] allocates a new DataFrame), and runs each operation as a separate NumPy pass with no fusion. By 30 GB on a 32 GB laptop, you are out of memory. DuckDB does the same group-by 50× faster and never OOMs on data larger than RAM because it streams.

Spark was overkill. The default "big data" answer for Python users was PySpark. But Spark is a JVM cluster framework. Local mode exists, but it has 30 seconds of JVM startup, a 200-line spark-defaults.conf, and an entire vocabulary (executors, partitions, shuffles, serialisers) that an analyst running one notebook query has to learn for nothing. Why this gap was real: a huge fraction of "Spark jobs" in the wild ran on a single machine and processed less than 100 GB. They paid the full cluster tax for none of the cluster benefit.

Postgres wasn't optimised for analytics. People did try to use Postgres as their analytical engine — it has SQL, it's a database, it's already there. But Postgres is a row store with the Volcano executor. A GROUP BY over 100M rows that DuckDB does in 3 seconds takes Postgres 5 minutes, and that is before you account for the indexes Postgres needs you to build first.

DuckDB sat exactly in the gap between these three: SQL-native like Postgres but columnar, fast like Spark but with no setup, embedded like pandas but with a real query optimiser.

The workflows it unlocked

Once an analyst has DuckDB on their laptop, a class of work becomes possible that was previously not. The following diagram shows the dominant pattern.

The flow is: an analyst in Jupyter writes one line of Python that wraps a SQL string, points it at a Parquet file (locally or via the httpfs extension at an S3 URL), and gets back a result as a pandas or Polars DataFrame. The query runs in the same Python process, against the column store on disk, and lands the result back in the same memory space — no network, no serialisation between systems. Why this is qualitatively different from the cluster workflow: the analyst owns the entire stack. There is no DBA to ask, no IAM role to request, no cluster to spin up, no egress to wait for. The iteration loop is "edit query, hit shift-enter, see result" — not "edit query, wait for cluster, check Slack for the data team's response".

This is the workflow that powers most of the data work at modern startups now. It is also why DuckDB ended up embedded inside other people's products: Tableau uses it for parts of its desktop product, Datadog uses it inside some of its own analytical tooling, and a long list of BI vendors ship it as their local query engine. The "embed an analytical engine" story finally had an answer.

Worked example: top products by revenue per state

Here is the canonical scenario, taken from the kind of question that shows up at every Indian e-commerce startup. You have 5 GB of orders in a Parquet file on your laptop. You want the top 10 products by revenue in each state. You have 32 GB of RAM and one hour before the meeting.

Top-N per group on 5 GB of e-commerce orders

The Parquet file looks like this — produced by your data pipeline, sitting in /data/orders_2025.parquet:

order_id  state         product_id  product_name        qty  price_inr
8451293   Maharashtra   P-1241      Cotton kurta        1    1499.00
8451294   Karnataka     P-3421      Bluetooth earbuds   1    2999.00
...
~ 80 million rows, 5.2 GB on disk

The DuckDB version is three lines of Python:

import duckdb

result = duckdb.query("""
    SELECT state, product_name, revenue
    FROM (
        SELECT state, product_name,
               SUM(qty * price_inr) AS revenue,
               ROW_NUMBER() OVER (PARTITION BY state ORDER BY SUM(qty * price_inr) DESC) AS rn
        FROM 'orders_2025.parquet'
        GROUP BY state, product_name
    )
    WHERE rn <= 10
    ORDER BY state, revenue DESC
""").to_df()

print(result.head(20))

That runs in about 8 seconds on a MacBook with 32 GB RAM. DuckDB pushes the column projection down into the Parquet reader (only reads state, product_name, qty, price_inr — skips order_id, addresses, timestamps), uses the vectorised hash aggregator for the GROUP BY, and the vectorised window operator for the ROW_NUMBER per state. Why projection pushdown matters here: the file has 18 columns, the query touches 4. Reading only the 4 needed column chunks shrinks I/O by 4.5× before any computation happens.

Compare to the pandas version.

import pandas as pd
df = pd.read_parquet('orders_2025.parquet')
df['revenue'] = df['qty'] * df['price_inr']
agg = df.groupby(['state', 'product_name'], as_index=False)['revenue'].sum()
agg = agg.sort_values(['state', 'revenue'], ascending=[True, False])
top10 = agg.groupby('state').head(10)

This OOMs. read_parquet materialises all 80 million rows × 18 columns into RAM, then df['qty'] * df['price_inr'] allocates another 640 MB column, then groupby materialises an intermediate. Peak memory blows past 32 GB and the kernel dies.

Compare to the local-Spark version.

from pyspark.sql import SparkSession, Window
from pyspark.sql import functions as F

spark = (SparkSession.builder
         .appName("topN")
         .config("spark.driver.memory", "16g")
         .config("spark.sql.shuffle.partitions", "16")
         .getOrCreate())

df = spark.read.parquet("orders_2025.parquet")
df = df.withColumn("revenue", F.col("qty") * F.col("price_inr"))
agg = df.groupBy("state", "product_name").agg(F.sum("revenue").alias("revenue"))
w = Window.partitionBy("state").orderBy(F.desc("revenue"))
ranked = agg.withColumn("rn", F.row_number().over(w))
top10 = ranked.filter(F.col("rn") <= 10).orderBy("state", F.desc("revenue"))
top10.toPandas()

This works, takes about 30 seconds end-to-end (including 18 seconds of JVM startup), and required you to remember the Spark API for window functions, configure spark.driver.memory, and trust that toPandas() will not silently truncate. Why this matters: the difference is 8 seconds vs 30 seconds plus 50 lines of setup. For a one-shot analytical question, the DuckDB version is qualitatively better — the analyst writes SQL, gets a DataFrame, moves on.

The DuckDB version of this code would be the first thing you'd reach for at any Indian e-commerce startup with a five-engineer data team.

DuckDB's relationship with the cloud: MotherDuck

The pure local story has one obvious limit: the data has to be on your laptop. For workloads where the source data lives in S3 and the team wants a shared query layer, MotherDuck — the company founded in 2022 by some of the original DuckDB team — extends the model with a hybrid architecture. You run DuckDB locally on your laptop and it can transparently push parts of a query to a server-side DuckDB instance that has cheaper access to the data lake. Joins between local and remote tables just work.

The interesting design point here is dual execution: you write one SQL query, and the planner figures out which subqueries should run locally (because the data is on your machine) and which should run in the cloud (because the data is in S3). The same Postgres-compatible SQL, the same DuckDB engine, just running in two places. Why this matters as a category: the cloud-warehouse-or-laptop dichotomy was a false one. The hybrid model says you can have the ergonomics of local DuckDB and the data of a cloud warehouse, depending on which is closer to which.

The broader trend: embedded analytics

DuckDB is not alone. The category it broke open — embedded columnar engines — now has several serious players, each making a slightly different bet.

chDB is ClickHouse-as-a-library. ClickHouse the server is an extremely fast columnar database used by Cloudflare and Uber for real-time analytics; chDB embeds the same execution engine into your Python process. The bet: ClickHouse's MergeTree storage and SIMD-aggressive engine, accessible without standing up a cluster.

Polars is a Rust dataframe library, built on Apache Arrow, that consciously rebuilt pandas with vectorisation, multi-threading and lazy evaluation as defaults. It is not a SQL engine — it is a dataframe API — but it competes with DuckDB for the same workloads. The bet: keep the imperative DataFrame interface that Python users know, just make it 50× faster.

DuckDB-WASM compiles DuckDB to WebAssembly and runs it in your browser. Open a webpage, query a Parquet file in your downloads folder, get charts — no server involved at all. The bet: analytical queries should be possible from a static webpage. Several BI tools now use it for client-side dashboards.

The common thread across all three: assume modern hardware, exploit columnar layout, ship as a library not a server, optimise for the analyst's iteration loop. The cluster era assumed your laptop was tiny and your data was huge. The embedded era assumes the opposite is now true for most workloads — and is probably right.

When DuckDB is the wrong answer

The fair counterpoint: DuckDB is not the right answer for every analytical workload. Three places where it isn't.

Concurrent multi-writer workloads. DuckDB is single-process. If forty users need to query the same dataset simultaneously, you want a shared server (Postgres, ClickHouse, Snowflake), not forty laptops loading the same Parquet file. Why this matters: the embedded model wins for the one-analyst case and loses for the many-user case. Pick deliberately.

Truly large data. When your data is genuinely 10 TB or 100 TB and you cannot fit a sample on a laptop, you do need a cluster. DuckDB can stream and spill to disk and handle data larger than RAM, but the iteration loop on a 50 TB Parquet directory in S3 is slower than a properly tuned BigQuery or Snowflake.

Always-on production dashboards. A BI dashboard that 200 employees hit every morning is a service. Services need uptime, monitoring, and shared state. A library running in someone's notebook is not a service.

The lesson is the same one databases keep teaching: pick the system that matches your access pattern. DuckDB fits a vast and previously underserved middle ground — interactive analytics, ETL on Parquet, embedded analytics inside other apps, education, prototyping. For that middle ground, it is the most ergonomic option that has ever existed.

What you should take away

Most analytical workloads are smaller than the cluster era assumed. A 32 GB laptop is enough for a startling fraction of real industry data work. You do not need to default to Spark or Snowflake.
In-process is a category, not just a deployment choice. When the engine lives in your Python process, the friction-reducing implications cascade — no IAM, no port, no serialisation, no waiting for someone else to provision anything. The iteration loop tightens by an order of magnitude.
DuckDB is the SQLite of analytics. Single binary, zero deps, Postgres-compatible SQL, vectorised columnar engine, reads Parquet/CSV/Arrow in place. That combination is what made it the centre of the embedded-analytics renaissance.
Embedded analytics is a renaissance, not a fad. chDB, Polars, DuckDB-WASM and others are pushing the same pattern into ClickHouse, dataframes, and the browser. Build your data systems with the assumption that "fast columnar engine in the same process as the analyst" is now a normal building block.

In the next chapter you'll trade the embedded model for the server model and look at ClickHouse's MergeTree family — the storage engine that powers some of the largest real-time analytical workloads on the planet.