Query Engines: Trino, Spark SQL, Athena

In short

Once your data lives in open columnar files (Parquet) wrapped in an open table format (Iceberg or Delta), the question stops being which database do I use and becomes which SQL engine do I point at it. The three dominant choices in 2026 are Trino, Spark SQL, and AWS Athena, and they are not really competitors — they are specialists.

Trino (the open-source successor to Facebook's Presto, forked by the founders in 2020 after a trademark dispute) is built for interactive analytical SQL with sub-minute latency. A coordinator parses and plans the query, hundreds of stateless workers execute it in parallel, and pluggable connectors let one Trino cluster join data sitting in Hive on S3, Iceberg on GCS, MySQL in your transactional database, Postgres in your data product, Kafka in your streaming layer, and Elasticsearch in your search index — all in one SELECT. It is what Netflix, LinkedIn, Pinterest, Lyft and Airbnb run for ad-hoc analytics.

Spark SQL is the SQL surface of Apache Spark, the same engine that powers your batch ETL pipelines and your ML training jobs. Catalyst optimiser plans the query, Tungsten generates code, and the same DataFrame underneath can be a SQL result, a feature for a model, or a partition of a Delta-Lake write. It is the default for anything that runs on a schedule and takes minutes-to-hours: bronze-to-silver transforms, large joins, window functions over a year of history, ML feature engineering. Databricks, Palantir, Uber, every Fortune 500 data team has Spark in production.

AWS Athena is serverless Trino. AWS re-platformed Athena onto Trino in 2020 (the engine version is called Athena Engine v3), exposed it as a region-wide multi-tenant service, and bills you per terabyte scanned ($5/TB in most regions, much less if you scan compressed Parquet with predicate pushdown). There is no cluster to provision, nothing to patch, nothing to size. You point it at an S3 bucket, write SQL, get an answer. The catch is the lock-in: it only runs in AWS, only against AWS Glue catalog (or self-managed Hive metastore), and you cannot tune the engine.

The decision tree is short. Need ad-hoc interactive queries with sub-minute latency on a lake? Trino if you have ops capacity, Athena if you do not. Need scheduled batch ETL or ML training? Spark SQL. Need to join across heterogeneous sources (lake + Postgres + Kafka) in one query? Trino — it is the only one with the connector ecosystem. Want zero ops at any latency? Athena (or pay for a managed warehouse: Snowflake, BigQuery, Databricks SQL). And critically, because Iceberg and Delta are open table formats, you do not have to choose just one — many production lakehouses run Spark for nightly batch and Trino for daytime interactive against the same tables.

You finished the previous chapters of Build 16 with the lakehouse architecture in place: cheap durable storage in S3, columnar Parquet inside, transactional metadata wrapped around the files by Iceberg or Delta. The picture deliberately leaves the engine slot empty. In the warehouse world that slot was filled by the warehouse itself — Vertica, Teradata, Redshift — and the engine and the storage were welded together. In the lakehouse world the slot is open, and you fill it by picking a SQL engine that can read the table format you chose.

This chapter is about the three engines that matter for that slot today. They have overlapping capabilities and substantially different sweet spots; the goal is to give you the design intuitions to pick the right one (or the right combination — most production teams run more than one) for a given workload.

The thesis: storage is shared, engines are specialised

The sentence to internalise is this: in a 2026 lakehouse, the SQL engine is no longer part of the database. It is a separate, swappable component that reads the same files everyone else reads.

Many engines, one set of files. The contract that lets them coexist is the open table format (Iceberg, Delta) plus open file format (Parquet).

This is a strictly stronger position than the warehouse-era picture, where adding a new query workload meant either contending on the warehouse cluster (slow, political) or extracting the data into a second system (expensive, divergent, eventually-consistent). With the lake-engine split, adding a new workload means giving a new engine IAM read access to the bucket. Why this works without coordination: the table format provides snapshot isolation through immutable file groups and an atomically-swapped pointer to the current snapshot. Any engine that reads the snapshot pointer and then reads the listed files sees a consistent view, no matter how many other engines are reading or writing concurrently. Iceberg and Delta both use this pattern; the engines do not need to know about each other.

The three engines covered here all speak this contract. They differ in how they execute the SQL, what extra capabilities they bring, and what ops cost they impose.

Trino: interactive SQL, federated, MPP-shaped

Trino was born at Facebook in 2012-2013 as a project called Presto, written by Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang to replace Hive for ad-hoc analytical queries on Facebook's 300 PB warehouse. Hive at the time compiled SQL down to MapReduce jobs; a query that should have taken thirty seconds took thirty minutes because every shuffle hit HDFS. Presto's pitch was that you did not need MapReduce for interactive queries — you needed a long-lived MPP-style cluster of stateless workers that streamed data through a pipelined execution engine, the way Vertica and Greenplum did, but talking to HDFS and Hive metastore as the storage layer instead of local disks. The 2013 open-source release was widely adopted, and the original Presto SIGMOD paper (published in 2019, after years of production use) is still a clean read on the architecture.

In late 2018 the original founders left Facebook, citing disputes over project direction. They forked the codebase, called the fork PrestoSQL, and continued development under the Presto Software Foundation. In December 2020 a trademark dispute with Facebook (which had retained "Presto") forced them to rename the project, and they chose Trino. Today Trino is the active, well-maintained line; Facebook's PrestoDB still exists but lags significantly in connector and feature work. The story is told in the foundation's own "Origin of Trino" blog.

The architecture is straightforward MPP, but with one critical twist that is the core of its value proposition.

Trino's architecture. Coordinator + stateless workers like any MPP engine, but the connector layer makes the *data source* a plug-in. One SQL query can join across S3, MySQL, Postgres and Kafka.

The mechanics: a query arrives at the coordinator. The coordinator parses it, builds a logical plan, asks each connector for table metadata (column names, types, partitioning, optionally per-partition statistics), runs the cost-based optimiser, produces a distributed physical plan, and dispatches stage fragments to the workers. Workers stream tuples through the pipeline — scan, filter, hash-join, aggregate — without writing intermediate state to disk; partial results shuffle over the network to downstream stages. The final stage streams results back to the coordinator and out to the client.

Why no checkpointing makes Trino fast and fragile: the pipelined execution engine moves rows between operators in memory and never persists intermediate results. This is the right design for queries that finish in seconds-to-minutes; you avoid the disk-write tax that kills Hive on MapReduce. The price is that if one worker crashes mid-query, the whole query fails — there is no checkpoint to resume from. Trino tries to mitigate this with retries (the fault-tolerant execution mode added in 2022 does spool intermediate state to S3, at a latency cost), but the default is "fast and fragile". For interactive queries this is a sensible trade; for 4-hour ETL jobs it is a bad one.

Three production properties matter:

The connector ecosystem is the moat. Trino ships with over 30 connectors: the lake formats (Hive, Iceberg, Delta), the OLTP databases (MySQL, Postgres, SQL Server, Oracle), the NoSQL stores (MongoDB, Cassandra), the streaming layers (Kafka, Pulsar), the search engines (Elasticsearch, OpenSearch), the in-memory caches (Redis, Memcached), and a generic JDBC connector for anything else. A federated query like SELECT u.email, SUM(o.amount) FROM mysql.public.users u JOIN iceberg.gold.orders o ON o.user_id = u.id WHERE o.dt = current_date GROUP BY u.email joins live OLTP data against historical lake data in one statement. No ETL, no copy, no eventual consistency window — the join happens at query time. Federation is the killer feature; nothing else does it as cleanly at this scale.

Stateless workers scale horizontally and elastically. Workers hold no persistent state; you can add or remove them at any time, and the coordinator will route new queries to the larger fleet immediately. Auto-scaling Trino on Kubernetes is a well-trodden pattern.

Java-heavy operations. Trino is a JVM application (~50 MB JAR per node, multi-GB heap per worker), and operationally it behaves like one: GC tuning matters, OOM kills are the dominant failure mode at scale, JMX metrics are how you observe it. A team running self-hosted Trino at Netflix or LinkedIn scale is a small team of dedicated engineers. This is the main reason the AWS-managed flavour (Athena) is so popular: it skips the ops.

A typical interactive query latency on Trino against Iceberg-on-S3, with reasonable warming, is 5-30 seconds for a 100 GB scan with predicate pushdown. The same query on Hive-on-HDFS with MapReduce in 2013 took 5-30 minutes. The 100x speedup is the entire point.

Spark SQL: the SQL face of a batch + ML engine

Apache Spark is a different beast in motivation and shape. It started at the AMPLab at UC Berkeley in 2009 as Matei Zaharia's PhD project, with a thesis that MapReduce was too slow for iterative algorithms (gradient descent, PageRank, k-means) because every iteration re-read its input from HDFS. Spark introduced the Resilient Distributed Dataset (RDD) abstraction — a partitioned in-memory dataset with lineage tracking — and let you express iterative computations as a chain of RDD transformations that the scheduler could keep in memory between iterations. The 2010 NSDI paper on Spark and the 2012 paper on RDDs were the founding documents.

Spark SQL arrived in 2014 as a layer on top of the RDD engine: a relational query language and DataFrame API, backed by the Catalyst optimiser (a rule-based + cost-based query planner written in Scala) and later the Tungsten execution engine (whole-stage code generation that compiles a query plan down to JVM bytecode at runtime). The Spark SQL SIGMOD 2015 paper is the readable reference; it is one of the cleaner papers in the analytical-database literature.

The shape that matters for this chapter: Spark SQL is not a separate product. It is the SQL surface of the same engine that runs your batch ETL jobs (df.groupBy(...).agg(...).write.parquet(...)), your ML pipelines (MLlib regression, classification, clustering), your graph computations (GraphX), and your streaming jobs (Structured Streaming). A SQL query in Spark goes through Catalyst, becomes a physical plan made of SparkPlan operators, and executes on the same fleet of executors that handle every other Spark workload. The DataFrame is the universal currency: a SQL SELECT produces one, an ML pipeline consumes one, a df.write.format("delta") writes one to your lake.

This makes Spark the right answer for anything where SQL is one step in a longer computation. A typical day-in-the-life:

# bronze → silver: read 1 TB of raw clickstream, type and dedupe
df = spark.read.json("s3://lake/bronze/clicks/dt=2026-04-25/")
clean = df.dropDuplicates(["click_id"]).withColumn("ts", to_timestamp("event_time"))
clean.write.format("delta").mode("overwrite").save("s3://lake/silver/clicks/dt=2026-04-25/")

# silver → gold: SQL aggregation joining with dimension tables
spark.sql("""
  SELECT u.tier, p.category, COUNT(*) AS clicks, SUM(c.revenue) AS rev
  FROM delta.`s3://lake/silver/clicks/dt=2026-04-25/` c
  JOIN delta.`s3://lake/silver/users/` u ON c.user_id = u.id
  JOIN delta.`s3://lake/silver/products/` p ON c.product_id = p.id
  GROUP BY u.tier, p.category
""").write.format("delta").mode("overwrite").save("s3://lake/gold/cohort_revenue/dt=2026-04-25/")

# gold → ML: train a classifier on the same DataFrame the SQL produced
features = assembler.transform(spark.read.format("delta").load("s3://lake/gold/cohort_revenue/"))
model = LogisticRegression(featuresCol="features", labelCol="churn").fit(features)

All three steps run on the same Spark cluster, share the same metastore, and write Delta tables that a Trino cluster across the room can read sub-minute.

The execution model is quite different from Trino's. Spark uses stage-based execution: the Catalyst plan is broken into stages at shuffle boundaries; each stage runs as a set of parallel tasks; results of each stage are materialised (to memory if it fits, to disk if it spills) before the next stage starts. This is slower per-query than Trino's pipelined engine — every shuffle adds disk-write latency — but it makes the engine fault-tolerant out of the box. If a task fails, Spark just re-runs that task, recomputing from the upstream materialised data. For 4-hour batch jobs over 10 TB of input, this is the right trade: you cannot afford to restart the whole query because one worker had a transient OOM. Why this matters for the engine choice: Spark trades per-query latency for per-job reliability. A simple SELECT count(*) FROM gold.orders WHERE dt='2026-04-25' takes 30-60 seconds on Spark and 3-5 seconds on Trino against the same Iceberg table — Spark spends time launching its scheduler, materialising stages, and writing intermediate state. For a one-off interactive query that is brutal; for a 4-hour ETL it is invisible.

Operationally Spark runs on YARN (the legacy Hadoop scheduler), Kubernetes (the modern default), Mesos (rare now), or a dedicated cluster manager. The largest commercial deployment is Databricks, which has a heavily-optimised proprietary fork (Photon, a vectorised C++ execution engine) and a managed runtime; AWS EMR, GCP Dataproc, and Azure HDInsight provide thinner managed wrappers around the open-source release. Self-hosted Spark on Kubernetes is doable but, like self-hosted Trino, demands a real platform team.

AWS Athena: serverless Trino

Athena is the simplest of the three to describe. It is a fully managed, serverless, multi-tenant query service in AWS that exposes Trino (since the Engine v3 release in late 2022; before that it was an older fork of Presto) over an S3 data lake. There is no cluster, no instance type, no capacity planning; you write SQL in the AWS Console (or via JDBC, the Boto3 SDK, or the API), and Athena routes it onto a shared pool of Trino workers running inside AWS infrastructure. You pay 5 per terabyte of data scanned (variable by region, with a10 minimum-per-query in some regions), plus standard S3 read costs. The Athena documentation is the authoritative reference.

The architecture is otherwise just Trino. The same connectors work (Hive on Glue Data Catalog, Iceberg, Delta Lake via the iceberg connector, Hudi, federated query against RDS / Redshift / DocumentDB through Athena Federated Query, which packages your connector as a Lambda function). The same SQL works. The same predicate-pushdown and column-pruning behaviour applies. The same Iceberg partition pruning and Delta data-skipping work.

What you trade for the zero-ops experience:

No tuning knobs. You cannot configure the Trino node count, the heap size, the GC strategy, the cluster autoscaling policy. AWS picks defaults that work for the median workload; if your workload is far from the median, you cannot fix it.
Per-query cost surprise. A SELECT * FROM gold.events against a 50 TB unpartitioned table costs you 250. Always-on. There is no warning, no quota by default, no prompt. Cost engineering on Athena is a real job. The mitigations — partition columns, Iceberg/Delta with column projection, `LIMIT` clauses, workgroup-level data-scanned limits — are all on the user's side. <span class="why">Why this matters more than on self-hosted Trino: a self-hosted Trino cluster costs you a fixed amount per hour regardless of how much SQL you throw at it; bad queries waste capacity but do not increase the bill. Athena's pricing model converts query inefficiency directly into rupees on the AWS invoice. A junior analyst running `SELECT * FROM big_table` once is a250 line item your finance team will see and will email you about.
AWS lock-in. Athena only runs in AWS, only against AWS-managed catalog (Glue or self-managed Hive metastore on EMR), and is billed through the AWS account. Multi-cloud organisations cannot use it as their primary engine.
No persistent compute affinity. Each query runs on a fresh slice of the shared pool, with no warm cache of your tables. A query that would take 5 seconds on a self-hosted Trino cluster with hot Parquet pages takes 15 seconds on Athena. For ad-hoc queries this is fine; for sub-second BI dashboards it is not.

The right way to think about Athena: it is the serverless commodity for occasional SQL on S3. If you run dozens of queries an hour against the same hot tables, a dedicated Trino cluster (self-hosted or managed by Starburst / e6data) is cheaper and faster. If you run a few queries a day, Athena is unbeatable on ops cost, and the per-query bill is trivial.

The decision tree

The three engines are usually not in competition; you will run more than one, and the question is which one runs which workload. The decision tree is short.

The trade-off matrix. The three engines occupy different quadrants of the (latency, ops, fault-tolerance) space, and most production data platforms run at least two of them.

The flow chart in your head should be:

Is this a scheduled batch job that runs for minutes or hours and might fail mid-flight? Spark SQL. Stage-materialised execution gives you the fault tolerance you need; the DataFrame API integrates with whatever Python or Scala glue you have around the SQL; the same engine handles the ML training that follows the ETL.
Is this an ad-hoc analytical query that needs to come back in seconds? Trino if you have ops capacity and a busy cluster; Athena if you do not. Both run the same engine; the difference is purely operational.
Does the query need to join data sitting in heterogeneous sources — lake plus an OLTP database plus a Kafka topic — without copying any of them? Trino. Nothing else does federation as cleanly. Athena Federated Query is a workable second option in AWS-only shops.
Is the team small enough that "no ops" is worth a 2x latency hit and per-query billing? Athena. The ops budget you save buys a lot of analyst time.
Do you need ML training, graph computation, or streaming as part of the same job graph as the SQL? Spark, every time.

A subtle point: because Iceberg and Delta are open table formats, you do not have to choose. The same gold.orders Iceberg table can be written by a nightly Spark job and read all day by a Trino cluster and a roomful of Athena users. The lakehouse is engine-pluralist by design; this is its most underappreciated property.

Worked example — an Indian e-commerce data team running all three

You are the data platform lead at an Indian e-commerce company — say a Flipkart-like marketplace, 50 million daily active users, ₹400 crore daily GMV. The company has standardised on a lakehouse on s3://flipkart-lake/ with Iceberg tables in the gold layer. You need to support three distinct workloads.

Workload 1: nightly ETL (Spark SQL). Every night at 02:00 IST, a Spark job reads the previous day's bronze data — about 1 TB of raw clickstream JSON, payment events from Aurora MySQL CDC, and order events from a Kafka topic. It transforms, deduplicates, joins with dimension tables, and writes 100 GB of Iceberg-format gold tables: gold.orders_daily, gold.user_sessions_daily, gold.product_views_daily, plus a denormalised gold.user_360_daily that joins the three.

The Spark cluster runs on EMR — 50 r5.4xlarge instances (16 vCPU, 128 GB RAM each) — for the 4-hour window. Why Spark and not Trino: the job has long-running joins that spill to disk, must survive worker failures (a single worker dying after 3 hours of work cannot trigger a 3-hour restart), and feeds an ML pipeline that runs the same Spark cluster after the ETL completes. The cost: 50 nodes × 4 hours × ₹85/hour spot price = ₹17,000 per night, or ₹5.1 lakhs per month.

Workload 2: BI dashboards and ad-hoc analytics (Trino). The BI team runs Tableau dashboards that hit the gold tables continuously through the work day. There are about 200 active dashboards, refreshing on 5-minute schedules; query latency target is under 10 seconds for the median dashboard, under 30 seconds for the P95. There are also data analysts running ad-hoc Trino SQL through a Hue-like UI all day.

The Trino cluster runs on EKS — 10 m5.4xlarge worker instances plus 1 coordinator — and stays up 24×7. The cluster reads the same Iceberg tables that the Spark job writes; Iceberg's snapshot isolation ensures that mid-query writes from Spark do not break in-flight Trino queries. Why Trino and not Spark for this: the BI team's dashboards must feel instantaneous (sub-10s); Spark's per-query overhead would push every dashboard above 30s. Why Trino and not Athena: the BI workload is constant (200 dashboards, all day, every weekday), so the per-TB-scanned cost would dominate; a fixed Trino cluster is cheaper at this query volume. The cluster cost: 11 nodes × 24 hours × ₹40/hour spot = ₹10,560 per day, or ₹3.2 lakhs per month — flat regardless of how many dashboards run.

Workload 3: cost-engineering and finance ad-hoc (Athena). The cost-engineering team runs maybe 30 queries a week — investigating cost spikes, building one-off reports for the CFO, comparing this month's spend against last month's. They cannot justify owning a Trino cluster, and they only know SQL, not Spark. Athena is perfect: they open the AWS Console, write SELECT * FROM gold.aws_cost_daily WHERE dt > current_date - 30, and pay for what they scan. With Iceberg partition pruning and column projection, a typical query scans ~5 GB and costs ₹2. Total cost: ~₹50 per week, or ₹200 per month — versus the ₹3 lakh/month a dedicated Trino cluster would cost just for this one team's workload.

The combined picture: all three engines run side-by-side against the same s3://flipkart-lake/gold/ Iceberg tables. None of them know about the others. The only contract is the Iceberg snapshot. Total monthly cost: ₹5.1L (Spark) + ₹3.2L (Trino) + ₹0.002L (Athena) ≈ ₹8.3L/month for a data platform that would have cost ₹50-80L/month on a comparable Snowflake or pre-decoupling Vertica deployment. The engine pluralism is the cost saving.

The three workloads, the three engines, and the one lake.

A note on the convergence

A theme worth surfacing explicitly: the engines are converging on the same storage and the same SQL dialect, while staying differentiated on execution. Trino and Spark SQL both speak Iceberg and Delta natively; both have first-class Parquet support; both run the same predicate-pushdown and column-pruning optimisations; both support the SQL standard's window functions, CTEs, and lateral joins. A SQL query you can write against Trino is, in 90% of cases, a SQL query you can run unchanged on Spark or Athena.

What stays specialised is the shape of the work. Trino's pipelined engine wins at low latency and federation; Spark's stage-materialised engine wins at fault tolerance, scale, and ML integration; Athena's serverless model wins at zero-ops occasional use. The choice is no longer about which engine can answer a SQL query — they all can — but about which engine's execution model matches your workload's shape. That is a substantially better world to live in than the warehouse-vs-Hadoop wars of 2014.

References

"Presto: SQL on Everything" — the Facebook team's ICDE 2019 paper on Presto's architecture; the foundational reference for Trino's design.
Trino documentation — the canonical reference for connectors, deployment, and tuning.
"The Origin of Trino" — the founders' account of the 2020 fork from Presto and the rename.
Spark SQL: Relational Data Processing in Spark — the SIGMOD 2015 paper introducing Catalyst and DataFrames; the cleanest reference for the Spark SQL design.
AWS Athena documentation — the official guide to Athena, including Engine v3 (Trino-based) and Federated Query.
"Trino vs Spark" — Starburst's side-by-side comparison; partisan but technically accurate on the architectural differences.