Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
In short
Once your data lives in open Parquet files wrapped in an open table format like Iceberg or Delta, the SQL engine stops being part of the database and becomes a swappable component you point at the lake. Trino is built for interactive federated queries with sub-minute latency; Spark SQL is the default for scheduled batch ETL and ML feature work; AWS Athena is serverless Trino with zero ops and per-TB pricing. Most production lakehouses run more than one engine against the same tables, because the table format — not the engine — is the contract.
In the warehouse era, the SQL engine and the storage were welded together — buying Redshift or Teradata meant buying both at once, and adding a new query workload meant either contending on the same cluster or copying the data into a second system. The lakehouse breaks that weld: storage is Parquet in S3 wrapped by Iceberg or Delta, and the engine slot above it is open. This chapter is about the three engines that fill that slot in 2026 — Trino, Spark SQL, and Athena — and how to pick between them.
The thesis: storage is shared, engines are specialised
The sentence to internalise is this: in a 2026 lakehouse, the SQL engine is no longer part of the database. It is a separate, swappable component that reads the same files everyone else reads.
This is a strictly stronger position than the warehouse-era picture, where adding a new query workload meant either contending on the warehouse cluster (slow, political) or extracting the data into a second system (expensive, divergent, eventually-consistent). With the lake-engine split, adding a new workload means giving a new engine IAM read access to the bucket. Why this works without coordination: the table format provides snapshot isolation through immutable file groups and an atomically-swapped pointer to the current snapshot. Any engine that reads the snapshot pointer and then reads the listed files sees a consistent view, no matter how many other engines are reading or writing concurrently. Iceberg and Delta both use this pattern; the engines do not need to know about each other.
The three engines covered here all speak this contract. They differ in how they execute the SQL, what extra capabilities they bring, and what ops cost they impose.
Trino: interactive SQL, federated, MPP-shaped
Trino was born at Sociogram in 2012-2013 as a project called Presto, written by Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang to replace Hive for ad-hoc analytical queries on Sociogram's 300 PB warehouse. Hive at the time compiled SQL down to MapReduce jobs; a query that should have taken thirty seconds took thirty minutes because every shuffle hit HDFS. Presto's pitch was that you did not need MapReduce for interactive queries — you needed a long-lived MPP-style cluster of stateless workers that streamed data through a pipelined execution engine, the way Vertica and Greenplum did, but talking to HDFS and Hive metastore as the storage layer instead of local disks. The 2013 open-source release was widely adopted, and the original Presto SIGMOD paper (published in 2019, after years of production use) is still a clean read on the architecture.
In late 2018 the original founders left Sociogram, citing disputes over project direction. They forked the codebase, called the fork PrestoSQL, and continued development under the Presto Software Foundation. In December 2020 a trademark dispute with Sociogram (which had retained "Presto") forced them to rename the project, and they chose Trino. Today Trino is the active, well-maintained line; Sociogram's PrestoDB still exists but lags significantly in connector and feature work. The story is told in the foundation's own "Origin of Trino" blog.
The architecture is straightforward MPP, but with one critical twist that is the core of its value proposition.
The mechanics: a query arrives at the coordinator. The coordinator parses it, builds a logical plan, asks each connector for table metadata (column names, types, partitioning, optionally per-partition statistics), runs the cost-based optimiser, produces a distributed physical plan, and dispatches stage fragments to the workers. Workers stream tuples through the pipeline — scan, filter, hash-join, aggregate — without writing intermediate state to disk; partial results shuffle over the network to downstream stages. The final stage streams results back to the coordinator and out to the client.
Why no checkpointing makes Trino fast and fragile: the pipelined execution engine moves rows between operators in memory and never persists intermediate results. This is the right design for queries that finish in seconds-to-minutes; you avoid the disk-write tax that kills Hive on MapReduce. The price is that if one worker crashes mid-query, the whole query fails — there is no checkpoint to resume from. Trino tries to mitigate this with retries (the fault-tolerant execution mode added in 2022 does spool intermediate state to S3, at a latency cost), but the default is "fast and fragile". For interactive queries this is a sensible trade; for 4-hour ETL jobs it is a bad one.
Three production properties matter:
The connector ecosystem is the moat. Trino ships with over 30 connectors: the lake formats (Hive, Iceberg, Delta), the OLTP databases (MySQL, Postgres, SQL Server, Oracle), the NoSQL stores (MongoDB, Cassandra), the streaming layers (Kafka, Pulsar), the search engines (Elasticsearch, OpenSearch), the in-memory caches (Redis, Memcached), and a generic JDBC connector for anything else. A federated query like SELECT u.email, SUM(o.amount) FROM mysql.public.users u JOIN iceberg.gold.orders o ON o.user_id = u.id WHERE o.dt = current_date GROUP BY u.email joins live OLTP data against historical lake data in one statement. No ETL, no copy, no eventual consistency window — the join happens at query time. Federation is the killer feature; nothing else does it as cleanly at this scale.
Stateless workers scale horizontally and elastically. Workers hold no persistent state; you can add or remove them at any time, and the coordinator will route new queries to the larger fleet immediately. Auto-scaling Trino on Kubernetes is a well-trodden pattern.
Java-heavy operations. Trino is a JVM application (~50 MB JAR per node, multi-GB heap per worker), and operationally it behaves like one: GC tuning matters, OOM kills are the dominant failure mode at scale, JMX metrics are how you observe it. A team running self-hosted Trino at Streamora or ProfNet scale is a small team of dedicated engineers. This is the main reason the AWS-managed flavour (Athena) is so popular: it skips the ops.
A typical interactive query latency on Trino against Iceberg-on-S3, with reasonable warming, is 5-30 seconds for a 100 GB scan with predicate pushdown. The same query on Hive-on-HDFS with MapReduce in 2013 took 5-30 minutes. The 100x speedup is the entire point.
Spark SQL: the SQL face of a batch + ML engine
Apache Spark is a different beast in motivation and shape. It started at the AMPLab at UC Berkeley in 2009 as Matei Zaharia's PhD project, with a thesis that MapReduce was too slow for iterative algorithms (gradient descent, PageRank, k-means) because every iteration re-read its input from HDFS. Spark introduced the Resilient Distributed Dataset (RDD) abstraction — a partitioned in-memory dataset with lineage tracking — and let you express iterative computations as a chain of RDD transformations that the scheduler could keep in memory between iterations. The 2010 NSDI paper on Spark and the 2012 paper on RDDs were the founding documents.
Spark SQL arrived in 2014 as a layer on top of the RDD engine: a relational query language and DataFrame API, backed by the Catalyst optimiser (a rule-based + cost-based query planner written in Scala) and later the Tungsten execution engine (whole-stage code generation that compiles a query plan down to JVM bytecode at runtime). The Spark SQL SIGMOD 2015 paper is the readable reference; it is one of the cleaner papers in the analytical-database literature.
The shape that matters for this chapter: Spark SQL is not a separate product. It is the SQL surface of the same engine that runs your batch ETL jobs (df.groupBy(...).agg(...).write.parquet(...)), your ML pipelines (MLlib regression, classification, clustering), your graph computations (GraphX), and your streaming jobs (Structured Streaming). A SQL query in Spark goes through Catalyst, becomes a physical plan made of SparkPlan operators, and executes on the same fleet of executors that handle every other Spark workload. The DataFrame is the universal currency: a SQL SELECT produces one, an ML pipeline consumes one, a df.write.format("delta") writes one to your lake.
This makes Spark the right answer for anything where SQL is one step in a longer computation. A typical day-in-the-life:
# bronze → silver: read 1 TB of raw clickstream, type and dedupe
df = spark.read.json("s3://lake/bronze/clicks/dt=2026-04-25/")
clean = df.dropDuplicates(["click_id"]).withColumn("ts", to_timestamp("event_time"))
clean.write.format("delta").mode("overwrite").save("s3://lake/silver/clicks/dt=2026-04-25/")
# silver → gold: SQL aggregation joining with dimension tables
spark.sql("""
SELECT u.tier, p.category, COUNT(*) AS clicks, SUM(c.revenue) AS rev
FROM delta.`s3://lake/silver/clicks/dt=2026-04-25/` c
JOIN delta.`s3://lake/silver/users/` u ON c.user_id = u.id
JOIN delta.`s3://lake/silver/products/` p ON c.product_id = p.id
GROUP BY u.tier, p.category
""").write.format("delta").mode("overwrite").save("s3://lake/gold/cohort_revenue/dt=2026-04-25/")
# gold → ML: train a classifier on the same DataFrame the SQL produced
features = assembler.transform(spark.read.format("delta").load("s3://lake/gold/cohort_revenue/"))
model = LogisticRegression(featuresCol="features", labelCol="churn").fit(features)
All three steps run on the same Spark cluster, share the same metastore, and write Delta tables that a Trino cluster across the room can read sub-minute.
The execution model is quite different from Trino's. Spark uses stage-based execution: the Catalyst plan is broken into stages at shuffle boundaries; each stage runs as a set of parallel tasks; results of each stage are materialised (to memory if it fits, to disk if it spills) before the next stage starts. This is slower per-query than Trino's pipelined engine — every shuffle adds disk-write latency — but it makes the engine fault-tolerant out of the box. If a task fails, Spark just re-runs that task, recomputing from the upstream materialised data. For 4-hour batch jobs over 10 TB of input, this is the right trade: you cannot afford to restart the whole query because one worker had a transient OOM. Why this matters for the engine choice: Spark trades per-query latency for per-job reliability. A simple SELECT count(*) FROM gold.orders WHERE dt='2026-04-25' takes 30-60 seconds on Spark and 3-5 seconds on Trino against the same Iceberg table — Spark spends time launching its scheduler, materialising stages, and writing intermediate state. For a one-off interactive query that is brutal; for a 4-hour ETL it is invisible.
Operationally Spark runs on YARN (the legacy Hadoop scheduler), Kubernetes (the modern default), Mesos (rare now), or a dedicated cluster manager. The largest commercial deployment is Databricks, which has a heavily-optimised proprietary fork (Photon, a vectorised C++ execution engine) and a managed runtime; AWS EMR, GCP Dataproc, and Azure HDInsight provide thinner managed wrappers around the open-source release. Self-hosted Spark on Kubernetes is doable but, like self-hosted Trino, demands a real platform team.
AWS Athena: serverless Trino
Athena is the simplest of the three to describe. It is a fully managed, serverless, multi-tenant query service in AWS that exposes Trino (since the Engine v3 release in late 2022; before that it was an older fork of Presto) over an S3 data lake. There is no cluster, no instance type, no capacity planning; you write SQL in the AWS Console (or via JDBC, the Boto3 SDK, or the API), and Athena routes it onto a shared pool of Trino workers running inside AWS infrastructure. You pay 5 per terabyte of data scanned (variable by region, with a10 minimum-per-query in some regions), plus standard S3 read costs. The Athena documentation is the authoritative reference.
The architecture is otherwise just Trino. The same connectors work (Hive on Glue Data Catalog, Iceberg, Delta Lake via the iceberg connector, Hudi, federated query against RDS / Redshift / DocumentDB through Athena Federated Query, which packages your connector as a Lambda function). The same SQL works. The same predicate-pushdown and column-pruning behaviour applies. The same Iceberg partition pruning and Delta data-skipping work.
What you trade for the zero-ops experience:
- No tuning knobs. You cannot configure the Trino node count, the heap size, the GC strategy, the cluster autoscaling policy. AWS picks defaults that work for the median workload; if your workload is far from the median, you cannot fix it.
- Per-query cost surprise. A
SELECT * FROM gold.eventsagainst a 50 TB unpartitioned table costs you 250. Always-on. There is no warning, no quota by default, no prompt. Cost engineering on Athena is a real job. The mitigations — partition columns, Iceberg/Delta with column projection, `LIMIT` clauses, workgroup-level data-scanned limits — are all on the user's side. <span class="why">Why this matters more than on self-hosted Trino: a self-hosted Trino cluster costs you a fixed amount per hour regardless of how much SQL you throw at it; bad queries waste capacity but do not increase the bill. Athena's pricing model converts query inefficiency directly into rupees on the AWS invoice. A junior analyst running `SELECT * FROM big_table` once is a250 line item your finance team will see and will email you about. - AWS lock-in. Athena only runs in AWS, only against AWS-managed catalog (Glue or self-managed Hive metastore on EMR), and is billed through the AWS account. Multi-cloud organisations cannot use it as their primary engine.
- No persistent compute affinity. Each query runs on a fresh slice of the shared pool, with no warm cache of your tables. A query that would take 5 seconds on a self-hosted Trino cluster with hot Parquet pages takes 15 seconds on Athena. For ad-hoc queries this is fine; for sub-second BI dashboards it is not.
The right way to think about Athena: it is the serverless commodity for occasional SQL on S3. If you run dozens of queries an hour against the same hot tables, a dedicated Trino cluster (self-hosted or managed by Starburst / e6data) is cheaper and faster. If you run a few queries a day, Athena is unbeatable on ops cost, and the per-query bill is trivial.
The decision tree
The three engines are usually not in competition; you will run more than one, and the question is which one runs which workload. The decision tree is short.
The flow chart in your head should be:
- Is this a scheduled batch job that runs for minutes or hours and might fail mid-flight? Spark SQL. Stage-materialised execution gives you the fault tolerance you need; the DataFrame API integrates with whatever Python or Scala glue you have around the SQL; the same engine handles the ML training that follows the ETL.
- Is this an ad-hoc analytical query that needs to come back in seconds? Trino if you have ops capacity and a busy cluster; Athena if you do not. Both run the same engine; the difference is purely operational.
- Does the query need to join data sitting in heterogeneous sources — lake plus an OLTP database plus a Kafka topic — without copying any of them? Trino. Nothing else does federation as cleanly. Athena Federated Query is a workable second option in AWS-only shops.
- Is the team small enough that "no ops" is worth a 2x latency hit and per-query billing? Athena. The ops budget you save buys a lot of analyst time.
- Do you need ML training, graph computation, or streaming as part of the same job graph as the SQL? Spark, every time.
A subtle point: because Iceberg and Delta are open table formats, you do not have to choose. The same gold.orders Iceberg table can be written by a nightly Spark job and read all day by a Trino cluster and a roomful of Athena users. The lakehouse is engine-pluralist by design; this is its most underappreciated property.
Worked example — an Indian e-commerce data team running all three
You are the data platform lead at an Indian e-commerce company — say a BharatBazaar-like marketplace, 50 million daily active users, ₹400 crore daily GMV. The company has standardised on a lakehouse on s3://flipkart-lake/ with Iceberg tables in the gold layer. You need to support three distinct workloads.
Workload 1: nightly ETL (Spark SQL). Every night at 02:00 IST, a Spark job reads the previous day's bronze data — about 1 TB of raw clickstream JSON, payment events from Aurora MySQL CDC, and order events from a Kafka topic. It transforms, deduplicates, joins with dimension tables, and writes 100 GB of Iceberg-format gold tables: gold.orders_daily, gold.user_sessions_daily, gold.product_views_daily, plus a denormalised gold.user_360_daily that joins the three.
The Spark cluster runs on EMR — 50 r5.4xlarge instances (16 vCPU, 128 GB RAM each) — for the 4-hour window. Why Spark and not Trino: the job has long-running joins that spill to disk, must survive worker failures (a single worker dying after 3 hours of work cannot trigger a 3-hour restart), and feeds an ML pipeline that runs the same Spark cluster after the ETL completes. The cost: 50 nodes × 4 hours × ₹85/hour spot price = ₹17,000 per night, or ₹5.1 lakhs per month.
Workload 2: BI dashboards and ad-hoc analytics (Trino). The BI team runs DashView dashboards that hit the gold tables continuously through the work day. There are about 200 active dashboards, refreshing on 5-minute schedules; query latency target is under 10 seconds for the median dashboard, under 30 seconds for the P95. There are also data analysts running ad-hoc Trino SQL through a Hue-like UI all day.
The Trino cluster runs on EKS — 10 m5.4xlarge worker instances plus 1 coordinator — and stays up 24×7. The cluster reads the same Iceberg tables that the Spark job writes; Iceberg's snapshot isolation ensures that mid-query writes from Spark do not break in-flight Trino queries. Why Trino and not Spark for this: the BI team's dashboards must feel instantaneous (sub-10s); Spark's per-query overhead would push every dashboard above 30s. Why Trino and not Athena: the BI workload is constant (200 dashboards, all day, every weekday), so the per-TB-scanned cost would dominate; a fixed Trino cluster is cheaper at this query volume. The cluster cost: 11 nodes × 24 hours × ₹40/hour spot = ₹10,560 per day, or ₹3.2 lakhs per month — flat regardless of how many dashboards run.
Workload 3: cost-engineering and finance ad-hoc (Athena). The cost-engineering team runs maybe 30 queries a week — investigating cost spikes, building one-off reports for the CFO, comparing this month's spend against last month's. They cannot justify owning a Trino cluster, and they only know SQL, not Spark. Athena is perfect: they open the AWS Console, write SELECT * FROM gold.aws_cost_daily WHERE dt > current_date - 30, and pay for what they scan. With Iceberg partition pruning and column projection, a typical query scans ~5 GB and costs ₹2. Total cost: ~₹50 per week, or ₹200 per month — versus the ₹3 lakh/month a dedicated Trino cluster would cost just for this one team's workload.
The combined picture: all three engines run side-by-side against the same s3://flipkart-lake/gold/ Iceberg tables. None of them know about the others. The only contract is the Iceberg snapshot. Total monthly cost: ₹5.1L (Spark) + ₹3.2L (Trino) + ₹0.002L (Athena) ≈ ₹8.3L/month for a data platform that would have cost ₹50-80L/month on a comparable Snowflake or pre-decoupling Vertica deployment. The engine pluralism is the cost saving.
A note on the convergence
A theme worth surfacing explicitly: the engines are converging on the same storage and the same SQL dialect, while staying differentiated on execution. Trino and Spark SQL both speak Iceberg and Delta natively; both have first-class Parquet support; both run the same predicate-pushdown and column-pruning optimisations; both support the SQL standard's window functions, CTEs, and lateral joins. A SQL query you can write against Trino is, in 90% of cases, a SQL query you can run unchanged on Spark or Athena.
What stays specialised is the shape of the work. Trino's pipelined engine wins at low latency and federation; Spark's stage-materialised engine wins at fault tolerance, scale, and ML integration; Athena's serverless model wins at zero-ops occasional use. The choice is no longer about which engine can answer a SQL query — they all can — but about which engine's execution model matches your workload's shape. That is a substantially better world to live in than the warehouse-vs-Hadoop wars of 2014.
Common confusions
-
"Trino and Spark SQL are competitors — I have to pick one." They are complements far more often than they are rivals. Trino's pipelined execution wins on interactive sub-minute queries; Spark's stage-materialised execution wins on multi-hour batch jobs that must survive worker death. Most production lakehouses (Streamora, Pinscope, Cabline, the BharatBazaar-style example above) run both engines against the same Iceberg tables, with Spark writing nightly and Trino reading all day. The choice is per-workload, not per-platform.
-
"Athena is a different product from Trino." Since Athena Engine v3 (released by AWS in late 2022), Athena is Trino — a managed, multi-tenant Trino fleet that AWS operates inside its account boundary. The SQL dialect, the connectors, the optimizer, the predicate-pushdown behaviour are all Trino's. The differences are operational: serverless billing, no tuning knobs, AWS-only catalog. If you understand Trino, you understand Athena's engine; what Athena adds is purely on the ops/billing side.
-
"Athena charges per TB scanned, so it is always cheaper than running my own Trino cluster." Only at low query volumes. Athena's $5/TB pricing makes it unbeatable for the cost-engineer running 30 queries a week (~₹200/month in the worked example). At high query volumes — 200 BI dashboards refreshing every 5 minutes against 100 GB tables — Athena's per-TB-scanned bill quickly exceeds the fixed cost of a 10-node Trino cluster. Run the arithmetic on your actual query log before assuming Athena is the cheap option.
-
"Spark SQL is just a SQL parser bolted onto Spark — it does the same thing as Trino." The parser is the easy part; the execution model is the difference. Spark's Catalyst optimiser plans a SQL query into stages separated by shuffle boundaries, materialises each stage's output (to memory or disk), and only then starts the next stage. Trino's pipelined engine streams tuples through operators without materialising. For a
SELECT count(*) FROM gold.orders WHERE dt='2026-04-25', Trino returns in 3-5 seconds while Spark returns in 30-60 seconds against the same Iceberg table. The SQL is identical; the execution is not. -
"Trino's federated queries replace ETL — I can just join Postgres and S3 live." They can, and for ad-hoc investigations they often should. But Trino's federation pulls every relevant row from the OLTP database into the Trino workers at query time, which is fine for a 100-row dimension lookup and catastrophic for a 10 GB fact-table scan. Production teams use federated Trino for small + fresh sources joined against large + cold lake data. If both sides of the join are large, you still want a nightly ETL into the lake — Trino's federation is a tool for the long tail of analytical questions, not a substitute for the gold layer.
-
"Iceberg / Delta / Parquet are formats — surely the engine I pick doesn't matter much." The format is the contract; the engine is the execution. The same Iceberg table will return a query 5x faster on Trino than on Spark for an interactive workload, and 5x slower for a batch ETL job. The format guarantees correctness across engines (snapshot isolation, schema evolution, partition pruning); it does not guarantee performance. Pick the engine for the workload's shape — latency, fault-tolerance, ML adjacency — not for the format the table happens to be in.
Going deeper
The thesis above stops at the architectural sketch. This section pushes into four issues that show up the moment you operate any of these engines in production: the cost-based-optimizer dependency on table statistics, the cache-locality gap that separates Athena from a self-hosted Trino, Trino's fault-tolerant execution mode and why most teams disable it, and the Photon / Velox C++ vectorization wave that is rewriting all three engines under the hood.
Cost-based optimization is a statistics problem, not an algorithm problem
Both Trino and Spark SQL ship sophisticated cost-based optimizers that pick join orders, broadcast-vs-shuffle strategies, and aggregation modes based on estimated cardinalities. Both fall over for the same reason: the estimates are only as good as the table statistics. On Iceberg, the statistics live in the manifest files alongside the data — min/max per column per data file, null counts, row counts, NDV (number of distinct values) sketches. On Hive metastore tables they live in the catalog, populated by a ANALYZE TABLE command that you have to run by hand or on a schedule.
Production failure mode: a Spark job writes a new partition to gold.orders. The partition has 50 million rows. Nobody runs ANALYZE TABLE. The next morning a BI analyst writes SELECT u.tier, COUNT(*) FROM gold.orders o JOIN gold.users u ON o.user_id = u.id WHERE o.dt = '2026-04-25' GROUP BY u.tier. The optimizer has stale statistics for gold.orders showing 1 million rows. It picks a broadcast join (broadcast gold.users, hash-join with the assumed-small gold.orders partition), the broadcast plan blows up at runtime when the actual partition is 50x larger than estimated, and the query OOMs the worker. Why this is the dominant production-problem class: the optimizer's plans are correct for the statistics it sees; the bug is always in the staleness of those statistics. Iceberg's manifest-level statistics auto-update on every commit, which is why most modern lakehouses prefer Iceberg over Hive metastore. Delta's transaction log stores per-file statistics with similar guarantees. Hive metastore's per-partition statistics are the worst of the three, requiring explicit ANALYZE to stay current — and they are still the most common production setup, because the migration to Iceberg or Delta is in progress at most companies.
Cache locality: why Athena is slower than self-hosted Trino on the same query
A self-hosted Trino cluster running 24×7 against the same hot Iceberg tables will, after a few hours of warmup, have a substantial fraction of those tables' Parquet column chunks resident in the workers' OS page cache and in Trino's own OrcRecordReader / ParquetReader decompressed-page cache. A query that scans 10 GB of Parquet on a warm cluster might actually read only 200 MB from S3 — the rest comes from RAM at memory-bandwidth speed.
Athena does not give you this. Each Athena query lands on a fresh slice of the multi-tenant Trino pool, and that slice has no warm cache for your tables. Every byte of the 10 GB scan is fetched from S3, which is two-to-three orders of magnitude slower than memory. This is why the same query that runs in 5 seconds on a warm self-hosted Trino takes 15-25 seconds on Athena — the engine is identical, the storage is identical, but the cache locality differs.
Mitigations on Athena are limited: you can prewarm by running the same query twice (the second run sometimes hits a slightly warmer pool, but with no guarantee), or you can route to a Trino-compatible managed product that does keep persistent compute (Starburst Galaxy, e6data) at higher cost. The fundamental trade-off is structural: serverless multi-tenancy and warm caches are at odds. AWS's product positioning — "Athena for occasional queries, EMR-on-EKS for hot workloads" — reflects this honestly.
Trino's fault-tolerant execution: the option you usually want off
Trino added a fault-tolerant execution (FTE) mode in 2022, in response to the long-standing "your 4-hour query died because one worker OOMed" problem. With FTE enabled, Trino spools intermediate exchange data to S3 (or another object store) at every shuffle boundary. If a worker crashes mid-query, the coordinator re-launches the affected stage from the spooled data instead of failing the whole query. The mode is configured per-cluster or per-query via the retry_policy=TASK setting. The Trino FTE documentation is the canonical reference.
The trade-off is severe. Spooling intermediate state to S3 adds 30-100% latency overhead on most queries — the same overhead Spark's stage-materialised model has, applied to Trino. Once you turn on FTE, Trino's interactive-query advantage over Spark largely disappears. The right way to use FTE is selectively: leave it off for the BI cluster (interactive queries should fail-fast and re-run, not spool to S3); turn it on for the rare long-running analytical query that the data team kicks off ad-hoc and cannot afford to restart from scratch. Most teams never enable it; most who try it turn it off again within a sprint. The Spark engine is the better answer for jobs where fault tolerance is the dominant requirement — which is exactly why the engine pluralism in the BharatBazaar-style example is the right architectural shape.
Photon, Velox, and the C++ vectorization wave
Both Trino and Spark are JVM applications. Their inner loops — predicate evaluation, hash-table probes, aggregation — pay the JVM tax: indirect dispatch on virtual methods, garbage-collected object headers on every row, JIT warmup before the hot path is fast. For 95% of analytical workloads this is fine; for the top 5% — large hash joins, regex matching, JSON parsing on tens of billions of rows — the JVM tax is the dominant cost.
The industry response over the past five years has been to rewrite the inner loops in C++ using SIMD vectorization, then drop the C++ engine in as a replacement for the JVM execution layer while leaving the planner and the catalog alone. Databricks calls their version Photon, a closed-source C++ engine that backs Spark SQL on the Databricks runtime. The ASF's open-source equivalent is Velox, a C++ analytical execution library originally from Meta, now used as the execution backend in Presto, Spark (via Gluten), and several proprietary forks. The Velox VLDB 2022 paper walks through the design.
The practical effect: a CTAS or large hash-join on Photon-backed Spark runs 2-5x faster than on stock Spark; a Velox-backed Trino is 1.5-3x faster on the SIMD-friendly operators. The catch is that not all operators are vectorized yet, and queries that fall back to the JVM path on a few unsupported expressions can be slower than vanilla Spark. The wave is real but uneven. Expect by 2027 or 2028 the default open-source Trino and the default open-source Spark to ship with Velox-equivalent C++ engines on by default; the Java path will still be there for unsupported expressions, but the hot operators will run in vectorized C++. The convergence we noted earlier — same storage, same SQL, different execution — is about to get one more axis: same SQL, different physical execution, Java for the long tail and C++ for the hot path.
Where this leads next
- /wiki/data-lakes-on-object-storage — the storage layer that all three engines read; understand the bucket layout before you reason about engine choice.
- /wiki/table-formats-iceberg-delta-lake-hudi — the contract that lets engine pluralism work; without snapshot isolation in the table format, Trino-and-Spark-on-the-same-table is unsafe.
- /wiki/cost-based-optimization-and-cardinality-estimation — the optimizer machinery whose statistics-staleness problem dominates Trino and Spark production failures.
- /wiki/lakehouse-vs-warehouse-the-real-tradeoffs — the architectural argument for why the engine slot is open at all.
- /wiki/duckdb-and-the-embedded-analytics-renaissance — the fourth-engine option for the laptop / notebook use case, complementing the three covered here.
- /wiki/separation-of-storage-and-compute-snowflakes-bet — the precondition that makes engine pluralism economic; without compute disaggregation, "many engines, one storage" would mean buying the storage many times.
References
- "Presto: SQL on Everything" — the Sociogram team's ICDE 2019 paper on Presto's architecture; the foundational reference for Trino's design.
- Trino documentation — the canonical reference for connectors, deployment, and tuning.
- "The Origin of Trino" — the founders' account of the 2020 fork from Presto and the rename.
- Spark SQL: Relational Data Processing in Spark — the SIGMOD 2015 paper introducing Catalyst and DataFrames; the cleanest reference for the Spark SQL design.
- AWS Athena documentation — the official guide to Athena, including Engine v3 (Trino-based) and Federated Query.
- Velox: Meta's Unified Execution Engine — the VLDB 2022 paper on the C++ vectorized execution library now backing Presto, Spark (via Gluten), and several proprietary engines.
- "Trino vs Spark" — Starburst's side-by-side comparison; partisan but technically accurate on the architectural differences.