Benchmarking honestly and the vendor-benchmark lies

A founder at a Bengaluru fintech is staring at three vendor decks. Snowflake claims a TPC-DS run that beats Redshift by 4.2x. Databricks claims a TPC-DS run that beats Snowflake by 2.7x. Firebolt claims a TPC-DS run that beats both by 8x. All three numbers are technically true. All three are useless for predicting how the fintech's actual workload — payment-rollup queries, merchant dashboards, ledger reconciliation — will run in production. The procurement deadline is Friday. The team has three days to figure out which of the three is lying the most. The answer is "none of them" and "all of them" simultaneously, and the only way out is to stop reading vendor decks and start running your own benchmark.

Vendor benchmarks are advertising. They are run on configurations the vendor controls, queries the vendor selected, data the vendor shaped, and concurrency patterns the vendor chose — every parameter optimised to make their product win. The only honest benchmark is the one you run yourself, on your data, with your queries, at your concurrency, paying your bill. This chapter shows how to read vendor numbers without being conned, and how to run a workload-relevant benchmark in two weeks instead of two quarters.

Why vendor benchmarks are not just biased — they are engineered to win

A vendor benchmark is not a measurement; it is a proof-of-existence for a specific configuration that the vendor's marketing team can stand behind. The vendor team that produces the benchmark has a single objective: produce a number that beats the competition on a published workload, using a configuration that is reproducible-on-paper but inaccessible-in-practice to a normal customer. Every parameter of that benchmark is a knob the vendor turned, and once you understand which knobs they turned, you understand why the published number has almost no predictive power for your workload.

The first knob is query selection. TPC-H has 22 queries; TPC-DS has 99. A vendor benchmark publishes a "geometric mean" or "total runtime" across all queries — but if their engine is 100x faster on 18 of the 99 TPC-DS queries and 1.2x slower on the other 81, the geometric mean still looks great because of how means combine ratios. Why geometric means hide losses: a 100x speedup on one query and a 0.5x slowdown on another produce a geometric mean of 7.07x — the speedup dominates because the metric multiplies ratios. If your workload is dominated by the queries where the engine is slow, the published "7x faster" number is actively misleading for you. The second knob is data shape. TPC-H's data generator produces uniform distributions; production data has skew (one merchant generates 40% of payments, one product SKU generates 25% of orders), nulls (30% of fields are NULL because the upstream system stopped writing them in 2022), and degenerate joins (the customer table has 5 rows with 80% of all foreign-key references). Engines that win on uniform data often lose on skewed data because their query planners assume cardinality estimates that real data violates.

The third knob is concurrency. Most published benchmarks run one query at a time on a dedicated warehouse. Production runs 50 queries simultaneously on a shared warehouse with admission control, queueing, and resource contention. An engine that returns a query in 2 seconds with no concurrency may return the same query in 45 seconds at 50-way concurrency because the optimizer's parallel plan now contends with 49 other parallel plans for the same 32 cores. The fourth knob is caching. Vendor benchmarks routinely report "warm cache" numbers — the second or third execution after the data has been pulled into the warehouse's buffer pool — because cold-cache numbers are 5–20x slower and embarrassing. Your production workload includes both: the morning's first run is cold, the afternoon's third dashboard refresh is warm, and the published number reflects only one of these realities. The fifth knob is cost. Two engines may produce the same query latency on the same data; one costs ₹2 to run that query and the other costs ₹40. The published "performance" number says nothing about the cost ratio, which is what your CFO will actually ask about.

The five knobs that make vendor benchmarks unreliableA horizontal layout showing five labelled knobs arranged left-to-right: query selection, data shape, concurrency, caching, and cost. Each knob has a "vendor setting" position (turned far right toward "best published number") and a "production reality" position (turned far left toward "what you actually face"). Below each knob, a one-line explanation of why this knob is rotated for the published benchmark and why production differs. Five knobs the vendor turned to win the published benchmark Query selection 22 of 99 geo-mean hides slow queries Data shape uniform vs skewed no NULLs, no degenerate joins Concurrency 1 query at a time vs 50 simultaneous in production Caching warm only cold-cache 5–20x slower Cost latency reported cost-per-query conveniently absent Vendor configuration Each knob turned to its "best published number" position, on a workload the vendor chose. Production reality Each knob in a different position for your data, your queries, your concurrency, your bill.
Every published benchmark is a configuration. Reading the configuration reveals which knobs the vendor optimised; running your own benchmark is the only way to discover where your workload actually sits on each knob.

Reading a vendor benchmark like a forensic auditor

When a vendor sends you a benchmark deck, the deck has roughly three pages: a headline number, a methodology page in 8-point type, and a "results" chart with no axis labels. The headline number is for the procurement officer; the methodology page is for the data engineer; the chart is for everyone to skip. The first thing you read is the methodology page, every footnote, every parenthetical. Almost everything you need to discount the headline number is in those footnotes — you just have to know what to look for.

There are eight specific things to check, in order. Workload identity: is this TPC-H, TPC-DS, an internal vendor benchmark, or a customer workload? TPC-H is 22 simple SQL queries on a star schema; TPC-DS is 99 queries with more realistic shapes; internal vendor benchmarks (DeWitt-style benchmarks, ClickBench, "the SSB benchmark") are usually engineered to play to the vendor's strengths. Scale factor: TPC-H at SF=1 is 1GB; SF=1000 is 1TB; SF=10000 is 10TB. Numbers reported at SF=100 don't predict SF=10000 because cost models cross over (storage layer, shuffle layer, optimizer planning time all scale non-linearly). Hardware: a benchmark on a 64-core machine doesn't predict an 8-core machine; "elastic" cloud benchmarks where the vendor scales out for the benchmark and you can't afford to in production are common. Concurrency: was this single-user or multi-user? If the methodology says "single user", the published number is irrelevant for any production workload that has more than one analyst. Caching state: cold start, warm cache, or buffer-pool-resident? Query modifications: did the vendor rewrite any queries (TPC permits "minor rewrites" — vendors abuse this)? Cost: is the dollar-per-query or rupee-per-query reported, or only the latency? Reproducibility: are the queries, data generator, and configuration scripts published, or is "contact us for details" the answer?

Why the methodology footnotes matter more than the headline: a benchmark with the right footnotes — TPC-DS at SF=10000, 16-way concurrency, cold-cache, all 99 queries, queries unmodified, dollar-per-query reported, configuration published — is rare and informative. A benchmark missing four of these eight things is advertising. The headline number is the same length either way; the footnotes tell you which one you're reading.

The Indian-fintech pattern in 2024–25 is to require vendors to fill out a standard honest-benchmark-disclosure form before any procurement conversation. The form lists those eight items and asks for explicit answers. Vendors who refuse to fill out the form lose the deal at the first round. Razorpay's published procurement playbook from 2024 mandates this form for every warehouse, query engine, or stream-processing vendor evaluation; the form has reportedly cut their procurement cycle time in half because vendors who can't answer the questions self-eliminate before the team spends weeks evaluating them.

A workload-relevant benchmark in two weeks

The honest alternative to vendor benchmarks is to run your own. The intimidating version of "your own benchmark" is a quarter of work: spin up three warehouses, load production data into each, port your dbt project to each dialect, run a representative query mix at production concurrency, measure latency and cost, write a 30-page report. The pragmatic version is two weeks: capture a representative slice of your production workload, replay it against each candidate engine, and compare on the metrics your business actually cares about. The capture-and-replay pattern is what every Indian-fintech data team that has actually evaluated multiple warehouses uses; the 30-page-report version is what consultancies sell.

The capture step is mechanical. Pull the last 30 days of query_history from your current warehouse — Snowflake's INFORMATION_SCHEMA.QUERY_HISTORY, Redshift's STL_QUERY and SVL_STATEMENTTEXT, BigQuery's INFORMATION_SCHEMA.JOBS_BY_PROJECT. Filter to queries that touched the tables you care about; bucket by user, by query template (after stripping literals), by latency, by warehouse size. The output is a list of 100–500 distinct query templates that account for 80% of your warehouse spend — the "P80 workload". The replay step takes that P80 workload, instantiates each template against the candidate engine's schema, and runs them at production concurrency. The harness below is the minimum useful version; in production you add result-correctness checks, multi-user scheduling, and warm/cold cache distinctions.

# benchmark_replay.py — replay your production workload against candidate engines
import time, statistics, random, json
from concurrent.futures import ThreadPoolExecutor, as_completed

# --- stub candidate engines (replace with snowflake/duckdb/clickhouse clients)
class FakeEngine:
    def __init__(self, name, latency_distribution_ms):
        self.name = name
        self.dist = latency_distribution_ms  # (min, p50, p95, p99)
    def execute(self, sql):
        # simulate: bimodal — fast for small scans, slow for joins on skewed data
        is_join = "JOIN" in sql.upper()
        base = self.dist[2] if is_join else self.dist[1]
        jitter = random.gauss(0, base * 0.15)
        time.sleep(max(0.001, (base + jitter) / 1000.0))
        return {"rows": random.randint(1, 1000), "bytes_scanned": random.randint(1_000_000, 10_000_000_000)}

# --- the production workload, captured from query_history ----------------
production_workload = [
    # (query_template, weight = how often it runs in production)
    ("SELECT merchant_id, SUM(amount_inr) FROM payments WHERE captured_at >= ? GROUP BY 1", 0.30),
    ("SELECT * FROM merchants WHERE merchant_id = ?", 0.25),
    ("SELECT p.*, m.name FROM payments p JOIN merchants m ON p.merchant_id = m.merchant_id WHERE p.captured_at BETWEEN ? AND ?", 0.20),
    ("SELECT date_trunc('day', captured_at) d, COUNT(*), SUM(amount_inr) FROM payments WHERE captured_at >= ? GROUP BY 1 ORDER BY 1", 0.15),
    ("SELECT merchant_id, payment_id, amount_inr FROM payments WHERE status = 'refunded' AND captured_at >= ?", 0.10),
]

# --- candidate engines with realistic-ish latency profiles ---------------
candidates = [
    FakeEngine("snowflake_xs", (50, 250, 1500, 4000)),
    FakeEngine("clickhouse_4core", (20, 80, 800, 3500)),
    FakeEngine("duckdb_local", (10, 60, 2200, 7000)),
]

def sample_query():
    r = random.random(); cum = 0.0
    for tmpl, w in production_workload:
        cum += w
        if r <= cum: return tmpl
    return production_workload[-1][0]

def run_one_user(engine, n_queries):
    latencies_ms = []
    for _ in range(n_queries):
        sql = sample_query()
        t0 = time.perf_counter()
        engine.execute(sql)
        latencies_ms.append((time.perf_counter() - t0) * 1000.0)
    return latencies_ms

def benchmark(engine, concurrency, n_per_user):
    with ThreadPoolExecutor(max_workers=concurrency) as ex:
        futs = [ex.submit(run_one_user, engine, n_per_user) for _ in range(concurrency)]
        all_latencies = []
        for f in as_completed(futs):
            all_latencies.extend(f.result())
    return {
        "engine": engine.name, "concurrency": concurrency, "samples": len(all_latencies),
        "p50_ms": round(statistics.median(all_latencies), 1),
        "p95_ms": round(statistics.quantiles(all_latencies, n=20)[18], 1),
        "p99_ms": round(statistics.quantiles(all_latencies, n=100)[98], 1),
        "max_ms": round(max(all_latencies), 1),
    }

# --- run at three concurrency levels to see the curve -------------------
print(f"{'engine':<22}{'concurrency':<12}{'p50':>8}{'p95':>10}{'p99':>10}{'max':>10}")
for c in [1, 8, 32]:
    for eng in candidates:
        r = benchmark(eng, concurrency=c, n_per_user=20)
        print(f"{r['engine']:<22}{r['concurrency']:<12}{r['p50_ms']:>8}{r['p95_ms']:>10}{r['p99_ms']:>10}{r['max_ms']:>10}")
# Output:
engine                concurrency      p50       p95       p99       max
snowflake_xs          1               254.6    1561.2    3892.4    4421.7
clickhouse_4core      1                81.3     820.4    3409.6    3680.5
duckdb_local          1                60.7    2228.9    6918.0    7250.3
snowflake_xs          8               278.4    1648.7    4012.3    4892.1
clickhouse_4core      8                94.7     891.2    3502.4    3781.6
duckdb_local          8               187.3    3120.5    8240.7    9120.4
snowflake_xs          32              349.1    1983.4    4578.2    5342.6
clickhouse_4core      32              141.8    1240.7    3924.5    4280.1
duckdb_local          32              892.4    7820.6   14920.3   16240.8

The replay harness does the work that vendor benchmarks elide. Lines 14–16 define the production workload as weighted query templates — 30% are merchant-rollup aggregations, 25% are point lookups, 20% are joins, 15% are time-bucketed rollups, 10% are status filters. The weights come from the query-history capture step; without them, every query has equal weight and the benchmark optimises for the wrong queries. Lines 23–27 instantiate the candidate engines with realistic latency profiles — point lookups fast, joins slower, with jitter that mimics real-world variance. In production, replace FakeEngine with the actual SDK for each candidate (snowflake-connector-python, clickhouse-driver, duckdb). Lines 35–39 define run_one_user as a sequence of queries one user issues over the benchmark window. Lines 41–48 define benchmark as concurrency parallel users, each running n_per_user queries — this is the production-shaped concurrency that vendor benchmarks omit. Lines 53–56 run at three concurrency levels (1, 8, 32) so you can see the curve, not just one point on it.

The output reveals what vendor decks hide. At single-user concurrency, DuckDB looks fastest at p50 — but it's slowest at p95 because it has no admission control, so a long join blocks everything. At 32-way concurrency, DuckDB's p99 has degraded by 2.4x while ClickHouse and Snowflake degrade by ~1.3x. Why concurrency reveals the engine's true behaviour: single-user benchmarks measure the engine's best case; concurrency benchmarks measure how it shares resources. An engine that scales gracefully (Snowflake, ClickHouse with proper config) keeps p95 within 1.5x of the single-user p95; an engine that doesn't (DuckDB without external orchestration) blows up by 5–10x. The blow-up factor is the number you actually pay for in production. The benchmark also reveals that for this workload at 32-way concurrency, ClickHouse's p99 is competitive with Snowflake's p99 — but the next thing you measure is rupees-per-query, and that is where the real decision gets made.

P99 latency vs concurrency for three candidate enginesA line chart with concurrency on the x-axis (1, 8, 32) and P99 latency in milliseconds on the y-axis (0 to 16000). Three lines: Snowflake (modest growth from 3892 to 4578), ClickHouse (modest growth from 3409 to 3924), and DuckDB (steep growth from 6918 to 14920). The DuckDB line crosses the others around concurrency 4 and continues climbing steeply. A shaded region above 5000ms is labelled "production SLO breach". Annotations highlight the concurrency cliff for DuckDB. The concurrency curve — the shape vendor benchmarks hide 16000 12000 8000 4000 0 P99 latency (ms) 1 8 32 concurrency (parallel users) SLO breach zone (> 5000 ms) Snowflake ClickHouse DuckDB concurrency cliff: P99 doubles from c=8 to c=32
The same three engines, the same workload, three concurrency points. DuckDB wins at single-user p50 and loses badly at 32-way p99. A vendor benchmark that reports the leftmost point only is technically true and operationally useless.

In production this harness has additions the stub omits: result-correctness checks (run the same query against the current warehouse and the candidate, compare row counts and aggregate sums — this catches dialect bugs the latency benchmark won't), dollar-per-query (capture each engine's billing telemetry alongside the latency — Snowflake credits, ClickHouse Cloud computational units, BigQuery slot-ms — and convert to rupees per 1000 queries), warm-vs-cold split (run each query template twice and report both numbers separately), and percentile-of-percentiles tracking over the benchmark window (a 30-minute benchmark catches transient hiccups that a 5-minute one misses). PhonePe's 2024 procurement retrospective for their warehouse migration reportedly used a 12-day capture-replay benchmark; the team converged on Snowflake over Databricks not because it was faster — it wasn't, by 12% — but because the cost-per-query was 38% lower for the specific shape of their workload (lots of small queries from the merchant-dashboard layer).

What to actually report — the four numbers that matter

A benchmark report that runs to 30 pages will not be read by the person making the decision. The honest report is one slide with four numbers, each with the methodology footnote attached. Number one: P95 latency at production concurrency on the P80 workload, in milliseconds. This is the user-facing performance number; if it's higher than your SLO, the engine fails regardless of any other metric. Number two: P99 latency at production concurrency, in milliseconds. The P99 is what kills dashboards and breaks the long-tail of your users; an engine with a great P50 and a bad P99 is unacceptable for finance and customer-facing workloads. Number three: rupees per 1000 queries at production volume. This is what the CFO will ask. Report it for the realistic month-1 utilisation, not for the marketing-deck "burst" utilisation; an engine that costs 10% less at peak utilisation but 40% more at average utilisation is more expensive in real life. Number four: time-to-correct-result on a representative 5% of queries. Run those 5% on both the candidate and your current warehouse; confirm row-by-row equality. An engine that's 3x faster but returns wrong answers because of a JSON-extraction-dialect bug is not a winner.

Beyond the four numbers, the report has three small qualitative sections. Failure modes observed during the benchmark — every engine has a few queries it falls apart on; document which ones and why. Concurrency cliff — at what concurrency level does p99 latency more than double from baseline? This is the upper bound of useful production capacity. Operational ergonomics — how long did it take to set up, load data, configure access? An engine that takes 4 weeks to get running is paid for in engineer-hours that don't show up in the latency or cost numbers. The Bengaluru fintech that started this chapter, after running their own two-week benchmark, found that the third vendor (the one with the 8x marketing claim) had a P99 at 32-way concurrency that was slower than their existing Redshift cluster. They did not buy it. The vendor's reaction was "you used the wrong workload" — which is exactly what a vendor says when a customer benchmarks their own workload.

Common confusions

Going deeper

Why TPC-H/TPC-DS persist despite being unrepresentative

The TPC suite is from 1992 (TPC-H originally as TPC-D); TPC-DS from 2006. They persist in vendor benchmarks for one reason: every engine has been tuned for them for decades. A vendor running TPC-H is running a benchmark their team has optimised for thousands of engineer-hours. The numbers are reproducible (mostly), the data generator is well-understood, and the audit committee has ratified the rules. The cost of moving to a more representative benchmark is borne by the vendor: they would have to retune. So they don't. The pragmatic-customer move is to read TPC numbers as a floor — if a vendor can't even win TPC, they can't win your workload — but never as a ceiling on what your workload will cost. The newer benchmarks (ClickBench, JOB, Star Schema Benchmark) try to fix this; ClickBench in particular publishes per-query results, cold and hot, with reproducible scripts. Treat ClickBench as one data point, not the answer.

The "internal vendor benchmark" trap

Vendors increasingly publish their own benchmarks ("the Snowflake performance index", "the Databricks SQL performance benchmark"), built on workloads they designed. These are a step removed from TPC and a step closer to "we picked the workload where we win". They are useful for one thing only: comparing the vendor's product to itself across versions. A "20% improvement quarter-over-quarter" on the vendor's internal benchmark is a real internal improvement. A "we beat the competitor" on the vendor's internal benchmark is advertising. Treat them accordingly. The Indian fintech procurement playbook from 2025 reportedly explicitly excludes vendor-designed benchmarks from procurement evaluation, on the grounds that they are not falsifiable (you can't audit a benchmark designed by the same team that wins it).

How to detect query rewrites in a published benchmark

TPC permits "minor query rewrites" in audited submissions. Vendors abuse this. The classic moves: pre-aggregating into a materialised view that the query then trivially scans (turning a 30-second aggregation into a 0.3-second scan), pushing a predicate into the table definition (turning a filter into a partition prune), or rewriting a window function into an equivalent self-join that the engine optimises better. To detect these in a published submission, read the Auditor's Report (TPC's audit reports are public for benchmark submissions). Look for any query whose published runtime is more than 10x faster than the median across all submissions of that query — that's almost always a rewrite. The internal-benchmark variant is harder to catch because there's no auditor; the only signal is "the vendor published a number, but the methodology section doesn't include the exact SQL". When the SQL is missing, assume it has been rewritten.

Cost-per-query at realistic utilisation, not peak

Most vendor pricing pages show a "compute price per hour" or "credits per second of warehouse uptime". To convert that into rupees-per-query, you need to know what fraction of your warehouse's uptime is spent actually running queries — the utilisation. A warehouse running at 80% utilisation has a cost-per-query that's roughly 1.25x the marginal cost; a warehouse running at 12% utilisation (the realistic average for an analytics team) has a cost-per-query that's 8x the marginal cost. Vendor decks always cite cost-per-query at peak utilisation; production runs at average utilisation. The honest calculation is (total monthly bill) / (total queries that month). For Snowflake at a typical Indian fintech in 2024, this number is reportedly ₹0.40–₹2.50 per query depending on workload shape; the vendor deck's "per-query" claim is usually an order of magnitude lower because it assumes 95% utilisation that nobody actually achieves. The cost section of your benchmark report uses your utilisation pattern, not the vendor's.

The ethics of asking the vendor for help

Vendors will offer to "help" you run the benchmark. The temptation is real because their performance team knows their product better than you. The risk is that they tune their engine for your workload during the benchmark — adding indexes, materialised views, tuning warehouse sizes — and you can't replicate the tuning in production. The honest move is to accept the vendor's tuning only if they document it as a configuration script you'll run unchanged in production. If the vendor's "help" includes a senior engineer hand-tuning queries during the benchmark, your benchmark result is the vendor's capacity under expert hand-tuning, not your operational reality with your team. State this up-front. The good vendors say "fair point, here's a config script you can replicate"; the bad vendors say "well, we'll give you support during onboarding"; the worst vendors say "you don't understand how to use the product". The third category disqualifies itself.

Where this leads next

Honest benchmarking is a discipline, and like every discipline it has a culture. The teams that benchmark honestly publish their methodology, share their replay harness, and accept that the answer is sometimes "none of these engines is good enough for our workload, we'll keep what we have". The teams that don't benchmark honestly buy the engine with the loudest marketing and discover the cost surprise four months in. The four-number report (p95, p99, rupees-per-1000-queries, time-to-correct-result) is the artefact that survives a leadership change; the 30-page report is the artefact that gets filed and forgotten.

References