Benchmarking honestly: TPC-C, TPC-H, YCSB, and lies
A vendor PDF lands in your inbox: "Our database does 4 million transactions per second." Eighteen months later, on a Friday at 6 p.m., production p99 latency hits 12 seconds and your CTO wants a word. Somewhere between the PDF and the pager, a benchmark lied to you — not by being wrong, but by answering the wrong question.
Every published benchmark is a lie of omission. TPC-C, TPC-H, and YCSB measure useful things, but the things they measure are almost never the thing your workload does. The discipline of honest benchmarking is to (a) report tail latency under sustained load with HdrHistogram, (b) refuse to compress the result into a single number, and (c) build a small benchmark that mimics your traffic before you trust anyone's number.
Why every benchmark you read is true and useless
A benchmark answers exactly one question: how fast did this specific workload run on this specific hardware against this specific configuration of this specific version of this specific database with these specific connection pool settings, this specific data set size, and these specific contention patterns, on the day the test was run?
Vendors quote the answer. They do not quote the question.
The first thing to internalise is that there is no such thing as a database's "throughput" the way there is such a thing as a CPU's clock speed. Throughput is a function of the workload. A database that does 200,000 simple key-value reads per second on YCSB workload A might do 800 transactions per second on TPC-C with the same hardware, because TPC-C transactions touch nine rows across five tables and contend on a hot warehouse counter. Both numbers are real. Neither is "the throughput of the database."
The honest job, then, is not to find the database with the biggest number. It is to:
- Understand what each named benchmark actually measures.
- Read between the lines of any reported result — what was tuned, what was hidden, what was averaged away.
- Build a tiny benchmark of your own workload, run it under load you actually expect, and read the tails.
This chapter teaches all three.
TPC-C, TPC-H, YCSB — what they actually measure
The three benchmarks you will see cited most often were designed for three completely different worlds. Mixing them up is the first amateur mistake.
TPC-C — the hot-warehouse OLTP test
TPC-C, ratified by the Transaction Processing Performance Council in 1992, simulates an order-entry system for a wholesale supplier. There are warehouses, districts, customers, items, stock, orders. Five transaction types — new-order, payment, order-status, delivery, stock-level — each touch multiple rows across multiple tables. The benchmark reports tpmC, transactions per minute committed, but only counts the new-order transactions; the other four types must run at fixed mix ratios alongside.
The thing that makes TPC-C real is the hot row. Every new-order transaction reads and updates a row in the WAREHOUSE table for its warehouse, and a row in the DISTRICT table for its district. With 10 warehouses and 10 districts each, the entire write-side of the workload contends on 110 rows. Adding more clients does not linearly scale tpmC — it scales until the lock contention saturates, then it plateaus. This is exactly what real OLTP looks like in production: at IRCTC, the 10:00 a.m. tatkal opens, and a hundred thousand users contend on the same handful of train rows.
There is also a response-time clause. At least 90% of new-order transactions must complete within 5 seconds. A vendor cannot quote tpmC without honouring this — quietly inflating throughput at the cost of tail latency disqualifies the run. (Honest vendors honour this. Marketing decks frequently do not.)
Why the response-time clause matters: without it you can game tpmC by accepting unbounded queueing. Run 10× the optimal client count, the database keeps committing transactions at the same rate, but each transaction sits in queue for 30 seconds before its turn. Throughput unchanged, latency destroyed, users gone. The TPC-C clause forces the reported throughput to be the throughput you can sustain while staying responsive — which is the throughput a real production system actually has to deliver.
TPC-H and TPC-DS — the analytical question generator
TPC-H is the OLAP counterpart. The schema is a star-ish layout — LINEITEM, ORDERS, CUSTOMER, SUPPLIER, PART, PARTSUPP, NATION, REGION — and the workload is 22 fixed queries (Q1 through Q22) chosen to exercise joins, aggregations, sub-queries, and window functions. The size is parameterised by scale factor: SF=1 is 1 GB of raw data, SF=100 is 100 GB, SF=10000 is 10 TB.
The metric is QphH@SF — queries per hour at scale factor SF — calculated as a geometric mean of the 22 query times. Geometric mean is important: a system that is fast on Q1 and Q3 but catastrophic on Q9 cannot hide it by averaging arithmetically.
TPC-DS extends this to 99 queries and a more realistic retail-data warehouse schema. ClickHouse, Snowflake, BigQuery, Redshift, and Databricks all publish TPC-H and TPC-DS numbers. The ones to read sceptically are the ones that quote a single query out of 22 ("Q1 in 0.4 s!"), or that quietly disable a query because their planner crashes on it.
YCSB — the Yahoo Cloud Serving Benchmark
YCSB, released by Yahoo Research in 2010, measures single-row key-value operations. There are no transactions, no joins, no aggregations. Just read(key), update(key, value), insert(key, value), scan(start_key, count), and a parameter file that says what mix of operations to run.
The standard workloads are:
- Workload A — 50/50 read/update. Models a session store.
- Workload B — 95/5 read/update. Models a photo-tagging app.
- Workload C — 100% reads. Models a user profile cache.
- Workload D — 95/5 read/insert with recency bias. Models a status update timeline.
- Workload E — 95/5 scan/insert. Models a threaded discussion.
- Workload F — read-modify-write. Models a counter.
YCSB is the right benchmark to read when the database under test is genuinely a key-value store — Redis, Cassandra, DynamoDB, HBase, FoundationDB. It is the wrong benchmark to compare a relational database against, because it deliberately removes the relational features that a relational database is paying for.
The lies vendors tell — and how to spot them
Once you know what each benchmark measures, you can read a vendor PDF the way a lawyer reads a contract. Here is the catalogue of common omissions, in roughly the order of how often you will encounter them.
Lie 1: averages instead of percentiles
The most common lie. "Our database has an average latency of 1.2 ms." Average latency is meaningless for a user-facing system. If your average is 1.2 ms but your p99 is 800 ms, then 1 in 100 requests takes 800 ms — which on a page that does 50 backend calls means roughly half your page loads have a single 800 ms stall.
The numbers that matter are p50, p95, p99, p99.9, and p99.99. The right way to read a system's behaviour is by its tail, not its mean.
Why averages lie under fan-out: a typical web page makes 50–200 backend calls. Even if each call has a 1% chance of hitting the p99 stall, the page itself has a 1 - 0.99^{100} \approx 63\% chance of including at least one stalled call. The user-perceived latency is dominated by the slowest of the fan-out, not the mean of the fan-out.
Lie 2: coordinated omission
This one is subtle and devastating, and it is the reason Gil Tene built the HdrHistogram library. Most home-grown benchmarks have a loop that looks like:
for i in range(N):
t0 = time.perf_counter()
db.get(random_key())
t1 = time.perf_counter()
record(t1 - t0)
This loop measures the latency of each request the loop got around to issuing. But suppose the database stalls for 1 second. During that stall, the benchmark is blocked, so it does not issue any requests. When the stall ends, the loop records one slow request — the one it finally got back — and resumes. The thousand requests that would have been issued during the stall, and would have all observed latencies between 1 ms and 1 second, are simply not recorded.
The result: the histogram has one spike at 1 second and a long flat region of fast measurements before and after. The reported p99 looks fine. The actual p99 — what users would have felt if requests had arrived on schedule — is much worse.
This is coordinated omission. The benchmark coordinates with the system under test to omit measurements during slow periods.
The fix is to issue requests on a fixed schedule (say, one per millisecond) regardless of what the previous request did, and to record the latency of every request from when it should have been issued to when it actually completed:
import time
schedule_interval = 0.001 # 1 ms = 1000 req/s
start = time.perf_counter()
for i in range(N):
intended_send_time = start + i * schedule_interval
now = time.perf_counter()
if now < intended_send_time:
time.sleep(intended_send_time - now)
actual_send = time.perf_counter()
db.get(random_key())
actual_complete = time.perf_counter()
# latency from when it SHOULD have been sent, not when we got around to sending
record(actual_complete - intended_send_time)
If the database stalls for a second, this loop falls behind by a second's worth of requests, and when it resumes it backfills the missing requests as fast as it can — and every backfilled request records the full waited-plus-served time. The p99 now reflects what users would have actually felt.
Why HdrHistogram exists: it does this scheduled-send correction in C/Java/Python at production speed, with constant-memory storage of every value across 6 orders of magnitude with 0.1% precision. Tene's library is now the de facto standard. If a benchmark report does not say "HdrHistogram" or describe an equivalent correction, assume coordinated omission is hiding a fat tail.
Lie 3: warm caches and small data
A vendor benchmark on a 1 GB dataset on a server with 64 GB of RAM measures the speed of memory, not the speed of the database. Every page lives in the buffer pool. The disk is never touched. The reported throughput is 100× higher than what you will see when your data outgrows RAM — which is the exact moment the database stops being interesting.
The TPC standards prevent this for compliant runs (TPC-C requires a working set 5–10× RAM; TPC-H scales the data size with the metric). But in a vendor blog post, the dataset size is rarely mentioned. If the post does not state (a) the data size, (b) the RAM, and (c) the working-set ratio, the number is not interpretable.
Lie 4: cherry-picked single queries
Reporting Q1 of TPC-H without the other 21 queries. Reporting only the read throughput of a 95/5 workload because the writes are slow. Reporting steady-state throughput without the 5-minute startup transient where the buffer pool was filling. A benchmark is a distribution of behaviours; quoting a single point of it is at best lossy and at worst dishonest.
Lie 5: tuning the benchmark, not the workload
Vendor X publishes a TPC-C number with synchronous_commit = off, batch size 10000, and a custom kernel scheduler. The number is real. None of it generalises to a default install. Always read the configuration appendix of a benchmark; if it is not there, assume the configuration is exotic.
The USE method — what to instrument while benchmarking
Brendan Gregg's USE method (Utilization, Saturation, Errors) is the simplest checklist for finding the bottleneck during a benchmark run. For every resource in the system — CPU, memory, disk, network, locks — you ask three questions:
- Utilization — what fraction of the resource is busy? (
topfor CPU,iostatfor disk.) - Saturation — how much extra work is queued waiting for the resource? (
vmstatrun-queue,iostatawait time, lock-wait counters.) - Errors — how many failures? (timeouts, retries, OOM kills, swap.)
The crucial insight is that utilisation alone does not tell you whether you are at the limit. A disk at 60% utilisation can still be saturated if the queue depth is 32 and average wait is 50 ms. A CPU at 100% utilisation is fine if the run-queue length is 0. You need all three.
# bench_with_use.py — log USE metrics alongside latency
import subprocess, time, threading
def sample_use():
while True:
# CPU utilization + run-queue
with open("/proc/loadavg") as f:
loadavg = f.read().split()[0]
# Disk: iostat-style. Use psutil for portability.
import psutil
disk = psutil.disk_io_counters()
net = psutil.net_io_counters()
cpu = psutil.cpu_percent(interval=None)
ts = time.time()
with open("use.log", "a") as f:
f.write(f"{ts:.3f},{cpu},{loadavg},"
f"{disk.read_bytes},{disk.write_bytes},"
f"{net.bytes_sent},{net.bytes_recv}\n")
time.sleep(0.5)
threading.Thread(target=sample_use, daemon=True).start()
# ... your benchmark runs here ...
Run that alongside the benchmark. When you see a tail-latency spike, line it up against the USE log: which resource went saturated at that moment? Disk write queue? Network buffer? CPU run-queue? That is your bottleneck. Without USE data, you will guess wrong.
A 60-line benchmark that does not lie
Here is the smallest honest benchmark you can write. It has scheduled sends (no coordinated omission), records every latency to an HdrHistogram, captures USE-style metrics during the run, and reports the full percentile spectrum.
Open your editor. Type this in. Do not copy-paste — type it. Every line will matter when you debug your own benchmark later.
# honest_bench.py — closed-loop is a lie; this is the open-loop fix.
import time, random, statistics, threading, json
import psycopg # or your driver of choice
from hdrh.histogram import HdrHistogram
# 1) Connect ONCE. Driver setup time is not part of latency.
conn = psycopg.connect("dbname=shop user=riya")
conn.autocommit = True
cur = conn.cursor()
# 2) Pre-load realistic data: 1 million rows. Bigger than buffer pool.
cur.execute("CREATE TABLE IF NOT EXISTS orders(id BIGINT PRIMARY KEY, "
"customer TEXT, amount NUMERIC, ts TIMESTAMP DEFAULT NOW())")
cur.execute("SELECT COUNT(*) FROM orders")
if cur.fetchone()[0] < 1_000_000:
cur.executemany(
"INSERT INTO orders(id, customer, amount) VALUES (%s, %s, %s) "
"ON CONFLICT DO NOTHING",
[(i, f"cust_{i % 100_000}", round(random.uniform(50, 50000), 2))
for i in range(1_000_000)])
# 3) Open-loop: schedule sends at exactly 2000 req/s, regardless of latency.
TARGET_RPS = 2000
DURATION_S = 60
hist = HdrHistogram(1, 60_000_000, 3) # 1 us to 60 s, 3 sig figs
# 4) Resource sampler — write USE metrics every 500 ms.
done = threading.Event()
def sample():
while not done.is_set():
with open("use.log", "a") as f:
f.write(f"{time.time():.3f},{psutil.cpu_percent(0.0)},"
f"{psutil.disk_io_counters().write_bytes}\n")
time.sleep(0.5)
import psutil; threading.Thread(target=sample, daemon=True).start()
# 5) The hot loop — open-loop send schedule.
N = TARGET_RPS * DURATION_S
start = time.perf_counter()
for i in range(N):
intended = start + i / TARGET_RPS
now = time.perf_counter()
if now < intended:
time.sleep(intended - now)
cur.execute("SELECT amount FROM orders WHERE id = %s",
(random.randint(0, 999_999),))
cur.fetchone()
completed = time.perf_counter()
# Latency from intended-send to completion. THIS is the user-felt latency.
hist.record_value(int((completed - intended) * 1_000_000))
done.set()
# 6) Report the full tail, not a single number.
for p in [50, 90, 95, 99, 99.9, 99.99]:
print(f"p{p:>5} : {hist.get_value_at_percentile(p) / 1000:>8.2f} ms")
print(f"max : {hist.get_max_value() / 1000:>8.2f} ms")
print(f"count : {hist.get_total_count():>8}")
A run of this against Postgres on a stock M2 MacBook with the Postgres data directory on local SSD looks like:
$ python honest_bench.py
p 50 : 0.34 ms
p 90 : 0.71 ms
p 95 : 0.92 ms
p 99 : 2.18 ms
p 99.9 : 14.40 ms
p99.99 : 142.66 ms
max : 401.20 ms
count : 120000
Read the table top to bottom. The p50 says the database is fast. The p99.9 says one in a thousand requests takes 14 ms — fine for a backend service, alarming for a real-time game. The p99.99 of 142 ms means one in ten thousand requests takes nearly a fifth of a second; if your service does a hundred database calls per page, your worst page in a thousand has a 142 ms stall. The max says the worst single request in this 60-second run took 400 ms. Probably a checkpoint flush; a glance at use.log at that timestamp will confirm.
This is what an honest benchmark report looks like. Compare it to "average latency 0.4 ms, throughput 2000 req/s". Both describe the same run. Only one tells you whether to ship.
Common confusions
-
"Average latency is the latency of the database." It is not. Average latency tells you about the bottom 80% of requests. p99.9 tells you what happens to your users on a bad page, and p99.99 tells you what happens during traffic spikes. Always demand a percentile spectrum, never a mean.
-
"More requests per second is always better." Not under contention. A TPC-C run that adds clients past saturation point sees throughput plateau and tail latency explode. The right metric is throughput at a fixed tail-latency budget — e.g. "max sustained req/s while p99.9 ≤ 50 ms" — not raw throughput.
-
"YCSB benchmarks compare key-value databases to relational databases fairly." They do not, in either direction. YCSB has no transactions, joins, or constraints, so a relational database wins by ignoring features it would have to honour in production. Conversely, YCSB's flat key space favours hash-indexed stores; range scans pay differently. Use YCSB only when both candidates are actually key-value stores.
-
"If I run my benchmark for one minute, I have a steady-state measurement." Probably not. Buffer pools warm up over tens of seconds. Compaction kicks in after the LSM tree has accumulated enough levels (often minutes). Postgres autovacuum runs on its own schedule. Run for at least 10 minutes after a 5-minute warm-up before reading a number.
-
"My closed-loop benchmark gave the same answer as an open-loop one, so coordinated omission isn't a problem here." It probably is, and you cannot tell without comparing them. The two coincide only if the system never stalls. Any time the system stalls — GC pause, fsync flush, replica catch-up, network jitter — the closed-loop number underreports tail latency. The right default is open-loop with HdrHistogram and a fixed send rate.
-
"TPC-C results across vendors are directly comparable." Only if both runs are TPC-published, audited, and at the same warehouse count. The audit process catches cheats like ignoring the 5-second response-time clause; vendor blog posts skip the audit. Compare TPC-C numbers from
tpc.org/tpcc/results/only.
Going deeper
How HdrHistogram actually stores every measurement
HdrHistogram is a clever data structure: for a precision p (say 3 significant digits) and a max value M, it allocates buckets that are exponentially spaced — small ranges have small bucket widths, large ranges have larger widths, but each bucket is at most 10^{-p} of its centre value. This lets it record every value in 6 orders of magnitude with 0.1% precision in a few kilobytes of memory.
The clever part is the value recording with expected interval: recordValueWithExpectedInterval(value, expectedInterval) adds the actual measurement and also synthetically adds correction values for any expected intervals that were missed during the long measurement. This is the formal fix for coordinated omission. Tene's HdrHistogram javadoc is required reading.
USE versus RED — both are right, both are partial
Brendan Gregg's USE is resource-centric: utilisation, saturation, errors per resource. Tom Wilkie's RED (Rate, Errors, Duration) is service-centric: per endpoint, what is its request rate, its error rate, its latency distribution? RED is what you graph in Grafana for SLOs. USE is what you graph when SLOs are red and you are trying to find the bottleneck. A good benchmark report shows both: top-level RED for the workload, drilled-down USE for the resource that saturated first.
TPC-C in production — IRCTC, Razorpay, the order-entry pattern
When IRCTC opens tatkal at 10:00 a.m., the load profile is essentially TPC-C's new-order transaction at extreme scale — every booking reads and updates a small set of seat-availability rows that the entire crowd contends on. Razorpay's payment ledger has the same shape: every settlement contends on the merchant balance row. The reason TPC-C remains the gold standard for OLTP is that every serious order-entry workload, including these, has a hot-row contention pattern that synthetic uniform-random benchmarks miss entirely. Read the TPC-C specification — yes, the actual PDF — once in your career.
Why TPC-H queries are hand-picked, not generated
TPC-H's 22 queries were chosen by humans to cover the join shapes (star, snowflake, self-join), aggregation patterns (group-by, having, distinct), sub-query forms (correlated, exists, in), and date-range filters that real BI workloads contain. Q1 is a streaming aggregation — easy. Q9 is a multi-way join with a non-equality predicate — punishing. Q21 has a correlated sub-query that catches a class of planner bugs. A database that wins on Q1 but loses on Q9 is a database with a weak query optimiser; this is exactly the diagnostic the geometric-mean reporting is designed to surface.
Running your own benchmark — the workflow
- Capture production traffic — use Postgres
pg_stat_statements, MySQL slow log, or app-level traces to extract the top-30 query templates with their actual frequency. - Pick a load shape — usually a Poisson process around the mean rate, with realistic burstiness (peak/mean ratio of 3–5×).
- Replay with open-loop drivers —
pgbench -fwith custom scripts,sysbenchfor OLTP-style,ohaorwrk2for HTTP, all use scheduled-send semantics. - Run for 30 minutes minimum — 5 min warm-up, 25 min steady state.
- Report the percentile spectrum and the USE log — not a single number.
- Re-run with the configuration of the system you are comparing against — same data size, same warm-up, same client mix.
If you cannot do all six, the result is suggestive, not conclusive. Most internal benchmarks are suggestive. That is fine — just label them honestly.
The one-paragraph history
The TPC was founded in 1988 by Omri Serlin to stop the "benchmark wars" of the late 1980s, where every database vendor published incomparable numbers from incomparable workloads. The TPC's contribution was audited, reproducible, comparable benchmarks. Forty years later, vendors have rediscovered the original sin: cite a number, hide the workload. The discipline of running your own benchmark is the only defence.
Where this leads next
- Running a database in production — chapter 185: the operational discipline a benchmark cannot capture. Backups, upgrades, on-call.
- The 30-year arc and where databases go next — chapter 186: the long history that makes today's benchmarking arguments make sense.
- fsync, write barriers, and durability — chapter 3: the layer below benchmarks. If you benchmark an
fsync-disabled run and quote it as durable, you are lying. - The append-only log — chapter 2: where the read-versus-write speed asymmetry first appeared, and where every benchmark you will run hides it.
References
- Transaction Processing Performance Council, TPC-C Specification v5.11 — the canonical OLTP benchmark, with its rare-among-vendors response-time clause. tpc.org/tpcc.
- Brian F. Cooper et al., Benchmarking Cloud Serving Systems with YCSB (SoCC, 2010) — the YCSB paper. research.yahoo.com / ACM.
- Gil Tene, How NOT to Measure Latency — the talk and the slide deck that named coordinated omission. Strange Loop 2015. HdrHistogram source: github.com/HdrHistogram/HdrHistogram.
- Brendan Gregg, The USE Method — the canonical reference for resource-bottleneck diagnosis during benchmarks. brendangregg.com/usemethod.html.
- Andy Pavlo, What Are We Doing With Our Lives? Nobody Cares About Our Concurrency Control Research (SIGMOD, 2017) — keynote on benchmark drift in academic database papers. pavlo's CMU page.
- Mark Callaghan, Benchmarketing — Facebook's database engineer on how to read vendor benchmarks. smalldatum.blogspot.com.
- Frank McSherry, Scalability! But at what COST? (HotOS, 2015) — the Configuration that Outperforms a Single Thread paper. github.com/frankmcsherry.
- Kyle Kingsbury, Jepsen analyses — the consistency-correctness side of benchmarking. jepsen.io.