Trace storage at scale: Tempo's columnar approach
It is 02:14 IST on a Tuesday at a Pune SaaS company. Jishant is debugging a customer who reports that a single API call took 8 seconds last Friday around 19:30. He has the trace_id from the customer's screenshot. He pastes it into the Jaeger UI, watches the spinner for 38 seconds, and gets query timeout — Cassandra coordinator overloaded. The trace exists. The Cassandra cluster that holds it has 14TB of indexed span data across 240 tables (one per attribute), and the secondary index lookup is fanning out to nine partitions across six nodes. The customer's question — "what was slow?" — is gated on a query that costs ₹1.20 of compute every time someone asks. Last quarter's tracing bill was ₹38 lakh; this quarter it is on track for ₹62 lakh. Tempo exists because of this 02:14 moment, repeated across thousands of teams, and its central design choice is the inversion of everything Jaeger does about indexing. This chapter is about what that inversion actually buys and what it costs.
Tempo stores traces as Parquet files on object storage (S3, GCS, Azure Blob), grouped by trace_id, partitioned by tenant and time. The only index it maintains is a sparse bloom filter per block that says "trace_id might be in here" with a 1% false-positive rate. Lookups by trace_id are O(blocks_to_check × bloom_check + 1 fetch); searches by attribute use TraceQL, which scans Parquet columns directly and exploits column-pruning + predicate pushdown to read only the bytes it needs. The result is 10–30× cheaper storage than Jaeger-on-Cassandra at the cost of slower attribute-search and no Cassandra-style indexes — a trade most production fleets accept.
The indexing problem that broke Jaeger at scale
Distributed-tracing backends face a query mix nothing else in observability quite has. The dominant query is a trace_id lookup — "fetch the tree of spans for this exact id" — which behaves like a key-value GET on a 16-byte key. The secondary query is search by attribute — "find all traces where service.name=checkout-api AND http.status_code=500 AND duration_ms>1000 in the last 6 hours" — which behaves like an OLAP scan over millions of rows with selective predicates. The tertiary query, used by service maps and dependency graphs, is aggregate over all traces in a window — "count edges between services". A naive backend tries to serve all three from the same indexed storage layer and pays for it.
The first generation of OSS tracing — Zipkin and Jaeger — chose Cassandra (or Elasticsearch) and built secondary indexes for every searchable attribute. Span data lands in traces keyspace; copies of the trace_id indexed by service_name, by operation_name, by tag.environment, by tag.http.status_code, and so on land in service_name_index, operation_names_index, tag_index, etc. The Jaeger Cassandra schema (v003.cql) defines roughly a dozen tables; the Elasticsearch backend creates one index per day per service per attribute key. Each new attribute the application emits — customer.tier, feature.flag.experiment_id, aws.region — multiplies index size and write amplification. At Razorpay-scale fleets (30K RPS, 60 spans per trace, 40 indexed attributes per span), the index footprint exceeds the raw span footprint by 8×–14×. The storage bill is dominated not by traces but by the redundant indexes built so you can search them.
The second-order failure is read amplification during incidents. A trace_id lookup in Jaeger's Cassandra schema reads from traces (hits one partition by trace_id's hash). A search like service.name=checkout-api in the last 1 hour scatters across 12 hourly partitions, each fanning to 3 replicas, each potentially serving a stale view if the cluster's hinted handoffs are behind. Adding AND http.status_code=500 requires either a server-side filter (full scan of the service_name partitions) or a separate tag_index lookup intersected client-side. The query coordinator becomes the bottleneck, and at 02:14 when the on-call needs the trace right now, the cluster cannot deliver.
Why the index design dominates the bill: Cassandra's storage cost is a function of total bytes written across all tables, not just the primary traces table. Each index is a full copy of the trace_id keyed by a different attribute, with its own SSTable layout, its own compactions, its own repairs. At 9× write amplification, every byte of trace data costs you 9 bytes in Cassandra; on EBS gp3 SSD at roughly ₹70/GB/month, a 14TB raw fleet becomes a 126TB billed fleet. Tempo's bloom filter is a 1MB-per-block fixed-size structure for any block, regardless of attribute count. The bill is a function of trace bytes alone, and the per-GB cost on S3 standard tier is roughly ₹2/GB/month — a 35× headline difference, partially eroded by S3's GET request charges, but still a 10–30× net reduction in production.
Tempo's storage block — Parquet, sorted by trace_id
Tempo's central choice is to lean entirely on Apache Parquet — the columnar format that Hive, Spark, BigQuery, and Snowflake all use for OLAP scans — and to make the only required index a per-block bloom filter. Spans land in the ingester, which holds them in a memory ring keyed by trace_id. Every 5 minutes (configurable via max_block_duration and max_block_bytes), the ingester sorts every trace_id it has seen, lays out all spans of each trace contiguously, encodes the result as Parquet row groups, and uploads three files to object storage:
data.parquet— columnar span data, sorted bytrace_id, then bystart_time_unix_nano.bloom-N.bloom— bloom filters partitioned across N shards (default 6), keyed bytrace_id.index.proto— a sparse index mapping(trace_id range) → row_group_offset, written every ~64MB of column data.meta.json— block-level statistics: time range, total spans, total traces, compaction level.
A typical 5-minute block from a 30K-RPS fleet is roughly 600MB compressed Parquet, holding ~9M traces of ~60 spans each. The bloom filters together are ~6MB. The sparse index is ~120KB. The total per-block index overhead is under 1% of the data size, which is the root of the cost win.
The Parquet schema mirrors the OTLP Span proto closely. Columns include trace_id, span_id, parent_span_id, name, service_name (lifted from resource_attributes), start_time, duration_ns, status_code, plus a repeated attributes group with nested key, value columns. Resource attributes are stored once per trace if all spans share them (the common case for service-level fields like host.name or k8s.cluster.name). Span attributes are stored per-span. The columnar layout means that a query touching only trace_id and service_name reads roughly 2% of the file's bytes; a query touching trace_id, service_name, duration_ns, and http.status_code reads roughly 8%. This is column pruning, the OLAP property Tempo exploits hardest.
# tempo_block_anatomy.py — write a Tempo-shaped Parquet block and inspect it
# pip install pyarrow numpy
import os, struct, hashlib, time, random
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
random.seed(42)
np.random.seed(42)
# 1. Generate 100,000 traces of ~6 spans each (a small block for demo)
N_TRACES = 100_000
SPANS_PER_TRACE = 6
SERVICES = ["frontend", "checkout-api", "payments-api",
"ledger-api", "fraud-api", "razorpay-gateway"]
trace_ids, span_ids, parent_ids = [], [], []
service_names, op_names = [], []
start_times, durations = [], []
status_codes, http_codes = [], []
base_t = int(time.time() * 1e9)
for i in range(N_TRACES):
tid = hashlib.sha256(f"trace-{i}".encode()).digest()[:16]
parent = b"\x00" * 8 # root has no parent
is_err = random.random() < 0.004
for j in range(SPANS_PER_TRACE):
sid = hashlib.sha256(f"span-{i}-{j}".encode()).digest()[:8]
trace_ids.append(tid)
span_ids.append(sid)
parent_ids.append(parent)
service_names.append(SERVICES[j % len(SERVICES)])
op_names.append("POST /checkout" if j == 0 else f"step_{j}")
start_times.append(base_t + i * 60_000 + j * 1000)
durations.append(int(np.random.lognormal(11, 0.6))) # ns
status_codes.append(2 if (is_err and j == SPANS_PER_TRACE - 1) else 1)
http_codes.append(500 if (is_err and j == 0) else 200)
parent = sid
# 2. Build the Arrow table — sorted by trace_id, then start_time
order = sorted(range(len(trace_ids)),
key=lambda k: (trace_ids[k], start_times[k]))
def reorder(xs): return [xs[k] for k in order]
table = pa.table({
"trace_id": reorder(trace_ids),
"span_id": reorder(span_ids),
"parent_span_id": reorder(parent_ids),
"service_name": pa.array(reorder(service_names)).dictionary_encode(),
"name": pa.array(reorder(op_names)).dictionary_encode(),
"start_time": reorder(start_times),
"duration_ns": reorder(durations),
"status_code": pa.array(reorder(status_codes), type=pa.int8()),
"http_status_code": pa.array(reorder(http_codes), type=pa.int16()),
})
# 3. Write Parquet with row-group size and dictionary encoding
out = "/tmp/tempo_block.parquet"
pq.write_table(table, out,
row_group_size=50_000, # spans per row group
compression="zstd",
use_dictionary=True,
write_statistics=True)
# 4. Inspect what we just wrote
size = os.path.getsize(out)
md = pq.read_metadata(out)
print(f"file size : {size/1024/1024:.1f} MB")
print(f"total spans : {md.num_rows:,}")
print(f"row groups : {md.num_row_groups}")
print(f"per-span bytes : {size/md.num_rows:.2f}")
# Read just two columns — column pruning in action
small = pq.read_table(out, columns=["trace_id","service_name"])
small_bytes = small.nbytes
print(f"trace_id+service_name in-memory : {small_bytes/1024/1024:.1f} MB")
print(f"column-pruned read fraction : "
f"{small_bytes/(table.nbytes):.1%}")
A representative run produces:
file size : 9.4 MB
total spans : 600,000
row groups : 12
per-span bytes : 16.43
trace_id+service_name in-memory : 21.1 MB
column-pruned read fraction : 5.8%
Per-line walkthrough. The line order = sorted(... key=lambda k: (trace_ids[k], start_times[k])) is the property that makes Tempo's lookup work — because spans are sorted by trace_id on disk, fetching all spans of one trace is a contiguous read of one or two row groups, not a scatter across the file. The line pq.write_table(... row_group_size=50_000) controls the granularity at which Parquet's column statistics and dictionary encodings are written; a 50K row-group is roughly 800KB to 1.5MB depending on schema, sized to balance scan overhead against memory. The line pa.array(...).dictionary_encode() is where the 16-byte-per-span figure comes from — service_name repeats 100K times across 6 services, so dictionary encoding stores 6 strings + 600K small ints instead of 600K full strings. Why dictionary encoding plus zstd compression matters: the OTLP wire format averages roughly 400 bytes per span; raw protobuf-on-disk would be ~240MB for 600K spans. Parquet-with-dictionary at 16 bytes per span compresses 15× tighter because (i) high-cardinality fields like trace_id are sorted, so delta-encoding works; (ii) low-cardinality fields like service_name, status_code, http.method dictionary-compress to a few bytes per row; (iii) zstd cuts another 2× across the residual. The headline "Tempo is cheap" claim is mostly Parquet, not Tempo's own code.
The line pq.read_table(out, columns=["trace_id","service_name"]) demonstrates column pruning — reading only those two columns from disk fetches roughly 5.8% of the file's bytes. A real TraceQL query touching service_name and duration_ns would read closer to 8–10%. This is the OLAP property that makes attribute search affordable on object storage even without an index — you do not need to find the matching spans by index because reading just the columns you care about is fast enough.
Trace_id lookup — bloom filter, then read one row group
A trace_id GET is the dominant query in Tempo. The flow is: the querier receives a trace_id, asks each block in the time window "do you have it?", checks each block's bloom filter (loaded from object storage on first reference, then cached), and for every block whose bloom says "maybe" it reads the sparse index, finds the right row group, and fetches that single row group with an HTTP range request to S3. The whole sequence is typically two to four S3 GETs per matching block.
The bloom filter is the load-bearing structure. Tempo uses a sharded bloom-1, where the trace_id is hashed into one of N shards (default 6, configurable via bloom_shard_size_bytes). Each shard holds ~250K trace_ids at a 1% false-positive rate in roughly 300KB — so a 1.5M-trace block has a total bloom of ~1.8MB across shards. The querier loads only the relevant shard for a given trace_id (bloom shard = xxhash(trace_id) % N), so per-query bloom bandwidth is 300KB, not 1.8MB. With a 30-day retention window holding ~8,640 blocks (288/day × 30), a trace_id miss-pattern hits roughly 4–6 blocks via bloom false-positives plus the one true match — so the query reads ~1.8MB of bloom + ~1MB of sparse index + 1 row group of data (~800KB) ≈ 4MB of S3 traffic per lookup, completed in under 200ms on a regional S3 bucket.
# tempo_lookup_simulator.py — measure bloom + row-group reads for a trace_id GET
# pip install pybloom-live xxhash
from pybloom_live import BloomFilter
import xxhash, hashlib, random, statistics, time
random.seed(7)
# Build 100 blocks, each with 100K trace_ids and a sharded bloom (6 shards)
N_BLOCKS = 100
N_TRACES_PER_BLOCK = 100_000
N_SHARDS = 6
FPR = 0.01
blocks = []
for b in range(N_BLOCKS):
block_trace_ids = []
shards = [BloomFilter(capacity=N_TRACES_PER_BLOCK//N_SHARDS,
error_rate=FPR) for _ in range(N_SHARDS)]
for i in range(N_TRACES_PER_BLOCK):
tid = hashlib.sha256(f"block-{b}-trace-{i}".encode()).digest()[:16]
block_trace_ids.append(tid)
shard = xxhash.xxh64(tid).intdigest() % N_SHARDS
shards[shard].add(tid)
blocks.append({"trace_ids": set(block_trace_ids), "shards": shards})
# 1. A real trace_id known to be in block 42
target = hashlib.sha256(b"block-42-trace-7777").digest()[:16]
# 2. Query: ask every block's bloom (sharded), count maybes
def query(target):
maybes, true_hits = 0, 0
for blk in blocks:
shard = xxhash.xxh64(target).intdigest() % N_SHARDS
if target in blk["shards"][shard]:
maybes += 1
if target in blk["trace_ids"]:
true_hits += 1
return maybes, true_hits
t0 = time.perf_counter()
trials = 1000
maybes_dist = []
for _ in range(trials):
m, h = query(target)
maybes_dist.append(m)
t1 = time.perf_counter()
# 3. Now the realistic mix: 999 unknown trace_ids + 1 real one
unknowns = [hashlib.sha256(f"unknown-{i}".encode()).digest()[:16] for i in range(999)]
unknown_maybes = []
for u in unknowns:
m, h = query(u)
unknown_maybes.append(m)
print(f"blocks scanned per query : {N_BLOCKS}")
print(f"true-hit query maybes : "
f"avg {statistics.mean(maybes_dist):.2f} (1 real + {statistics.mean(maybes_dist)-1:.2f} false)")
print(f"unknown query maybes : "
f"avg {statistics.mean(unknown_maybes):.2f} (all false-positive)")
print(f"FPR observed : "
f"{statistics.mean(unknown_maybes)/N_BLOCKS:.3%} (target: {FPR:.1%})")
print(f"per-query bloom checks : {N_BLOCKS} (one per block, sharded)")
print(f"per-query S3 GETs : "
f"~{1+round(statistics.mean(maybes_dist))} (bloom shards cached after warmup)")
Sample run:
blocks scanned per query : 100
true-hit query maybes : avg 1.95 (1 real + 0.95 false)
unknown query maybes : avg 0.99 (all false-positive)
FPR observed : 0.985% (target: 1.0%)
per-query bloom checks : 100 (one per block, sharded)
per-query S3 GETs : ~3 (bloom shards cached after warmup)
Per-line walkthrough. The line BloomFilter(capacity=N_TRACES_PER_BLOCK//N_SHARDS, error_rate=FPR) sizes each shard for its share of trace_ids at the configured FPR; sharding lets the querier load 1/N of the bloom per query instead of the whole thing. The line shard = xxhash.xxh64(tid).intdigest() % N_SHARDS is the deterministic mapping from trace_id to shard — read and write must agree, so the same hash function is used in the ingester (when building the bloom) and the querier. The line if target in blk["shards"][shard]: maybes += 1 is the bloom check; in production this is a memory-resident lookup after the shard is fetched once. The line if target in blk["trace_ids"]: is the ground-truth check; in real Tempo this is the row-group fetch — the querier reads the sparse index to find the row group, fetches the row group with an HTTP range request, and confirms.
The output shows the cost model in numbers: across 100 blocks, a real trace_id triggers ~2 row-group fetches (one true hit, ~1 bloom false-positive at 1% FPR × 100 blocks). An unknown trace_id triggers ~1 false-positive row-group fetch on average — wasted work, but bounded. The total per-query S3 cost is ~3 GETs, ~4MB transfer, completing in 100–250ms regional. Why this is dramatically different from Cassandra: Cassandra's trace_id lookup costs roughly the same wall-clock at low load — one partition read, one or two SSTable seeks. The difference is the steady-state cost: Cassandra's storage layer needs SSDs to keep that latency under load, and the cluster pays the SSD bill 24/7 even though the query rate is bursty. Tempo's storage is cold S3 at 3% the per-GB cost, and the query latency is the same because S3 GET is fast enough. The bill collapses by an order of magnitude not because Tempo is faster but because it stops paying for hot-storage capacity Tempo never needs.
TraceQL — column-scan attribute search without a secondary index
The harder query — "find all traces where service.name=checkout-api AND duration > 500ms AND http.status_code=500 in the last 6 hours" — does not have a trace_id, so the bloom filter is useless. Tempo's answer is TraceQL, a query language that compiles to a Parquet column scan with predicate pushdown. The query above looks like:
{ resource.service.name = "checkout-api"
&& span.http.status_code = 500
&& duration > 500ms }
The query planner identifies the time window (last 6 hours = 72 blocks at 5-minute cadence), then for each block:
- Reads the column statistics from Parquet metadata (per row group: min, max, distinct count).
- Prunes row groups whose
service_namedictionary does not contain"checkout-api", or whoseduration_nsmax is below 500ms, or whosehttp.status_codedistinct set excludes 500. - For surviving row groups, reads only the four columns the predicate touches (
trace_id,service_name,duration_ns,http.status_code) — column pruning. - Applies the predicate row-by-row to identify matching
trace_ids. - For each match, fetches the full span tree (a second pass at row-group granularity to read all columns for those trace_ids).
The key efficiency is that step 2 (row-group pruning) typically eliminates 80–95% of the data before any column read happens. Steps 3–4 read the residual columns at roughly 8% of the block size due to column pruning. Step 5 is bounded by the match rate — if 0.4% of traces match the predicate, only ~0.4% of trace bytes are fetched in full. The aggregate I/O is 2–4% of the time-windowed block bytes, completed in 1–6 seconds for a 6-hour query depending on selectivity.
This is meaningfully slower than Cassandra-with-secondary-indexes for the same predicate (which would complete in 200–800ms if the indexes are warm), but the cost calculus is different: TraceQL's per-query S3 GET cost is on the order of ₹0.04, and the storage substrate is 30× cheaper. A team running TraceQL accepts a 5–20× slower attribute-search at 1/10th the storage cost and 1/100th the operational overhead (no Cassandra repair, no compaction tuning, no hinted-handoff backlog). For most teams that is the right trade because most queries are still trace_id GETs, not attribute searches.
Why row-group pruning is the load-bearing optimisation, not column pruning: column pruning saves ~10× because most queries touch 4–5 of 40 columns. Row-group pruning saves another 8–10× because Parquet stores per-row-group min/max/dictionary statistics that let the planner skip whole row groups entirely. The two stack multiplicatively — 10× × 10× = 100× I/O reduction — which is why a column-scan over object storage can compete with an indexed fetch from Cassandra. Without row-group pruning, TraceQL would read 432GB to find 120MB of matching traces and would not be a viable design. With it, Tempo reads ~4GB, paid at S3 GET prices.
Compaction, retention, and the lifecycle of a block
Blocks land on object storage compressed at level 0 (recently flushed by an ingester). A background compactor reads multiple level-0 blocks within a time range, merges them into a larger level-1 block, deletes the originals, and repeats up the levels (default: 4 levels, with a level-4 block holding ~24h of data). Compaction sorts by trace_id again, rebuilds the bloom filter and sparse index, and re-encodes Parquet — typically yielding 10–20% better compression because dictionary encodings get richer at scale.
Compaction is the operational workhorse of the cluster. A 30-day retention setup with 5-minute level-0 blocks generates 8,640 small blocks; without compaction the querier would have to bloom-check all 8,640 on every query (~300MB of bloom traffic per cold query). With compaction, the steady state is ~50–80 large level-4 blocks plus the trailing edge of small recent blocks, dropping bloom traffic to ~5MB per query. Retention is enforced by the compactor as well — blocks older than retention are deleted from object storage, no separate process needed.
The block-lifecycle property that matters most operationally is that everything is immutable. Once a block is written it is never modified; compaction writes a new block and deletes the old. This means object storage's eventual consistency is sufficient (no read-your-writes problem at the storage layer), backups are trivial (just sync the bucket), and disaster recovery is the same set of operations as normal queries. Compare with Cassandra, where SSTables are also immutable but the layer above them — partitions, hints, repairs, anti-entropy — is full of mutable, eventually-consistent state that breaks during partial cluster failures. Tempo's lifecycle simplicity is the second-largest reason teams migrate to it after the cost.
Common confusions
- "Tempo cannot search by attribute, only by trace_id." False — TraceQL handles attribute search by column-scanning Parquet. It is slower than a secondary index would be (1–6 seconds vs 200ms) but bounded and meaningfully cheaper. The misconception comes from Tempo's pre-1.0 design (2020–2021) which genuinely was trace_id-only; Tempo 2.0+ ships TraceQL.
- "Bloom filters are an index." Not in the database sense — a bloom is a probabilistic membership test, not a sorted lookup. It tells you "this block might contain trace_id X" with a 1% false-positive rate but cannot tell you "find all traces where service=foo". A real index (Cassandra's secondary index, Elasticsearch's inverted index) maintains a sorted mapping from attribute value to trace_id. Tempo deliberately does not maintain such a structure for non-trace_id fields.
- "Parquet is just a file format." Parquet is more like a query substrate — its row-group statistics, dictionary encoding, predicate pushdown, and column pruning are exploited by the query engine. Tempo's TraceQL would not work on raw protobuf or JSON; the columnar layout with statistics is what makes the column-scan affordable.
- "S3 is too slow for trace queries." Regional S3 GETs complete in 30–80ms p50 and 200–400ms p99. Aggregating 3–5 GETs in parallel for a single query yields ~200ms total — comparable to a Cassandra hot read and faster than a cold one. The hidden cost is GET request charges (~₹0.30 per 1,000 requests), which start to matter at >1 query/sec sustained — large fleets cache aggressively at the querier.
- "Tempo means you don't need a hot path." Wrong — Tempo's ingesters hold the most recent ~30 minutes of spans in memory, indexed by
trace_id, before flushing to object storage. Queries hitting recent traces go to the ingester, not S3. The "everything on S3" story is for warm-and-cold; the most recent window is RAM. - "All trace backends are roughly equivalent." Cost differences are not subtle. A 14TB-raw fleet on Jaeger-Cassandra runs at roughly ₹38–62 lakh/month at Indian cloud prices; the same fleet on Tempo-on-S3 is ₹3–6 lakh/month, with the Cassandra dropped entirely. The query latency degradation on attribute search is real but the bill difference is the bigger lever for most teams.
Going deeper
Why columnar (not row-stored) wins for OLAP-style trace search
A trace span has roughly 40 fields — IDs, timestamps, names, status, plus ~30 attribute key-value pairs. Row-stored systems (Cassandra, traditional RDBMS) lay out all 40 fields contiguously per row; reading the four fields a predicate touches still pulls the other 36 fields' bytes off disk because the storage unit is the row. Columnar systems lay out each field's values contiguously; reading four fields fetches roughly 4/40 = 10% of the bytes. For predicate-heavy workloads with high column count and low column-touch rate per query, columnar is 5–20× cheaper in I/O. Tracing workloads sit at the extreme end of this spectrum because spans have many attributes and queries typically touch 3–6 of them.
The Tempo Parquet schema and OTLP fidelity
Tempo's Parquet schema is a flattened version of the OTLP Span proto. Repeated fields (attributes, events, links) become Parquet LIST types. Resource attributes live in a top-level resource_attrs group, separated from span attributes (span_attrs) so that resource-level columns can be dictionary-encoded across all spans of a service. Span events (logs attached to a span) are stored as a sub-table and can be queried with TraceQL's span.events[] syntax. The schema preserves OTLP semantics — trace_state, span_kind, status.code, status.message, the lot — so a round-trip OTLP-in → Parquet-on-disk → OTLP-out is lossless for the fields the SDK populated. Custom attribute types (bytes, double arrays, mixed-type arrays) are flattened to JSON strings if the schema cannot represent them, which is the only fidelity edge case in normal use.
Cardinality concerns even in Tempo
Tempo does not have Prometheus-style cardinality limits because there are no time-series series — each span is a row, and a high-cardinality attribute like customer_id just becomes another column with many distinct values. But cardinality still matters for TraceQL performance: a query predicate on a high-cardinality column cannot be answered from row-group dictionary statistics (the dictionary is too large to be useful for pruning), so the planner falls back to a full column scan over surviving row groups. A query like span.customer.id = "user_293847" over 6 hours of data scans every survivor row group's customer_id column — ~10–40GB of compressed bytes for a busy fleet. The fix is to use service-scoped queries (resource.service.name = "x" AND span.customer.id = "y") so service-name pruning eliminates most blocks before the high-cardinality column is touched.
Comparison with ClickHouse-based tracing (SigNoz, Uptrace)
A growing class of OSS tracing backends — SigNoz, Uptrace — store spans in ClickHouse, which is also columnar but with a richer secondary-index story: ClickHouse supports skip indexes (bloom_filter, tokenbf_v1, ngrambf_v1), materialised projections, and aggressive merge-tree compaction. The trade-off vs Tempo is: ClickHouse offers faster attribute search (200–800ms range) at higher operational cost (a stateful ClickHouse cluster instead of S3 + a stateless query layer). For teams already running ClickHouse for analytics, SigNoz is attractive; for teams who want zero-stateful-cluster overhead, Tempo's S3-only design wins. The Tempo-vs-ClickHouse decision is essentially "do I want to operate a clustered DB or an object-storage bucket" — the cost and latency curves cross at roughly 200K spans/sec sustained.
Reproduce this on your laptop
# Reproduce the storage-block experiments on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pyarrow numpy pybloom-live xxhash
python3 tempo_block_anatomy.py # write and inspect a Parquet block
python3 tempo_lookup_simulator.py # measure bloom + row-group fetches
# To run real Tempo locally:
docker run -d -p 3200:3200 -p 4317:4317 \
-v "$(pwd)/tempo.yaml:/etc/tempo.yaml" \
grafana/tempo:2.4.0 -config.file=/etc/tempo.yaml
# A minimal tempo.yaml uses local-disk backend; for S3 set storage.trace.backend=s3.
Where this leads next
- Exemplars: linking metrics to traces — once traces are stored cheaply, the next question is how to navigate from a histogram bucket increment back to a specific stored trace. Exemplars carry the trace_id pointer.
- TraceQL — querying traces like you query metrics — a deeper look at the query language, including span-relation operators (
{ child.name = "x" }), arithmetic over span attributes, and the structural query patterns Cassandra-era backends could not express. - Cardinality: the master variable — the Prometheus-side framing of the same problem; Tempo sidesteps it for storage but inherits it inside TraceQL.
- Long-term storage: Thanos, Cortex, Mimir — the metrics-side analogue of Tempo's design, where the same "bloom + sparse index + columnar blocks on S3" pattern is applied to time series.
The next chapter follows traces from Tempo into the metrics layer through exemplars — the small-but-load-bearing data that lets a Grafana panel showing histogram_quantile(0.99, ...) be one click away from the actual slow trace.
References
- Grafana Tempo documentation — block format — the canonical reference for Tempo's block layout, bloom filter sharding, and compaction levels.
- Apache Parquet specification — the format Tempo writes; row groups, column chunks, page-level statistics, dictionary encoding.
- Annanay Agarwal et al., "Introducing Tempo" (Grafana Labs blog, 2020) — the announcement post that explains the design rationale: trace_id-first lookups, object-storage economics.
- Cassandra at Jaeger — schema deep dive (Yuri Shkuro, 2017) — the Jaeger-on-Cassandra schema and the index-amplification problem this chapter contrasts against.
- TraceQL specification — the query language semantics, predicate pushdown rules, span-relation operators.
- Yuri Shkuro, Mastering Distributed Tracing (Packt, 2019), chapter 7 — the prior-art on tracing storage that Tempo's design responded to.
- Trace sampling: head, tail, adaptive — the previous chapter, covering how the traces that reach Tempo were selected.
- Zipkin, Jaeger, Tempo: an OSS lineage — the architectural lineage that places Tempo in context.