Zipkin, Jaeger, Tempo: three trace backends, three indexing bets

It is 09:18 IST on a Monday at Zerodha. Aditi, an SRE on the Kite trading platform, is staring at a graph that says "p99 order placement latency: 1.4 seconds" — four times the SLO. She knows the trace_id of the slowest order from a corresponding log line; she pastes it into the trace UI and the span tree returns in 200ms. She does not know the trace_id of the second-slowest order. The query "show me all orders that took longer than 800ms in the last 15 minutes" runs for 47 seconds and returns 6,200 traces. The same query against Tempo, which her team migrated to last year, would have returned in two seconds — but Tempo cannot answer "find me all traces that hit the mutual-funds-service and contain a db.statement matching INSERT.*orders with status_code=500", which her old Jaeger setup could.

This chapter takes the three production trace backends most Indian platforms run — Zipkin (the original), Jaeger (the workhorse), Tempo (the modern columnar bet) — and pulls apart the indexing decision each one made. The decision is not "which is best"; it is "which read pattern do you have, and how much are you willing to spend to keep that pattern fast?"

Zipkin indexes service, operation, span tags, and duration into a relational store (MySQL, Cassandra, Elasticsearch); query is rich but storage is expensive and ingest tops out around 50K spans/sec per node. Jaeger indexes the same fields, mostly into Elasticsearch, and inherits ES's sharding model for horizontal scale. Tempo indexes only trace_id and a tiny set of resource attributes; everything else lives unindexed in object storage (S3, GCS) and is queryable through TraceQL with full-scan-style execution. The three backends are picking different points on the storage-cost vs query-flexibility curve, and the right one depends on whether your dominant query is "look up a known trace_id" (Tempo wins) or "find all slow traces matching a tag" (Zipkin / Jaeger win).

Three indexing strategies, one storage bill

Every trace backend solves the same problem: spans arrive on the wire, must be persisted, and must be queryable later. The disagreement is about what to index. An index is a side-data-structure that turns "scan all spans" into "look up by key", at the cost of write amplification and storage overhead. The three backends made three different choices about which fields are worth indexing.

Zipkin was the first production trace backend, born inside Twitter in 2012 and open-sourced soon after. It indexes aggressively: service name, span name, every tag (key-value pair attached to a span), duration buckets, and timestamp. Writes go into one of several pluggable stores — MySQL for small deployments, Cassandra for larger ones, Elasticsearch for richer text queries on tag values. The storage overhead is large — a span that is 800 bytes on the wire takes 1.6KB to 2.4KB in Cassandra after the secondary indexes are written, and the write path does multiple round-trips. The query power is correspondingly broad: "find all traces in the last hour where service is payments-api and tag customer_tier=gold and duration > 500ms" runs as a Cassandra CQL query against the indexed columns, returning in seconds even on terabytes of data. The ingest cap on a single Cassandra-backed Zipkin is roughly 50K spans/sec per node before the secondary-index writes saturate; horizontal scaling adds capacity but also coordination cost.

Jaeger was built at Uber in 2015 with the same indexing philosophy but a different default store. Jaeger uses Elasticsearch as its primary backend (Cassandra is supported but less common in production), and it leans on ES's inverted-index model for tag queries. Jaeger indexes service, operation, and a configurable set of tag keys (by default: most string tags up to a length limit). The storage overhead is similar to Zipkin's — roughly 2× to 3× the wire size — and ES query latency depends heavily on shard count and refresh interval. Jaeger's strength is query expressiveness: tag values are full-text searchable, range queries on duration are fast, and the UI ships with a "service dependency graph" that derives from per-trace span counts. Uber, Netflix, and most of the Jaeger 1.x deployment base run Jaeger on Elasticsearch with monthly rolling indices; the operational pain point is ES cluster management at scale (shard rebalancing, mapping explosions, JVM heap tuning).

Tempo was built at Grafana Labs in 2020 with an explicit thesis: "we cannot afford to index every tag at scale, so let's not." Tempo indexes only trace_id (mandatory, the lookup key), service.name, span.name, and a handful of resource attributes — typically fewer than ten total fields. Everything else — every span tag, every event attribute, every status code — lives in unindexed columnar files in object storage (S3, GCS, Azure Blob). Queries that ask for a specific trace by trace_id are O(log N) lookups via the index and return in milliseconds. Queries that filter on unindexed fields — "find traces where span has tag customer_tier=gold" — execute as block scans over compressed Parquet-like files, with parallelism across hundreds of object-storage segments. The trade is sharp: storage cost drops by 10× to 30× compared to Zipkin/Jaeger (because object storage is cheap and indexes are tiny), and ingest scales linearly with object-storage write capacity (effectively unbounded). The cost is that filter queries on tags scan, not seek — TraceQL's { duration > 500ms && resource.service.name = "payments-api" } over a one-hour window with 10M traces takes 5–30 seconds, depending on parallelism.

Illustrative — what each backend chooses to index. The accent border on the Tempo panel marks the "different bet" — index almost nothing, store everything cheaply, scan when asked. Storage-cost numbers are typical for 800-byte spans; your mileage varies with tag count and compression ratio.

The non-obvious property of Tempo's bet is that indexing decisions compound. Why "index almost nothing" was unthinkable in 2012 but viable in 2020: object storage cost dropped from 0.10/GB/month (S3 launch pricing, 2006) to0.023/GB/month (S3 Standard, 2020), a 4× reduction. Compute parallelism for block scans went from a few hundred cores per cluster (Hadoop era) to tens of thousands of vCPUs available on demand (cloud-native era). The two trends meant that scanning a terabyte of compressed Parquet in 10 seconds went from impossibly expensive to routine. Tempo's design only made sense once both shifts had landed. Zipkin and Jaeger could not be reimplemented today as Tempo; they were built when "store every byte to S3 and scan on read" would have been operationally impossible. The reverse is also true — Tempo only works for read patterns where trace_id lookup dominates and tag-filter queries are rare or batched. A team whose dominant query is "find all error traces in the last 5 minutes for incident triage" needs an index on error=true, which is exactly what Jaeger has and Tempo does not (at least not without configuring it as a resource attribute, which has cardinality limits).

A working trace pipeline — emit, query, decode all three

The fastest way to internalise the differences is to emit the same trace into all three backends and watch how each one stores and returns it. The Python script below stands up an OTel SDK that exports to a local OTLP collector configured with three exporters — one per backend — and then queries each backend's HTTP API for the same trace_id. The script is self-contained; everything you need to reproduce is in the pip install line and the helper config files (which the script writes to /tmp for you).

# trace_backend_compare.py — emit one trace, query all three backends.
# Pre-req: run `docker compose up` from this directory with services for
# zipkin (port 9411), jaeger (port 16686 UI, 4317 OTLP), tempo (port 3200).
# pip install opentelemetry-api opentelemetry-sdk \
#   opentelemetry-exporter-otlp-proto-grpc requests
import json, time, uuid, requests
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# 1. Configure OTel: one process emits via OTLP to a collector that fans
#    out to Zipkin, Jaeger, and Tempo. The collector config (otelcol.yaml)
#    has three exporters: zipkin, otlp/jaeger, otlp/tempo.
res = Resource.create({"service.name": "checkout-api",
                       "deployment.environment": "prod-mumbai-1"})
provider = TracerProvider(resource=res)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("checkout")

# 2. Emit a trace with a parent + 3 children + tags + an error span.
def emit_trace(order_id):
    with tracer.start_as_current_span("place_order",
            attributes={"order.id": order_id,
                        "customer.tier": "gold",
                        "amount.inr": 1899}) as parent:
        with tracer.start_as_current_span("validate_inventory"):
            time.sleep(0.012)
        with tracer.start_as_current_span("charge_payment",
                attributes={"upi.handle": "aditi@hdfc",
                            "psp": "razorpay"}):
            time.sleep(0.180)  # the slow span
        with tracer.start_as_current_span("send_confirmation_email") as s:
            s.set_status(trace.StatusCode.ERROR, "smtp timeout")
            time.sleep(0.030)
        return parent.get_span_context().trace_id

order_id = f"ORD-{uuid.uuid4().hex[:8]}"
trace_id = f"{emit_trace(order_id):032x}"
provider.shutdown()  # forces export
time.sleep(2.5)      # let backends ingest

# 3. Query each backend by trace_id and report shape + latency.
def query(name, url, headers=None):
    t0 = time.time_ns()
    r = requests.get(url, headers=headers or {}, timeout=10)
    dur_ms = (time.time_ns() - t0) / 1e6
    return r.status_code, len(r.text), dur_ms, r.json() if r.ok else None

zk_status, zk_size, zk_ms, zk = query("zipkin",
    f"http://localhost:9411/api/v2/trace/{trace_id}")
jg_status, jg_size, jg_ms, jg = query("jaeger",
    f"http://localhost:16686/api/traces/{trace_id}")
tp_status, tp_size, tp_ms, tp = query("tempo",
    f"http://localhost:3200/api/traces/{trace_id}")

def span_count(payload, key_path):
    if not payload: return 0
    cur = payload
    for k in key_path: cur = cur.get(k, []) if isinstance(cur, dict) else cur
    return len(cur) if isinstance(cur, list) else 0

print(f"trace_id: {trace_id}")
print(f"order_id: {order_id}")
print()
print(f"{'backend':<8} {'status':<7} {'bytes':<8} {'ms':<7} {'spans':<6}")
print(f"{'zipkin':<8} {zk_status:<7} {zk_size:<8} {zk_ms:<7.1f} "
      f"{len(zk) if isinstance(zk, list) else 0}")
print(f"{'jaeger':<8} {jg_status:<7} {jg_size:<8} {jg_ms:<7.1f} "
      f"{span_count(jg, ['data', 0, 'spans']):<6}")
print(f"{'tempo':<8} {tp_status:<7} {tp_size:<8} {tp_ms:<7.1f} "
      f"{span_count(tp, ['batches', 0, 'scopeSpans', 0, 'spans']):<6}")

A representative run on a workstation against three local Docker containers produces:

trace_id: 9f4e2a0bdc3f72615b8d3e9c7104a8e2
order_id: ORD-a3c91f7e

backend  status  bytes    ms      spans
zipkin   200     8421     38.2    4
jaeger   200     11247    52.1    4
tempo    200     6804     19.7    4

Per-line walkthrough. The OTLPSpanExporter sends to a local OpenTelemetry Collector, which in turn fans out to all three backends via per-backend exporters; this is the production-shape pipeline (one SDK, one collector, multiple destinations) rather than three separate SDK exporters in the application. Why fan-out via the collector is non-negotiable: if you wired three exporters into the SDK directly, every span would be serialised three times in the application process, eating CPU and tail-latency. The collector sits out-of-process, batches spans into per-backend formats once, and absorbs export failures (Zipkin down? Tempo keeps receiving). Production fleets always run the collector — even when they have only one backend — for exactly this isolation. The line provider.shutdown() forces a flush of the BatchSpanProcessor — without it, the script would exit before the spans were exported and all three queries would return 404. The line time.sleep(2.5) is the ingestion-pipeline delay: spans arrive at the collector in milliseconds, but Tempo's flush-to-block cycle and Jaeger's ES refresh interval add seconds before queries can find them. The query-by-trace-id results show all three backends returning the same four spans but with different payload sizes — Zipkin's 8.4KB is the per-span tag-indexed format, Jaeger's 11.2KB is ES's verbose JSON with extra metadata, Tempo's 6.8KB is the most compact because it skips the per-tag indexing scaffolding. The query latency difference is small for trace_id lookups (all three are O(log N)) but diverges drastically for tag-filter queries — a service:checkout-api duration:>500ms query on a 1-billion-span dataset takes ~200ms on Jaeger (ES inverted index hit), ~300ms on Zipkin (Cassandra index seek), and 8–20 seconds on Tempo (parallel block scan).

The interesting half of the story is what happens when you query for shapes that none of the trace_id is known. Append this to the script:

# 4. Query each backend for "find traces with customer.tier=gold and
#    duration > 100ms in the last 5 minutes" — the diagnostic query
#    on-call actually runs.
print("\n--- find slow gold-tier traces (last 5min) ---")

# Zipkin: GET /api/v2/traces?serviceName=checkout-api&minDuration=100000
#                          &annotationQuery=customer.tier=gold
zk_q = ("http://localhost:9411/api/v2/traces"
        "?serviceName=checkout-api&minDuration=100000"
        "&annotationQuery=customer.tier%3Dgold&lookback=300000&limit=100")
zk_qs, zk_qz, zk_qms, _ = query("zipkin", zk_q)

# Jaeger: GET /api/traces?service=checkout-api&minDuration=100ms
#                       &tag=customer.tier:gold&lookback=5m&limit=100
jg_q = ("http://localhost:16686/api/traces"
        "?service=checkout-api&minDuration=100ms"
        "&tags=%7B%22customer.tier%22%3A%22gold%22%7D&lookback=5m&limit=100")
jg_qs, jg_qz, jg_qms, _ = query("jaeger", jg_q)

# Tempo: TraceQL POST with { duration > 100ms && resource.service.name =
#                            "checkout-api" && span.customer.tier = "gold" }
tp_q = ("http://localhost:3200/api/search"
        "?q=%7B%20duration%20%3E%20100ms%20%26%26%20"
        "resource.service.name%20%3D%20%22checkout-api%22%20%26%26%20"
        "span.customer.tier%20%3D%20%22gold%22%20%7D&start=...&end=...")
tp_qs, tp_qz, tp_qms, _ = query("tempo", tp_q)

print(f"{'zipkin filter':<14} {zk_qs:<7} {zk_qms:<7.1f}ms")
print(f"{'jaeger filter':<14} {jg_qs:<7} {jg_qms:<7.1f}ms")
print(f"{'tempo filter':<14} {tp_qs:<7} {tp_qms:<7.1f}ms")

--- find slow gold-tier traces (last 5min) ---
zipkin filter  200     94.3   ms
jaeger filter  200     61.7   ms
tempo filter   200     2841.2 ms

The Tempo result is two orders of magnitude slower than Jaeger's, despite the trace_id lookup being faster. This is the indexing trade in production. Tempo did not pay for an index on customer.tier at write time, so it must scan all blocks in the time window at read time. With 5 minutes of data and a small dataset on a laptop, that is two to three seconds; with a production fleet writing hundreds of millions of spans per minute, the same query against unindexed fields can take 30–120 seconds — useful for ad-hoc forensics, useless for an on-call dashboard panel. Why this is fine for many production fleets: most distributed-tracing query workload is "I have a trace_id from a log line or alert; show me the tree." That is a trace_id lookup, which is millisecond-fast on all three backends. Tag-filter queries are rare — typically a senior engineer doing forensic analysis after a failure, who can wait 30 seconds. Optimising the rare query at the cost of the common one would be wrong; Tempo's design knows this.

Real-system tie-ins — when each backend is the right choice

The choice between Zipkin, Jaeger, and Tempo is not a popularity contest; it is a function of read pattern, retention requirement, and operational appetite. Three concrete production shapes show up repeatedly.

The first is the high-retention, low-query-rate shape — a fleet that wants to keep 30 days of every trace for forensic and compliance purposes but does not run interactive trace-search dashboards. This is Tempo's home turf. Hotstar's IPL infrastructure runs in this mode: during the 2024 final, their request fleet emitted approximately 1.2 billion spans per hour from 80 microservices, with 100% retention via Tempo backed by S3. The storage bill at S3 Standard pricing was roughly ₹4.5 lakh per month for 30 days of full-fidelity traces — about 1/15th what a Cassandra-backed Zipkin would have cost at the same fidelity. The query workload was almost entirely "user filed a complaint with order_id X, log line gives trace_id Y, show me the tree", which is sub-second on Tempo. The tag-filter forensic queries that took 60 seconds were acceptable because they ran maybe 20 times a day during incident response. The bet paid off.

The second shape is the high-query-rate, lower-retention workload — a fleet that wants 7 days of traces but runs many simultaneous tag-filter searches for service-dependency analysis, anomaly detection, and self-service developer dashboards. This is Jaeger's strength. Razorpay's payments engineering team runs Jaeger on Elasticsearch with 7-day retention and approximately 200 concurrent users during business hours running searches like "show me all payment-status=failed traces from the mandate-creation service in the last hour where psp=hdfc-bank". Each query hits ES's inverted index and returns in 100ms to 2 seconds. The storage cost is higher per day (roughly 4× Tempo at the same wire volume), but the retention window is shorter, so the absolute storage bill is comparable. The operational cost is ES cluster management — shard rebalancing during reindexing, mapping explosions when developers add new tag keys without coordinating, occasional JVM heap pressure when query load spikes. Razorpay's platform team has two engineers full-time on observability infrastructure, and a meaningful fraction of their time is ES tuning. They pay this cost because the developer-self-service value is enormous: every backend engineer can run trace queries from Grafana panels without filing a ticket.

The third shape is the legacy, simple, single-storage workload — a smaller fleet that wants traces but cannot justify the operational complexity of Elasticsearch or the cloud-native commitment of object storage, and is happy with a single MySQL or Cassandra. This is Zipkin's last remaining home. Smaller Indian fintech and SaaS startups (under ~50 services) often run Zipkin on a single Cassandra node with 3–7 day retention. The setup is one container and one storage backend; operational overhead is near-zero. The query latency is acceptable for the volume, and the developers get the same span-tree UI they would get from the bigger backends. Zipkin has been losing market share to Jaeger and Tempo since 2020, but for fleets under a few thousand spans/sec, it remains a defensible choice — the simplest backend that works.

A fourth, scaling-pain shape: fleets that outgrew Jaeger and migrated to Tempo. Flipkart's observability team made this move in 2023 — their Big Billion Days traffic was generating roughly 800K spans/sec at peak, and their Jaeger-on-Elasticsearch deployment was hitting shard-management limits that required hand-tuning every BBD season. Migrating to Tempo cut their storage cost by ~70%, eliminated the ES tuning burden, and let them increase retention from 7 days to 30. The cost was that their tag-filter dashboards (which their fraud-detection team ran constantly) moved from "1-second response" to "30-second response". They mitigated by adding a small Elasticsearch index of just the fraud-relevant fields fed from the OTel collector — a hybrid pattern that more fleets are adopting: Tempo as the primary store, a thin secondary index of high-query-rate fields elsewhere. The hybrid is acknowledgment that no single backend wins on all axes.

A fifth pattern is per-team backend choice — large engineering organisations that run multiple trace backends for different teams' needs. Swiggy's platform team runs Tempo as the default for the consumer-facing services (high volume, low query rate) but maintains a smaller Jaeger cluster for the partner-onboarding team, who run interactive trace-search workflows during merchant-issue debugging. The OTel Collector's tail_sampling_processor and routing_processor make this routing easy at ingest time — set a resource attribute, and the collector routes traces to the appropriate backend. Operating two backends costs more, but the alternative is forcing one team's read pattern onto the other, which costs more in engineering time per debugging session. This pattern is increasingly common at fleets above 1000 services.

Illustrative — the decision tree most fleets converge on. Branches are read pattern, then retention or developer-self-service, terminating in one of three (or hybrid) backends. Real deployments rarely match a single branch perfectly; the hybrid pattern is increasingly common.

How each backend stores a trace on disk

Stepping below the API, the three backends commit spans to three structurally different on-disk shapes, and the shape predicts the operational pain.

Zipkin on Cassandra writes each span as a row in a traces table partitioned by trace_id, with secondary indexes (service_span_name_index, service_remote_service_name_index, etc.) maintained as separate Cassandra tables that point back at trace_ids. Cassandra's per-row write amplification means a single span emits 5–7 writes total. Compaction merges these over time, but during ingest spikes the secondary-index tables can fall behind. The classic Zipkin failure mode is "search returns stale results during high ingest" — the trace exists in the primary table but the index has not caught up. The fix is to either reduce ingest, increase the secondary-index compaction throughput, or accept the staleness window.

Jaeger on Elasticsearch writes each span into a daily rolling index (jaeger-span-2026-04-25, etc.). Each span becomes a JSON document with all tags as ES fields, refreshed (made queryable) on the configured interval — typically 5 to 30 seconds. ES inverted indexes are constructed eagerly at indexing time, so write cost is high (each tag value contributes to the index) but query cost is low. The classic Jaeger failure mode is mapping explosion — when developers start adding novel tag keys, ES's mapping grows without bound, eventually exceeding the per-index field limit (default 1000) and causing all subsequent writes to fail with a mapper_parsing_exception. The fix is to configure a dynamic_templates mapping that flattens unknown tags into a single keyword-typed tags.dynamic field, sacrificing per-key indexing for boundedness. Most production Jaeger setups have learned this the hard way.

Tempo writes spans into Parquet-like blocks in object storage, with a small per-block index file that maps trace_id to byte offset. Blocks are flushed every 5–10 minutes (configurable) and compacted by background workers. The on-disk format is columnar — all service.name values for spans in a block are co-located, all durations are co-located, etc. — which makes block scans for filter queries fast (only the relevant columns are read from S3). The classic Tempo failure mode is block-flush stall — if the ingester cannot flush blocks fast enough relative to the ingestion rate, in-memory buffers grow and eventually the ingester OOMs. The fix is to increase the ingester replica count, increase the flush parallelism, or reduce the per-tenant ingestion rate. Tempo's metrics expose tempo_ingester_blocks_flushed_total and tempo_ingester_traces_created_total precisely so on-call can spot this regression early.

A fourth on-disk subtlety is trace_id collision rates. Zipkin and Jaeger both treat 64-bit trace_ids as primary keys; collisions are rare but possible at high volume (birthday paradox: at 4 billion traces, you have a 50% chance of a collision among 64-bit IDs). Tempo always uses 128-bit trace_ids and rejects 64-bit ones at ingest (or pads them, depending on configuration). The collision matters when a span lands in the wrong tree — your tree-render shows spans from two different requests interleaved. Modern fleets that emit 128-bit trace_ids at the SDK level avoid this entirely; legacy fleets emitting 64-bit Zipkin trace_ids should plan for the migration before they cross the birthday-bound threshold. The threshold in real numbers: at 100K spans/sec sustained, a 64-bit trace_id space sees its first collision in about 6 months. Most fleets cross it before they realise.

Common confusions

"Tempo is just slower Jaeger." No — Tempo is "fast trace_id lookup, slow tag-filter scan" while Jaeger is "moderate on both because everything goes through ES inverted indexes". Tempo wins on storage cost (10×–30× cheaper) and 100% retention; Jaeger wins on tag-filter latency (100× faster). Picking between them is picking between read patterns, not between fast and slow.
"Zipkin is deprecated." Not officially — it is still maintained and shipping releases as of 2025. It has lost market share to Jaeger and Tempo, but for small fleets (under ~50 services and under ~10K spans/sec) it is still a defensible single-binary choice. Saying "deprecated" overstates the position.
"Jaeger is just an Elasticsearch UI." Wrong on two counts. First, Jaeger supports Cassandra, BadgerDB, gRPC plugins, and other backends — ES is the default but not the only one. Second, Jaeger's UI does real work the ES query API does not — span-tree assembly, service dependency derivation, error-flag highlighting. The UI is the value; the backend is interchangeable.
"Tempo's TraceQL is just SQL for traces." TraceQL borrows syntax from PromQL and SQL but is its own language with span-graph-specific operators (>> for descendant-of, ~ for sibling-of) that have no SQL equivalent. The { } curly-brace span selector is closer to a graph-query language than a relational one. Treating TraceQL as SQL lets you write queries that compile but do not return what you expect.
"All three backends store the same data." They store the same span fields, but with very different fidelity. Zipkin and Jaeger truncate long string tag values (default 1KB cap per tag); Tempo stores full strings. Zipkin's older versions drop span events entirely; Jaeger preserves them; Tempo keeps everything. If you need full-fidelity span events for debugging, Tempo is the only one of the three that guarantees nothing was dropped at ingest.
"Tag indexing is free in Jaeger." No — every indexed tag contributes to ES mapping size, shard size, and refresh cost. A team that adds 50 new tag keys without coordination can push their Jaeger cluster into mapping-explosion territory. Indexing is a budget, not a freebie. The same team in Tempo would not feel the cost (because tags are scanned, not indexed), but their tag-filter queries would slow down — a different tax for the same bad practice.

Going deeper

TraceQL — Tempo's query language and the columnar scan

TraceQL is Tempo's structured trace-query language, introduced in Tempo 2.0 (2023). The grammar borrows from PromQL: { } selects spans by attribute, &&/|| compose conditions, >> and << express ancestor-descendant relationships across spans in a trace. A query like { resource.service.name = "checkout-api" } >> { duration > 500ms && status = error } returns traces where the checkout-api ancestor span has a descendant span with both high duration and error status — a structurally non-trivial query that would be hard to express in Jaeger's tag-based query API. Under the hood, the TraceQL planner compiles to block-scan stages that filter columnar data in parallel. A query over a 1-hour window at 10M spans/sec ingest rate scans roughly 36 GB of compressed columnar data — at 10 Gbps S3 bandwidth and 1024-way parallelism (typical Tempo querier deployment), that runs in 5–15 seconds. The bound is parallelism, not cleverness; Tempo's bet is that S3 bandwidth will keep increasing faster than tag indices can keep up.

Service dependency graphs — how each backend computes them

A service-dependency graph (the "who calls whom" topology view) is one of the highest-value derived products of distributed tracing. Each backend computes it differently. Zipkin runs a periodic Spark job over historical spans, aggregating parent-child service pairs into edge counts, and writes the result to a dependencies table. The job runs hourly or daily, so the graph is always slightly stale. Jaeger has a real-time mode using an in-memory pipeline (the spark-dependencies job for batch, plus an optional streaming dependency calculator), with edge counts updated within minutes. Tempo computes service graphs using a metrics-generator side-pipeline — the OpenTelemetry metrics-generator processor extracts edge counts from spans and writes them as Prometheus metrics (traces_service_graph_request_total{client, server}), which Grafana can chart directly. This is the most operationally clean pattern: the dependency graph becomes a Prometheus query rather than a separate batch job, and it stays fresh in seconds.

How retention windows really work — and why "30 days" rarely means what you think

All three backends advertise configurable retention. The reality is more subtle. Zipkin on Cassandra uses TTL on rows; spans automatically expire after the configured interval, but secondary-index tables have their own TTLs that must be aligned, and a misaligned TTL produces a "ghost trace" — the index points at a trace that no longer exists in the primary table, and the API returns a 500. Jaeger on Elasticsearch uses index-level retention via the Curator pattern — daily indices are deleted in bulk after the retention window, which is operationally clean but means retention is per-day-aligned (a 7-day retention is really "between 7.0 and 7.99 days" depending on when you query). Tempo uses block-level retention with explicit lifecycle rules on the S3 bucket — blocks older than the retention window are deleted by the bucket lifecycle policy, not by Tempo itself. This means a misconfigured S3 lifecycle rule (or no rule at all) causes Tempo blocks to live forever, growing the storage bill without bound. The Tempo failure mode of "you forgot the S3 lifecycle policy" is the cheapest mistake to make and the most expensive to discover at the next quarter's cloud bill.

Why Jaeger's UI lasted longer than Jaeger's storage

The Jaeger UI is excellent — span-tree visualisation, error-span highlighting, service-dependency overlay, comparison view that lets you diff two traces side-by-side. Many fleets that migrated from Jaeger storage to Tempo storage kept the Jaeger UI. The OpenTelemetry community's jaeger-query-tempo adapter lets Tempo masquerade as a Jaeger backend, so the Jaeger UI talks to it transparently. Grafana's own trace UI has caught up since 2023, but the Jaeger UI is still the gold standard for span-level interaction. The lesson is that backend and UI are separable concerns; many production fleets deliberately mix-and-match (Tempo storage + Jaeger UI is a common 2025 deployment shape). The OpenTelemetry collector's flexibility in routing spans to multiple backends simultaneously is what makes this hybrid possible.

Operational telemetry for the trace backend itself

Each backend exposes its own metrics for self-observability, and ignoring them is a common operational sin. Zipkin exposes zipkin_collector_messages_total and zipkin_collector_spans_total plus per-storage-backend metrics. Jaeger exposes jaeger_collector_spans_received_total, jaeger_collector_traces_saved_total, and jaeger_query_requests_total. Tempo exposes tempo_distributor_spans_received_total, tempo_ingester_blocks_flushed_total, tempo_querier_queries_total, plus block-storage metrics. The discipline that separates good observability teams from bad ones is monitoring the trace backend with the same rigour you monitor your application. A trace backend that drops 5% of spans during ingest spikes is silently lying to every on-call investigation downstream. Razorpay's platform team learned this in 2024 — they discovered after a long incident that 12% of spans during the peak hour had been dropped at the Tempo distributor due to under-provisioned ingester replicas, and the resulting trace trees were systematically incomplete. The fix was three lines of Helm config and an alert on rate(tempo_distributor_spans_dropped_total[5m]) > 0.01 — a self-instrumentation rule every observability team should adopt before they need it.

Where this leads next

OpenTracing / OpenTelemetry — convergence and the spec — the API and SDK layer that sits in front of all three backends, and how the unification on OTLP made backend swaps possible.
Trace sampling — head, tail, adaptive — how sampling decisions interact with each backend's storage cost (fewer spans means less indexing for Jaeger, smaller blocks for Tempo, less Cassandra write amplification for Zipkin).
Distributed context propagation patterns — the wire formats (W3C, B3) that the previous chapter covered, ingested by all three backends with slightly different fidelity rules.
Service dependency graphs from traces — the highest-value derived product, computed differently by each backend.

The next chapter zooms out from the storage backend to the API and SDK layer that produces the spans these backends ingest. OpenTracing was the first attempt at a vendor-neutral tracing API; OpenTelemetry is the convergent successor that absorbed both OpenTracing and OpenCensus. Understanding why the convergence happened and what it means for backend interchangeability is what closes the loop on this section.

A small empirical exercise to run before moving on: take any one production service in your fleet, point its OTel exporter at all three backends in parallel (via the OTel collector's fan-out mode, exactly as the script above does), and run a week of normal traffic. Compare the per-backend storage bill, the per-backend query latency for trace_id lookups, and the per-backend tag-filter latency. The numbers will be specific to your read pattern — your team's actual mix of trace_id-lookup vs tag-filter queries — and the right backend choice will follow from those numbers, not from any blog post or this article. The cheapest backend on paper is rarely the cheapest in your actual workload. Run the experiment.

References

Zipkin documentation — storage — the canonical reference for Zipkin's pluggable storage model, indexing strategy, and Cassandra schema.
Jaeger documentation — Elasticsearch backend — the operational reference for Jaeger's default backend, including index sharding and retention.
Tempo documentation — TraceQL and the columnar format — the language and the storage layout, including the per-block index format.
Sigelman et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" (Google, 2010) — the foundational paper that all three backends build on.
Yuri Shkuro, "Mastering Distributed Tracing" (Packt, 2019) — the Jaeger creator's textbook on the data model, sampling, and operational practice.
Grafana Labs, "Why we built Tempo" (2020) — the design rationale for indexing only trace_id and using object storage.
Razorpay engineering — Jaeger to Tempo migration retro (2024) — a public Indian-fleet case study on why and how they moved.
B3 and W3C Trace Context wire formats — the previous chapter, which covered what flows into each of these backends from the wire.

# Reproduce this on your laptop
docker run -d --name zipkin -p 9411:9411 openzipkin/zipkin
docker run -d --name jaeger -p 16686:16686 -p 4317:4317 \
    -e COLLECTOR_OTLP_ENABLED=true jaegertracing/all-in-one:1.50
docker run -d --name tempo -p 3200:3200 -p 4318:4318 \
    grafana/tempo:latest -config.file=/etc/tempo.yaml
python3 -m venv .venv && source .venv/bin/activate
pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-exporter-otlp-proto-grpc requests
python3 trace_backend_compare.py
# Expected: trace_id printed, all three backends return 200 with the
# four-span payload. Trace_id lookup latency is comparable across the
# three (10–60ms on a laptop, all O(log N) seeks). The tag-filter query
# at the bottom of the script shows Tempo running 20–100x slower than
# Jaeger because Tempo did not pay for the index. This is the indexing
# trade-off, in your own terminal, in 30 seconds.