Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Exemplars: the metric ↔ trace bridge
It is 22:14 IST on a Saturday. Aditi, a platform engineer at a hypothetical Bengaluru-based ticketing firm we will call BookMyShoot, is staring at a Grafana panel. The panel shows http_request_duration_seconds for the seat-allocator service during the 22:00 IST drop of T20 Mumbai-vs-Chennai tickets. The p99 line on the histogram has just spiked from 180ms to 4.2 seconds. The aggregate request rate is 38,000 RPS. Aggregate error rate is 0.02%. Nothing else on the dashboard looks wrong. She has one of two things to do next: spend the next 25 minutes writing a TraceQL query against Tempo to find a 4-second-plus trace that matches service.name=seat-allocator in this minute, then spend another 10 minutes proving the trace she found is representative; or she can click on the orange dot Grafana has just rendered on top of the histogram bucket. The dot is sitting at (t=22:14:08, latency=4.18s). She clicks it. Grafana opens a new tab on Tempo with trace_id=8e3f2a7c…b1 already loaded. The span tree shows a 3.97-second wait on redis-seatlock-cluster-3. Total clicks: one. Total time: about four seconds. The orange dot is an exemplar, and this chapter is about the primitive that made the click possible.
An exemplar is a (timestamp, value, trace_id) triple attached to a histogram bucket increment — sparse, optional, and indexed alongside the bucket count. It is the only data structure in the metrics world that survives aggregation while still naming a specific request, which is what makes it the join key from "the p99 spiked" to "this is the slow trace". The mechanism is small (~50 bytes per histogram bucket per scrape); the operational leverage is enormous, because every other metric→trace correlation path is either lossy or 100x more expensive.
What an exemplar actually is — the data structure, not the marketing
A histogram in Prometheus is, at storage time, a set of cumulative counters. http_request_duration_seconds_bucket{le="0.1"} is the count of requests with latency ≤ 100ms; le="0.5" is the count with latency ≤ 500ms; le="+Inf" is the total count. The increments happen in the application: when a request finishes in 4.18s, the application increments the le="5" bucket by 1, and every higher bucket too. The bucket counter has no memory of which request caused the increment — once incremented, the request that caused it has dissolved into a shared count.
An exemplar is the optional repair to that loss. The OpenMetrics spec (now part of the Prometheus exposition format) defines an exemplar as a sparse annotation on a bucket increment: a label set, a value, and a timestamp. Critically, when prometheus-client (or the OpenTelemetry SDK) records histogram.observe(4.18, exemplar={"trace_id": "8e3f2a7c…b1"}), it stores the exemplar alongside the bucket counter. On the next scrape, the /metrics endpoint emits both the counter increment and the exemplar:
http_request_duration_seconds_bucket{le="5",service="seat-allocator"} 47823 # {trace_id="8e3f2a7c...b1"} 4.18 1714512848.103
The # {...} value timestamp syntax after the line is the exemplar. There is at most one exemplar per bucket per scrape interval — when the second slow request arrives during the same scrape window, it overwrites the exemplar of the first. This is the design: storage cost is bounded (one exemplar per bucket, regardless of QPS), and the exemplar is a sample of the requests that fell in that bucket, not a record of them all.
The "at most one per bucket per scrape" rule is the load-bearing design choice. Without it, exemplars would scale with QPS — at 38,000 RPS, recording an exemplar per request would emit 38,000 trace_ids per second from a single histogram, which is back to the problem high-cardinality labels created. With the cap, a 7-bucket histogram emits at most 7 exemplars per scrape, regardless of QPS. At a 15-second scrape interval, that is 28 exemplars per minute per histogram per service — three orders of magnitude smaller than the request stream itself. Storage cost stays bounded; correlation utility stays high.
Why "one per bucket" rather than "one per histogram" or "one per scrape": one per histogram would mean a single trace_id has to represent every bucket, which is meaningless — the request that produced a 4.2s exemplar has nothing to say about why the 50ms bucket also had increments. One per scrape would lose the information about which bucket the slow request fell into. One per bucket is the smallest useful granularity: it lets the dashboard render an exemplar dot at the correct bucket boundary, which is the visual cue that drives the click. The bucket-level granularity is what turns "the histogram looks bad" into "this specific bucket has a representative slow trace" — and that is a different cognitive operation, not just a quantitative refinement.
The contract has one more subtle property: exemplars are not part of the cumulative bucket counter. The _count and _sum series do not have exemplars; only the _bucket series do. This matters because PromQL aggregation (rate(), sum(), histogram_quantile()) operates on the bucket counters and propagates exemplars through the aggregation. When you compute histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))), Prometheus walks the buckets, finds the one where the 99th percentile lands, and surfaces the exemplars from that bucket only. The dashboard panel can then render those exemplars as clickable dots without the user knowing which bucket the p99 came from. The aggregation hides the bucket arithmetic; the exemplar surfaces the click target.
The bidirectional bridge — metrics to traces, traces back to metrics
Exemplars are usually pitched as "metrics → traces" — click a histogram dot, jump to a trace. That is the direction most teams wire first, and it is the dominant correlation walk. The reverse direction is less talked about but increasingly important: given a trace you are already looking at, find the metrics it contributed to. This is the trace-to-metric join, and it is what makes single-pane-of-glass actually work as a navigation graph rather than a one-way street.
The forward join (metric → trace) uses the exemplar's trace_id directly: take the trace_id from the bucket exemplar, deep-link into Tempo via https://tempo/ui/?traceID=<id>, render the span tree. This is mechanical and fast. The reverse join (trace → metric) needs a different path, because the trace itself does not know which histogram bucket it landed in — that decision was made by the application's histogram.observe() call, which the trace has no record of. The reverse join is reconstructed from the trace's resource attributes (service.name, pod, endpoint) and its duration: pick the right histogram ({service.name="seat-allocator"}), pick the bucket for the duration (4.18s → le=5.0), and now you can render the exemplars from that bucket on the histogram panel — including the trace you are looking at, plus its siblings (other slow traces in the same bucket during the same scrape window).
This is what Grafana's Tempo data source calls tracesToMetrics — a configuration that maps from a span's resource attributes to a Prometheus query, then links to a metric panel pre-filtered to that query. The result is bidirectional: from a trace, you can jump to "the histogram this trace contributed to, with exemplar siblings highlighted". From the histogram, you can jump back to any of those siblings. The navigation graph is undirected, which is what "single pane of glass" actually means in practice — not "all telemetry in one tool" but "every telemetry artefact links to its neighbours, in both directions".
Most teams wire the forward direction in the first quarter (it requires only prometheus-client 0.16+ and a Grafana exemplar data link) and discover the reverse direction during their first major incident triage where they wanted to ask "this trace looks bad — is it a one-off or part of a pattern across the fleet?". The reverse direction answers that in two clicks: trace → metric panel → all exemplar traces from the same bucket in the last 5 minutes. If twelve of them light up, it is a pattern. If it is just this one, it was a one-off — possibly a cosmic-ray bit-flip, possibly a hardware glitch on a specific pod. The pattern-vs-one-off question is the most common second question after "what is this trace doing", and the bidirectional bridge answers it natively.
Emitting and walking exemplars in Python — a working pipeline
Here is the runnable pipeline that emits a histogram with exemplars and walks them in both directions. It uses prometheus-client 0.20 and opentelemetry-sdk against a local Prometheus 2.43+ and Tempo, with the exemplar storage feature flag enabled (--enable-feature=exemplar-storage on Prometheus).
# exemplar_pipeline.py — emit histogram with exemplars; query both directions
# pip install prometheus-client opentelemetry-api opentelemetry-sdk \
# opentelemetry-exporter-otlp flask requests
import time, random, os
from flask import Flask, jsonify
from prometheus_client import Histogram, generate_latest, CONTENT_TYPE_LATEST
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
# 1. OTel resource — the load-bearing attributes for the reverse join
resource = Resource(attributes={
"service.name": "seat-allocator",
"service.instance.id": "alloc-7d9f-xk2",
"deployment.environment": "production",
"host.name": "seat-allocator-7d9f-xk2",
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# 2. histogram with explicit buckets sized to BookMyShoot's SLO ladder
LAT = Histogram(
"http_request_duration_seconds",
"request latency",
labelnames=["endpoint"],
buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
app = Flask(__name__)
@app.route("/lock/<seat_id>")
def lock_seat(seat_id):
with tracer.start_as_current_span(
"place_seat_lock",
attributes={"seat.id": seat_id, "endpoint": "/lock"}) as span:
ctx = span.get_span_context()
trace_id = format(ctx.trace_id, "032x")
t0 = time.time()
# 3. simulate redis lock — sometimes slow due to a hot shard
if random.random() < 0.003: # 0.3% of requests hit the slow shard
time.sleep(random.uniform(3.5, 4.5))
else:
time.sleep(random.uniform(0.02, 0.18))
elapsed = time.time() - t0
# 4. observe with exemplar — the trace_id is the forward join key
LAT.labels(endpoint="/lock").observe(
elapsed, exemplar={"trace_id": trace_id})
return jsonify(ok=True, trace_id=trace_id, latency=elapsed)
@app.route("/metrics")
def metrics():
# OpenMetrics format is required for Prometheus to parse exemplars
from prometheus_client import generate_latest
from prometheus_client.openmetrics.exposition import generate_latest as openmetrics
return openmetrics(), 200, {"Content-Type": "application/openmetrics-text; version=1.0.0; charset=utf-8"}
if __name__ == "__main__":
app.run(port=8080)
Sample /metrics scrape (truncated to the buckets that matter), showing the exemplar attached to the slow-bucket increment:
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{endpoint="/lock",le="0.05"} 8421
http_request_duration_seconds_bucket{endpoint="/lock",le="0.1"} 21893
http_request_duration_seconds_bucket{endpoint="/lock",le="0.25"} 47102
http_request_duration_seconds_bucket{endpoint="/lock",le="0.5"} 47198
http_request_duration_seconds_bucket{endpoint="/lock",le="1.0"} 47207
http_request_duration_seconds_bucket{endpoint="/lock",le="2.5"} 47209
http_request_duration_seconds_bucket{endpoint="/lock",le="5.0"} 47213 # {trace_id="8e3f2a7cb1d4e9a82b1"} 4.18 1714512848.103
http_request_duration_seconds_bucket{endpoint="/lock",le="10.0"} 47213
http_request_duration_seconds_bucket{endpoint="/lock",le="+Inf"} 47213
http_request_duration_seconds_count{endpoint="/lock"} 47213
http_request_duration_seconds_sum{endpoint="/lock"} 6234.81
The walker — querying Prometheus for exemplars, then Tempo for the trace, then back the other way:
# walk.py — both-directions exemplar walk
import requests, time
PROM = "http://localhost:9090"; TEMPO = "http://localhost:3200"
def find_slow_exemplars(metric, lookback_s=300):
"""Forward direction: ask Prometheus for exemplars on a metric."""
r = requests.get(f"{PROM}/api/v1/query_exemplars", params={
"query": metric,
"start": time.time() - lookback_s,
"end": time.time(),
}).json()
out = []
for series in r.get("data", []):
for ex in series.get("exemplars", []):
out.append((ex["value"], ex["labels"]["trace_id"], ex["timestamp"]))
return sorted(out, reverse=True) # slowest first
def trace_summary(tid):
"""Forward direction continued: pull the trace from Tempo."""
r = requests.get(f"{TEMPO}/api/traces/{tid}").json()
spans = [s for batch in r["batches"]
for ils in batch.get("scopeSpans", batch.get("instrumentationLibrarySpans", []))
for s in ils["spans"]]
return [(s["name"], int(s["endTimeUnixNano"]) - int(s["startTimeUnixNano"]))
for s in spans]
def reverse_join(trace_id, duration_s):
"""Reverse direction: from a trace, find peer exemplars in the same bucket."""
# find the bucket boundary the trace lands in
buckets = [0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
le = next((b for b in buckets if duration_s <= b), "+Inf")
# query exemplars on the same metric+bucket — peers are siblings of this trace
expr = f'http_request_duration_seconds_bucket{{le="{le}"}}'
peers = find_slow_exemplars(expr, lookback_s=600)
return [p for p in peers if p[1] != trace_id]
if __name__ == "__main__":
slow = find_slow_exemplars("http_request_duration_seconds_bucket")
if not slow:
print("no exemplars yet — run some load")
else:
v, tid, ts = slow[0]
print(f"[forward] slowest exemplar: trace_id={tid[:16]}… value={v:.3f}s")
for name, dur_ns in trace_summary(tid):
print(f" span {name:24s} {dur_ns/1e9:.3f}s")
peers = reverse_join(tid, v)
print(f"[reverse] peers in same bucket (last 10m): {len(peers)}")
for v2, tid2, ts2 in peers[:5]:
print(f" peer {tid2[:16]}… value={v2:.3f}s")
Sample run (after a few minutes of load):
[forward] slowest exemplar: trace_id=8e3f2a7cb1d4e9a8… value=4.180s
span place_seat_lock 4.180s
span redis-seatlock-3 3.974s
span postgres-seats 0.178s
[reverse] peers in same bucket (last 10m): 11
peer a13c4f8b91e2a743… value=4.082s
peer fb2e0d6c8a1f93b2… value=3.971s
peer 47a8e1b3c4d5f6a9… value=3.834s
peer c92b8e1a4f7d3c5b… value=3.812s
peer 1f4a7c8b2e9d6f3a… value=3.787s
Walking through the load-bearing lines: Resource(attributes={"service.name": "seat-allocator", ...}) authors the resource once; every span carries service.name, which is the reverse-direction join key. Histogram(..., buckets=[..., 5.0, 10.0]) uses explicit buckets aligned with BookMyShoot's SLO ladder (p99 SLO is 500ms, so the 0.5 bucket boundary is load-bearing for the SLO panel; the 5 and 10 buckets exist purely to catch the long tail without polluting the SLO panel). LAT.labels(endpoint="/lock").observe(elapsed, exemplar={"trace_id": trace_id}) is the headline call — prometheus-client since 0.16 supports the exemplar= kwarg on observe(), which records the trace_id as the bucket-level exemplar. generate_latest() from prometheus_client.openmetrics.exposition is critical: the standard text exposition format does not include exemplars in scrapes, only OpenMetrics does. A common mis-wiring is to use the standard exporter and wonder why exemplars never appear in Prometheus — the /metrics endpoint must explicitly emit OpenMetrics-format text.
Why the OpenMetrics requirement is silent: Prometheus 2.43+ has exemplar support but it activates per-target based on the Content-Type header of the scrape. If your endpoint serves Content-Type: text/plain; version=0.0.4 (the legacy Prometheus format), Prometheus parses it as legacy and discards any # {...} annotations as comments. If it serves Content-Type: application/openmetrics-text; version=1.0.0; charset=utf-8, Prometheus parses it as OpenMetrics and stores the exemplars. There is no error, no warning, no log line. The exemplars are silently dropped. The diagnostic is to curl -H "Accept: application/openmetrics-text" http://localhost:8080/metrics | grep '#' and verify the exemplar comments are present, then curl http://localhost:9090/api/v1/query_exemplars?query=... and verify Prometheus stored them. Many "why aren't my exemplars showing up in Grafana" sessions end at the Content-Type header.
The reverse-direction walk is the more interesting one. buckets[next((b for b in buckets if duration_s <= b))] maps the trace duration back to a bucket boundary — this is the application of the histogram's bucketing function, which the Prometheus storage already did at observation time. The reverse query then asks "what other exemplars are sitting in the same bucket, recently?". The 11 peers in the sample run are the answer to "is this trace one-off or a pattern" — eleven peers in ten minutes means there is a pattern, and the operator can now investigate the shared root cause (in this case, the hot Redis shard) with confidence rather than chasing a single-trace red herring.
Why peer-discovery via reverse exemplar walk beats grepping logs by timestamp: a timestamp-based "find slow logs around 22:14" query against Loki returns every log line in that minute regardless of which request it belonged to — typically thousands of lines for a busy service. The reverse exemplar walk returns only the trace_ids whose request-duration bucket matched the slow bucket, which is a much tighter filter — typically a few dozen across a 10-minute window. The relevance/precision is at least 10x higher because the filter is "requests that fell in the same bucket" rather than "anything that happened around that time". The exemplar is the unit of similarity for "slow requests like this one", not the timestamp.
A practical note: the /api/v1/query_exemplars endpoint is rate-limited by Prometheus's exemplar storage size (default: 100,000 exemplars total per Prometheus instance, configurable via --storage.tsdb.exemplars-storage-size). Once the cap is hit, the oldest exemplars are evicted. In a production load with hundreds of histograms, this means exemplars typically have a 30-minute to 2-hour retention window — which is fine for incident triage (you usually look up an exemplar within minutes of the alert) but inadequate for retrospective analysis a day later. Teams that want longer retention either raise the cap (linear memory cost) or push exemplars into Tempo via the metricsGenerator component, which writes them as zero-duration spans with a prometheus.exemplar=true attribute — converting metric-side ephemeral storage into trace-side durable storage.
When exemplars fail — the four wiring breaks
The pipeline above is the success path. In production, exemplars fail to appear in four characteristic ways. The diagnostic ladder for "I clicked the histogram and nothing happened" walks these in order.
Break 1: the exporter emits legacy text format. As covered in the OpenMetrics aside above, the application's /metrics endpoint must serve OpenMetrics-format text. Many prometheus-client integrations (older Django Prometheus middleware, some FastAPI exporters, custom pull endpoints) default to legacy text. The diagnostic: curl -H 'Accept: application/openmetrics-text' http://service:8080/metrics | grep '#' — if there are no # {trace_id="..."} annotations, the exporter is the problem. The fix: switch the response handler to prometheus_client.openmetrics.exposition.generate_latest.
Break 2: Prometheus exemplar storage is disabled. Prometheus 2.43+ supports exemplars only when started with --enable-feature=exemplar-storage. Without the flag, scraped exemplars are parsed and immediately discarded — the /api/v1/query_exemplars endpoint returns an empty result. The diagnostic: curl http://prometheus:9090/api/v1/status/flags | grep exemplar — if no flag is set, exemplar storage is off. The fix: add the flag to Prometheus's command-line arguments and restart.
Break 3: the application emits observe() without an exemplar= argument. A common pattern is to use the OpenTelemetry SDK's metric API, which (in version 1.20 and later) auto-attaches exemplars from the active span context. But if the metric is recorded outside an active span (a background goroutine, a metric updated from a cron job, a metric updated before the span has started), there is no active trace_id to attach. The metric increment goes through, but no exemplar. The diagnostic: scrape /metrics during normal load and grep for # { — if the count of exemplar lines is zero, instrumentation is the problem. The fix: ensure the metric observation happens inside a with tracer.start_as_current_span(...) block, or pass exemplar={"trace_id": ...} explicitly.
Break 4: Grafana data source is not wired to Tempo. Even with exemplars in Prometheus, Grafana will not render them as clickable dots unless the Prometheus data source has exemplarTraceIdDestinations configured to point at the Tempo data source. The diagnostic: open Grafana's data source settings, scroll to the Prometheus data source, check if the Exemplars section has at least one entry pointing at Tempo. The fix: add the data link in the Prometheus data source config (or in provisioning/datasources/prometheus.yml if you provision via YAML). Without the link, exemplars exist in Prometheus but the dashboard never surfaces them visually — leaving you with "I know they're there but I can't click them" frustration.
The four breaks compose. A team can have all four wired correctly except Break 4, see no exemplars in Grafana, conclude that exemplars are broken, and rip out the instrumentation. The cure is to walk the diagnostic ladder in order: scrape format → Prometheus flag → application instrumentation → Grafana data source. Each step is a one-command check; the entire ladder takes about 90 seconds. Most "exemplars don't work for us" stories are one of these four breaks, fixed in a single PR.
Common confusions
- "Exemplars are the same as high-cardinality labels." A label adds a dimension to every series for that metric —
tenant_id="hyd-cloud-kitchens"creates a new series per tenant per metric, multiplying cardinality. An exemplar is a sparse annotation on a bucket increment — at most one per bucket per scrape, regardless of how many distinct trace_ids existed. Adding 50 trace_ids per scrape per histogram is bounded; adding 50 distinct label values is unbounded by QPS and creates 50 new series per metric. The cardinality cost of exemplars is fixed; the cardinality cost of labels scales with cardinality. - "Exemplars store the full trace inline." Exemplars store a
trace_id(and a value, and a timestamp) — typically about 50 bytes per bucket. The trace itself lives in Tempo / Jaeger, retrieved on demand when the operator clicks the dot. Storing the full trace inline would balloon Prometheus's WAL by orders of magnitude; the pointer-only design is what keeps exemplars cheap. - "Exemplars are required for histograms to work." Histograms work fine without exemplars — they record bucket counts and that is enough for
histogram_quantile(). Exemplars are an optional, additive feature that improves the correlation story without changing the measurement story. A team can ship histograms first, add exemplars when they have a Tempo backend, and lose nothing in between. - "Once enabled, exemplars persist forever." Prometheus's exemplar storage is a fixed-size in-memory ring buffer (default 100,000 entries). Once full, the oldest exemplars are evicted regardless of the underlying metric retention. In a busy cluster, exemplar retention is typically minutes to a few hours — not days. Treat them as ephemeral incident-triage tools, not historical lookups; for long-term retention, push them through the Tempo
metricsGeneratorto convert them to durable trace spans. - "Exemplars work with all metric types." Exemplars are only defined on histogram and counter increments in OpenMetrics. Gauges (which represent a state, not an event) do not have exemplars — there is no "request" to point at. If your slow-thing is a queue depth or a connection-pool size, exemplars are not the right primitive; you need a separate event stream or a histogram-of-state-changes instead.
- "All exemplar trace_ids are equally useful." An exemplar's trace_id is only useful if the trace is still in Tempo when the operator clicks. If Tempo's retention is 24 hours and the histogram has a 7-day retention, exemplars older than 24 hours point at evicted traces — Grafana renders the dot, the click goes through, Tempo returns a 404. The diagnostic is to align Tempo retention with the typical incident-investigation window (≥48 hours covers most "what happened last night" investigations) and accept that older exemplars are visual but non-clickable. A future-version fix is the
metricsGeneratorpattern noted above, which keeps a sampled subset of exemplar traces durable beyond Tempo's main retention.
Going deeper
The exemplar storage data structure inside Prometheus
Prometheus's exemplar storage is a dedicated ring buffer in the TSDB head, separate from the chunk storage that holds bucket counts. Each exemplar is stored as a (series_ref, labels, value, ts) tuple — series_ref is a 64-bit pointer to the bucket counter series, labels is the exemplar's label set (typically just trace_id), value is the observed value, ts is the observation timestamp. The buffer is sharded by series_ref to make per-series queries efficient — when query_exemplars is asked for a metric, the storage walks the shards corresponding to the metric's series, which is bounded by the number of buckets times the number of label combinations. The per-series exemplar count is also bounded (default 4 per series); the global cap is the 100,000 limit. The implementation is in tsdb/exemplar.go in the Prometheus repo; it is a few hundred lines and worth reading once if you want to understand exactly when an old exemplar is overwritten by a new one (FIFO within each series).
Native histograms — the next-generation exemplar story
Prometheus 2.40 introduced native histograms (also called sparse histograms), which represent a histogram as a single time series with an exponential bucket scheme rather than a fixed set of _bucket{le=...} series. Native histograms are 10-100x more space-efficient than classic histograms (one series instead of N+2) and have native exemplar support — every bucket increment can carry an exemplar without the OpenMetrics text-format gymnastics. As of 2026, native histograms are still experimental and require Prometheus, Mimir, and the Grafana frontend to all support them; the rollout has been gradual. Teams that have adopted them report exemplar resolution improving (more buckets = more granular dot placement on the panel) and storage cost dropping. The migration path is non-trivial — existing dashboards and recording rules must be rewritten — but native histograms are where exemplar usage is heading. The prometheus-client Python library supports emitting native histograms via Histogram(..., native=True) from version 0.22.
Exemplar sampling and the rare-event problem
A 0.3%-rate slow path (the hot-Redis-shard scenario in the example) generates roughly 1 slow request per 333. With a 15-second scrape interval and an le=5.0 bucket, the exemplar is overwritten by the latest slow request in each scrape window. If the slow rate drops to 0.01% (1 per 10,000), exemplars become rare — many scrapes have no observation in the slow bucket, so no exemplar is emitted. The bucket count goes up only sporadically, the dashboard shows occasional dots, and the operator may not notice the pattern. The fix is not to oversample (that breaks the bounded-cost design); it is to add a recording rule that aggregates the slow bucket over a longer window and surface its exemplars on the alerting panel rather than the raw histogram. A 5-minute aggregation pulls in 20 scrape windows, so even a 0.01% rate has 2-3 exemplars per aggregation bucket on average — enough to drive a click.
The metricsGenerator — converting exemplars to durable spans
Tempo's metricsGenerator component runs alongside the trace ingester and synthesizes RED-style metrics (rate, errors, duration) from incoming spans. As of Tempo 2.3, it can also synthesize exemplars: for each span that contributes to a histogram bucket increment, it emits an exemplar with the span's trace_id. The exemplars flow into Prometheus via remote-write, populating the storage buffer with traceable pointers. Critically, because the exemplars come from spans Tempo already has, they are durable — the trace lookup will succeed even after Prometheus's exemplar buffer has rotated, as long as the trace is still in Tempo's retention. This pattern flips the default: instead of exemplars being shorter-lived than traces, they are now anchored to traces directly. Teams running heavy correlation workflows (Razorpay, Hotstar, Swiggy at scale) increasingly use the metricsGenerator pattern to avoid the "click the dot, get a 404" problem.
The OpenTelemetry exemplar-filter — what to attach when
The OTel metrics SDK has a configurable exemplar filter that decides which observations get exemplars attached. The default filter is trace_based: only observations made inside a sampled trace get an exemplar. This is sensible — if the trace was not sampled, the trace_id points at nothing in Tempo. But it has a subtle interaction with tail-based sampling: a tail-sampled trace might be kept (because it had an error) but the metric observation happened before the tail-sampling decision was made. The exemplar still attaches the trace_id, and the click works because Tempo did keep the trace. But for traces that were tail-sampled out (kept only as an error trace), the exemplar may point at a kept trace correctly only because the error path was the slow one. The interaction is brittle in production and worth understanding when designing the sampling policy: head-based sampling is exemplar-friendly (the decision is made before the metric observation); tail-based sampling can produce exemplars pointing at evicted traces. A pragmatic policy is to always-keep traces that have a duration-bucket exemplar attached, which is what most production OTel collectors now do via the tail_sampling policy latency rule.
Where this leads next
The exemplar mechanism this chapter describes is the load-bearing primitive for the broader correlation story in /wiki/the-one-pane-of-glass-promise-and-its-limits — single-pane-of-glass works only when the navigation graph between metrics, traces, logs, and profiles is wired as edges, and exemplars are the metric→trace edge. The log-side equivalent (trace_id propagated into log lines) is covered in /wiki/log-to-trace-correlation-trace-ids-in-logs; the metric→log path is covered in /wiki/metric-to-log-drill-down. All three together form the cross-pillar correlation walk that this article's bidirectional model is one face of.
Within Part 2 of the curriculum, /wiki/exemplars-linking-metrics-to-traces covers the OpenMetrics encoding and SDK-level mechanics in deeper detail — this article assumed the reader either already knows that material or can pick it up from the reference. The relationship is: chapter 28 is the protocol article; this chapter is the navigation article. They are complementary, not duplicative.
Cross-curriculum, the pattern of "sparse pointer attached to aggregated count" appears again in distributed databases (Cassandra hinted handoff sentinels), in stream processors (Kafka offset commits as sparse pointers into log positions), and in storage systems (B-tree leaf-page exemplar samples for query planning). The exemplar idea — pay a small fixed cost per aggregation unit to retain a pointer back to a representative input — is broadly applicable wherever aggregation otherwise destroys traceability.
# Reproduce this on your laptop
docker run -d --name prom -p 9090:9090 \
prom/prometheus --enable-feature=exemplar-storage \
--config.file=/etc/prometheus/prometheus.yml
docker run -d --name tempo -p 3200:3200 -p 4318:4318 grafana/tempo:latest
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp flask requests
python3 exemplar_pipeline.py &
# generate load — including ~1% slow requests via the hot-shard simulation
for i in $(seq 1 5000); do curl -s http://localhost:8080/lock/seat-$i > /dev/null; done
python3 walk.py
References
- Prometheus — Exemplars (feature flag and API) — the canonical doc for
--enable-feature=exemplar-storageand thequery_exemplarsendpoint. - OpenMetrics specification — Exemplar — the wire-format definition; useful when debugging Content-Type issues.
- Grafana — Tempo data source
tracesToMetrics— the configuration that enables the reverse-direction click from a span to its histogram. - OpenTelemetry — Metrics SDK exemplar filter — defines the
trace_basedandalways_onfilter modes and their interaction with sampling. - Pelkonen et al — Gorilla: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015) — the compression scheme that makes exemplar-bearing histograms cheap to store.
- Charity Majors et al — Observability Engineering, Chapter 8 — the case for high-cardinality correlation as the foundation of debuggability; exemplars are the bounded-cost approximation for systems that cannot pay full per-event cardinality.
- Tempo
metricsGeneratordocumentation — durable exemplar generation from spans. - /wiki/exemplars-linking-metrics-to-traces — internal: chapter 28's deeper coverage of the OpenMetrics encoding and SDK mechanics this chapter assumed.
- /wiki/drill-down-and-correlation — internal: the click-walk discipline this article's exemplar mechanism enables.