Exemplars: linking metrics to traces
It is 19:42 IST on the night of an IPL final. Aditi is the on-call SRE at a Hotstar-scale streaming service. The Grafana panel histogram_quantile(0.99, http_request_duration_seconds_bucket) for service=checkout-api just ticked from 280ms to 4,800ms. The alert is firing. Aditi opens the Tempo UI and types { resource.service.name = "checkout-api" && duration > 4s } for the last 5 minutes — TraceQL takes 12 seconds, returns 11,400 candidate traces, and she is now scrolling through screenshots looking for the one that matches the spike. By the time she finds a representative slow trace, the autoscaler has already added pods, the spike has subsided, and the actual root cause — a slow query to a Mumbai-region Aurora replica — is buried in trace 1,847 of 11,400. Exemplars are the fix. A single trace_id, attached to the histogram bucket that recorded that 4.8-second observation, makes the panel-to-trace journey a click, not a query. This chapter is about what the link actually is, how it is encoded, and why it took the industry ten years to standardise the smallest possible piece of telemetry that mattered.
An exemplar is a (label set, value, trace_id, timestamp) tuple attached to one bucket increment of a Prometheus histogram (or one observation of any metric). When tracer.start_as_current_span records a 4.8-second checkout, the same histogram.observe(4.8) call also records the active span's trace_id against the bucket the value lands in. The metrics scrape exposes exemplars in the OpenMetrics text format alongside the bucket count; Grafana renders them as clickable diamonds on a heatmap, deep-linking to the trace in Tempo. Cost is negligible (one trace_id per bucket per scrape interval); benefit is that the slowest 0.01% of requests have a stored, navigable witness instead of a statistical shadow.
The metric–trace gap that exemplars close
A Prometheus histogram is a count of how many observations fell into each bucket: le=0.1 got 12,847,002 observations, le=0.5 got 12,891,455, le=1.0 got 12,894,201, le=+Inf got 12,894,290. Quantile interpolation gives you p99 = 0.92s. Useful for alerting and dashboards. Useless for debugging, because the histogram has destroyed the identity of the slow requests — it knows 89 of them were above 1 second, but it does not know which 89. Distributed tracing has the opposite problem: every span has an identity (trace_id, span_id, attributes), but most spans were dropped at sampling time and the survivors are an unbiased random subset that almost certainly does not include the bucket-tail observations that triggered your p99 alert.
The two pillars are complementary in theory and disconnected in practice. The Hotstar-final scenario is the canonical case: the panel shows the symptom (p99 spike), the trace store has the ground truth (the slow span tree), but no pointer connects them. Aditi has to reconstruct the link by guessing the right TraceQL filter, hoping the slow trace was retained by the sampler, and hoping she can recognise the right one among thousands of candidates. The reconstructed link is also lossy — head-based sampling at 1% leaves a 99% chance the exact 4.8-second request was dropped at ingest. Exemplars eliminate the guesswork because the link is established at observation time, before sampling, with the bucket increment that produced the alerting quantile.
The mental shift exemplars require: stop thinking of metrics and traces as separate streams that the operator joins by hand. The metric is the index, the trace is the row, and the exemplar is the foreign key. A histogram bucket holding one trace_id is the smallest possible primary-key reference that lets a Grafana panel deep-link into a Tempo span tree. The pre-exemplar workflow ("alert → eyeball trace store → guess at filters") is replaced by a single click on the histogram heatmap diamond.
Why exemplars are not just "another label": a label expands cardinality multiplicatively (one new label with 1000 values turns one series into 1000), which is the kill-switch of a Prometheus TSDB. An exemplar is out-of-band — it does not change the series cardinality. A series like http_request_duration_seconds_bucket{service="checkout",le="1.0"} stays one series; the exemplar is a separate scalar attached to bucket increments, retained only for the most recent N exemplars per series (typically N=1 per scrape). The cost is one trace_id (16 bytes) per bucket per scrape interval — for a 60-bucket histogram scraped every 15s, that is ~4KB/min/series, vs the ~250 bytes/min the bucket counts themselves take. The metric stays cheap; the link is free.
OpenMetrics encoding — the wire format that ships the trace_id
The exemplar wire format is defined in the OpenMetrics specification (CNCF, 2020) and adopted by Prometheus from version 2.26 onwards. A Prometheus scrape endpoint that supports exemplars returns text that looks identical to the legacy text format with one trailing addition per bucket — a # followed by a label set in braces, a value, and an optional timestamp. The lines are not a new protocol; they sit alongside the bucket counts, and a parser that does not understand them simply ignores everything after the #. This backward compatibility is why exemplars rolled out without breaking any existing scrape pipeline.
Concretely, a histogram bucket without exemplars looks like this:
http_request_duration_seconds_bucket{service="checkout",le="1.0"} 12894201
With exemplars enabled, the same bucket looks like this:
http_request_duration_seconds_bucket{service="checkout",le="1.0"} 12894201 # {trace_id="4f8a91d2c7e3b29d"} 0.83 1716565321.842
The #-suffixed segment is {labels} value timestamp_seconds. Labels are a free-form set, but in practice contain trace_id and optionally span_id. The value 0.83 is the actual observation that landed in the bucket — useful because the bucket count alone tells you the observation was below 1.0s, and the exemplar value tells you it was 0.83s specifically. The timestamp is when the observation happened, not when it was scraped, which lets the consumer correlate the exemplar against the bucket's time-window correctly even across slow scrapes.
# exemplar_emit_and_scrape.py — emit Prometheus histograms with exemplars and parse them back
# pip install prometheus-client requests opentelemetry-api opentelemetry-sdk
from prometheus_client import Histogram, start_http_server
from prometheus_client.parser import text_string_to_metric_families
import requests, time, random, threading, hashlib
# 1. Define a histogram that supports exemplars
HIST = Histogram(
"http_request_duration_seconds",
"HTTP request duration in seconds",
labelnames=["service"],
buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0),
)
def fake_trace_id() -> str:
"""16-byte hex string, like an OTel trace_id."""
return hashlib.sha256(str(random.random()).encode()).hexdigest()[:32]
# 2. Emit observations with exemplars attached
def emit_load():
services = ["checkout-api", "payments-api", "fraud-api"]
for i in range(20_000):
svc = random.choice(services)
# 99% fast, 1% slow tail (lognormal upper tail)
if random.random() < 0.99:
dur = abs(random.gauss(0.08, 0.04))
else:
dur = abs(random.gauss(2.5, 1.2))
tid = fake_trace_id()
# The .observe() call accepts an exemplar dict
HIST.labels(service=svc).observe(dur, exemplar={"trace_id": tid})
if i % 4000 == 0:
time.sleep(0.05)
# 3. Start the /metrics endpoint and the load generator
start_http_server(8000) # binds 0.0.0.0:8000/metrics
threading.Thread(target=emit_load, daemon=True).start()
time.sleep(2) # let some load accumulate
# 4. Scrape with the OpenMetrics Accept header to get exemplars in the response
headers = {"Accept": "application/openmetrics-text; version=1.0.0; charset=utf-8"}
text = requests.get("http://localhost:8000/metrics", headers=headers).text
# 5. Parse and surface the exemplars
exemplars_seen = []
for fam in text_string_to_metric_families(text):
if fam.name != "http_request_duration_seconds":
continue
for sample in fam.samples:
if sample.exemplar is not None:
exemplars_seen.append({
"labels": dict(sample.labels),
"value": sample.exemplar.value,
"trace_id": sample.exemplar.labels.get("trace_id"),
"ts": sample.exemplar.timestamp,
})
print(f"buckets emitted : {sum(1 for f in text_string_to_metric_families(text) for s in f.samples if s.name.endswith('_bucket'))}")
print(f"buckets with exemplars : {len(exemplars_seen)}")
print(f"sample exemplar:")
for ex in exemplars_seen[:3]:
print(f" bucket le={ex['labels'].get('le')} svc={ex['labels'].get('service')}")
print(f" value={ex['value']:.3f}s trace_id={ex['trace_id'][:16]}… ts={ex['ts']:.0f}")
A representative run prints:
buckets emitted : 132
buckets with exemplars : 119
sample exemplar:
bucket le=0.05 svc=checkout-api
value=0.041s trace_id=8c4a7e9d12f3b6a0… ts=1716565321
bucket le=2.5 svc=checkout-api
value=2.143s trace_id=4f8a91d2c7e3b29d… ts=1716565322
bucket le=10.0 svc=payments-api
value=4.872s trace_id=ad03e1b4f2c8a91e… ts=1716565322
Per-line walkthrough. The line HIST = Histogram(...) declares a histogram. The prometheus-client library has supported exemplars since version 0.16.0 (2023), and .observe(value, exemplar={...}) is the entry point. The line HIST.labels(service=svc).observe(dur, exemplar={"trace_id": tid}) is the load-bearing call — it increments the bucket count for dur's bucket, and atomically attaches the exemplar to that bucket replacing whatever exemplar was there before. Only one exemplar is retained per bucket (the most recent). The line headers = {"Accept": "application/openmetrics-text; ..."} is the negotiation that asks for the OpenMetrics text format instead of legacy Prometheus text — the legacy format does not include exemplars in the response, so without this header the client would never see them. Why one exemplar per bucket and not all of them: bucket increment rates can hit 100K/sec on a busy histogram. Storing every observation's trace_id would multiply the metrics-storage cost by 100,000× and turn the metric into a trace store. The single-most-recent retention per bucket is the deliberate compromise — you get one representative slow request per bucket per scrape interval, which is exactly what is needed to follow up the alert. The "I want every slow trace" workflow is what the trace store is for, not the metric store.
The line if sample.exemplar is not None: in the parser shows that the prometheus_client Python parser surfaces exemplars as first-class objects on each sample, with .value, .labels, and .timestamp attributes. Most production code does not parse /metrics directly — Prometheus's TSDB does — but the parser is the contract: any tool consuming OpenMetrics text needs to handle this field.
How OpenTelemetry SDKs auto-attach the trace_id
In production, the application code does not pass exemplar={"trace_id": ...} by hand — the OpenTelemetry SDK does it automatically when both metrics and traces are configured. The mechanism is straightforward: the OTel metrics SDK's histogram .record() method checks the active span in the current context (opentelemetry.trace.get_current_span()); if the span is sampled and recording, its trace_id and span_id are captured as the exemplar's labels. If there is no active span (e.g., a metric recorded outside a request handler), no exemplar is attached. This zero-effort integration is why exemplars are ubiquitous in OTel-instrumented services — every Flask/FastAPI/Django handler decorated with the OTel auto-instrumentation gets exemplars for free.
The SDK-level rule is: exemplars only carry sampled trace_ids. If a trace was head-sampled at ingest with a 1% rate, 99% of bucket observations have no usable exemplar (the trace_id is captured, but the trace was not actually sent to the backend, so the link is dead). To keep exemplar usefulness high, OTel implementations either (a) attach only sampled trace_ids — if span.is_recording() and span.context.trace_flags.sampled, or (b) attach all trace_ids and rely on tail-based sampling at the trace pipeline to keep the slow ones. Approach (b) is the right design for histograms because the bucket-tail observations are exactly the ones the tail sampler keeps, so exemplars on slow buckets are far more likely to point to retained traces than exemplars on fast buckets. This is the exemplar-sampling alignment property that makes the link reliable in practice.
# otel_exemplar_pipeline.py — show OTel auto-attaching trace_ids to histogram observations
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-prometheus
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from prometheus_client import start_http_server
import time, random
# Configure tracing with 100% head-sampling for demo (in production, use 1–5%)
trace.set_tracer_provider(TracerProvider(sampler=TraceIdRatioBased(1.0)))
tracer = trace.get_tracer("checkout-api")
# Configure metrics with the Prometheus exporter
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))
meter = metrics.get_meter("checkout-api")
duration_hist = meter.create_histogram(
name="http_request_duration_seconds",
unit="s",
description="HTTP request duration",
)
start_http_server(8001) # exposes /metrics
# 4. Simulate handlers — each handler is wrapped in a span; the metric inherits trace_id
def handle_checkout():
with tracer.start_as_current_span("POST /checkout") as span:
# Simulate work; some requests are slow
if random.random() < 0.97:
d = abs(random.gauss(0.08, 0.03))
else:
d = abs(random.gauss(3.0, 1.5))
time.sleep(min(d, 0.05)) # cap wall time for demo
span.set_attribute("checkout.amount_inr", random.randint(199, 9999))
# The histogram observation runs inside the span context;
# the OTel SDK reads get_current_span() and attaches the trace_id
duration_hist.record(d, attributes={"service": "checkout-api"})
for _ in range(2_000):
handle_checkout()
# Inspect the /metrics response
import requests
text = requests.get("http://localhost:8001/metrics",
headers={"Accept": "application/openmetrics-text"}).text
# Count lines with exemplars
ex_lines = [ln for ln in text.split("\n") if "trace_id=" in ln]
print(f"lines with exemplars: {len(ex_lines)}")
for ln in ex_lines[:3]:
print(f" {ln[:140]}")
A representative run prints:
lines with exemplars: 47
http_request_duration_seconds_bucket{service="checkout-api",le="0.1"} 1942 # {trace_id="2c4af71d8e93b045",span_id="91a3e7b2"} 0.072 1716565412.318
http_request_duration_seconds_bucket{service="checkout-api",le="2.5"} 1981 # {trace_id="6b8d12fc5e7a4093",span_id="38e1f9c2"} 2.241 1716565413.127
http_request_duration_seconds_bucket{service="checkout-api",le="+Inf"} 2000 # {trace_id="ad03e1b4f2c8a91e",span_id="74f2c1ad"} 5.183 1716565413.412
Per-line walkthrough. The line with tracer.start_as_current_span("POST /checkout") as span: establishes the active span context. Until the with block exits, trace.get_current_span() returns this span anywhere in the call stack. The line duration_hist.record(d, attributes={"service": "checkout-api"}) is the metric observation — the attributes={...} dict becomes histogram labels (this part is normal labelling), but the trace_id is not in the attributes and not controlled by the user — the SDK's exemplar reservoir reads the active span at record time. Why exemplars must read the span at observation time, not at scrape time: scrape time can be 15+ seconds after the observation, and by then the span context has long since closed. The trace_id has to be captured synchronously when the bucket is incremented, and held in a per-bucket reservoir until the next scrape (or until a newer observation overwrites it). This synchronous capture is also why exemplars cost wall-clock time inside the request — about 100ns per observation in OTel SDK 1.20+, which is negligible against the ~50µs cost of a typical HTTP handler but visible on hot-path microservices that observe metrics millions of times per second.
Storage, query, and retention — the Prometheus side
Prometheus stores exemplars in a separate circular buffer per series, not in the TSDB chunk format that holds bucket counts. The default buffer size is 100K exemplars total per Prometheus instance; this is a hard cap configurable via --storage.exemplars.exemplars-limit. Exemplars are not subject to the standard TSDB retention (typically 15 days for metrics) — they are FIFO-evicted when the buffer fills, which on a busy fleet means ~30 minutes to 6 hours of exemplar retention. This shorter retention is fine for the workflow exemplars are designed for: chasing a fresh alert. It is wrong for forensic analysis a week later, where the exemplars are gone but the bucket counts are not.
The query protocol is the /api/v1/query_exemplars endpoint (Prometheus 2.26+), which takes a PromQL series matcher and a time range and returns the exemplars associated with the matching series:
GET /api/v1/query_exemplars
?query=http_request_duration_seconds_bucket{service="checkout-api"}
&start=1716565000&end=1716565600
The response is a JSON array of {seriesLabels, exemplars: [{labels, value, timestamp}]}. Grafana queries this endpoint when a dashboard panel has the "exemplars" toggle on, and renders them as diamonds on top of the bucket heatmap. Clicking a diamond extracts the trace_id from the exemplar's labels and constructs a Tempo URL — this is the deep-link that makes the workflow one-click.
The two storage gotchas worth knowing in production: (1) Exemplars are not federated — Prometheus's /federate endpoint does not include exemplars, so a top-level Prometheus federating from regional Prometheuses sees buckets but not exemplar trace_ids. The fix is to use remote_write instead of federation; remote_write protocol has carried exemplars since Prometheus 2.40 (2022). (2) Exemplar storage uses ~150 bytes per exemplar in memory, so the default 100K-exemplar cap costs ~15MB RAM. Scaling to 1M exemplars (a fleet emitting ~5K obs/sec across all services) costs ~150MB and noticeably increases startup time as the buffer hydrates from the WAL.
Common confusions
- "An exemplar is the same as a label." No — labels increase series cardinality multiplicatively and are stored in the TSDB index. Exemplars are out-of-band, retained in a circular buffer per series, do not affect cardinality, and do not participate in PromQL filtering. You cannot write
histogram_quantile(0.99, ... {trace_id="..."})becausetrace_idis not a label. - "Exemplars work without OpenMetrics." Wrong — the legacy Prometheus text format has no syntax for exemplars. The scrape client must send
Accept: application/openmetrics-textand the exporter must respond with that content type. Many old exporters (pre-2022) silently strip exemplars even when configured to emit them; check the Content-Type header on the response if exemplars seem missing. - "Exemplars survive forever like bucket counts." No — Prometheus's exemplar buffer is a small circular buffer (100K default), evicted FIFO. On a busy fleet, exemplars older than 30 minutes to 6 hours are gone. Bucket counts have full TSDB retention (15 days+). Plan to investigate within the exemplar window or fall back to TraceQL.
- "Every histogram observation gets an exemplar." Only one exemplar is retained per bucket per scrape interval (the most recent). If 50,000 observations land in
le=1.0between two scrapes, you get one exemplar pointing to the last one. This is a feature, not a bug — full retention would turn the metric store into a trace store. - "Exemplars from sampled traces are useless." Partially true: if a trace was head-sampled at 1% ingest, the exemplar's trace_id points to a trace that does not exist in the backend. The fix is tail-based sampling — keep all errors and slow traces — which aligns with the bucket-tail observations exemplars naturally point to. With tail-sampling, exemplar-pointed traces are almost always retained.
- "Exemplars only work for histograms." OpenMetrics specifies exemplars on counters and histograms; some implementations (Prometheus, OTel) restrict them to histograms because that is the dominant use case. Counter exemplars are useful for rare events ("which request triggered this 5xx counter increment") but are emitted by fewer SDKs.
Going deeper
The 2014–2020 history — why exemplars took six years
The metric–trace link was conceptually obvious by 2015 — Dapper-era Google was already doing it internally — but the OSS world stalled because the Prometheus text format had no way to encode side-data without breaking parsers. The OpenMetrics working group (CNCF, 2017) ratified the # {labels} value timestamp syntax precisely as a backward-compatible extension; legacy parsers ignore the #-suffix as a comment. Prometheus implemented the read side in 2.26 (March 2021) and Grafana picked up rendering in 7.4. The lag between concept and ubiquity was almost entirely format-compatibility politics — there was no technical reason the link could not have shipped in 2016.
High-cardinality exemplar labels and the gotcha
The OpenMetrics spec allows arbitrary labels in exemplars, not just trace_id. A team can attach {customer_id="xyz", trace_id="...", region="ap-south-1"} per exemplar. Tempting, but dangerous: while exemplars do not increase series cardinality, they do consume the per-series exemplar buffer, and a high-cardinality label set in exemplars makes deduplication useless and wastes the buffer. Best practice is to keep exemplar labels minimal (trace_id always, optionally span_id) and let the destination trace store carry the rest as span attributes.
Exemplar storage at scale — beyond Prometheus
For long-term exemplar retention (weeks, not hours), the standard pattern is to remote_write to a long-term backend that retains exemplars: Mimir (Grafana, since 2.0), Cortex, and VictoriaMetrics enterprise all carry the field. Mimir's exemplar storage uses object storage with the same block-and-bloom design as Tempo — exemplars are stored as a separate column alongside bucket counts, accessed via a query_exemplars API forwarded from the standard Prometheus protocol. The retention is configured independently of metric retention; a typical fleet keeps 30 days of metrics and 7 days of exemplars to balance cost.
Razorpay's exemplar-driven alert workflow
A Razorpay-style payment fleet emits histograms for every UPI hop (payment_initiate, bank_callback, npci_response, merchant_notify). Each hop's histogram has exemplars wired through OTel auto-instrumentation. When the on-call gets paged for payment_p99 > 800ms, the Grafana panel is configured to show exemplars by default — a single click on the upper-bucket diamond at the alert timestamp opens the Tempo trace view. Mean time to root cause dropped from ~22 minutes (pre-exemplar, 2021) to ~90 seconds (post-exemplar, 2023) for the workflows that follow this pattern. The remaining latency is reading the trace, not finding it. This is the operational impact exemplars are designed to deliver, and is the strongest case for wiring them through the entire OTel stack rather than treating them as a nice-to-have.
Reproduce this on your laptop
# Reproduce the exemplar emission and parsing experiments
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client requests opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-prometheus
python3 exemplar_emit_and_scrape.py # raw prometheus-client exemplars
python3 otel_exemplar_pipeline.py # OTel auto-attaches trace_id from active span
# To run real Prometheus + Tempo + Grafana with exemplar wiring:
docker compose -f https://raw.githubusercontent.com/grafana/intro-to-mltp/main/docker-compose.yml up
# Open Grafana at localhost:3000 → Explore → enable "Exemplars" toggle on histogram panels.
Where this leads next
- TraceQL — querying traces like you query metrics — once exemplars deep-link a histogram bucket to a trace, the next layer is querying traces structurally; exemplars feed the URL, TraceQL feeds the search.
- Trace storage at scale: Tempo's columnar approach — the storage substrate the exemplar URL points at; exemplar retention (hours) is much shorter than trace retention (days), so the workflow assumes the backing trace is still in Tempo when the diamond is clicked.
- Trace sampling: head, tail, adaptive — head-sampling kills exemplars for the dropped-trace fraction; tail-sampling rescues them by aligning kept-trace selection with bucket-tail observations.
- Cardinality: the master variable — exemplars are a deliberate design that adds linkage without inflating cardinality; understanding why they sidestep the problem is the same lens you bring to label hygiene generally.
The next chapter follows exemplars into the broader correlation pattern that ties metrics, logs, and traces into one navigable plane: how trace_id propagates not just through histogram buckets but through structured log lines and span events, so the operator can pivot from any pillar to either of the others in one click.
References
- OpenMetrics specification — exemplars — the canonical wire-format definition; the
# {labels} value timestampsyntax and the rules for which sample types may carry exemplars. - Prometheus 2.26 release notes — exemplar storage — the first Prometheus release to ingest, store, and serve exemplars; the design decisions on circular-buffer storage and the
query_exemplarsendpoint. - Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), chapter 7 — the connect-the-pillars chapter that frames exemplars as the smallest possible navigation primitive.
- Grafana — exemplars in dashboards — how Grafana renders exemplars on histogram heatmaps and constructs the deep-link URLs to Tempo / Jaeger / Zipkin.
- OpenTelemetry SDK — exemplar reservoir specification — how the OTel metrics SDK selects which observation to retain per bucket; the default reservoir samples uniformly within the scrape interval.
- Mimir long-term exemplar storage — the design that extends exemplar retention from Prometheus's hours to weeks via object storage; relevant when bucket-to-trace lookups must work days after the alert.
- Trace storage at scale: Tempo's columnar approach — the trace store that exemplar URLs point into.
- Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), chapter 4 — the foundational text on connecting observability pillars; exemplars are the realisation of the connection that book argued for.