Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Drill-down and correlation
It is 21:08 IST on a Tuesday and Karan, an SRE at a hypothetical Bengaluru-based food-delivery firm we will call FlavrDash, sees the tier-1 dashboard for order-orchestrator go red. The 1h burn-rate panel is at 18.2 — well past the 14.4 page threshold — and the worst-tenant panel is showing tenant=hyderabad-cloud-kitchens at 41% error rate. The aggregate error rate across the fleet is 0.7%, almost invisibly above the 0.5% SLO. Karan has 47 minutes of error budget left at the current burn rate. He clicks the worst-tenant panel. The drill-down tier-3 dashboard opens, pre-filtered to that tenant, and he sees a histogram of failed orders by Kafka partition — 38 of the 39 errors are on partition=14. He clicks the partition cell. A trace explorer opens with five exemplar traces from that partition; one of them, trace_id=a91c4f..., has a 4.8-second span on a downstream inventory-check call to a Postgres replica that just finished its VACUUM FULL. The whole walk — burn-rate → tenant → partition → trace → query — took 3 minutes 12 seconds. Karan rolls back the bad pg_cron change. The burn-rate gauge starts dropping at 21:15. Nobody else on the team woke up. This is the chapter on the mechanism that made that walk possible.
Drill-down is hierarchical reduction — each click narrows the scope by a known dimension (tenant, region, pod, endpoint, span). Correlation is the join — linking a red metric to the specific traces, logs, and code paths that produced it. Both fail when the artefacts are not authored to be linked: dashboards without click-throughs, metrics without exemplars, logs without trace IDs, traces without resource attributes. The discipline is to make the join points explicit at instrumentation time, before the first incident.
What drill-down actually is — hierarchical scope reduction
Drill-down is the operation of taking a measurement at one scope and asking "which sub-scope produced this?". The tier-1 panel says the fleet error rate is 0.7%. Drilling down means asking: which pod? Then: for that pod, which endpoint? Then: for that endpoint, which tenant? Each click is a WHERE clause appended to the underlying query, narrowing the dimensional cube by one axis. The dashboard is the visual frontend; the database underneath is doing dimensional aggregation, and drill-down is the inverse — picking a slice and re-aggregating at finer granularity.
The mathematical shape is a fan-out tree. The root is the top-level metric (error_rate{service="order-orchestrator"}). The first split is by the most-likely-relevant high-cardinality dimension (pod for fleet-wide pathology, tenant_id for noisy-neighbour pathology, region for regional-failure pathology, endpoint for code-path pathology). The second split goes one level deeper. The third lands on a span, a log line, or a query. A well-designed dashboard pre-orders these splits — the first drill-down level is the one that disambiguates the most common failure modes for that service shape.
The reason most teams get this wrong is that they think of drill-down as a UI feature — "click on a panel and Grafana magically shows you more". Drill-down is not a UI feature. It is a data architecture in which every metric carries enough labels (or exemplars, or links) to be joinable to the next level. A panel that does not link to anything when clicked is a panel that has no drill-down architecture behind it — the click has nowhere to go because the metric was never authored to be drillable. The fix is at instrumentation time, not at dashboard time.
Why the architecture has to come first: at incident time, the on-call engineer is in pattern-matching mode (per the Klein NDM framework discussed in the previous chapter), with maybe 5-10 minutes of error budget. They cannot stop to write a new PromQL query, attach a new label, or instrument a new span. Every drill-down step has to be a click, not a query authoring action. That requires the join points (pod label on the metric, tenant_id on the span, trace_id on the log line) to already exist in the telemetry — which means they had to be added at instrumentation time, weeks or months before the incident. The cost of adding them is paid up front; the value is realised only when an incident actually happens. Teams that have not paid that cost find out at 02:46 that their dashboard has no drill-down architecture.
Correlation — the join across pillars
Drill-down operates within a single telemetry pillar (metrics → metrics, or traces → traces). Correlation is the harder operation: linking a metric to a trace, a trace to a log, a log to a profile. The three pillars (and the fourth, profiles) are stored in separate backends — Prometheus / Mimir / Cortex for metrics, Tempo / Jaeger for traces, Loki / Elasticsearch for logs, Pyroscope / Parca for profiles. Correlation is a join across these backends, and the join key has to be carried through all four by your instrumentation.
The canonical join keys are: trace_id (W3C TraceContext, propagated through HTTP headers), service.name (OTel resource attribute), pod / host.name (OTel resource attribute or Prometheus label), tenant_id (custom span attribute and metric label), and request_id / correlation_id (custom, carried through baggage). When all four pillars share these keys, the join is mechanical — click a high-latency point on a histogram, the linked exemplar gives you a trace_id, paste it into the trace explorer, see the span tree, click a span, the linked log query ({trace_id="..."}) gives you the log lines from that exact span. Each click is a WHERE on a different backend, joined by a shared key.
When the keys are missing, correlation collapses into manual archeology: look at the timestamp of the metric breach, scan logs by timestamp ± 30 seconds, hope the right log lines surface, hand-correlate by service name and message content. This is what most incident timelines actually look like, and it is why post-mortems often end with "we should add trace_id to our logs". The post-mortem is correct; the action item is real; and the team will only get around to half of it before the next incident.
The economics are stark. Teams that author the join keys up front pay a one-time cost: ~3-4 lines of OTel SDK setup (tracer.start_as_current_span, logger.bind(trace_id=...), pyroscope.tag_wrapper({"trace_id": ...})). Teams that retrofit pay the cost across every service, every language, every CI pipeline — and pay it under incident pressure, which is when the cost is highest. Razorpay's platform team published an internal style guide in 2024 that mandated trace_id propagation through every internal HTTP header, every Kafka message header, and every log line — and reported that mean-time-to-resolution dropped from 47 minutes to 12 minutes for cross-service incidents within two quarters. The drop was not because the bugs got easier; it was because the correlation walk became mechanical instead of archeological.
An additional subtlety: drill-down must be idempotent under repeat clicks. If the on-call engineer clicks the same drill-down link twice (because the page took a moment to load and they double-clicked), the destination dashboard must not double-filter or accumulate filter state across navigations. Grafana's data-link variable substitution is idempotent by design (each navigation replaces the variable rather than appending), but custom URL-construction code in dashboards-as-code generators sometimes gets this wrong — appending &var-tenant=hyd-cloud-kitchens to a URL that already has var-tenant=hyd-cloud-kitchens produces a malformed URL that Grafana silently ignores. The diagnostic: navigate twice from the same panel, compare the URL after each click, ensure they are identical.
A working drill-down + correlation pipeline in Python
The example below is a runnable Python harness that emits a metric with an exemplar, captures a trace with the same trace_id, and emits a log line bound to that trace_id — then queries Prometheus for the metric, Tempo for the trace, and Loki for the log line, joining them by the shared key. It is the minimal correlation pipeline that turns drill-down from a UI fantasy into a working operational tool. The pipeline is what FlavrDash's platform team would call its tier-0 instrumentation contract.
# correlation_walk.py — emit linked metric+trace+log, then walk them by trace_id
# pip install prometheus-client opentelemetry-api opentelemetry-sdk \
# opentelemetry-exporter-otlp opentelemetry-instrumentation-flask \
# flask requests loguru
import time, random, json, requests
from flask import Flask, request
from prometheus_client import Histogram, generate_latest, CONTENT_TYPE_LATEST
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from loguru import logger
# 1. set up OTel with a resource that carries service.name + pod
resource = Resource(attributes={"service.name": "order-orchestrator",
"pod": "order-orch-7d9f-xk2",
"tenant_id": "default"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# 2. histogram with native exemplar support — Prometheus 2.43+ syntax
LAT = Histogram("order_latency_seconds", "order latency",
buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0])
app = Flask(__name__)
@app.route("/order/<tenant>")
def place_order(tenant):
with tracer.start_as_current_span("place_order",
attributes={"tenant_id": tenant, "endpoint": "/order"}) as span:
ctx = span.get_span_context()
trace_id = format(ctx.trace_id, "032x")
span_id = format(ctx.span_id, "016x")
# 3. log line bound to trace_id — this is the join key for Loki
bound = logger.bind(trace_id=trace_id, span_id=span_id,
tenant_id=tenant, pod="order-orch-7d9f-xk2",
service="order-orchestrator")
bound.info(f"order received for {tenant}")
t0 = time.time()
# 4. simulate a slow downstream call for one tenant
time.sleep(random.uniform(0.02, 0.08) if tenant != "hyd-cloud-kitchens"
else random.uniform(2.5, 4.5))
elapsed = time.time() - t0
# 5. observe with exemplar — this is the join key for Tempo
LAT.observe(elapsed, exemplar={"trace_id": trace_id, "span_id": span_id})
bound.info(f"order completed in {elapsed:.3f}s")
return {"ok": True, "trace_id": trace_id}
@app.route("/metrics")
def metrics():
return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
if __name__ == "__main__":
app.run(port=8080)
A second script does the correlation walk — it queries Prometheus for the slowest histogram bucket, extracts the exemplar trace_id, queries Tempo for the trace, queries Loki for the log lines, and prints the joined view. This is what a Grafana "Explore" panel does when wired correctly, but seeing it as Python makes the join explicit:
# walk.py — given a Prometheus exemplar, walk to Tempo and Loki by trace_id
# Sample run output (what the reader sees on stdout):
#
# [exemplar] order_latency_seconds bucket le=5.0 trace_id=a91c4f8b...e2 value=4.812s
# [tempo] trace a91c4f8b... has 7 spans, root=place_order tenant=hyd-cloud-kitchens
# span inventory-check duration=4.78s status=ERROR
# attributes: db.statement="SELECT ... FROM inventory WHERE ..."
# [loki] 7 log lines for trace_id=a91c4f8b... in service=order-orchestrator
# 21:08:14.103 INFO order received for hyd-cloud-kitchens
# 21:08:14.105 INFO inventory-check started
# 21:08:18.881 ERROR inventory-check timeout after 4.78s
# 21:08:18.882 WARN retry 1/3 for inventory-check
# ...
import requests
PROM = "http://localhost:9090"; TEMPO = "http://localhost:3200"; LOKI = "http://localhost:3100"
def slowest_exemplar():
r = requests.get(f"{PROM}/api/v1/query_exemplars",
params={"query": 'order_latency_seconds_bucket',
"start": time.time()-300, "end": time.time()})
pts = r.json()["data"]
candidates = [(e["value"], e["labels"]["trace_id"])
for series in pts for e in series.get("exemplars", [])]
return max(candidates) if candidates else (None, None)
def trace_view(tid):
r = requests.get(f"{TEMPO}/api/traces/{tid}").json()
spans = [s for batch in r["batches"] for s in batch["spans"]]
err = [s for s in spans if s.get("status",{}).get("code")=="ERROR"]
return spans, err
def logs_for_trace(tid):
q = f'{{service="order-orchestrator"}} | json | trace_id="{tid}"'
r = requests.get(f"{LOKI}/loki/api/v1/query_range",
params={"query": q, "start": time.time()-600, "end": time.time()})
return [json.loads(line[1]) for stream in r.json()["data"]["result"]
for line in stream["values"]]
Walking through the load-bearing lines: Resource(attributes={...}) authors the resource attributes once, and every span emitted through this tracer carries them — service.name, pod, tenant_id are now joinable from the trace backend without re-authoring per-span. logger.bind(trace_id=..., span_id=...) is the loguru pattern for structured logging — every log line emitted via bound.info(...) includes those keys as JSON fields, which Loki indexes (or | json extracts) for the log → trace join. LAT.observe(elapsed, exemplar={"trace_id": ...}) is the Prometheus exemplar API — it attaches a sparse trace_id to the histogram observation, which Grafana surfaces as a clickable point on the histogram panel; clicking it deep-links into the trace backend with the trace_id pre-filled. Why exemplars matter even when the metric and trace are in different backends: the exemplar is a side-channel on the metric — Prometheus stores at most one exemplar per histogram bucket per scrape interval, so the storage cost is negligible (a 50-bucket histogram costs at most 50 trace_ids per scrape). But that one trace_id is the only join key that survives metric aggregation. Once the histogram has been compressed to bucket counts, the original observations are gone — the exemplar is the only thread back to a specific request that fell in that bucket. Without exemplars, the click on a "p99 spike" panel can give you "the bucket that contains the spike", but not "a request that produced the spike". The exemplar is the difference between drill-down and drill-blind.
The correlation walk is mechanical because every step has a deterministic join: metric → exemplar → trace_id → tempo → spans → trace_id (still) → loki → log lines. Every join uses a key authored at step 1 of the pipeline (the resource attributes plus the exemplar). If any single step's key is missing, the walk breaks at that step. r = requests.get(f"{TEMPO}/api/traces/{tid}").json() — Tempo's HTTP API is the simplest of the three, returning the raw OTLP ResourceSpans payload as JSON; the spans within carry their own service.name and per-span attributes, so the click from trace → log can re-join on service.name if trace_id somehow drops. Why the join keys are redundant on purpose: any single key can drop — a buggy logging library can omit trace_id on some lines, an OTel SDK upgrade can change the resource attribute name, a Kafka consumer can fail to propagate baggage. Carrying multiple join keys (trace_id + service.name + pod + tenant_id + request_id) means the walk degrades gracefully — even if trace_id is missing on the log line, you can still filter by service.name and pod and approximate-join by timestamp ± 100ms to recover the correlation. The redundancy is the safety net.
A second <span class="why"> annotation worth its own line: Why the harness uses both trace_id and span_id even though trace_id alone is enough for the trace-level join: span_id lets the log line link to a specific span within the trace, not just the trace as a whole. A trace with 80 spans (typical for a Hotstar IPL request crossing 80 microservices) generates dozens of log lines per service; without span_id, you cannot tell which log line belongs to the 4.8-second inventory-check span versus the 200-microsecond auth-check span. The span_id is the difference between "logs from this request" and "logs from the failing piece of this request" — and the failing piece is what you actually need at 21:08 IST.
A practical note on operational discipline: the correlation walk works only when every service on the call path has been instrumented to the same contract. A single service that emits spans without trace_id propagation creates a gap in the trace tree — the parent span ends, then a black hole, then a child span starts (often in a different service) with no link back. Tempo represents the gap as a missing-parent reference, which Grafana's trace view renders as a disconnected fragment. The on-call engineer sees the gap and now has to guess which service produced the gap, then fix that service's instrumentation, then wait for the next incident to confirm the fix. The whole loop is on the order of weeks. The right policy is to fail-deploy any service that does not pass an instrumentation contract test in CI — a synthetic request that exercises the service's main endpoints and asserts that every emitted span has the required resource attributes and that downstream HTTP/Kafka calls carry the propagation headers. Razorpay's platform team calls this otel-contract-test and runs it in every service's CI pipeline; Hotstar's equivalent is trace-fitness (per their 2025 SREcon Mumbai talk). Both block merges if the contract is violated.
When drill-down breaks — the four common failures
The walk above is what success looks like. In production, drill-down breaks in four characteristic ways, each diagnosable by which step of the click sequence stalls.
Failure 1: panel link is missing. The tier-1 dashboard panel does not have a links field set in Grafana JSON. Clicking the panel does nothing, or opens the same panel in a new tab. The on-call engineer falls back to copy-pasting query expressions into Grafana's Explore view — slow, error-prone, and breaks the recognition pattern. The fix: every tier-1 panel must have at least one links entry pointing to the corresponding tier-2 dashboard with the panel's query parameters propagated. The CI policy from the previous chapter (alerts-without-panels) extends to "panels-without-links" — every tier-1 panel must have at least one links entry, enforced at PR time. Diagnostic: open the dashboard JSON, grep for "links": [], count the empty arrays — every empty array is a click-path dead end.
Failure 2: link target dashboard does not pre-filter. Clicking the worst-tenant panel navigates to a dashboard that shows all tenants instead of pre-filtering by the clicked tenant. The on-call engineer has to manually select the tenant from a dropdown — extra cognitive load, breaks the click-down flow. The fix: link URLs must include var-tenant=$__field.labels.tenant_id (Grafana's data-link template variable syntax) so the destination dashboard inherits the clicked dimension. This is one line of JSON per panel, but it has to be added on every panel that surfaces a high-cardinality label.
Failure 3: exemplar is missing on the metric. The tier-1 panel is a histogram showing a p99 spike. Clicking the spike point gives no trace_id because the metric was not configured to emit exemplars. The on-call engineer falls back to "find a trace from this service in this time window with high latency" — slow query against Tempo, often returning thousands of candidate traces. The fix: every histogram on a tier-1 dashboard must emit exemplars; in prometheus-client, this is the exemplar={"trace_id": ...} argument to .observe(). The CPU and storage cost is negligible — a 50-bucket histogram with one exemplar per bucket per scrape adds ~3KB per series per hour.
Failure 4: trace_id on logs but not propagated through async boundaries. The synchronous request path correctly carries trace_id through HTTP headers, but a Kafka consumer reading the same request's downstream events does not — because the producer didn't put trace_id in the Kafka headers, or the consumer didn't read them. The result: logs from the synchronous path are joinable to the trace; logs from the async path are not. The fix: OTel's inject and extract API on the Kafka producer/consumer pair, called explicitly on every message. This is the most commonly-skipped instrumentation step and the one that produces the most "we have a trace gap from second 0.18 to second 4.7" post-mortems.
A real walk — the FlavrDash incident, click by click
The opening paragraph compressed Karan's 3m12s walk into a single narrative. Here is what the same walk looks like as a sequence of explicit clicks, with the URL parameters Grafana rewrites on each step. This level of detail is what the on-call runbook should encode — not "investigate the burn-rate" but the actual click path.
The walk has six steps; each step ends in a state that determines the next click. State at step end is what makes the walk debuggable — if the engineer gets stuck at step N, they know exactly which join failed (the panel-link, the variable-substitution, the exemplar, the trace-attribute, the log-trace-id, or the runbook-link). The post-incident review can then point at the missing wiring and fix it for next time. Walks that are not enumerated step-by-step degrade into "Karan figured it out" stories that nobody can replicate or improve on.
Click 1 (T+0:00). PagerDuty fires OrderOrchestratorErrorBudgetBurning{service="order-orchestrator",window="1h"}. The page annotation includes a deep-link to https://grafana.flavrdash/d/order-orch-overview?viewPanel=2&from=now-1h&to=now. Karan opens the link on his laptop — Grafana lands him directly on the burn-rate panel of the tier-1 dashboard, with the 1h window pre-set.
Click 2 (T+0:34). Burn-rate panel shows 18.2. The worst-tenant panel below it shows topk(1, error_rate_by_tenant) at 41% for tenant=hyd-cloud-kitchens. Karan clicks the worst-tenant panel cell. Grafana navigates to https://grafana.flavrdash/d/order-orch-tenant-detail?var-tenant=hyd-cloud-kitchens&from=now-1h&to=now — the per-tenant tier-3 dashboard, pre-filtered.
Click 3 (T+1:08). The per-tenant dashboard's "errors by Kafka partition" panel shows partition 14 with 38 of 39 errors. Karan clicks the partition-14 row. Grafana navigates to a TraceQL search: { resource.service.name="order-orchestrator" && resource.tenant_id="hyd-cloud-kitchens" && messaging.kafka.partition=14 && status=error } | duration > 2s. Five matching trace_ids return.
Click 4 (T+2:01). Karan clicks the trace_id a91c4f8b...e2. The Tempo span tree opens. The root span (place_order) is 4.83s; one child span (inventory-check) is 4.78s, marked ERROR, with attribute db.statement="SELECT ... FROM inventory ..." and peer.service="postgres-replica-2".
Click 5 (T+2:48). Karan clicks the "logs for this span" link in the Tempo span detail. Grafana opens Loki Explore with query {service="postgres-replica-2"} | json | trace_id="a91c4f8b..." — and the log lines show LOG: still vacuuming relation "public.inventory" (60% complete). Karan now knows the cause: the pg_cron VACUUM FULL change shipped at 21:03 IST is still running and holding an AccessExclusiveLock on the inventory table.
Click 6 (T+3:12). Karan rolls back the pg_cron change via kubectl rollout undo deployment/pg-cron-runner. The burn rate starts dropping at T+7m as in-flight retries succeed. PagerDuty auto-resolves the page at T+12m. The whole sequence — page → burn-rate → worst-tenant → partition → trace → span → log → root cause — fit in 6 clicks because every panel had a links field, every metric had exemplars, every trace had resource attributes, and every log had trace_id. None of that wiring was free; it was paid for over the previous 18 months by the platform team.
The post-incident review identified the six wiring decisions that made the walk possible, each of which was added at a specific point in the platform's history:
- Tier-1 dashboard with
linksfield on every panel — added during the Q4 2023 dashboard hierarchy migration, after a 38-minute outage where the on-call had to manually search for the per-tenant dashboard. - Worst-tenant
topk(1)panel on tier-1 — added during the Q1 2024 multi-tenant SLO project, after a customer-success escalation about a tenant-specific outage that the aggregate dashboard missed. tenant_idas a span resource attribute (not a metric label) — added during the Q2 2024 cardinality budget review, after the metrictenant_idlabel exploded series count past the Prometheus limit.- Histogram exemplars on every tier-1 latency metric — added during the Q3 2024 OTel SDK upgrade, when the
prometheus_clientlibrary gained native exemplar support. trace_idpropagation through Kafka headers — added during the Q4 2024 async-tracing project, after a 6-hour debugging session where the order-orchestrator → fulfillment-service trace had a 4-second gap.- Loki
derivedFieldsfor trace_id extraction — added during the Q1 2025 Grafana Cloud onboarding, alongside the TempotracesToLogsconfiguration. - Pyroscope
tracesToProfileslink from the Tempo span detail — added during the Q3 2025 continuous-profiling rollout, to close the loop from "this span is slow" to "this line of code is hot". - Runbook deep-link in the PagerDuty alert annotation — added during the Q4 2025 alert-hygiene sweep, after a junior on-call spent 4 minutes finding the right dashboard URL during their first incident.
Each of those decisions cost between half a day and two weeks of platform-team work. The cumulative cost — about three engineer-months over 18 months — paid off in a single 12-minute incident response. Compare this with a counterfactual: if any one of the six wirings had been missing, the walk would have stalled at that step, the engineer would have fallen back to manual archeology, and the resolution time would have been 30-90 minutes instead of 12. The wiring is not aesthetic; it is the difference between an SLO breach and an SLO save.
Common confusions
- "Drill-down is the same as filtering." Filtering narrows what is visible on the same panel; drill-down narrows the scope of the underlying query by appending a
WHEREclause and often navigating to a different dashboard. Filtering is "show me only errors"; drill-down is "show me the pod where the errors are happening, with a different set of panels appropriate to per-pod investigation". Conflating the two leads to dashboards where the only "drill-down" is a filter dropdown — which works for one-axis narrowing but fails when the next level needs different metrics entirely. - "Correlation is the same as having all telemetry in one place." Putting metrics, logs, and traces in one tool (Datadog, New Relic) does not produce correlation; it produces co-location. Correlation requires shared join keys —
trace_id,service.name,pod— that are authored consistently across pillar emitters. A team using Datadog withouttrace_idpropagation has the same correlation problem as a team using Prometheus + Tempo + Loki withouttrace_idpropagation; the bottleneck is instrumentation, not vendor. - "Exemplars are a vendor feature." Exemplars are an OpenMetrics specification feature, supported by Prometheus 2.43+, native histograms, OTel's metrics SDK, and most modern metric backends (Mimir, Cortex, Thanos, VictoriaMetrics). They are vendor-portable. The only thing that varies is the dashboard UI's deep-link behaviour — Grafana, Datadog, and Honeycomb all support exemplar click-through.
- "More join keys are always better." Each additional join key costs label cardinality (on metrics) and span attribute storage (on traces) and log field volume (on logs). The right join-key set is small and load-bearing:
trace_id,service.name,pod,tenant_id. Addingrequest_idis useful when you need within-trace per-request joining; addinguser_idis usually a privacy/cardinality mistake unless the user count is bounded and the requirement is auditable. Pick the keys deliberately; document them in a team style guide. - "Drill-down works as long as the data exists somewhere." Drill-down works only when the click path exists — panel link, query parameter propagation, destination panel, key join. Each click is a contract; if any step is unwired, the click stalls. The presence of data in the backend is necessary but not sufficient; the navigation has to be authored.
- "You can retrofit drill-down after the first incident." Retrofit is possible but expensive: every service has to be redeployed with new instrumentation, every dashboard updated with new links, every alert rule updated to reference new metrics. The retrofit happens under incident-response pressure, which is when engineering bandwidth is lowest. Up-front authoring is ~10x cheaper than retrofit, in our (and Razorpay's, and Hotstar's) experience.
Going deeper
The Tempo-Mimir-Loki correlation contract — what Grafana's "Explore" actually does
The Grafana Explore view's cross-pillar correlation is implemented by a set of data-link configurations on each datasource: the Mimir/Prometheus datasource has a exemplarTraceIdDestinations field listing which Tempo datasources to link exemplar trace_id values to; the Tempo datasource has a tracesToLogs and tracesToMetrics field listing which Loki and Mimir datasources to link span service.name and trace_id values to; the Loki datasource has a derivedFields field that extracts trace_id from log lines (via regex) and turns them into Tempo trace links. The configuration is a directed graph between datasources, with each edge labelled by which keys to propagate. When this graph is fully connected, Explore-mode correlation works without further effort. When edges are missing, the click silently fails — no error, just nothing happens. The diagnostic ladder for "my Explore correlation is broken" is to check the datasource provisioning YAML for exemplarTraceIdDestinations, tracesToLogs, and derivedFields — most production breaks live in one of those three.
Cross-cluster correlation — the join when one trace spans two regions
A request that crosses an availability-zone or region boundary (Mumbai user → ap-south-1 frontend → ap-south-1 backend → us-east-1 fraud-check service) produces a single logical trace_id but two metric backends (one Mimir per region), two log backends, and two trace backends. The correlation walk has to know which backend to query for each span — and the canonical pattern is to add a region resource attribute on every span, then have Grafana's Tempo datasource use the region value to pick the right Mimir/Loki datasource via Grafana's variable-driven datasource feature. This adds one click of latency to the correlation walk (the datasource picker) but keeps cross-region correlation possible without merging all telemetry into a single backend. PhonePe's NPCI-aware tracing pipeline (per their 2024 SREcon Bengaluru talk) uses exactly this pattern — region=npci-bbpou-1 versus region=npci-bbpou-2 as the discriminator.
The "drill-up" inverse — when you need to broaden, not narrow
Drill-down narrows scope; sometimes you need the inverse — drill-up — to ask "is this just the Hyderabad tenant or is something broader breaking?". The mechanical pattern is the same as drill-down but in reverse: start at the narrowest scope (one trace, one log line, one pod), strip the most-specific filter, and re-aggregate at the broader scope. Grafana does not have a native drill-up button; the workaround is a dashboard variable that is unset by default and gets set by drill-down clicks but can be unset by clicking a "clear filters" button. The conceptual model is the same: drill-down = WHERE clauses appended, drill-up = WHERE clauses removed. Both require the same dimensional architecture.
The cost model of high-cardinality drill-down
Drill-down to per-tenant or per-pod requires those labels to exist on the metric. The cost of carrying tenant_id as a metric label scales with the number of distinct tenants; for a 200-tenant SaaS with 50 metrics, each tenant adds 50 series, so the total is 10,000 series — manageable. For a 50,000-merchant marketplace (Flipkart, Meesho), merchant_id as a metric label produces 2.5M series per metric, which is unaffordable. The fix: bound the cardinality at instrumentation time — emit tenant_id only on metrics that are actually drilled down per-tenant (typically 5-10 tier-1 metrics), and carry it as an exemplar (which is sparse — at most one per bucket per scrape) on histograms. The exemplar pattern keeps tenant_id queryable for correlation walks without paying the per-tenant cardinality cost on every metric. This is the pattern that makes per-tenant drill-down affordable for fintech and large e-commerce.
Profiles as a fifth pillar — pyroscope correlation by span_name
Continuous profiling (Pyroscope, Parca) is the fourth pillar that's increasingly joined into the correlation walk. The join key from a trace span to a flamegraph is service.name + span_name (or a custom pyroscope.tag_wrapper({"span_name": ...}) block in Python). Clicking a slow span in Tempo can deep-link to "the flamegraph filtered to this service and this span_name during this time window" — which often surfaces the offending function call (a hot lock, a misplaced json.loads in the request loop, an os.path.exists on every request) within seconds. The Pyroscope datasource in Grafana supports this join via a tracesToProfiles data-link. As of 2026 most teams have not wired this — they treat profiles as a separate "I'll go look at flamegraphs when I have time" tool, when in fact profiles are the fastest way to go from "this span is slow" to "this line of code is slow". The teams that have wired profile-correlation report MTTR drops on the order of 30-50% for code-path-bound (as opposed to network-bound) latency issues.
Where this leads next
The correlation walk this chapter describes is the foundation for /wiki/dashboard-as-code-grafana-json-terraform — once the click-paths exist, you encode them as version-controlled Grafana JSON / Terraform so they survive across team changes and dashboard rebuilds. The wall-correlation chapter (/wiki/wall-tying-pillars-together-needs-correlation) closes Part 12 by treating correlation as a cultural discipline — the engineering decisions that make it possible (instrumentation contract, naming convention, deployment policy) rather than the mechanical wiring this chapter has covered.
Within Part 13 (OpenTelemetry internals), the chapter on resource attributes and baggage covers the canonical names for the join keys (service.name, service.instance.id, host.name) and the propagation rules that make them survive across language and framework boundaries. The chapter on OTLP semantic conventions covers the agreed-upon attribute names (http.method, db.statement, messaging.kafka.partition) that make cross-team correlation possible without negotiation.
Cross-curriculum, this chapter cross-links to the data-engineering material on lineage and observability (/wiki/observability-for-data-pipelines-not-just-services) — the same drill-down + correlation discipline applies to batch and streaming pipelines, with the join keys being job_id, task_id, partition, and dag_run_id instead of trace_id and span_id. The architecture is the same; the keys differ.
# Reproduce this on your laptop
docker compose up -d # prom + tempo + loki + grafana
python3 -m venv .venv && source .venv/bin/activate
pip install prometheus-client opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp opentelemetry-instrumentation-flask \
flask requests loguru
python3 correlation_walk.py & # the Flask app emitting metric+trace+log
# generate some load
for tenant in default hyd-cloud-kitchens default default; do
curl http://localhost:8080/order/$tenant; echo
done
python3 walk.py # the correlation walker that prints joined view
References
- Prometheus exemplars — official spec for histogram exemplars and the trace_id join.
- OpenTelemetry — semantic conventions for resource attributes — canonical names for
service.name,host.name,pod.name. - Grafana Tempo — TraceQL and data links — the trace-side join API.
- Charity Majors et al — Observability Engineering, Chapter 6 — high-cardinality correlation as the foundation of "why is this happening".
- Cindy Sridharan — Distributed Systems Observability, Chapter 4 — the correlation walk as a fundamental observability operation.
- Razorpay engineering — propagating trace context through Kafka — pattern for async-boundary correlation, the most-skipped instrumentation step.
- /wiki/exemplars-linking-metrics-to-traces — internal: deeper coverage of the exemplar mechanism this chapter assumes.
- /wiki/dashboard-anti-patterns — internal: the previous chapter on the structural failures drill-down compensates for.