Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Metric-to-log drill-down
It is 09:14 IST. Riya, an SRE at a hypothetical Bengaluru-based discount-broker we will call TradeKite, is watching the market-open dashboard. The orders-per-second line is climbing the way it does every weekday at 09:15 — a clean ramp from 12k to 38k RPS over forty seconds. At 09:14:47 the error_rate{service="order-router", code="5xx"} panel jumps from 0.04% to 2.1%. Not catastrophic. Recoverable. But two thousand orders a second are failing and her error-budget for the quarter just took a 6% hit in fourteen seconds. She needs to know which error. Right now. The pre-open queue is going to drain into the matching engine in another forty seconds and if the bug is in the path she has not identified yet, the post-open spike will be ten times worse.
She right-clicks the spike. The Grafana context menu offers Explore, View in Logs, Drill down → order-router logs. She clicks the third option. A new tab opens — Loki, pre-filtered to {service="order-router"} | json | level="ERROR" and time-bounded to the 60-second window centred on the spike. There are 1,247 log lines. The top three repeat the same string: redis lock contention on key=symbol_book:RELIANCE timeout_ms=4800. Riya pages the cache team's on-call. Time from spike to actionable diagnosis: 22 seconds. The post-open spike does not happen because the cache team scales the redis cluster before market-open finishes ramping. The drill-down click was the entire investigation.
Metric-to-log drill-down is a Grafana data-link configured on a metric panel that, on click, opens a pre-filtered Loki query bounded to the same time window and the same labels as the metric point. The mechanism is small (one panel-link with templated variables) but the wiring spans Prometheus label conventions, Loki stream-label parity, the Grafana data-link template grammar, and the human discipline of not clicking through panels with eight unrelated services in one query. Done right, time-to-diagnosis drops from minutes of tab-switching to one click.
Why a metric spike alone tells you nothing
A metric is a numerical aggregate over a time-window — error_rate{service="order-router"} = 0.021 at 09:14:47. The aggregate is the count of errors divided by the count of requests, both summed across every replica of order-router over the last 60 seconds. The numerator might be 2,100 and the denominator 100,000; or 47 and 2,200; or any pair that yields the same ratio. The metric does not carry the constituents. It cannot. A metric is a four-tuple of (timestamp, label-set, value, optional histogram bucket) — the entire reason TSDBs scale to billions of series is that they discard the original event stream and keep only the rolled-up shape.
A log line is the opposite shape. A single log carries the full content of one event — the merchant ID, the order symbol, the redis key that failed, the stack trace, the customer's session token. Loki indexes a small set of stream labels (service, level, env) and stores the rest as searchable content. There is no aggregation, no rollup, no quantile interpolation. To answer "what happened at 09:14:47", you need the logs, not the metric.
The drill-down is the navigation that bridges the two stores. A panel renders the metric. A click on the panel resolves to a query against the log store, constrained to the same time window and the same labels as the metric point that was clicked. The constraint is what makes the result useful — without it, "show me logs for order-router" returns 38 million lines from the last hour. With the time bound (09:14:17 → 09:15:17) and the label bound (service="order-router"), the result is the 1,247 lines Riya read. The constraint is the entire value of the drill-down; everything else is just URL encoding.
Why a click rather than a permanent dashboard panel showing both metrics and logs side by side: a dashboard panel that renders 1,247 log lines for every visible metric point is a 38,000-line scrollback even on a quiet minute; on a busy minute it is so much HTML that the browser drops frames. The drill-down click is the deferred equivalent — the panel renders the metric (cheap, indexed, pre-aggregated), the logs are fetched only when the SRE asks for them. This is the same lazy-evaluation discipline that exemplars use to attach trace IDs to histogram buckets without bloating the metric storage. The query expense is paid only on the rare path where it matters.
The deferred model also keeps the storage cost asymmetric in the right direction. Metrics are cheap (~1.3 bytes per sample after Gorilla XOR encoding); logs are expensive (~150–800 bytes per line, even after gzip). A dashboard that renders metrics rendering as a graph is reading megabytes of TSDB; a dashboard that also renders logs is reading gigabytes. Deferring the log read until the click means the average dashboard view costs 1000× less than the worst case, while the rare incident path still has the full investigative depth available. This asymmetry — cheap aggregates on the home page, expensive details one click away — is the core design principle of every observability UI worth using; the metric-to-log drill-down is its most concrete instance.
The six wires that make the click work
The drill-down click looks magical from the SRE's point of view but it is six independent wires, all of which must be configured correctly. Drop any wire and the click resolves to "no data" or to the wrong logs entirely — silently, with no error.
Wire 1 — the metric must carry the labels you intend to drill on. If your panel renders error_rate{service="order-router"} but your logs are tagged app="order-router", the templated URL substitutes service into the LogQL filter, which never matches anything. The fix is a label-naming convention enforced across the platform. The OpenTelemetry resource-attribute convention says service.name for the metric (which Prometheus exposes as service_name after sanitisation), and the same convention propagates to the log shipper's stream labels. The convention is paid for once, in the platform team's logging library; the cost-savings are paid back every time an SRE drills down.
Wire 2 — the time window must be bounded. Without a time bound, the drill-down opens a Loki query with the full retention window — 30 days, 100 billion log lines. Loki rejects the query, or accepts it and returns after eight minutes. The Grafana data-link template provides ${__from} and ${__to} variables that resolve to the panel's currently visible time range, but a more useful default is a tight window centred on the click point: ${__from:epoch_millis} minus 30 seconds, plus 30 seconds. The 60-second window is enough to capture the constituents of one metric point at a 60-second resolution, and small enough that Loki returns in milliseconds.
Wire 3 — the query must be syntactically valid LogQL. The data-link template is a string that gets URL-encoded and pasted into the Loki query parameter. Quotes need to be escaped (\"), curly braces need to be balanced, the JSON-parse stage | json is required to filter on parsed fields. A common bug is the data-link template that produces {service=$service} instead of {service="$service"} — Loki then errors with "expected matcher", but the click silently lands on Grafana's error page rather than the log results. The fix is to test the template by manually clicking in dev once, with a known service.
Wire 4 — the query must be filtered, not just opened. Opening Explore with the right data source is not enough; the query must include the level filter (| level="ERROR") and any other filters that match the metric. A panel showing error_rate should drill to ERROR-level logs, not to all logs of the service. A panel showing latency_p99 > 1s should drill to logs where the request took longer than 1s — which means logs need a latency_ms field and the LogQL filter is | latency_ms > 1000. The drill-down is only as precise as the filter; an unfiltered drill-down to a noisy service is barely better than no drill-down.
Wire 5 — the Loki side must be ingesting the labels you filter on. If service is a Prometheus label but the log shipper is configured to drop or rename it on ingestion, the LogQL filter {service="order-router"} matches nothing. This is the same shipper-renaming pitfall described in /wiki/log-to-trace-correlation-trace-ids-in-logs — a Promtail relabel rule, a Vector transform, or a Fluent Bit parser can all drop or rename a label silently. The diagnostic is to query Loki directly without the drill-down — {service="order-router"} from the LogQL CLI — and confirm that lines come back. If they do not, the shipper is the problem; if they do, the data-link is the problem.
Wire 6 — the data source UID must resolve. The Grafana data link names a data-source UID (uid: loki-uid) and Grafana resolves the UID at click-time to the Loki HTTP URL. If the UID was changed during a Grafana re-provisioning (a common silent break — the YAML provisioning file regenerated UIDs), the click resolves to "data source not found" and the SRE sees a blank panel. The fix is to pin the UID in both the Loki provisioning YAML and the dashboard data-link YAML — the same alignment discipline as the trace-correlation wiring.
The six wires are independent, but they fail in characteristic combinations. Wires 1 and 5 (label-naming and shipper-preservation) usually fail together — a team that did not enforce the convention end-to-end has both gaps. Wires 2 and 6 (time-window and UID) usually fail in isolation — they are configuration mistakes, not architecture mistakes, and one team's CI lint can catch them without coordinating across services. Wires 3 and 4 (LogQL syntax and filter precision) are the everyday breakage — a refactor of the dashboard's data-link template introduces a subtle bug that the smoke test should catch but the team has not built the smoke test yet. The mitigation is to think of the six wires as a checklist with named sub-owners: the platform team owns Wires 1, 5, 6; the dashboard author owns Wires 2, 3, 4. Splitting the ownership stops the "everyone thinks someone else is testing this" failure mode that otherwise dominates.
# drill_down_demo.py — Flask service emitting metrics + logs with the labels
# Grafana drill-down expects, plus a script that builds the data-link URL.
# pip install flask prometheus-client structlog requests
import logging, os, random, time, json, urllib.parse, sys
from flask import Flask, request, jsonify
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import structlog
# 1. structured-logging config — every log line carries service, level, env,
# and the time as a sortable ISO timestamp. These are the same labels the
# metric carries, which is what makes the drill-down work.
SERVICE = "order-router"
ENV = "production"
structlog.configure(
processors=[
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.stdlib.BoundLogger,
logger_factory=structlog.stdlib.LoggerFactory(),
)
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format="%(message)s")
log = structlog.get_logger().bind(service=SERVICE, env=ENV)
# 2. the metric — same `service` label that Loki streams will carry
ORDERS = Counter(
"orders_total", "Total orders processed",
["service", "env", "status"],
)
ORDER_LATENCY = Histogram(
"order_latency_seconds", "Order processing latency",
["service", "env"],
buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
)
app = Flask(__name__)
@app.route("/order/<symbol>")
def order(symbol):
started = time.perf_counter()
# simulate a redis hot-key on RELIANCE during market open
if symbol == "RELIANCE" and random.random() < 0.4:
time.sleep(random.uniform(2.0, 4.8))
ORDERS.labels(service=SERVICE, env=ENV, status="5xx").inc()
log.error("redis_lock_contention",
symbol=symbol, key=f"symbol_book:{symbol}",
timeout_ms=4800, status_code=503)
ORDER_LATENCY.labels(service=SERVICE, env=ENV).observe(
time.perf_counter() - started)
return jsonify(ok=False, reason="redis_lock"), 503
time.sleep(random.uniform(0.001, 0.04))
ORDERS.labels(service=SERVICE, env=ENV, status="2xx").inc()
log.info("order_placed", symbol=symbol, qty=random.randint(1, 100))
ORDER_LATENCY.labels(service=SERVICE, env=ENV).observe(
time.perf_counter() - started)
return jsonify(ok=True)
@app.route("/metrics")
def metrics():
return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
# 3. the data-link URL builder — what Grafana resolves on click
def drill_down_url(grafana_base, loki_uid, service, env, t_epoch_ms):
payload = {
"datasource": loki_uid,
"queries": [{
"refId": "A",
"expr": f'{{service="{service}",env="{env}"}} | json | level="error"',
}],
"range": {
"from": str(t_epoch_ms - 30_000),
"to": str(t_epoch_ms + 30_000),
},
}
return (f"{grafana_base}/explore?orgId=1&left=" +
urllib.parse.quote(json.dumps(payload)))
if __name__ == "__main__" and "url" in sys.argv:
print(drill_down_url(
"https://grafana.tradekite.local", "loki-prod",
"order-router", "production",
int(time.time() * 1000)))
elif __name__ == "__main__":
app.run(port=8080)
Sample run — first hit the service to generate metrics + logs, then build the URL the panel data-link would emit:
$ python3 drill_down_demo.py &
$ for i in $(seq 1 80); do curl -s localhost:8080/order/RELIANCE > /dev/null; done
$ curl -s localhost:8080/metrics | grep '^orders_total{'
orders_total{env="production",service="order-router",status="2xx"} 47.0
orders_total{env="production",service="order-router",status="5xx"} 33.0
$ python3 drill_down_demo.py url
https://grafana.tradekite.local/explore?orgId=1&left=%7B%22datasource%22%3A%20%22loki-prod%22%2C%20%22queries%22%3A%20%5B%7B%22refId%22%3A%20%22A%22%2C%20%22expr%22%3A%20%22%7Bservice%3D%5C%22order-router%5C%22%2Cenv%3D%5C%22production%5C%22%7D%20%7C%20json%20%7C%20level%3D%5C%22error%5C%22%22%7D%5D%2C%20%22range%22%3A%20%7B%22from%22%3A%20%221714052087000%22%2C%20%22to%22%3A%20%221714052147000%22%7D%7D
The load-bearing lines: log = structlog.get_logger().bind(service=SERVICE, env=ENV) is the contract — every log line emitted via this logger automatically carries service and env as JSON fields, which Loki indexes as stream labels (after the Promtail/Vector pipeline maps them). The pair service + env is exactly the pair the metric carries, which is exactly the pair the drill-down URL substitutes — three independent stages agreeing on the same two field names. ORDERS = Counter("orders_total", ..., ["service", "env", "status"]) uses the same label names. drill_down_url(...) constructs the templated URL: range.from is the click time minus 30 seconds, range.to is plus 30 seconds, and the expr is the LogQL query bounded by the same labels. The urllib.parse.quote call URL-encodes the JSON payload — Grafana's Explore URL takes its query state in the left parameter as an encoded JSON object. {service=\"order-router\",env=\"production\"} | json | level="error" is the actual LogQL — the stream-label selector {service=...,env=...} is the cheap part (uses Loki's index), the | json | level="error" is the parse-and-filter stage that runs over the chunks the index returned. Putting the cheap filter first is what makes the drill-down fast even at Hotstar scale.
Why the time window is ±30 seconds rather than ±5 minutes or ±1 minute: at a 60-second metric scrape interval, every metric point summarises 60 seconds of events, so a ±30 second window is the exact set of events that contributed to the clicked point. ±5 minutes returns 10× more lines and dilutes the signal — a SRE looking at a 09:14:47 spike does not need 09:11:00 logs. ±1 minute is fine but trims slightly into adjacent metric points if they are also interesting (often, the spike's leading edge is what you want). The number is empirical; the tradeoff is "more context vs more signal", and 30s + 30s sits at the inflection. Tune by your scrape interval and your typical investigation pattern.
A subtle corollary: if your dashboard is rendered with stale data (a Last 1 hour window viewed at minute 67), the ${__from} variable resolves to the dashboard's view time, not real time. Clicking on a spike that "just happened" five minutes ago will produce a Loki query bounded to that historical window — which is correct. The drill-down is click-time-aware, not real-time-aware, and the distinction matters when the SRE is investigating a now-stale incident or scrolling backwards through dashboard history. This is why Grafana's "Now" toggle on the time-range picker is critical to keep on during live triage.
Three real drill-down patterns and where they break
The vanilla "click on a metric, get the logs" is the simplest case. Production observability builds layers on top — and each layer has its own failure modes.
Pattern 1: panel-link from a singleton metric. A simple panel-link is fine when the panel shows one metric for one service — error_rate{service="order-router"} rendered as a single line. The data-link substitutes ${__field.labels.service} into the LogQL template, gets back exactly one value, and the URL resolves cleanly. This is the TradeKite case from the lead. The break: if the panel is later refactored to show multiple services on the same chart (a common dashboard evolution), the ${__field.labels.service} resolves to whichever line the user clicked on — which is correct — but if the click lands on the legend rather than the line, the resolution is empty and the URL contains {service=""}, returning zero log lines. The fix is to make the data-link template defensive: {service=~"${__field.labels.service:regex}"} with a regex matcher, so that an empty value matches nothing rather than producing a syntactically valid empty-match.
Pattern 2: dashboard-link from a templated dashboard variable. A multi-service dashboard typically has a $service template variable at the top — the user picks a service from a dropdown, and every panel re-renders. The drill-down link should respect the variable: {service="$service"} rather than the hard-coded order-router. Grafana resolves $service against the current dashboard variable selection at click-time. The break: when the variable is multi-value ($service IN (order-router, payments-api)), the resolved expression is {service="order-router|payments-api"} — which is invalid LogQL because LogQL's = operator is exact-match, not regex. The fix is to use $service:regex formatter, which produces {service=~"order-router|payments-api"} — the regex-match operator handles the multi-value case correctly. A surprising number of "drill-down works on single-service dashboards but not on multi-service ones" tickets resolve to this one-character fix.
Pattern 3: alert-to-log drill-down via the alert annotation. The previous two patterns fire from a panel click; the third fires from an alert notification. The alert manager builds a Slack/PagerDuty payload that includes a "view logs" link, generated by templating the alert's labels into the same LogQL URL builder. The break: alerts produced by recording rules sometimes carry fewer labels than the underlying metric — the recording rule may aggregate by service only, dropping env and region. The drill-down URL then constructs {service="order-router"} without an env filter, returning logs from production, staging, and dev simultaneously. The fix is to either (a) re-add the dropped labels to the recording rule, paying the cardinality cost; or (b) hard-code the env="production" constraint into the alert's URL builder, accepting that the alert is production-specific and the URL builder is environment-aware.
Why these breaks are silent rather than loud: the data-link mechanism evaluates lazily — Grafana does not pre-validate that the templated query resolves correctly, because doing so would require Grafana to understand LogQL syntax, the cardinality of every variable, and the schema of every metric. Lazy evaluation is the correct architectural choice — it keeps Grafana general — but it pushes the validation responsibility onto the dashboard author. The mitigation is two practices: (1) a smoke test in the dashboard's pull-request review that clicks every drill-down link in dev and confirms a non-empty result; (2) a Grafana plugin or CI check that lints data-link templates against the metric labels they reference. Neither is built-in; both are engineering investments worth making once a team has more than ~20 dashboards.
When the click is wrong — three production stories
The drill-down is a tool. Like all tools, it has cases where the right answer is "do not click". Three real-shape stories from the trenches, plus one bonus the team learned the slow way after a re-provisioning.
The misleading-correlation story. A hypothetical Pune-based logistics startup we will call ShipBee runs a delivery_failures_total counter that spikes every Tuesday at 14:00 IST. The on-call SRE, Aditi, drilled down into the spike for six straight Tuesdays before noticing that the matching log lines all had reason="address_unavailable" — and that the spike was caused by a batch retry job that retried failed-yesterday deliveries every Tuesday at 14:00 sharp. The drill-down was correct. The interpretation was wrong: the SRE assumed the spike was new deliveries failing, but it was old deliveries being retried. The fix was a separate metric (delivery_failures_retry_total) split from the headline counter, with its own panel and its own drill-down. The lesson: the drill-down shows you the constituents of a metric, not the cause of a metric. Causality lives in the narrative the SRE constructs from the constituents — and a metric that aggregates two semantically different events will mislead every drill-down it serves.
The cardinality-explosion story. A hypothetical Gurugram-based food-delivery service we will call ChaiSwift added a restaurant_id label to their order_failures_total metric so that the drill-down would filter logs by restaurant. The cardinality went from 12,000 series (1 service × 50 status codes × ~240 cities) to 14,400,000 series (× 60,000 restaurants). Prometheus's head_chunks count exceeded 6M and the WAL replay began taking 35 minutes on pod restart. The drill-down worked — clicking on a per-restaurant error rate did filter logs to that restaurant — but the cost was a TSDB outage every time a pod restarted. The fix was to remove the restaurant_id from the metric and instead make the drill-down filter on a parsed log field: {service="checkout"} | json | restaurant_id="$restaurant". The metric stayed low-cardinality (fast indexing); the drill-down stayed high-precision (Loki content-scans the parsed field). The lesson: drill-down does not require the filter labels to live on the metric. Stream labels for cheap filtering, content fields for precise filtering — pick the right side for each.
The wrong-window story. A hypothetical Mumbai-based payments processor we will call PayWeave runs a 5-second-resolution metric on the headline error rate during high-volume sales (Republic Day, Diwali). The default drill-down window of ±30 seconds was set when the metric was 60-second-resolution; it was never updated. After the resolution change, every drill-down click returned 12× more log lines than the metric point summarised, and the on-call started seeing logs from adjacent spikes mixed with the clicked spike. Time-to-diagnosis went up — not down — because the SRE could no longer trust that the logs corresponded to the clicked metric point. The fix was to update the time window to ±5 seconds in the dashboard's data-link config. The lesson: the drill-down's time window is a tunable parameter, not a constant; tune it to the metric's scrape interval whenever that changes.
The bonus story — cross-environment leak. A hypothetical Hyderabad-based co-working SaaS we will call DeskBaaz had a single Grafana instance shared between staging and production environments, with the data-source UID hardcoded into the dashboard's drill-down link. After a Grafana migration to a new Kubernetes cluster, the UIDs were regenerated. The hardcoded UID happened to land on the staging Loki instead of the production one. For ten days, every drill-down from the production dashboard returned staging logs — which looked plausible (same service names, same field schemas) but contained synthetic test traffic. An incident triage on day eight ran for forty minutes investigating "phantom" errors that did not match the real production error pattern, until someone noticed the trace IDs in the logs were not present in production Tempo. The fix was to switch from hardcoded UIDs to data-source name references (datasourceUid: ${ds:loki-prod} resolving via the dashboard's runtime data-source map). The lesson: every shortcut in the data-link wiring is a future cross-environment leak waiting for a re-provisioning event to surface it.
Common confusions
- "Metric-to-log drill-down and exemplars are the same thing." They are siblings, not the same. Exemplars (covered in
/wiki/exemplars-metrics-traces) attach a single trace ID to a single histogram bucket observation, letting you jump from a histogram bar to one specific trace. Metric-to-log drill-down opens all logs that match the metric's labels and time window — typically hundreds or thousands of lines. Exemplars are precise (one row, one trace); drill-down is broad (the constituents of an aggregate). You want both: exemplars for "show me the slowest request that contributed to this histogram", drill-down for "show me everything that contributed to this spike". - "The drill-down link uses the same query as the metric." It does not. The metric is a Prometheus query (PromQL); the drill-down is a Loki query (LogQL). They share labels and a time window — that is the entire correlation primitive — but the queries are syntactically and semantically different stores. The data-link template's job is to translate the labels-and-time-window into a syntactically valid LogQL query string, and it is responsible for the syntax conversion (
{service="x"}for both, but| json | level="error"is LogQL-only). - "If I add a label to the metric, the drill-down picks it up automatically." Only if the data-link template references it. The template is a string with explicit
${__field.labels.X}substitutions; ifXis a new label, you must add it to the template manually. Forgetting this is why "I added aregionlabel to my metric and now the drill-down does not filter by region" — the metric carries the label, the URL template does not interpolate it, the LogQL query never gains the constraint. - "Drill-down replaces the alert link in PagerDuty." It does not. Alerts include a drill-down URL as one of multiple links — the runbook, the dashboard, the trace explorer, the recent deploys for this service. The drill-down is one tool in the on-call's hand; pretending it replaces the runbook destroys the diagnostic ladder. (The runbook says "if redis lock contention, check whether the cache cluster is at >70% memory" — the drill-down found the contention; the runbook says what to do about it.)
- "Drill-down works the same on histograms as on counters." It does not. A histogram bucket is
latency_p99{le="0.5"}— when you click on it, the click-time labels includele="0.5", which is not a label on the underlying logs. The data-link template must strip thelelabel before substituting into LogQL:{service="${__field.labels.service}"}not{service="${__field.labels.service}",le="${__field.labels.le}"}. The strip is one of those one-line fixes everyone learns the third time their histogram drill-down returns no data. - "The 60-second window is fine for everything." It is fine for 60-second-resolution metrics. For high-resolution metrics (15s scrape, 5s scrape) the window should shrink to ±15s or ±10s, otherwise you fetch logs from neighbouring metric points and dilute the signal. For low-resolution metrics (5-minute downsampled long-term storage), the window should grow to ±2.5 minutes. The rule is "match the window to the scrape resolution"; the default 60s is a starting point.
Going deeper
How Grafana data-link templating actually resolves at click-time
The data-link is a Grafana panel field configured in the dashboard JSON model under fieldConfig.defaults.links[]. Each link has title, url, and an optional targetBlank. The url is a string with ${variable} placeholders that Grafana resolves in three passes: (1) dashboard variables — $service, $env, $region — replaced by the current dashboard variable values; (2) panel field variables — ${__field.labels.X} — replaced by the clicked field's label set; (3) time variables — ${__from}, ${__to} — replaced by epoch milliseconds of the panel's time range. The order matters: dashboard variables are resolved first, so a templated query like {service="$service",region="${__field.labels.region}"} works if $service is a dashboard variable but region is a metric label that varies per data point.
The full grammar is documented at Grafana — Data links and the :regex, :percentencode, :queryparam formatters cover the most common substitution edge cases. A subtle feature is the ${__url_time_range} shortcut, which expands to from=...&to=... in one go, saving the dashboard author from constructing two ${__from} / ${__to} substitutions independently. Internal Grafana plugins (the trace data source, the Loki data source) use this same grammar — the drill-down system is not a special case but the most visible application of a general templating engine that also powers the Tempo data source's tracesToLogsV2 (used in the trace-to-log direction this article's siblings cover).
Indexed labels vs parsed fields in Loki — the cost difference
A LogQL query like {service="order-router"} | json | level="error" runs in two phases. The stream-label selector {service="order-router"} is index-served — Loki's inverted index over stream labels returns the relevant chunks in milliseconds. The | json | level="error" is content-scanned — Loki opens each chunk, parses each line as JSON, evaluates the level filter, and returns matching lines. The cost ratio is roughly 100×: a query that filters only on stream labels is 100× cheaper than one that filters on parsed fields.
The drill-down convention is to put as many constraints as possible into the stream selector. If level is frequently filtered, configure the log shipper to promote level from a JSON field to a stream label — at the cost of multiplying the stream count by 5 (info, warn, error, debug, fatal × the existing streams). Whether the index multiplication is worth the query speedup depends on cardinality budgets; at most teams it is, for level only — but never for high-cardinality fields like trace_id or user_id. The audit pattern is to query Loki's /loki/api/v1/series endpoint, group the active streams by label, and look at the cardinality of each candidate label — promote a label only if its distinct-value count is under ~10 per service.
Production wiring at hypothetical Hotstar IPL-final scale
At a hypothetical 38,000 RPS streaming-services backend during the IPL final, the metric-to-log drill-down is the difference between "we know which microservice is breaking" and "we know which dependency of which microservice is breaking". The cost decomposes as: the metric panel renders from a Prometheus query that costs ~5ms (a single 1-day range query for a recording-rule-precomputed series); the drill-down click costs 80–250ms for the Loki query (60-second window, 80 services × 200 RPS each = 1M log lines indexed, ~12k matching the level filter, returned in 180ms).
The bottleneck is not the drill-down itself but the human's reading speed — 1,247 log lines is 90 seconds of reading. The mitigation is a follow-on | pattern extraction that groups similar log lines and shows counts: instead of 1,247 lines, the SRE sees 1,201 × redis_lock_contention key=symbol_book:RELIANCE, 38 × upstream_504 dependency=npci-rail, 8 × deserialisation_error. Pattern extraction runs on the Loki side (newer Loki versions support | pattern <_>) or on the SRE's side via awk/sort/uniq -c. Either way, the click is the entry point; pattern extraction is what makes the result digestible.
The cost ratio inverts at idle. Outside an incident, the dashboard rarely fires drill-downs (one or two per shift), but the dashboard itself renders every 30 seconds for every SRE looking at it. The metric panels' cost dominates the dashboard's day-to-day footprint. The drill-down's cost dominates only during incidents — exactly when query volume on Loki is at its peak from the rest of the team also drilling. This is the case for the Loki query cache mentioned earlier: incident-time clicks cluster on a small set of identical queries (everyone clicks the same spike), so the cache hit-rate during incidents is much higher than during the calm — the cache earns its operational complexity precisely when the team needs it most.
When the drill-down is the wrong primitive, and how to make it cheap when it is right
For very high-rate metrics with mostly homogeneous logs (kafka_messages_processed_total on a single-purpose consumer), the drill-down is over-engineered — every log line says the same thing, the metric already tells you the count, and clicking through is theatre. Use a sample-N tail (logcli query --limit=20) instead, or skip logs entirely and look at consumer lag (kafka_consumer_lag) which is the actual diagnostic. For very low-rate metrics with heterogeneous logs (cron_job_duration_seconds), the drill-down is right but the time window should shrink to ±5s — the metric points are minutes or hours apart, the constituents are a single run, and ±30s is unnecessary slack. The rule is "drill-down where the metric and the logs share a denser time-aligned aggregation"; outside that range, other primitives are better tools.
When the drill-down is the right tool, it is worth making cheap. A popular dashboard (the one a payment-platform on-call team opens every morning) might fire the same drill-down query forty times in a busy hour. Loki, by default, re-runs each one — opens the chunks, parses the JSON, evaluates the filter. The second-onwards queries are pure waste. Loki has a query-result cache (results_cache_config in the Loki config) backed by Redis or memcached that stores the result of each unique LogQL query for a configurable TTL. Setting a 5-minute TTL on the drill-down cache reduces the average query time from 180ms to 8ms for repeat clicks within the same window, and reduces Loki's chunk-read I/O proportionally. The cache key includes the time range, so a click at 09:14:47 and a click at 09:14:51 are different cache keys (different windows, different results); only identical clicks hit the cache. This is invisible to the SRE — the drill-down feels faster — but is the difference between a Loki cluster that scales to 200 SREs and one that wedges at 40. Configuration lives in Loki query caching docs.
The reverse direction and the dashboard-rollout hygiene checklist
The drill-down is one direction of an undirected graph. The reverse — clicking a log spike and opening the metric panel that summarises it — is rarer in dashboard-driven workflows but useful in log-led investigations. Grafana supports it via the Loki data source's "interpolate to dashboard" feature: when viewing logs in Explore, a button overlay above the histogram of log volume lets you jump to a dashboard pre-templated with the matching service. The wiring is symmetric: Loki labels → Prometheus labels (same names), Loki time range → Prometheus time range, Loki query filter → dashboard variable selection. Where the forward direction starts from "I see a spike, what produced it", the reverse starts from "I see an unusual log pattern, what does the broader system look like at this moment". Both paths exist; teams usually configure one and forget the other, then rediscover the value when the rare investigation hits the unconfigured direction.
When a platform team rolls out a new dashboard template across a microservice fleet (typical when a new region or a new line-of-business is bootstrapped), the drill-down configuration is the most-skipped step — every panel "works" without it, so the rollout passes review and the missing drill-downs surface only when the first incident hits. A six-item hygiene checklist applied before the dashboard is shipped: (1) every metric panel has a data-link to logs of the same service, with the time window matching the panel's scrape resolution; (2) every error-rate panel filters to level="error" in the LogQL; (3) every latency-percentile panel additionally filters to logs where the parsed latency_ms field exceeds the percentile threshold; (4) every dashboard variable has the :regex formatter on its substitution; (5) the data-source UID is read from the same provisioning manifest as the dashboard, not hardcoded; (6) a smoke-test bot clicks each drill-down link in dev once per week and pages if any returns zero results in the dev environment (where errors are routinely injected). The checklist is twenty minutes per dashboard and prevents the on-call experience documented in the lead from playing out for the seventh time. Razorpay's platform team published a one-page version of this checklist as part of their internal dashboard-review template; Hotstar's SRE org runs a CI lint on the Grafana JSON model that fails the PR if a metric panel of kind=stat or kind=timeseries lacks a data-link entry.
Where this leads next
Metric-to-log drill-down is the metric → logs edge of the cross-pillar correlation graph. The trace → logs edge is covered in /wiki/log-to-trace-correlation-trace-ids-in-logs; the metric → trace edge in /wiki/exemplars-metrics-traces; the trace → metric edge (drilling from a span to its histogram contribution) in /wiki/drill-down-and-correlation. With this article, four of the six bidirectional edges are described.
The remaining two — log → metric (the rate-of-this-log-pattern panel, useful when a high-cardinality field cannot be a metric label) and the dashboard-spanning version of all six (the "single-pane debugger" that orchestrates them automatically) — are covered in /wiki/log-pattern-rate-as-metric and /wiki/the-single-pane-of-glass-anti-pattern-and-when-it-isnt. The cross-curriculum thread is consistent: every cross-pillar primitive is a sparse pointer (label-set, trace_id, time-window) shared between two stores at observation time, and queryable in either direction at investigation time.
For the underlying log-store mechanics that make this drill-down cheap (stream-label index vs content scan, chunk format, time-bound query plan), see /wiki/why-high-cardinality-labels-break-tsdbs and /wiki/structured-vs-unstructured-logging. For the alert-side drill-down (Pattern 3 in this article), see /wiki/alert-routing-and-on-call-context — alerts are the third-most-common drill-down entry point after dashboards and Explore, and the most demanding because the SRE who clicks is half-awake.
# Reproduce this on your laptop
docker run -d -p 9090:9090 prom/prometheus
docker run -d -p 3100:3100 grafana/loki:latest
docker run -d -p 3000:3000 grafana/grafana:latest
python3 -m venv .venv && source .venv/bin/activate
pip install flask prometheus-client structlog requests python-logging-loki
python3 drill_down_demo.py &
for i in $(seq 1 100); do curl -s localhost:8080/order/RELIANCE > /dev/null; done
# in Grafana: add Prometheus + Loki data sources, build a panel for
# rate(orders_total{status="5xx"}[1m]), add a data-link with the URL the
# demo script prints, click the spike and watch the logs filter open.
A note on the verbal shape of the drill-down across teams: in some orgs (Razorpay, Swiggy) the term is "drill-down"; in others (Hotstar, Flipkart) the term is "explore" or "investigate". The mechanism is identical, and articles or runbooks that use one term should be readable to readers who use the other. This article uses "drill-down" because it is the most common and is the term Grafana itself uses in its menu copy.
The skill the click is meant to teach is not the click itself — that is mechanical — but the habit of going from "the dashboard tells me something" to "the logs tell me what". An on-call shift that takes the dashboard at face value will misdiagnose the third Tuesday-batch-retry spike in a row. An on-call shift that drills down on every spike that crosses a threshold will read the right log line within ninety seconds.
The drill-down link is the affordance that nudges the second behaviour into a habit; well-wired dashboards make the right thing the easy thing, and that is the whole point of the engineering investment described in this article.
References
- Grafana — Configure data links — the canonical reference for data-link grammar, including
${__field.labels.X},${__from}, and the:regex/:queryparamformatters used in templated drill-down URLs. - Loki — LogQL query language — stream-label selectors, parser stages (
| json,| logfmt,| pattern), label filters, and the cost model that distinguishes index-served selectors from content-scanned filters. - Prometheus — Naming conventions — the label-name conventions every drill-down depends on; alignment with OpenTelemetry resource attributes (
service.name→service_name) is the same source of truth. - Charity Majors et al., Observability Engineering, Chapter 9 — the case for cross-pillar correlation and the navigation primitives that turn dashboards into debuggers.
- Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — foundational reasoning on why correlation matters and how the three pillars compose into one investigative workflow.
- Loki query result caching — operational reference for the query cache that turns a 180ms drill-down click into an 8ms one for repeat investigations during incidents.
- /wiki/log-to-trace-correlation-trace-ids-in-logs — internal: the trace ↔ log edge of the same correlation graph.
- /wiki/exemplars-metrics-traces — internal: the metric ↔ trace edge, the high-precision sibling of this article's high-recall mechanism.