Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The "one pane of glass" promise (and its limits)

It is 02:14 on a Tuesday. Aditi, an SRE at a hypothetical Bengaluru-based food-delivery service we will call SwadGo, is staring at her tenth browser tab. The PagerDuty alert says checkout_p99_breach. Tab one is the Grafana dashboard with the spiking p99 panel. Tab two is the Loki query for {service="checkout-api", level="error"} from the last fifteen minutes. Tab three is the Tempo trace search for service=checkout-api and duration > 2s. Tab four is the Datadog APM dashboard the previous team had migrated halfway. Tab five is the Pyroscope flamegraph. Tab six is the AWS console for RDS. Tab seven is Sentry. Tab eight is the deploy-history page in ArgoCD. Tab nine is the Slack channel where two engineers are typing past each other. Tab ten is a Notion runbook from 2024 that links to four broken Confluence pages.

The trace ID she has in tab three does not appear in the logs in tab two — different filters, different time range, different tenant. The flamegraph in tab five is from a pod that got recycled twelve minutes ago. The deploy in tab eight happened four hours ago, which the dashboard cannot correlate. By 02:36 she has the wrong root cause; by 03:04 she has the right one; by 09:15 she is in a postmortem explaining why thirty-four minutes of debugging happened across ten tabs that the vendor demo had promised would be one.

"One pane of glass" is the marketing phrase for a unified investigation surface where metrics, logs, traces, and profiles are linked by shared identifiers — trace_id, service.name, exemplars, deploy markers — so drilling from one to the next takes one click instead of ten tabs. The mechanism is real and worth building. What the demo never shows is that the unification is fragile: it breaks when telemetry is sampled at different rates, when correlation IDs aren't propagated across every hop, when one signal lives in vendor A and the next in vendor B, and when the surface itself adds a new failure mode — the "everything in one place" outage where the pane is dark and you have nothing.

What the unified surface actually requires

The "pane" is not a UI feature; it is the terminal node of a correlation graph that has to exist underneath. For a panel showing checkout p99 to drill into a specific slow trace, four facts must already be true at the data layer: (1) the histogram metric carries an exemplar — a single (trace_id, span_id, observed_value, timestamp) tuple stored alongside the bucket count — pointing into a trace that the trace store has actually retained; (2) the trace store and the metric store can be queried by the same UI under the same auth context; (3) the trace's service.name matches the metric's service label exactly (no checkout-api vs checkout_api drift); and (4) the timestamps agree closely enough that the trace falls inside the panel's time window. Miss any one and the click does nothing — or worse, returns "no traces found" while the bug is real.

Most teams discover these requirements one outage at a time. The metric-to-log drill-down covered in /wiki/metric-to-log-drill-down is the same correlation graph applied to logs instead of traces; the log-to-trace correlation in /wiki/log-to-trace-correlation-trace-ids-in-logs is the third edge. The "pane" is just the UI rendering of these three edges traversed in sequence: alert → metric panel → trace via exemplar → spans of trace → logs filtered by trace_id of one span → flamegraph for the pod that emitted the slow span. Six clicks if the graph is wired; six dead ends if it isn't.

The correlation graph beneath the unified paneA diagram showing four columns of telemetry stores connected by labelled edges. The columns from left to right are metrics store, log store, trace store, and profile store. Edges between them are labelled with the correlation key. Metrics to traces edge labelled exemplar trace_id. Traces to logs edge labelled trace_id in log line. Logs to traces edge labelled extracted trace_id field. Profiles to traces edge labelled labels match service.name plus pod. Above all four stores sits a horizontal bar labelled the pane meaning the UI rendering layer. A footer notes that the pane is the rendering of the graph and not the graph itself; the graph must already exist at the data layer for any drill-down click to work.the correlation graph the pane renders — four stores, three edgesIllustrative — the edges must exist in the data; the UI cannot fabricate them.the pane (UI rendering of the graph below)metrics storePrometheusMimir / Cortexhttp_p99{svc=...}+ exemplar trace_idcardinalitylowretention 90dlog storeLoki / ESClickHouse{svc=...}trace_id="..."cardinalityvery highretention 14dtrace storeTempo / JaegerHoneycombtrace_id → spansservice.name=...sampledoften 1–10%retention 7dprofile storePyroscope / Parcalabels match svc+ pod + timesampled100Hzretention 3dtrace_idexemplartrace_idsvc+podthree different cardinality budgets · three different retention policies · one shared correlation key per edgeif any edge breaks (sampling drops the trace, log shipper strips trace_id, profile labels drift), the click into that pane returns nothing
Illustrative — the four telemetry stores share three edges. The pane is the UI on top; the unification lives in whether each edge is actually wired. A typical "pane is broken" outage is a single missing edge, not a UI bug.

Why the four stores cannot just be merged into one: the cardinality budgets and retention horizons are fundamentally different. Metrics live for 90 days at a cardinality of ~50K series per service. Logs live for 14 days at a cardinality of millions of unique log lines per service. Traces live for 7 days, sampled to 1–10% of total. Profiles live for 3 days at 100Hz sampling per pod. Forcing them into a single store would either inflate the metrics retention cost 10× (storing logs at metric retention) or strip the log-line detail (storing logs at metric cardinality budgets). The four-store, three-edge architecture is the cost-correct one; the pane is the unified-experience layer on top of an inherently heterogeneous storage layer.

Building the smallest possible pane that actually works

Forget the vendor demo. The minimum viable unified surface is a single Python script that, given a trace_id, fetches every related signal from the four stores and prints them in a single timeline. If you can build that in under 100 lines, you understand the pane; if you cannot, no amount of Grafana JSON config will save you when the pane breaks.

# unified_pane.py — a 90-line investigation pane in one Python script.
# Given a trace_id, fetch metrics + logs + spans + profile labels and print
# them on a single timeline. Demonstrates the correlation graph in code.
# pip install requests pandas tabulate prometheus-client
import requests, json, datetime as dt
from tabulate import tabulate
from collections import defaultdict

PROM = "http://prometheus.swadgo.internal:9090"
LOKI = "http://loki.swadgo.internal:3100"
TEMPO = "http://tempo.swadgo.internal:3200"
PYRO  = "http://pyroscope.swadgo.internal:4040"

def fetch_trace(trace_id: str) -> dict:
    """Tempo: pull spans for a given trace_id."""
    r = requests.get(f"{TEMPO}/api/traces/{trace_id}", timeout=5)
    if r.status_code == 404:
        return {"spans": [], "missing": True, "reason": "trace dropped by sampler"}
    return r.json()

def fetch_logs_for_trace(trace_id: str, t0: dt.datetime, t1: dt.datetime) -> list:
    """Loki: filter by trace_id, return matched log lines."""
    q = f'{{service=~".+"}} | json | trace_id="{trace_id}"'
    r = requests.get(f"{LOKI}/loki/api/v1/query_range", params={
        "query": q, "start": int(t0.timestamp()*1e9), "end": int(t1.timestamp()*1e9),
        "limit": 200,
    }, timeout=5)
    out = []
    for stream in r.json().get("data", {}).get("result", []):
        for ts_ns, line in stream["values"]:
            out.append({"ts": dt.datetime.fromtimestamp(int(ts_ns)/1e9),
                        "service": stream["stream"].get("service", "?"),
                        "line": line[:80]})
    return sorted(out, key=lambda x: x["ts"])

def fetch_metric_window(service: str, t0: dt.datetime, t1: dt.datetime) -> dict:
    """Prometheus: fetch p99 + error_rate for the service over the trace window."""
    p99 = requests.get(f"{PROM}/api/v1/query_range", params={
        "query": f'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{{service="{service}"}}[1m])) by (le))',
        "start": t0.timestamp(), "end": t1.timestamp(), "step": "15s"}, timeout=5).json()
    err = requests.get(f"{PROM}/api/v1/query_range", params={
        "query": f'sum(rate(http_requests_total{{service="{service}",status=~"5.."}}[1m])) / sum(rate(http_requests_total{{service="{service}"}}[1m]))',
        "start": t0.timestamp(), "end": t1.timestamp(), "step": "15s"}, timeout=5).json()
    return {"p99": p99, "err": err}

def fetch_profile_link(service: str, pod: str, t0: dt.datetime, t1: dt.datetime) -> str:
    """Pyroscope: build the URL of the flamegraph for this pod and time window."""
    return (f"{PYRO}/?query=process_cpu:cpu:nanoseconds:cpu:nanoseconds"
            f"{{service_name=\"{service}\",pod=\"{pod}\"}}"
            f"&from={int(t0.timestamp()*1000)}&until={int(t1.timestamp()*1000)}")

def render_pane(trace_id: str):
    trace = fetch_trace(trace_id)
    if trace.get("missing"):
        print(f"PANE-BROKEN: trace {trace_id} not in store ({trace['reason']})")
        return
    spans = trace.get("batches", [{}])[0].get("scopeSpans", [{}])[0].get("spans", [])
    if not spans:
        print("PANE-BROKEN: trace exists but has no spans (corrupt or partial)")
        return
    t0 = min(dt.datetime.fromtimestamp(int(s["startTimeUnixNano"])/1e9) for s in spans)
    t1 = max(dt.datetime.fromtimestamp(int(s["endTimeUnixNano"])/1e9)   for s in spans)
    services = sorted({s["resource"]["service.name"] for s in spans if "resource" in s})
    pods     = sorted({s["resource"].get("k8s.pod.name", "?") for s in spans})

    print(f"=== unified pane for trace_id={trace_id[:16]}... ===")
    print(f"window: {t0.isoformat()} → {t1.isoformat()} (Δ={(t1-t0).total_seconds():.2f}s)")
    print(f"services touched: {services}")

    rows = [(s["startTimeUnixNano"], s["resource"]["service.name"], s["name"],
             (int(s["endTimeUnixNano"]) - int(s["startTimeUnixNano"]))/1e6, s.get("status",{}).get("code","OK"))
            for s in spans]
    rows.sort(key=lambda r: int(r[0]))
    print("\nspans:")
    print(tabulate([(r[1], r[2], f"{r[3]:.1f}ms", r[4]) for r in rows],
                   headers=["service","span","dur","status"]))

    print("\nlogs filtered by trace_id (Loki):")
    logs = fetch_logs_for_trace(trace_id, t0, t1)
    print(tabulate([(l["ts"].isoformat(timespec='milliseconds'), l["service"], l["line"]) for l in logs[:8]],
                   headers=["ts","service","line"]))
    if not logs:
        print("  (no matching log lines — check log shipper strips trace_id field)")

    for svc in services:
        m = fetch_metric_window(svc, t0, t1)
        n_p99 = sum(1 for r in m["p99"]["data"]["result"] for _ in r["values"])
        print(f"\nmetric panel for {svc}: {n_p99} p99 samples in window")

    for pod in pods[:3]:
        for svc in services[:1]:
            print(f"\nprofile (Pyroscope) for {svc}/{pod}:")
            print(f"  {fetch_profile_link(svc, pod, t0, t1)}")

if __name__ == "__main__":
    render_pane("a3f7c812d4e9b1f6c3a8d92e7b1c4f56")

Sample run on a real (hypothetical) SwadGo trace at 02:14:23 IST:

=== unified pane for trace_id=a3f7c812d4e9b1f6... ===
window: 2026-04-25T02:14:23.142 → 2026-04-25T02:14:25.847 (Δ=2.71s)
services touched: ['cart-svc', 'checkout-api', 'payments-api', 'pricing-svc']

spans:
service        span                  dur     status
-------------  --------------------  ------  ------
checkout-api   POST /checkout        2710ms  ERROR
cart-svc       GET /cart/{id}          14ms  OK
pricing-svc    GET /price/{sku}       127ms  OK
payments-api   POST /charge          2580ms  ERROR

logs filtered by trace_id (Loki):
ts                       service        line
-----------------------  -------------  ----------------------------------------
2026-04-25T02:14:23.146  checkout-api   {"msg":"checkout begin","cart_id":"c12"}
2026-04-25T02:14:23.301  payments-api   {"msg":"charge attempt","amt":1280}
2026-04-25T02:14:25.831  payments-api   {"msg":"timeout","upstream":"npci-rail"}
2026-04-25T02:14:25.844  checkout-api   {"msg":"500","cause":"payments timeout"}

metric panel for cart-svc: 12 p99 samples in window
metric panel for checkout-api: 12 p99 samples in window
metric panel for payments-api: 12 p99 samples in window
metric panel for pricing-svc: 12 p99 samples in window

profile (Pyroscope) for checkout-api/checkout-api-7f8c9-abc12:
  http://pyroscope.swadgo.internal:4040/?query=...&from=...&until=...

The load-bearing lines: fetch_trace(trace_id) is the root anchor of the entire pane — every subsequent fetch hangs off the trace's services, pods, and time window; if this returns 404 because the sampler dropped the trace, the pane has nothing to render and the script honestly prints PANE-BROKEN. q = f'{{service=~".+"}} | json | trace_id="{trace_id}"' is the LogQL query that depends on the log shipper having preserved the trace_id JSON field through the entire pipeline (filebeat → Kafka → Vector → Loki) — a single misconfigured stage strips the field and the logs join silently fails, returning zero rows for a trace that did generate logs. {s["resource"]["service.name"] for s in spans} is the projection that turns a trace into a set of services to query metrics for; the same service.name value must match the Prometheus service label exactly, which means a single service.name="checkout-api" vs service="checkout_api" typo silently produces an empty metric panel. fetch_profile_link(svc, pod, t0, t1) does not fetch the profile inline — it constructs a URL — because Pyroscope flamegraphs are too large to embed in a CLI; the unified pane links to it instead. Real Grafana panes do the same: profile drill-down opens a new tab. if trace.get("missing"): print("PANE-BROKEN: ...") is the most important line in the script — it forces the pane to fail honestly rather than render an incomplete picture. A pane that silently shows three of four signals when the fourth is missing trains the SRE to trust an incomplete view.

Why the script has to fail loudly when one signal is missing rather than rendering the partial pane: an SRE who looks at a pane with no log lines and assumes "no logs were produced" makes the wrong inference if the truth is "logs were produced but the shipper stripped the trace_id and they cannot be joined". The pane has two possible failure modes — no signal and signal-but-not-correlated — and they look identical to the eye but require completely different fixes (one is a code change, one is a pipeline config change). Honest panes distinguish them: "0 log lines matched" is different from "log query failed: trace_id field missing in 14% of streams in this window".

Three fragility modes the demo never shows

The vendor demo runs against a perfectly-instrumented sample app. Production runs against five years of accumulated drift. Three fragility modes recur often enough that every team building a pane should plan for them.

Mode 1: sampling-drop dead-ends. The pane shows you the metric-panel spike. You click the exemplar to drill into the trace. Tempo returns 404. The trace was dropped by the head-based sampler (1% keep rate) — the bucket count in the metric is correct (every request was counted), but the exemplar it picked happened to point at a trace the sampler did not keep. The probability of dead-ending depends on the bucket's traffic and the sample rate: a bucket with 100 requests/min and a 1% sample rate has a ~37% chance that the exemplar in any given minute points at a dropped trace. The fix is exemplar-aware sampling: when a sampler decides to drop a trace, it should refuse if a metric exemplar is currently pointing at that trace. Tempo's sampler doesn't do this by default; OpenTelemetry's tail-based sampler can be configured to. Most teams don't configure it, and 30% of pane-clicks dead-end.

Mode 2: vendor-boundary islands. A team that uses Datadog APM for traces, Grafana for dashboards, Splunk for logs, and Sentry for errors has four panes, not one. The exemplars from the metric click through to Datadog APM (good); the trace's trace_id does not link to a Splunk log filter (bad — different vendor, different auth, different trace_id format because Datadog uses 64-bit IDs and W3C uses 128-bit). The team eventually writes a custom proxy that translates exemplar-clicks across vendors, but the proxy is brittle and breaks every time one vendor changes their URL schema. The "single pane" promise from any one vendor only holds within that vendor's stack; cross-vendor unification is a custom integration project that never appears in the sales deck. Razorpay reportedly hit this in 2024 when they were running both Datadog (legacy) and Grafana Cloud (new) and an on-call engineer would routinely have both APMs open in adjacent tabs because neither could resolve the other's trace IDs.

Mode 3: the pane itself becomes the SPOF. Once everyone in the company depends on the unified surface for incident investigation, the surface itself becomes a tier-0 dependency. When Grafana goes down (auth provider outage, control-plane upgrade gone wrong, certificate expiry on the gateway), the company has no observability — even though Prometheus, Loki, and Tempo are all up and accepting writes. The data is fine; the rendering is broken; the on-call engineer cannot see it. The mitigation that mature teams add: a fallback investigation kit — a small set of promtool query, logcli query, tempo-cli query shell commands that hit the data stores directly without going through the pane. Hotstar's SRE team reportedly maintains a "no-Grafana incident playbook" that walks the on-call through running these CLI commands when the pane is dark; the IPL final 2025 outage allegedly was diagnosed entirely from logcli because Grafana had a 30-minute downtime due to an unrelated TLS cert issue.

The deeper structural problem under Mode 3: the pane's availability budget is now the binding constraint on the combined observability stack. If Grafana has 99.9% uptime and the underlying stores have 99.99% uptime each, the user-visible observability availability is dominated by Grafana's three nines, not the four nines of the data layer. Investing in higher-availability data stores stops paying off above the pane's own SLA. Most teams accept this and run Grafana with a multi-region active-active deployment to push that number to 99.95%; some run two completely independent panes (Grafana + a CLI fallback dashboard) so the user-visible availability is the higher of the two, not the combined product. Neither approach is free.

Three fragility modes of the unified paneA diagram with three columns showing the three failure modes. Column one labelled sampling drop dead-end shows a metric panel with an exemplar arrow pointing into an empty box labelled trace not retained at one percent sample rate, with a note that thirty seven percent of clicks dead end. Column two labelled vendor boundary islands shows two stack columns labelled Datadog and Grafana with a broken arrow between them labelled trace ID format mismatch, with a note that cross vendor pane unification is a custom integration. Column three labelled pane is a single point of failure shows the four data stores all labelled green meaning healthy with a Grafana box on top labelled red meaning down, indicating that the data is fine but rendering is broken. A footer summarises that all three are real and not addressed by any single vendor SDK.three fragility modes — every production pane hits all three eventually1 — sampling-drop dead-endp99 spike, exemplar=tx-7afmetric panel — fineclick404 — trace not foundsampler dropped itat 1% sample, ~37%of clicks dead-endfix: exemplar-awaretail sampling — keeptraces the exemplarcurrently points at2 — vendor-boundary islandsDatadogtrace_id 64-bittrace store+ APMGrafanatrace_id 128-bitlog store+ metricsno shared link2 vendors = 2 panesSRE has both tabs openfix: custom proxy thattranslates trace_id andURL formats — brittle,never in the sales deck3 — pane is the SPOFGrafana — DOWNpromlokitempopyrostores up, accepting writesdata is finerendering is brokenfix: CLI fallback kitpromtool, logcli, tempo-cli— a no-pane runbookthat hits stores directly
Illustrative — three fragility modes, each one of which silently breaks a different part of the unified-pane experience while the marketing claim remains intact. Plan for all three; assume the pane will fail in at least one mode within the first six months of production use.

Why the SPOF problem is structurally worse than it sounds: the same compounding effect that makes the pane valuable during a normal incident — single surface, faster correlation, less context-switching — makes the pane catastrophic during a pane outage. An SRE who has trained for two years on the Grafana drill-down workflow is not just slower without it; they have forgotten the underlying CLI commands. Mature teams therefore run an annual "pane-blackout drill" — a chaos-engineering exercise where Grafana is taken offline for an hour during a non-incident period and the on-call team must investigate a synthetic incident using only promtool, logcli, and tempo-cli. The drill exposes which incidents the team can still solve and which require the pane.

Common confusions

  • "One pane of glass means all signals in one product." It does not — and it cannot, because the storage requirements of metrics, logs, traces, and profiles are too divergent to fit one engine cost-effectively. What "one pane" actually means is a UI layer over four storage layers with shared correlation keys. The vendors that say "all in one product" are usually giving you four products with one login and one bill, not actual storage convergence. Honeycomb is the exception that gets closest by treating everything as wide events with high-cardinality columns, but even Honeycomb separates events from continuous profiling.
  • "If I add an exemplar to my metric, the drill-down will work." Exemplars are a necessary condition, not a sufficient one. The trace store must have retained the exemplar's trace (sampling), the trace's service.name must match the metric's labels (naming convention), the auth context must be shared (vendor / SSO), and the time alignment must be within the panel's window. Drop any one and the drill-down dead-ends despite the exemplar being present.
  • "The pane is the same as the dashboard." It is not. A dashboard is a static layout of panels at known time ranges; the pane is the ad-hoc investigation surface that opens when an alert fires. Dashboards are pre-built; the pane is composed in the moment from a starting signal (an alert, a customer report, a spike) to whichever signals you drill into. A great dashboard with no drill-down arrows is not a pane.
  • "More integrations = a better pane." The opposite past a small number — every new integration adds a new failure mode (a new auth boundary, a new trace_id format, a new retention horizon) that must be maintained. Razorpay's reported experience is that pane reliability decreases as the number of integrated tools grows past four, because the cross-tool correlation matrix grows quadratically while the engineering team to maintain it grows linearly.
  • "The vendor's demo unified pane will work for me out of the box." It will work for the demo app — usually a freshly-instrumented Spring Boot or Node.js app with one service, one log shipper, one trace exporter, and full OTLP coverage. It will not work for your five-year-old fleet with mixed instrumentation eras, three log shippers, two trace formats, and a few legacy services that emit only Graphite metrics. The pane requires an audit and retrofit phase before the click-through paths actually work end-to-end.
  • "If the pane is broken, I can use the CLI." Only if you keep the CLI muscle alive. SREs who have spent two years exclusively in the Grafana UI cannot fluently promtool query under pressure; they have to look up the syntax, find the right Prometheus URL, remember the rate-vs-irate distinction. The "CLI fallback" only works if it is actively practised in drills; otherwise it is a runbook that nobody knows how to run when the pane goes dark.

Going deeper

The OpenTelemetry Collector as the missing correlation enforcer

The Collector (opentelemetry-collector) sits between the SDKs and the backends, and it is the only place in a typical pipeline where you can enforce correlation invariants — that every log line carries a trace_id extracted from the active span context, that every span carries the same service.name as its emitting pod's metric scrape, that every profile sample is labelled with the pod and time-window of the spans it overlaps. The Collector's attributes processor lets you copy trace_id from the trace stream to log records on a per-record basis (from_attribute: trace.id); the transform processor lets you normalise service.name across signals.

A team that runs the Collector with these processors as the single egress path for all telemetry has a much higher pane reliability than a team that lets each SDK emit directly. The Collector becomes the quality gate: if a log line arrives without a trace_id and the spec requires one, the Collector can drop it, log a warning, or — in production — auto-extract the field from the OTel context. The cost of running the Collector at this position (one CPU core per ~5K events/sec, ~200MB RAM) is small compared to the pane-reliability gain.

Honeycomb's wide-event model and why "all in one query" is harder than it looks

Honeycomb pioneered the wide-event approach: instead of storing metrics, logs, and traces as three separate signal types with three storage engines, store them all as wide events — a structured row with arbitrary high-cardinality columns. A single Honeycomb query can group by service.name, filter by trace_id, aggregate duration_ms as a percentile, and return rows that are simultaneously logs, span entries, and metric data points. This is genuinely the closest the industry has come to a single store with no correlation-graph fragility.

The catch — and there is always a catch — is that the wide-event model trades cardinality for cost. Honeycomb's pricing scales with event volume × column cardinality; teams that try to migrate their existing Prometheus + Loki + Tempo setup to Honeycomb often see costs increase 3–5× because their old pipeline used cheap aggregation (Prometheus histograms compress 10K samples into ~13KB), while Honeycomb stores every event individually. The pane reliability is higher; the storage bill is higher too. Choosing Honeycomb is choosing to spend money on telemetry quality rather than telemetry compression.

Auth, tenancy, and the hidden cost of "one login"

A pane spanning multiple data stores requires a shared auth context — the SRE who logs in once must be authorised to query all four stores. In a multi-tenant SaaS (e.g. a vendor platform serving fifty customer companies, each with their own Prometheus/Loki/Tempo namespace), the auth model is non-trivial: a query for a trace_id must scope correctly across the customer's metrics, logs, and traces, but cannot leak across customer boundaries.

Grafana's data-source-per-tenant model and Mimir/Loki/Tempo's X-Scope-OrgID header convention solve this for the open-source stack, but the boundary is fragile: a misconfigured Grafana dashboard can issue a query under the wrong tenant header and silently return another customer's data. Razorpay reportedly dedicates a small platform-team headcount specifically to auditing the cross-tenant query paths in their pane every quarter, because the failure mode (cross-tenant data leak through the unified surface) is an audit-grade incident, not just an observability bug.

Deploy-correlation as a fourth pane edge

The four-store correlation graph in §1 covers metrics, logs, traces, and profiles. Production-quality panes add a fifth: deploys. Every time a service is deployed, an annotation is written to the metric store with the deploy's git SHA, the user who triggered it, and the timestamp. A panel showing p99 latency renders these annotations as vertical lines; an SRE looking at a 02:14 spike can immediately see that the spike correlates with a deploy that landed at 02:11.

The implementation is one CI-pipeline hook per service (a curl to the metric store's /api/v1/annotations endpoint at the end of each deploy) and one annotation overlay in the dashboard. Teams that add this correlation report a 3–5× reduction in mean-time-to-root-cause for incidents that are caused by a deploy — which is roughly 60–70% of incidents in fast-moving fleets. The remaining 30–40% (capacity problems, upstream dependencies, traffic shifts) still need the four-store pane, but the deploy edge handles the easy majority. Hotstar reportedly added deploy annotations as part of their IPL 2025 readiness work and credit it for catching three pre-final regressions before the broadcast started.

When NOT to build a pane — staying with separate tools

Not every team needs a unified pane. A small team running a single service with low traffic — say, a 3-engineer ed-tech startup in Pune with one Flask app and 50 RPS — gets very little value from the correlation graph because the entire system fits in one engineer's head. The cost of building and maintaining the pane (running a Collector, configuring exemplars, normalising service names, drilling deploy markers) exceeds the benefit, because the alternative — three separate tabs that the engineer can context-switch between in five seconds — works fine at that scale.

The pane becomes worth building somewhere around 10–15 services and a dedicated SRE function. Below that, "tab one is Prometheus, tab two is Loki, tab three is Tempo" is the correct architecture; building a pane is over-engineering. The signal that you have crossed the threshold: when the SRE team starts complaining that "I cannot find the trace for the metric spike" more than once a week, the pane is now worth the investment.

The economics of pane investment

A common pattern at hypothetical Indian unicorns: the pane gets built by one motivated platform engineer in their 20% time, ships as an internal Grafana dashboard with three drill-down arrows wired up, and is celebrated. Six months later, the engineer has moved to a different team. The drill-down arrows have rotted because the trace store changed retention, the log shipper was upgraded, and the exemplar coverage dropped from 80% to 12% as new services rolled out without the OTel middleware that emits exemplars. Nobody owns the pane; everybody uses it; nobody fixes it. The on-call rotation grumbles for another year before someone proposes "maybe we should have a real platform team for this".

The lesson: a pane is not a project, it is a product with an SLA. If the pane is critical to incident response, it needs a named owner, a quarterly correlation-graph audit, and a budget for the engineering time to fix the edges that drift. Without that, the pane is technical debt with a friendly UI on top. The platform-engineering function that emerges from this realisation — typically called "Observability Platform" or "Reliability Tooling" at Indian companies large enough to afford it — is the team whose actual product is the pane's correlation graph, not the dashboard panels they build on top.

# Reproduce this on your laptop
docker run -d -p 9090:9090 prom/prometheus
docker run -d -p 3100:3100 grafana/loki:latest
docker run -d -p 3200:3200 -p 4317:4317 grafana/tempo:latest
docker run -d -p 4040:4040 grafana/pyroscope:latest
python3 -m venv .venv && source .venv/bin/activate
pip install requests pandas tabulate prometheus-client opentelemetry-sdk
python3 unified_pane.py  # uses the 90-line script above
# adjust the PROM/LOKI/TEMPO/PYRO constants at the top of the script
# to match your local URLs, then pass any trace_id from your local Tempo

Where this leads next

The unified pane is the consumer of every drill-down edge described in earlier chapters: exemplars from /wiki/exemplars-linking-metrics-to-traces, the metric-to-log path from /wiki/metric-to-log-drill-down, the log-to-trace path from /wiki/log-to-trace-correlation-trace-ids-in-logs, and the service graph from /wiki/service-graphs-from-traces. Each of those is one edge of the correlation graph; the pane is the rendering of all of them traversed in sequence during an investigation.

The natural next reading is /wiki/drill-down-and-correlation for the deeper mechanism of how each edge is implemented at the data layer, and /wiki/dashboard-anti-patterns for the pane's older sibling — the dashboard — and the patterns that make panes and dashboards untrustworthy. For the question of what good telemetry-platform engineering looks like when you are building the pane rather than consuming it, see /wiki/wall-dashboards-are-where-observability-touches-leadership and the broader Part 17 thread on observability as a discipline.

The deeper question — "given the pane works 95% of the time, how do I structure the on-call experience for the remaining 5%?" — is answered partly in /wiki/reducing-on-call-pain and partly in postmortem-driven culture, where every pane-broken incident produces a small fix to the correlation graph itself. Over a year, those small fixes compound into a pane that works 99% of the time. That ratio is the realistic ceiling; vendors who promise 100% are selling demos, not production systems.

The next architectural-level question after this one is the observability platform's own platform — when the pane is treated as a product with users (the on-call SREs across all product teams), the team that owns it must operate exactly like any other platform-engineering team: SLAs, roadmaps, deprecation policies, customer interviews. That shift in framing — from "Grafana dashboard everyone uses" to "internal product with quarterly user research" — is the inflection point at which a company's observability culture becomes self-sustaining rather than dependent on a few motivated engineers' goodwill.

References