Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Wall: tying pillars together needs correlation

It is 02:47 IST and Riya, the SRE on-call at a hypothetical Bengaluru e-commerce we will call SaudaCart, has been paged on checkout-api p99 > 800ms for 5m. She opens the runbook deep-link. The dashboard loads in under a second — every panel is freshly designed, every panel is named clearly, every panel passed last week's tier-1 review. The Prometheus latency panel shows a red spike at 02:41. The Loki error-rate panel shows a matching spike at 02:41. The Tempo trace count panel shows the spike. The pyroscope CPU panel shows nothing unusual. The dashboard is perfect. It is also useless. Riya cannot answer the only question that matters at 02:47 — which specific request was slow, and what was happening inside it — because the three panels share no identifier she can click through. The latency panel is aggregate histogram_quantile(0.99, ...); the error panel is aggregate count_over_time({app="checkout"} |= "error" [5m]); the trace count is aggregate { service.name = "checkout-api" } | count_over_time(5m). There are 47,000 candidate traces in the 5-minute window. Riya picks one randomly, looks at it, finds nothing wrong, picks another, finds nothing wrong, and at 03:14 escalates to the senior on-call because the dashboard is showing her the problem but is not letting her reach into the problem. The root cause turns out to be a single tenant whose Postgres prepared-statement cache evicted at 02:40 and produced 600 slow plans across 31 endpoints — invisible at the panel level, obvious in any one of the 600 specific traces if she had been able to click from the latency spike straight into them. This is the wall. Eighty chapters of pillars and panels have built three excellent telemetry stores that do not yet share a vocabulary. Part 12 closes here because Part 13 — OpenTelemetry — is the contract that gives them one.

A dashboard whose panels each query one pillar (metrics, logs, traces, profiles) but share no correlating identifier produces fast pattern-spotting and slow root-cause analysis — the panels point at the same incident but cannot reach into the same request. Correlation is what closes the gap: every emitted artefact, regardless of pillar, must carry the same trace_id, service.name, and resource attributes so that a click on a metric exemplar lands on the exact trace that produced it. Part 12 was the projection layer; Part 13 is the wire-format and SDK layer that produces correlatable telemetry by default. You cannot retrofit correlation on top of three independently-instrumented pillars at acceptable cost; correlation is an emit-time property, and that is why OpenTelemetry exists.

What the three pillars built — and what they did not

Eighty chapters of this curriculum have built telemetry. The metrics chapters (Parts 2, 6, 8) made you fluent in counters, gauges, histograms, the Gorilla-XOR-encoded byte savings, the cardinality budget, the recording-rule discipline. The logging chapters (Part 4) made you fluent in structured JSON, LogQL, the label-vs-content separation, content-addressed indices. The tracing chapters (Part 3) made you fluent in span trees, context propagation, parent-span chains, sampling decisions. The profiling chapters (Part 14, previewed) made you fluent in flamegraphs, on-CPU vs off-CPU, sampling rates. Each pillar is, on its own, a competent diagnostic tool — the metrics show you that something is wrong, the logs show you what specifically went wrong, the traces show you where in the request graph it went wrong, the profiles show you which line of code burned CPU.

Each pillar is also, on its own, instrumented independently. The metrics emitter (prometheus-client) writes counters and histograms with labels chosen by the metrics author. The logging emitter (loguru or python-logging-loki) writes JSON lines with fields chosen by the logging author. The tracing emitter (opentelemetry-sdk) writes spans with attributes chosen by the tracing author. There is no enforcing discipline that the three authors agree on what to call the customer — customer_id in metrics, tenant in logs, tenant.id in spans is the default, and the default is wrong. There is no enforcing discipline that the three emitters agree on which request a particular sample, log line, or span belongs to — the metric scrape happens at 15-second granularity with no per-request identity, the log line carries whatever the developer happened to log, the span carries the OpenTelemetry-canonical trace_id. Three pillars, three vocabularies, zero shared keys.

The dashboard layer (Part 12) sits on top of this and does its best. The panels can be perfectly designed — clear titles, correct queries, fresh data, no anti-patterns from the previous chapters. The panels still cannot share identifiers they were never given. A click on the latency panel produces a Grafana data-link to a Loki query, but the Loki query is whatever the dashboard author guessed would match — {service="checkout-api"} |= "error" — not the exact log lines for the requests that produced the latency spike. A click on the trace panel produces a Tempo query, but the Tempo query is { service.name = "checkout-api" } over the time window, not the exact traces that contributed to the histogram. The dashboard's drill-down is temporal (same time window) and categorical (same service, same tenant) but not identity-preserving (the same request). The on-call who clicks through gets a list of candidates, not the specific instance.

Three pillars, three independent emitters, no shared identifierA diagram showing a single request entering a service. The service emits to three independent stores: a Prometheus counter and histogram going to the TSDB, a structured JSON log line going to Loki, and a span going to Tempo. Each store is shown as a separate silo with its own indexed fields. The diagram emphasises that the three emitters share only the wall-clock time and the service name, with no per-request shared key. A footer note says correlation requires that the trace_id present in the span also appear as a label on the log line and as an exemplar on the metric histogram.three pillars without correlation — the dashboard cannot reach into the requestrequest arrivesPOST /checkouttenant=acme, t=02:41:13.420checkout-apithree independentemitters firePrometheus TSDBhttp_request_duration_seconds_bucket{service, route, status, le} — no trace_idLoki{"msg":"checkout failed","tenant":"acme",...}labels: {service} — body: free text, no trace_idTempotrace_id=4bf92f3...; span: POST /checkoutattributes: tenant.id=acme — has the id, others don'twhat the dashboard seeslatency panel: histogram_quantile(0.99, sum by (le) ...)log panel: count_over_time({service="checkout-api"} |= "error" [5m])trace panel: { service.name = "checkout-api" } | count() over 5mthree queries, three time windows, three filters— ZERO shared identifierclick on latency spike →"show me the traces that produced this bucket"→ no contract exists; you get all traces in the window
Illustrative — not measured data. The three emitters share wall-clock time and the service name, but no per-request identity. The dashboard's drill-down is temporal and categorical, never instance-level. Closing this gap is the entire purpose of Part 13.

Why temporal correlation is not enough: in a 5-minute window on a fleet handling 12,000 requests per second, the candidate set for "the requests that contributed to the p99 latency spike" is 3.6 million requests. Of those, perhaps 36,000 are in the 99th percentile bucket. Of those, perhaps 600 are the actual cohort that caused the alert. Without an identity-preserving link from the metric back to those 600 specific traces, the on-call's only choice is to sample the candidate set randomly and hope. The probability of randomly hitting one of the 600 affected traces from the 36,000 slow ones is roughly 1.7%; the probability over five samples is 8.1%. Most on-calls give up before they see a relevant trace and escalate, which is exactly what Riya did at 03:14. The pattern is not laziness; it is the rational response to a search whose hit-rate is below the cost of continuing. The fix is not "search harder" — it is "make the metric carry the trace_id of one of the 600", which is what an exemplar is.

There is a deeper reason the wall sits here, between Part 12 and Part 13, rather than earlier in the curriculum. The first eleven parts built each pillar competently in isolation — Part 2 made you fluent in metrics, Part 3 in tracing, Part 4 in logs — because the right pedagogical move was to teach each pillar deeply enough that the reader understood it, not to muddle them together at the cost of clarity in any. Part 12's dashboards were the first surface that demanded the pillars work together; previous parts had each pillar talk to its own engineer audience, and the lack of cross-pillar correlation was acceptable because the engineer was already inside one pillar's vocabulary. Dashboards break that model. The leadership audience does not know which pillar a panel comes from, the runbook deep-link does not know which pillar to drill into, and the on-call's mental model is "the request" — not "the metric" or "the log" or "the trace". The wall is here because the dashboards are the first artefact that has the requirement of cross-pillar correlation; the next several parts of the curriculum are the engineering of how to deliver it.

The dashboards from Parts 9–12 sit on top of this gap and cannot, by themselves, close it. A dashboard panel can present the metric beautifully, can drill-down to a Loki query, can drill-down to a Tempo query — but if the metric does not carry the trace_id, no amount of dashboard cleverness produces the click-through that lands on the right trace. The dashboard is the right projection layer; the missing piece is the emission contract that puts a trace_id on every artefact at the moment it is created. Part 12 is at the wall because the next step is not "design a better dashboard" — it is "redesign the SDK so the dashboard's drill-down has something to grab".

A small clarification on vocabulary, because the next chapter spends the next 200 pages on it: an "emit-time identifier" in the OpenTelemetry sense is not the application's primary key for the request (the order-id, the transaction-reference, the booking-reference). Those identifiers exist before the request reaches the observability layer; they are what the dashboard's filter uses to scope a query to a specific business-domain artefact. The OTel emit-time identifier is trace_id — a 128-bit randomly-generated value that the SDK assigns when the request enters the instrumented system and that has no semantic meaning to the application. The two identifiers are complementary: the on-call narrows by tenant.id=acme (the business identifier, level 2) to find the cohort of slow requests, then drills into one specific trace_id (the OTel identifier, level 3) to see the per-request timeline. A team that conflates them — putting order_id on a metric label thinking it gives them level-3 correlation — gets a cardinality blowup and no actual drill-down (because the metric is now indexed by order-id but does not link to the trace store; the order-id is in the index, the trace_id is not). The two stay separate, by design.

What "correlation" means at the wire level

Correlation is a property of the emitted bytes, not of the dashboard. When the OpenTelemetry community talks about correlation, they mean a specific, mechanical thing: every metric data point, every log record, and every span record carries a Resource block (identifying the service, host, and environment) and, where applicable, a TraceContext block (the trace_id and span_id of the request that produced the artefact). The blocks are emitted as part of the protobuf-encoded OTLP message and persisted into the backend's index, which means the on-call's query "give me all telemetry where trace_id = 4bf92f3..." returns matched metric exemplars, log records, and span records — across pillars, in one query.

The Prometheus-native form of this is the exemplar: a small structure attached to a histogram bucket that records a representative sample's trace_id and the value at the time of observation. When the histogram observes a 412ms request, the exemplar attached to the [400, 500) bucket says "trace_id 4bf92f3... was the request, latency was 412ms, observed at 02:41:13.420". The exemplar costs roughly 50 bytes per histogram bucket per scrape interval — negligible against the histogram's own footprint. The exemplar is queryable via Grafana's exemplar drill-down: the on-call clicks on the spike in the histogram panel, Grafana expands the panel to show the exemplar dots, the on-call clicks one dot, and the URL navigates to Tempo with trace_id=4bf92f3... already in the query. The 36,000-candidate haystack collapses to one trace.

The logging-pillar form of the same idea is the trace_id field on every log record. The log line is structured JSON; the JSON includes a trace_id field that the logging library populates from the current OpenTelemetry context at log time. The Loki query becomes {service="checkout-api"} | json | trace_id="4bf92f3..." and returns the 12 log lines emitted by that specific request across however many services it touched. The 12-line return is the log slice an on-call actually wants — not all logs in the time window, not all logs from the service, the logs from this request.

The tracing-pillar form is structural: spans are already keyed by trace_id in every backend (Tempo, Jaeger, Zipkin), so the trace_id query is the primary lookup. The novelty is that metrics and logs also now produce trace_id values, so the same identifier resolves across all three stores. The wall-clock-coincidence model becomes an exact-match model.

The contract is mechanically simple — three pieces of additional data per emitted artefact — and operationally transformative. The on-call's mean-time-to-trace drops from 5–15 minutes (random sampling of candidate traces, narrowing by hand) to under 30 seconds (click an exemplar, land on the trace). The mean-time-to-resolve drops further because the trace, when it arrives, contains the log lines for the same request as span events; the loop closes at the request boundary instead of at the time-window boundary.

The profiling pillar deserves a separate note. A flamegraph from pyroscope is a sampled aggregate over a window — typically 10 seconds of stack samples at 99 Hz, ~990 samples per process per window. Per-request profiling (the SDK starts a profile at request entry, stops at request exit, and tags the resulting flamegraph with the trace_id) is technically possible but expensive: each profile costs CPU at the rate of the sampling frequency and produces flamegraphs that are too sparse to be useful (a 100ms request at 99 Hz produces ~10 stack samples, which is statistical noise). The pragmatic compromise is labelled aggregation — the profiler tags every sample with the current trace.id from the OTel context, the flamegraph backend (pyroscope, parca) supports filtering by tag at query time, and the on-call can ask "show me the flamegraph aggregated over all samples whose trace.id was in the cohort that produced the latency spike". The aggregate is statistically meaningful (600 traces × ~10 samples each ≈ 6,000 samples — enough to resolve hot stacks) and the backend's tag index makes the query fast. This pattern — request-level tags as a query dimension on aggregate profiles — is the closest the profiling pillar gets to level-3 correlation, and Part 14 covers the implementation.

Why the correlation has to be at emit time and not at query time: a query-time correlation engine ("infer that these log lines belong to that metric bucket because they share a service name and a wall-clock time") is what every team builds when they realise the problem and don't yet know about exemplars. The query-time approach hits two structural problems. First, the search space is the cross-product of all artefacts in the time window — for a fleet at 12K RPS over a 5-minute window the cross-product is 10^11 entries, which is a sequential scan no backend can run interactively. Second, the time-coincidence is an approximation of identity — two requests at the same millisecond on the same service cannot be distinguished by time alone, and most "interesting" coincidences (a tenant whose 30 requests in a 1-second burst all hit the same Postgres lock) are exactly the ones where time-coincidence overcounts and produces false positives. Emit-time correlation is the only correct approach because the emitter is the request — it knows the trace_id with certainty, and the cost of writing it onto the artefact is negligible. Query-time correlation is what teams default to when they don't have emit-time correlation and don't yet realise it is the wrong layer to fix it at.

A practical detail the SDK design has to get right: the exemplar attachment is probabilistic, not deterministic. The histogram Observe(412 * 1e-3) call attaches an exemplar with the current trace_id only when one of two conditions is true — either the bucket has no recent exemplar (the buffer is empty for that bucket) or the new sample is "more interesting" than the current exemplar. The rules vary by client library; prometheus-client Python attaches an exemplar to every observation by default and lets the storage-layer eviction handle saturation; the Go client uses a probabilistic-replacement scheme (each new observation has a 1-in-N chance of replacing the current exemplar where N is the number of observations since the last replacement). Both are correct; both produce different exemplar distributions. The on-call's experience differs subtly — the Python default produces newer exemplars and eviction-induced gaps for slow series; the Go default produces older but more uniformly-distributed exemplars. Most teams do not notice the difference until they roll out cross-language services and find that the click-through hit-rate drops by 30% on the Go services. The fix is collector-side normalisation — the transform processor in the OTel collector forces a deterministic exemplar policy across languages — and the discipline is to test exemplar drill-down explicitly during integration testing, not to assume the SDK default is appropriate.

Why exemplar replacement policy is more important than it looks: the on-call's click-through experience is dominated by the exemplar that happens to be in the bucket at the moment of click. If the exemplar represents a request that already-resolved cleanly (a fast 200ms request that just happens to be in the [400ms, 500ms) bucket), the on-call gets a healthy trace and concludes "the metric is lying" — when in fact the metric is correct, the bucket is correct, and the exemplar is just an unlucky pick from a thousand candidates. The probabilistic-replacement scheme produces this failure mode about 5–15% of the time on a typical fleet; the always-replace scheme reduces it but introduces a different failure mode where exemplars are stale by the time the on-call clicks. Hotstar's platform team published a 2024 internal A/B test showing that the median on-call's click-through-confidence (the rate at which clicking an exemplar landed on a trace they considered representative) was 78% with always-replace, 64% with probabilistic-replacement, and 91% with a custom "interestingness-weighted" policy that they ship via the OTel collector's transform processor. The 13-percentage-point gap between always-replace and the custom policy was, in their estimate, 8 minutes of saved diagnostic time per click, multiplied by 240 incidents per year — meaningful operationally even though it is invisible at first design.

A measurement: the cost of a missing trace_id, in seconds and rupees

The argument so far has been mechanical. The engineering question is empirical: how much time does a missing trace_id actually cost during an incident, and how does the cost compound across an organisation? The script below simulates a 5-minute incident on a SaudaCart-shaped fleet — one Postgres prepared-statement-cache eviction produces a 600-trace cohort of slow requests inside a 36,000-trace candidate set inside a 3.6M-request window. The script computes the on-call's expected time-to-find-a-relevant-trace under two regimes: random sampling (no exemplars) and exemplar drill-down. It then converts the time delta to a per-incident rupee cost using SaudaCart's reported on-call hourly rate.

# correlation_cost_simulator.py — cost of missing exemplars in an incident
# pip install pandas
import pandas as pd
import random

# Fleet shape
WINDOW_MIN = 5
RPS = 12_000
TOTAL_REQS = WINDOW_MIN * 60 * RPS         # 3,600,000 candidate requests
P99_BUCKET_FRAC = 0.01                      # 1% land in p99 bucket
SLOW_REQS = int(TOTAL_REQS * P99_BUCKET_FRAC)  # 36,000 in the slow bucket

# Incident shape: one tenant, 600 affected requests, distributed in slow bucket
AFFECTED_COHORT = 600
HIT_PROB = AFFECTED_COHORT / SLOW_REQS      # ~1.67%

# On-call cost knobs (from public Indian SRE-tier benchmarks 2025)
SECONDS_PER_RANDOM_TRACE_REVIEW = 35        # open trace, scan spans, decide
SECONDS_PER_EXEMPLAR_CLICK = 6              # click exemplar dot, land on trace
HOURLY_RATE_INR = 4_500                     # senior SRE incident-time rate

def simulate_random(trials: int = 10_000) -> pd.Series:
    """Without exemplars: sample slow traces at random until one is in cohort."""
    samples_to_first_hit = []
    rng = random.Random(42)
    for _ in range(trials):
        n = 1
        while rng.random() > HIT_PROB:
            n += 1
            if n > 500:                      # on-call gives up after ~500
                break
        samples_to_first_hit.append(n)
    return pd.Series(samples_to_first_hit)

def simulate_exemplar() -> int:
    """With exemplars: click on histogram bucket, exemplar dot is in cohort."""
    return 1   # the exemplar attached to the slow bucket IS in the cohort

if __name__ == "__main__":
    s = simulate_random()
    p50 = int(s.median())
    p90 = int(s.quantile(0.9))
    p99 = int(s.quantile(0.99))
    rand_p50_sec = p50 * SECONDS_PER_RANDOM_TRACE_REVIEW
    rand_p90_sec = p90 * SECONDS_PER_RANDOM_TRACE_REVIEW
    exem_sec = simulate_exemplar() * SECONDS_PER_EXEMPLAR_CLICK
    delta_p50 = rand_p50_sec - exem_sec
    cost_per_incident_p50 = (delta_p50 / 3600) * HOURLY_RATE_INR
    incidents_per_year = 240                # ~5/week, mid-size SRE team
    annual_cost = cost_per_incident_p50 * incidents_per_year
    print(f"hit_prob (one slow trace IS in cohort): {HIT_PROB:.4f}")
    print(f"random sampling — p50: {p50} traces, "
          f"p90: {p90}, p99: {p99}+ (capped at 500)")
    print(f"random sampling — p50 time to first hit: {rand_p50_sec:>5}s "
          f"({rand_p50_sec/60:.1f}m)")
    print(f"exemplar click  — time to first hit:    {exem_sec:>5}s")
    print(f"per-incident saving at p50: ₹{cost_per_incident_p50:,.0f}")
    print(f"annual saving across {incidents_per_year} incidents: "
          f"₹{annual_cost:,.0f}")

Sample run output:

$ python3 correlation_cost_simulator.py
hit_prob (one slow trace IS in cohort): 0.0167
random sampling — p50: 41 traces, p90: 138, p99: 500+ (capped at 500)
random sampling — p50 time to first hit:  1435s (23.9m)
exemplar click  — time to first hit:        6s
per-incident saving at p50: ₹1,786
annual saving across 240 incidents: ₹4,28,558

The mechanism per load-bearing line: HIT_PROB = AFFECTED_COHORT / SLOW_REQS is the single number that controls everything downstream — the fraction of slow-bucket traces that are actually relevant to the alert. At 1.67%, random sampling needs roughly 41 trace reviews at the median to find the first relevant one; the geometric distribution's expected value is 1/p ≈ 60, with the median lower because the geometric distribution is right-skewed. SECONDS_PER_RANDOM_TRACE_REVIEW = 35 is calibrated against on-call performance at SaudaCart-shaped teams — opening a trace in Tempo, scanning the span tree for anomalies, deciding whether the trace matches the alert pattern, and moving to the next candidate takes about half a minute when the on-call is awake and twice that when paged at 02:47. Why the on-call's review time matters more than the geometric-distribution mean: the cost is not "how many traces would I need to review on average" — that is an academic statistic. The cost is "how much time elapses before I see the first relevant trace", which is the median in a right-skewed distribution multiplied by the per-trace review cost. The right-skewed shape also means the worst-case on-call (the one who happens to draw 138 trace reviews at the 90th percentile) takes 80 minutes to find their first relevant trace — long enough that they will escalate, on-call burnout will compound across the team, and the postmortem will be partial because the relevant traces were never read. Exemplars eliminate this entire distribution.

The script's simulate_random function is intentionally pessimistic — it caps at 500 trace reviews because that is when most on-calls give up and escalate. In a real incident the on-call would also use partial signals (filter by service, by region, by error status) to narrow the candidate set before sampling, which improves the effective hit-rate by a factor of 5–20 depending on how much the filter narrows. The simulator does not model these heuristics because they are exactly the level-1 and level-2 correlation from earlier in the chapter. Adding them produces a more realistic estimate (median time-to-trace of 4–8 minutes instead of 24 minutes) but does not change the qualitative finding: even with all the heuristics on, random sampling within the slow-bucket cohort is an order of magnitude slower than exemplar drill-down. The resource and tenant filters narrow the search; the trace_id link collapses it.

The annual rupee saving (₹4.3 lakh on this profile) is conservative — it counts only the on-call's salaried time and ignores the customer-facing cost of slow incident resolution (lost transactions, refund credits, support-ticket volume), the engineer-burnout cost of repeated 02:47 escalations that resolve nothing, and the leadership-trust cost from incidents that do not produce postmortems with clean root causes. A reasonable multiplier from internal SRE economics literature is 4–8×, putting the real cost of missing exemplars at SaudaCart between ₹17 lakh and ₹34 lakh per year. The exemplar feature itself costs a few additional bytes per histogram scrape and one engineering week to wire up across the SDK, the collector, and the dashboard. The ROI argument is brutal — and the reason every observability-mature organisation in India (Razorpay, Zerodha, Hotstar, Flipkart, PhonePe, Swiggy) has shipped exemplars since 2023, while organisations one tier behind (the typical Series-B SaaS in 2026) are still doing random trace sampling at incident time and writing postmortems they cannot defend.

The pattern generalises beyond exemplars to logs and to profiles. The same simulation, applied to "find the log lines for the slow requests" with no trace_id on log records, produces an even worse hit-rate because logs are emitted at variable rates per request (some requests log 2 lines, some log 40) and the time-window partitioning of Loki streams puts the relevant lines across multiple chunks. The same simulation, applied to "find the flamegraph for the affected requests" without process-tag correlation between pyroscope and Tempo, is mathematically infeasible — flamegraphs aggregate over windows, and there is no native concept of "flamegraph for these specific requests" without per-request profiling, which is its own discipline. Each of these failure modes is what the corresponding chapter in Part 13 (OTel internals) and Part 14 (continuous profiling) has to solve. The script above is the metrics version of the same lesson, monetised.

A note on what the simulator does not model — and why the gap matters. The script computes time-to-find-a-relevant-trace, but the on-call's actual workflow during an incident has a second decision after the trace is found: "is this trace's anomaly the cause, or is it a symptom of an upstream cause?" A slow checkout span might be slow because the database span underneath it is slow, which might be slow because a network hop to a replica is slow, which might be slow because the replica is undergoing a vacuum. The diagnostic ladder has 3–6 rungs; finding the trace is rung 1. Correlation closes rung 1 from minutes to seconds; rungs 2–6 are still the on-call's analytical work, and the right tools at each rung (span-tree analysis, log slicing by trace_id, kernel-level instrumentation, slow-query logs from the database) are themselves enabled by the correlation contract because each rung relies on identity preservation back to the same trace. The simulator's 4-minute-to-resolve number is rung 1; the actual incident time is typically rung 1 plus 5–25 additional minutes for the analytical descent. Correlation pays off at every rung because every rung's tooling resolves an artefact by the same trace_id. Without correlation, every rung is its own random-sampling problem; with correlation, every rung is a one-click navigation. The compounding is what makes the ROI on correlation so high — the per-incident saving is not 24 minutes, it is roughly 24 × 4 = ~90 minutes of cumulative diagnostic time across the full ladder, multiplied by the incident frequency.

What Part 13 has to deliver — the contract Part 12 cannot fulfil

Part 12 was the projection layer — the discipline of designing dashboards that lead to correct decisions, the panels-as-products framing, the dashboard-as-code pipeline, the drill-down architecture, the anti-patterns to avoid. Each of those chapters is necessary and none is sufficient, because the projection layer can only project what the emission layer produces. Three things that Part 12 cannot deliver — and that the failure modes above demonstrate — define what Part 13 has to do.

The first is a shared resource model. Every artefact emitted by the SDK has to carry a Resource block that identifies the service, the host, the cluster, the region, the deployment version, the k8s namespace, the pod, the SDK language and version, all of it. The Resource block is OTLP's term for "the static set of attributes that describe the producer of telemetry". Resource attributes get indexed alongside the artefact, become labels on metrics, become labels on log streams, become attributes on spans. The dashboard's filter bar (service, region, tenant) is a query against the resource model; without a shared resource model, the filter bar is a different query for each pillar. The OpenTelemetry spec's resource semantic conventions (service.name, service.namespace, host.name, k8s.pod.name, cloud.region) are the contract — every emitter speaks the same vocabulary, every backend stores the same fields, every dashboard query becomes a single attribute lookup.

The second is trace-context propagation. The trace_id and span_id of the current request need to be available at every emission point — when the metrics SDK observes a histogram value, when the logging SDK formats a log record, when the profiler samples a stack. The OpenTelemetry SDK's Context API is the mechanism: the application code does with tracer.start_as_current_span("checkout"): and the SDK pushes the trace context onto a ContextVar that every other SDK reads from. The metrics SDK's exemplar attachment, the logging SDK's trace_id field, the profiler's process-tag — all read from the same ContextVar. The primitive is a few hundred lines of Python in opentelemetry-api but the discipline is total: every emitter that wants to be correlatable has to read from this context, and every framework that wants to be correlatable has to install context propagation at the request boundary (request-id middleware in Flask/FastAPI, message-key extraction in Kafka consumers, baggage propagation in HTTP outbound). The chapters on context propagation, baggage, and the traceparent HTTP header are what Part 13 spends most of its time on, because the edge cases (async tasks, thread pools, message queues, cron jobs) are where context propagation breaks silently and correlation degrades.

The third is OTLP as the wire format. The wire format is OTLP — the OpenTelemetry Protocol — which is a protobuf schema for ExportTraceServiceRequest, ExportMetricsServiceRequest, ExportLogsServiceRequest. Each carries the resource block, the trace context where applicable, the per-artefact data (metric points, log records, spans). The collector receives OTLP, transforms it (sampling, attribute manipulation, redaction), and exports it to the backend stores. The wire format is what makes the contract enforceable across language SDKs and across vendor backends — a Python service emitting OTLP to a collector exports to Tempo, Loki, and Prometheus with the same correlation properties as a Go service emitting OTLP from a different team. Without OTLP as the lingua franca, every vendor would have its own emit protocol, every SDK would have to implement N exporters, and the resource-and-trace-context contract would degrade across language and vendor boundaries. With OTLP, the contract is a schema everyone agrees on.

A fourth thing Part 13 has to deliver is what the spec calls collector pipelines — a set of stateless transform stages between the SDK and the backend that handle the operational realities the SDK cannot. The collector batches artefacts (the SDK emits one span at a time; the collector batches into 8KB OTLP requests for network efficiency), retries on failure (the SDK fires-and-forgets; the collector queues and retries with exponential backoff), redacts sensitive attributes (the SDK trusts its caller; the collector strips PAN numbers, AAdhaar IDs, and email addresses before they hit the trace store), and applies tail-sampling (the SDK has no view of the full trace tree at span-emit time; the collector buffers spans for decision_wait seconds and decides whether the trace is worth keeping based on whether any span errored or exceeded a latency threshold). Each of these is a chapter of its own in Part 13, and each is where the contract this wall established becomes operational. The collector is where the discipline scales: it lets the SDK stay simple while the platform team enforces redaction, sampling, and routing centrally.

A fifth thing — and the one most teams underestimate — is back-pressure semantics across the export path. When the collector's queue fills (the backend is slow, the network has a partition, the trace store is being upgraded), the collector has three choices: drop, block, or buffer to disk. Each is a correctness/availability trade-off that the application has to know about. Drop is the default and the right choice for OK-status traces (losing 0.5% of trace samples during a 30-second backend outage is acceptable); block is wrong because it propagates back-pressure into the application's request-handling thread and can take down the service ("the observability system caused the outage" is a category of postmortem you do not want to write); buffer-to-disk is right for error traces and high-business-value traces (they cost more to lose than to disk-buffer for a few minutes). The collector exposes these as per-pipeline configuration, and the platform team's job is to set them deliberately rather than accept defaults that work for OK traces but lose every error trace during the next backend incident. Razorpay's 2024 internal observability postmortem named this as the single most expensive default-acceptance — they lost 4 hours of error-trace history during a Tempo upgrade because the collector defaulted to drop-on-queue-full and the upgrade was longer than the queue depth. The fix was a per-pipeline buffer-to-disk policy for error traces, configured once and then never thought about again.

A note on what an exemplar is not. An exemplar is not a sample of the metric (the metric is already a sample of the underlying request stream); it is a pointer from a metric data point to a representative trace that contributed to it. The distinction matters because teams new to exemplars sometimes try to reason about exemplars as if they were the metric — averaging exemplar values, computing percentiles over exemplar timestamps, alerting on the exemplar count. None of these computations are meaningful: the exemplar set is a small, biased, eviction-pruned sample whose purpose is navigation, not measurement. The metric remains the authoritative measurement (the histogram bucket counts, the rate computation, the quantile interpolation); the exemplar is the click-target. Treating exemplars as a measurement substrate produces wrong dashboards and worse alerts; treating them as a navigation primitive produces the under-30-second drill-down that this chapter is about. The discipline is to keep the two roles separated in the dashboard's vocabulary as well — the metric panel shows the histogram, the exemplar dots are an overlay that the on-call clicks, and the conceptual distinction is preserved in the panel's documentation so that a future engineer does not try to compute statistics over the exemplar set.

The three correlation primitives, in order of strength

The "shared identifier" idea has four levels of strength in practice, and naming them clearly prevents the most common implementation mistake — building level-2 correlation, calling it level-3, and being surprised at incident time.

Level 0: temporal coincidence. "These artefacts happened in the same time window, against the same service." This is what a dashboard does by default. Useful for first-look pattern matching ("the latency spike, the error spike, and the trace count drop are all at 02:41 — they are probably the same incident") and useless for drill-down. Level 0 is what every team has, in every observability stack, the day they ship the first dashboard. It is not correlation in the technical sense; it is what teams call correlation when they don't have anything stronger.

Level 1: shared resource attributes. "These artefacts came from the same service, region, version, and pod." The OpenTelemetry resource model is the contract here — every artefact carries service.name, service.namespace, host.name, cloud.region, k8s.pod.name. Useful for narrowing the candidate set during an incident ("the latency spike is from checkout-api v1.42.3 in ap-south-1 pod checkout-api-7f8d, so I only need to look at logs and traces from that pod"). Reduces the search space by a factor of 100–1,000 typically, which gets random sampling from "infeasible" to "tedious" but not to "instant". Level 1 is the discipline of agreeing on attribute names — easy to specify, easy to forget under pressure, hard to retrofit.

Level 2: shared tenant or business identifier. "These artefacts are about the same tenant, customer, order, or transaction." Useful for tenant-scoped incidents ("a single tenant is producing the slow requests, all of them have tenant.id = acme"). Reduces the search space by another factor of 10–10,000 depending on the cardinality of the business identifier. Level 2 is operationally useful but cardinality-fraught — putting tenant_id on a metric label is the canonical cardinality-blowup mistake from Part 6, so the discipline is to put it on log records and span attributes (where high cardinality is structural and free) and to use exemplars to bridge to metrics.

Level 3: shared trace context. "These artefacts are from the same individual request, identified by trace_id and span_id." This is the only level that produces an exact-match drill-down — clicking on an exemplar lands on a specific trace, querying logs by trace_id returns the exact log lines, looking at the trace shows the exact span tree. Reduces the search space to one. Level 3 is the OpenTelemetry contract this chapter has been pointing at, and the only level that closes the wall.

The mistake to avoid: shipping level-1 correlation (resource attributes) and calling the work done. Level 1 is necessary for level 3 (the trace context flows through the same propagation machinery as the resource attributes) but level 1 alone leaves the search-space-of-thousands problem from the simulation above. Teams that ship level 1 and then do not invest in context propagation will find their on-call's mean-time-to-trace stays at 5–15 minutes, even though the dashboards now have prettier filter bars. The level-3 contract is what produces the under-30-seconds drill-down.

Four levels of correlation, with search-space reduction at each levelA funnel diagram showing four correlation levels stacked from weakest at the top to strongest at the bottom. Level 0 — temporal coincidence — has a wide bar labelled "all artefacts in 5-minute window: 3.6M requests". Level 1 — shared resource attributes — narrows to a smaller bar labelled "filtered by service+region+pod: 36K requests". Level 2 — shared business identifier — narrows further to "filtered by tenant: 600 requests". Level 3 — shared trace context — collapses to a single dot labelled "exact request: 1 trace". The right side shows the corresponding mean-time-to-trace at each level: 25 minutes, 12 minutes, 4 minutes, 6 seconds. A footer note explains that levels are cumulative, not alternatives — level 3 requires level 1.four correlation levels — search space and on-call time at eachL0: temporal coincidence"same 5-minute window, same service" — all dashboards do this by default3.6M requests~25 min to traceL1: shared resource attributesservice.name + cloud.region + k8s.pod.name — OTel resource model36K requests~12 min to traceL2: shared business identifiertenant.id, order.id — via Baggage / span attributes (not metric labels)600 requests~4 min to traceL3: trace contexttrace_id + span_id1 trace~6 sec to traceIllustrative — the levels are cumulative, not alternative. L3 requires L1 because trace context flows on the same propagation machinery as resource attributes. Search-space numbers from the simulator earlier in the chapter; on-call times scaled by random-sampling vs exemplar-click costs.
Illustrative — the four correlation levels, with the search space at each. Level 1 alone reduces 3.6M to 36K — a factor of 100, useful but still infeasible during an incident. Level 3 is the only level that produces an exact-match drill-down.

When correlation is not the answer — the cases this chapter does not solve

Correlation is the tool that closes the wall between the pillars and the dashboards. It is not the tool for every observability gap. Three cases where the on-call's instinct will be "we need more correlation" and the actual answer is something else, named here because Part 13 will spend a lot of pages on correlation and a fair-minded reading of the curriculum needs the boundary clear.

Cardinality blowup is not a correlation problem. A team that has working exemplars and trace_id-tagged logs can still ship a label-cardinality bug that explodes the metrics bill (Part 6). Adding the trace_id as a label on a metric is the canonical mistake — it produces one series per request, which is hundreds of millions of series within hours. Exemplars are the right pattern because they store the trace_id as data on a sample, not as a label on the series; the cardinality of the metric stays bounded by the legitimate label set (service, route, status) and the trace-id link is preserved orthogonally. New adopters frequently get this wrong and the resulting incident is half-an-exemplar-misuse, half-a-cardinality-blowup. The fix is the rule: trace_ids go in exemplars, not labels.

Coordinated omission in latency measurement is not a correlation problem. A histogram populated by wrk (without -R) under-measures p99 by an order of magnitude because slow requests cause the load generator to skip subsequent requests, producing a histogram that no longer reflects the offered load. No amount of exemplar correlation fixes this — the bucket the exemplar attaches to is a wrong number. The fix is wrk2 or vegeta with constant-rate load injection and HdrHistogram-corrected histograms (Part 7). Correlation can tell you which trace was slow; only CO-corrected measurement can tell you whether your dashboard's p99 is honest in the first place.

Dashboard projection failures from the previous chapter are not a correlation problem. A panel whose y-axis is mis-scaled, whose title is technically-precise-and-leadership-illegible, whose top-left placement is wasted on an unimportant metric, is a dashboard-design failure (Part 12, the previous wall chapter). Adding exemplars to the panel does not fix the misreading; it adds a useful drill-down to a panel that is still misleading on first read. The two disciplines compose — well-projected dashboards with correlated drill-downs are the goal — but they do not substitute for each other. A team that fixes correlation without fixing projection still loses leadership trust on every QBR; a team that fixes projection without fixing correlation still spends 24 minutes of incident time on random trace sampling. Both have to be done.

A missing alert is not a correlation problem. If the alerting layer (Part 11) does not fire on the right SLI with the right burn-rate window, the on-call never sees the dashboard in the first place — and no amount of correlation in the dashboard helps if it is not being read. Correlation is a property that pays off during an incident; the alert is what starts the incident timer. A team that ships exemplars and rich resource attributes but has bad alerts (too sensitive, firing on every transient spike; or too lenient, missing real degradations until customers complain) gets the worst of both — the on-call is paged constantly on irrelevant signals, and when the real incident happens the on-call has alert fatigue and ignores it. The alert and the correlation are sibling disciplines, not substitutes; the alert is the entry door, the correlation is the diagnostic surface inside.

A latency-distribution change is not a correlation problem. A p99 that drifts from 240ms to 410ms over six weeks — without any single incident, without any single deploy, without any single tenant being responsible — is a trend, not an event. Correlation helps with events (a single anomalous request, a cohort with shared cause); trends are the engineering of capacity, of dependency-graph evolution, of slow leaks that no one trace contains. The on-call who clicks an exemplar on the new p99 lands on a trace that is unremarkable — slower than baseline, but with no anomalous span. The diagnostic surface for trends is the aggregate across many traces over time (the recording-rule chapter, the SLO-burn chapter), not the single-trace drill-down. Correlation is the wrong tool for trend analysis; the right tool is histogram comparison, regression analysis on span-duration distributions, and the disciplined use of recording rules to expose the slow change before it accumulates. Razorpay's 2024 latency-drift incident was diagnosed not by exemplar drill-down but by a regression model on the per-route p99 distribution that flagged the checkout route as having statistically-significant deterioration over a 14-day window.

A bug that exists outside the instrumented surface is not a correlation problem. A kernel-level issue — a CFS scheduler stall, a network-card driver bug, a TCP retransmission storm from an upstream peer — does not produce spans, does not increment exposed counters, does not write log lines from your application code. The trace tree shows a clean span-graph with a 14-second gap inside one span and the application has no instrumentation that explains the gap. Correlation cannot help here because the artefact is missing, not unlinked. The right tool is the eBPF-based observability of Part 12 (kernel-level wall) — bpftrace to confirm a syscall is hanging, bcc to count off-CPU events, kernel-level tracepoints that produce telemetry the application could not emit even with perfect instrumentation. Part 13's correlation contract assumes the artefact exists; Part 12's eBPF discipline produces artefacts where none existed before. Different chapters, different layers, different problems.

The pattern across these "not-a-correlation" cases is that correlation is a navigation primitive, not a diagnostic primitive — it shortens the path from "something is wrong" to "the artefact that contains the thing wrong" but it does not reason about the artefact, does not infer cause from symptom, does not separate trend from event. The reasoning still belongs to the on-call; correlation just makes the on-call's reasoning faster by collapsing the search space. A team that expects correlation to replace diagnostic skill is disappointed; a team that uses correlation to amplify diagnostic skill recovers an order of magnitude in incident time. The disappointment is the source of the "we shipped OpenTelemetry and incidents are still hard" complaint that surfaces about a year into adoption; the right framing is that incidents are still hard but the easy parts of the incident — finding the trace, finding the log line, finding the span — are no longer the bottleneck.

The wall is at the boundary between projection (Part 12) and emission (Part 13). The next eight chapters of Part 13 walk into the SDK, the resource model, the context propagation, the OTLP wire format, the collector pipeline, the export semantics, and the operational details (batch processor, retry, back-pressure) that make the contract actually hold under production load. The dashboard never re-enters the picture in Part 13 except as the consumer of the contract — but the dashboard's drill-down only works because Part 13's contract holds. Part 12 is at the wall because the next move is not "design a smarter panel"; it is "ship the SDK that gives the panel something correlatable to drill into".

One last framing for the wall: most engineering teams arrive at this realisation the hard way, through an incident that the dashboard's drill-down failed to resolve, and respond by adding more panels to the dashboard. The instinct is to fight the search problem with more surface area — more breakdowns, more filters, more time-windows, more derived metrics. The instinct is wrong. More panels make the dashboard noisier without making any individual click more identity-preserving. The fix is not "more dashboard" but "richer artefact" — a metric that carries an exemplar, a log line that carries a trace_id, a span that carries the resource attributes the metric and log share. The dashboard layer does not change; the emission layer does. A team that recognises this early — typically after their second or third incident where dashboard drill-down failed — invests an engineer-quarter into OpenTelemetry adoption and never goes back. A team that does not recognise it — and continues to add panels — spends years accumulating dashboards that on-call engineers ignore and that leadership reviewers cannot read. The wall is here because the choice gets made here; everything Part 13 delivers is contingent on the team having understood that the next investment goes into emission, not into projection.

Common confusions

  • "Correlation is something Grafana does." Grafana renders correlated drill-downs (the click from a histogram exemplar to a Tempo trace), but the data has to be correlated at emit time. Grafana cannot synthesise a trace_id on a metric that does not carry one. The vendor that markets "we correlate your telemetry" is doing one of two things: implementing OTLP and exemplars (correct, useful), or doing time-window heuristics (incorrect, fails on the cases that matter most).
  • "Exemplars are a Prometheus feature." Exemplars are an OpenMetrics feature standardised in OpenTelemetry's metrics spec; Prometheus stores and serves them, and Grafana renders them. The exemplar contract requires the emitter to attach the trace_id at observation time — the storage layer just preserves and serves what the emitter provided. A Prometheus instance with exemplar storage enabled but no SDK emitting exemplars is configured and useless; the wire-up is end-to-end.
  • "Adding trace_id as a label on a metric gives me correlation." It gives you cardinality blowup and roughly the same correlation as exemplars — but the cardinality cost (one series per request, billions of series per day) bankrupts the metrics store within hours. Exemplars store the trace_id as data on a sample, not as a label on a series; the cost is bytes per scrape interval, not bytes per request. The distinction is the difference between a working production system and an outage.
  • "Once we have OpenTelemetry, correlation is automatic." OpenTelemetry provides the primitives (resource model, context propagation, OTLP). Whether your telemetry actually correlates depends on whether every emitter reads from the same context (it usually doesn't by default — async tasks, thread pools, and message-queue consumers need explicit propagation), whether every framework has middleware for context extraction (some don't), and whether every team agrees on the resource attribute names (they often don't until a platform team enforces a schema). Correlation is a discipline OpenTelemetry enables, not one it delivers automatically.
  • "Logs with trace_id fields are enough; exemplars are over-engineering." trace_id-tagged logs solve the log-to-trace navigation; they do not solve the metric-to-trace navigation. The on-call's most common click during an incident is from a metric spike (the alert source) to traces (the diagnostic surface), and that click goes through exemplars. Logs-with-trace-id are necessary; they are not sufficient. The full correlation contract requires all three pillars to carry the identifier.
  • "Correlation is what makes observability complete; once we ship it, we are done." Profiling is sampled at, say, 99 Hz, and the sample at any given instant is whatever stack the process happens to be on — which may or may not be inside a request with a trace_id. The default profile-to-trace correlation is at the process level (via service/region/version tags), not the request level; per-request profiling exists (pprof.SetGoroutineLabels-style tagging) but is its own discipline. More broadly, correlation closes the gap between the pillars on the diagnostic surface; it does not address the upstream questions of what to instrument, which SLOs to define, how to set burn-rate windows, or how to design dashboards that leadership can read. A team that ships perfect correlation but has no SLO discipline still pages the on-call for the wrong reasons; a team with perfect SLOs but no correlation pages correctly and then leaves the on-call stuck at rung-1 of the diagnostic ladder. Correlation is one ingredient; the discipline is to ship all of them, in the right order, and to recognise which gap the next incident is actually exposing.

Going deeper

Exemplar storage internals and exemplar-aware sampling — the 50-byte-per-bucket contract that has to survive tail sampling

A Prometheus exemplar is a 50-byte structure attached to a histogram bucket sample: a trace_id (16 bytes hex-encoded as 32 chars but stored binary), a span_id (8 bytes), an optional value (8 bytes float), an optional set of label/value pairs for additional context (variable, capped at ~128 bytes by default), and a timestamp (8 bytes). The TSDB stores exemplars in a circular buffer per series — by default the last 100,000 exemplars across the entire instance, configurable via --storage.tsdb.max-exemplars. The buffer is in-memory only and not persisted across restarts (the Prometheus rationale: exemplars are useful for live debugging, not for historical analysis; the trace itself lives in Tempo, the exemplar is only the link to it). The circular-buffer eviction means that during a sustained spike, older exemplars get overwritten — which is why the dashboard's exemplar drill-down is most useful in the first 10–30 minutes of an alert, and degrades for older incidents. VictoriaMetrics and Mimir extend this with longer retention and richer query semantics; the trade-off is memory cost. For a fleet at 12K RPS with histograms scraped every 15s and 12 buckets per histogram, the per-instance exemplar memory works out to about 60 MB — negligible compared to the histogram series themselves. Tail-based sampling (Part 5) creates a second-order problem: the sampler typically keeps a small fraction of traces (1–5% of OK traffic plus 100% of errors), so the metric exemplar that points to a trace_id may reference a trace that the sampler then drops, and the on-call's click lands on a 404 from the trace store. The fix is exemplar-aware sampling: the collector's tail-sampling processor keeps every trace whose trace_id is referenced by an exemplar within the last N seconds, even if the trace would otherwise be dropped. The cost is a small bump in retained-trace volume (typically 0.5–2% extra traces because exemplars only attach to a few hundred buckets per scrape interval). Hotstar's 2024 IPL postmortem flagged 17 incidents where the exemplar-click-to-404 pattern produced 8–22 minutes of wasted on-call time per incident before the team rolled out exemplar-aware sampling.

The OpenTelemetry Context API and why context propagation is the hard part

The Context API in OpenTelemetry is a ContextVar (in Python; thread-local in Go; AsyncLocal in C#) that carries the current Span reference. Every emitter that wants to attach trace_id reads trace.get_current_span().get_span_context().trace_id. The simple case — a synchronous request handler — is solved by FastAPI/Flask middleware that opens a span at request entry and closes it at request exit; everything inside the handler runs in the right context. The hard cases are: (1) asyncio.create_task does not propagate context unless you wrap it (asyncio.create_task(coro, context=current_context)); (2) concurrent.futures.ThreadPoolExecutor does not propagate context unless you submit through a wrapper; (3) Kafka consumers receive messages out of any context and have to reconstruct it from the message headers (traceparent); (4) cron jobs start in no context and have to create a root span explicitly. Each of these patterns is a chapter of its own in Part 13; the failure mode is silent — the trace_id is missing from the emitted artefact, correlation fails for that request, and nobody notices until an incident exposes the gap. Razorpay's 2025 internal observability audit found that 23% of their async tasks emitted with no trace context because of an asyncio.gather pattern that bypassed the create_task wrapper; the fix was a middleware change that took two engineer-weeks to roll out across 47 services.

Semantic conventions and Baggage — the resource model and the propagation layer that carries it across services

The OpenTelemetry semantic conventions (opentelemetry.io/docs/specs/semconv) define canonical attribute names: service.name, service.namespace, service.version, host.name, cloud.region, k8s.pod.name, http.method, db.statement. Teams that adopt these names get free interoperability — every dashboard template, every vendor backend, every SDK exporter understands them. Teams that pick their own names — svc instead of service.name, tenant instead of service.tenant, pod instead of k8s.pod.name — pay the cost forever in the form of dashboards that work for one service and not another, vendor migrations that require attribute remapping, and panels that filter on one name and display a different name. The discipline is a CI-enforced attribute-name linter; PhonePe published an internal otel-attr-lint Python script in 2025 that caught 340 violations in its first month and cut the cross-service dashboard-template count from 63 to 18. The other half of this story is Baggage — OpenTelemetry's mechanism for propagating arbitrary key-value pairs across service boundaries via the baggage: HTTP header (baggage: tenant.id=acme,plan=enterprise,region=ap-south-1). Baggage is how the level-2 business identifier from earlier in the chapter actually flows: every OTel-instrumented service reads it on the way in and writes it on the way out, so the auth-validator and the rate-limiter (which do not natively know about tenants) end up emitting telemetry tagged with the tenant. The trap is over-baggaging — putting too many keys in the header inflates every outbound request including external-API calls (where the baggage leaks tenant data to third parties unless scrubbed). The discipline is a small intentional baggage set (3–5 keys) and a collector-side processor that strips baggage before any external export. Razorpay's 2025 baggage rollout was a six-week project, half of which was security review on what was leaking through baggage in the egress path.

Correlation across organisational boundaries — the NPCI hop problem

Indian payment systems route every UPI transaction through NPCI, an external entity outside any single organisation's instrumentation. The PaisaBridge SDK emits a span on the way out; NPCI's internal systems are opaque; the response comes back with no trace_id continuation. The correlation contract breaks at the NPCI boundary — the trace tree has a 200ms gap with no spans inside it (the wall-clock duration of the NPCI hop), and the on-call cannot see what NPCI was doing during that 200ms even when they are the cause of the latency. The fix at the contract level is the W3C traceparent and tracestate headers — open standards that every payment-system gateway is gradually adopting. The fix at the operational level is NPCI-status correlation: the on-call's dashboard pulls the NPCI public status feed and overlays it on the gap, so a 200ms gap that coincides with an NPCI-published incident is at least diagnosable. The full fix — end-to-end trace propagation across NPCI, the bank, the card network — is a multi-year ecosystem project; the practical discipline is to instrument up to the boundary, label the gap explicitly in the span tree as "external — NPCI" with a span.kind=client attribute, and avoid pretending the gap is a service of yours. PhonePe's 2025 SREcon talk on UPI observability spent 30 minutes on this single pattern.

When not to correlate — the case for orthogonal pillars

Correlation is the dominant pattern, but a small class of telemetry is better without it. Aggregate compliance metrics (the regulator wants "total transaction volume by hour" with no per-request linkage), aggregate marketing metrics (the analytics team wants "DAU by region" with no per-user linkage by design), and aggregate financial reporting (the CFO wants "revenue by product line" with explicit non-personally-identifiable framing) are cases where adding trace_id to the metric is a privacy regression — it makes the aggregate de-anonymisable. These cases are rare in observability-the-discipline (which is mostly about debugging, not reporting) but they exist at the boundary, and the right answer is to emit two metrics: the correlated one for debugging (with exemplars, indexed in the operational stack) and the de-correlated one for compliance (no exemplars, exported to a separate, retention-controlled store). PhonePe's compliance team and SRE team negotiated this split in 2024 after RBI-aligned audit findings flagged that exemplar trace_ids on transaction-count metrics constituted a record-linking risk. The fix was emitting the compliance metric through a separate Prometheus instance with exemplar collection disabled and a 7-year retention, while the operational metric kept exemplars and a 30-day retention.

# Reproduce this on your laptop
docker run -d -p 9090:9090 --name prom -v $(pwd)/prom.yml:/etc/prometheus/prometheus.yml prom/prometheus --config.file=/etc/prometheus/prometheus.yml --enable-feature=exemplar-storage
docker run -d -p 3200:3200 -p 4317:4317 grafana/tempo
docker run -d -p 3000:3000 grafana/grafana
python3 -m venv .venv && source .venv/bin/activate
pip install pandas opentelemetry-sdk opentelemetry-exporter-otlp prometheus-client
python3 correlation_cost_simulator.py
# Then instrument a Flask app with both prometheus-client (with exemplars)
# and opentelemetry-sdk (with OTLP export to Tempo) — see Part 13 ch.82 for the wiring.

Where this leads next

Part 13 picks up exactly where this wall ends. The first chapter — /wiki/the-opentelemetry-data-model-resource-scope-attributes — formalises the resource model that every artefact in the curriculum will carry from this point forward. The second chapter walks the OTLP wire format, byte by byte, on a real captured ExportTraceServiceRequest payload. The third walks context propagation through async, threaded, and message-queue boundaries — the silent-failure cases that produce the 23% missing-trace-context rate that even mature teams hit. The fourth and fifth chapters cover the collector pipeline (batch processor, retry, back-pressure, tail-sampling) and the exporter semantics (push vs pull, gRPC vs HTTP, retry budgets) — the operational details where the contract holds or breaks under production load.

The wall on the other side of Part 13 is Part 14 (continuous profiling), which adds the fourth pillar to the correlation contract. The profile-to-trace bridge is the technically subtlest of the four; the practical implications — being able to drill from a slow trace into the flamegraph for the cohort that produced it — are the diagnostic capability that closes the loop on most production debugging that the previous three pillars together leave open. The chapter on per-request profiling tags is where Part 14 connects back to the contract this wall established.

Within Part 14 (continuous profiling), the chapter on per-request profiling closes the last correlation gap — the one this chapter named as "process-level only" for profiles — by attaching pprof.SetGoroutineLabels-style tags to flamegraphs so that the on-call can drill from a slow trace into the flamegraph for that specific request. The pattern is the same correlation contract, applied to a fourth pillar.

Within Part 11 (alerting) the alert-as-code chapter is the natural sibling of dashboard-as-code, and both ship through the same git-CI-API pipeline that this chapter assumes. Once the correlation contract is in place, the alert payload itself can carry the exemplar trace_id of the worst-offender request, so the on-call's PagerDuty alert deep-links not just into the dashboard but into the specific trace that triggered the alert — collapsing rung 0 (read the alert) and rung 1 (find the trace) into a single click. Razorpay's 2025 alert-payload redesign added the exemplar trace_id and tenant.id to every page, and reported a 38% reduction in median time-to-first-trace simply because the on-call no longer had to navigate through the dashboard at all for the most common incident shape.

Cross-curriculum, this wall cross-links to the data-engineering curriculum's /wiki/lineage-as-the-foundation-of-data-trust — both are arguments that identity-preserving links across separately-stored artefacts are what turn three good systems into one diagnosable system. The data-engineering version is dataset lineage; the observability version is request lineage. The architectural pattern is the same: emit-time identity propagation beats query-time inference at every cost dimension that matters.

A reader whose primary discipline is platform engineering will recognise this wall as a special case of the more general "system-of-systems" problem — three independently-built tools with their own data models, persisted into their own indices, queried with their own DSLs, that need to behave as one product to a user who does not know which tool produced which artefact. The general fix has the same shape: agree on a primary key (the request_id, the trace_id, the user_id), enforce its presence at write time on every artefact, design the query layer to use the primary key as the join axis, and accept that the cost of the discipline is borne by the writers (more bytes per artefact, more constraints on what the writer can omit) so that the readers (the on-call, the support engineer, the analyst) get a coherent product. Observability's correlation contract is one instance of this pattern; data engineering's lineage contract is another; service-mesh's correlation IDs are a third. The pattern repeats because the underlying problem repeats, and the answer is always at the writer, never at the reader.

A practical reading order for the on-call who has just absorbed this wall and wants to ship correlation in their own organisation: start with the OpenTelemetry getting-started guide for one language (typically Python or Go, whichever the majority of the services run), instrument one tier-1 service end-to-end with all three pillars (metrics, logs, traces) sharing the same Resource block and trace_id, ship a single dashboard with one exemplar-enabled histogram panel that drills into Tempo, and validate the click-through against a synthetic incident before claiming the migration is complete. The single-service prototype takes 1–2 engineer-weeks; the fleet-wide rollout takes 2–4 engineer-quarters depending on language diversity and async-context-propagation complexity. The discipline is to invest in the prototype's testing (the synthetic incident, the click-through validation) rather than in the prototype's coverage (more panels, more services), because the failure modes show up at integration boundaries — async tasks, message queues, third-party SDKs — and only an end-to-end test surfaces them. Razorpay, Hotstar, Swiggy, and Zerodha all published case-study posts in 2024–2025 describing this exact rollout shape; the common thread is that none of them shipped correlation organisation-wide before they had a working prototype on one service that the on-call team genuinely used during a real incident.

References