Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Distributed tracing (W3C, Dapper, Jaeger)

It is 14:22 IST. Karan, a backend engineer at MealRush, is staring at a dashboard that says the lunch-hour /checkout endpoint's p99 jumped from 240 ms to 1.8 seconds at exactly 14:07. Eleven services touch that request. The CPU graphs are flat. The Kafka consumer lag is fine. The database p99 is 18 ms — also fine. Each service in isolation looks healthy. And yet, somewhere in the fan-out, 1.5 seconds is being spent. He opens the Jaeger UI, picks one slow request, and sees a flame graph: the restaurant-availability span is 1,612 ms long, with a single child span called geo-radius-lookup that takes 1,608 ms of it. A correlation ID would have told him which logs belong to this request; the trace tells him which hop took the time. A trace is not a log search — it is a structured causality tree, where the shape of the tree itself is the answer.

A distributed trace is a tree of spans. Each span records start time, duration, a parent pointer, and a service name; spans share a 128-bit trace ID. The W3C Trace Context spec (traceparent header) standardises propagation; Google's 2010 Dapper paper invented the architecture; Jaeger and OpenTelemetry are the open-source descendants. Sampling is the load-bearing engineering decision — keeping every trace at MealRush scale would cost ₹4 crore/year; head-based sampling at 1% is the default, tail-based on errors and slow requests is the upgrade.

What a span is, and why the tree shape matters

A span is one unit of work — usually one RPC, one database query, or one significant code block. It records four things every time: a unique 64-bit span_id, a parent_span_id (or null for the root), a start timestamp, and a duration. Optionally it carries attributes (key-value tags like http.status=200 or db.statement="SELECT ...") and events (log-line-equivalents that happened mid-span). Every span in one user-facing request shares the same 128-bit trace_id. Because each span knows its parent, the spans in one trace assemble into a tree — the root span at the API gateway, child spans for each downstream RPC, grandchildren for the database calls those RPCs made, and so on.

The shape of the tree is what you read.

A trace tree showing one MealRush /checkout requestA flame-graph-style trace visualisation. The root span is api-gateway, 1812ms wide, at the top. Below it, child spans for auth (12ms), order-svc (1798ms), and notify (8ms) are aligned by start time. Under order-svc, two children: cart-svc (32ms) and restaurant-availability (1612ms). Under restaurant-availability, geo-radius-lookup is 1608ms — almost the entire parent. The visual obviously points at geo-radius-lookup as the slow hop. Illustrative. Trace 9a4f7c1b... — MealRush /checkout — total 1,812 ms x-axis = wall-clock time within the trace; box width = span duration api-gateway 1812 ms auth (12 ms) order-svc 1798 ms cart-svc (32 ms) restaurant-availability 1612 ms geo-radius-lookup 1608 ms ← culprit postgis ST_DWithin (no GIST index after migration) notify (8 ms) 0 ms 906 ms 1812 ms The shape — one fat child swallowing its parent — is the diagnostic. Logs cannot show this.
Flame-graph view of one trace. The visual obviously says "geo-radius-lookup is the entire latency". A correlation ID would tell you the logs belong together; a trace tells you the geometry of the slowness. Illustrative.

A trace solves a category of question that logs cannot. Why logs are insufficient: a log line is a point in time tagged with a service name; it carries no parent pointer and no duration. To compute the latency of restaurant-availability from logs, you have to find its entered and exited lines, subtract timestamps, and trust that the host clocks are within a millisecond — none of which is reliable across services. A span carries the duration as a measured value in one place. The tree carries the causality as a structural pointer, not as a timestamp coincidence.

The trace is also not a metric. A metric tells you "p99 of /checkout is 1.8s". The trace tells you which hop, in which trace, exactly contributed how many milliseconds. Metrics summarise across thousands of requests; traces preserve one specific request in full. You read metrics first to know something is slow; you read traces to know what.

How propagation works: the W3C traceparent header

A trace works because every service knows the trace ID and the parent span ID before it does any work. That requires propagation — the same discipline as correlation IDs, but with structured parts.

The W3C Trace Context Recommendation (2020) defines two HTTP headers. The first is traceparent, mandatory:

traceparent: 00-9a4f7c1b8e2d6f0a3b5c7d9e1f2a4b6c-7d9e1f2a4b6c8d0e-01

Four fields, dash-separated: 00 is the version; the next 32 hex chars are the trace ID (128 bits); the next 16 hex chars are the parent span ID (64 bits) — i.e. the span ID of the caller (which is this request's parent); the last 2 hex chars are flags, where the lowest bit is the sampled flag. The second header is tracestate, which carries vendor-specific extensions as comma-separated key-value pairs (tracestate: paysetu=cid:c8f4a2,jaeger=variant:b). Both are propagated unchanged across hops except that each service updates the parent-span-id in the outgoing traceparent to the span ID it just generated for itself — that is what wires up the parent-child pointer.

The propagation contract has three rules: (1) on inbound, parse traceparent; if absent, you are the root and you mint a new trace ID; if present, inherit the trace ID and treat the parent-span-id as your span's parent. (2) Generate a new span ID for the work you do. (3) On every outbound call, emit traceparent with the same trace ID, the parent-span-id set to your span ID. The sampled flag, once decided at the root, propagates unchanged — every hop in the trace makes the same sampling decision, so traces are never half-sampled.

# tracer.py — minimal W3C-Trace-Context propagator (asyncio).
# Demonstrates inbound parse, span generation, outbound serialise.
import asyncio, contextvars, secrets, time, json

trace_ctx: contextvars.ContextVar[dict] = contextvars.ContextVar("trace", default={})

def parse_traceparent(h: str) -> dict:
    parts = h.split("-")
    if len(parts) != 4 or parts[0] != "00":
        return {}
    return {"trace_id": parts[1], "parent_span_id": parts[2], "flags": parts[3]}

def emit_traceparent(trace_id: str, my_span_id: str, flags: str) -> str:
    return f"00-{trace_id}-{my_span_id}-{flags}"

async def handle(service: str, headers: dict):
    parent = parse_traceparent(headers.get("traceparent", ""))
    trace_id = parent.get("trace_id") or secrets.token_hex(16)  # 128 bits
    my_span_id = secrets.token_hex(8)                            # 64 bits
    flags = parent.get("flags") or ("01" if secrets.randbelow(100) < 1 else "00")
    span = {"service": service, "trace_id": trace_id, "span_id": my_span_id,
            "parent": parent.get("parent_span_id"), "start": time.monotonic(), "flags": flags}
    trace_ctx.set(span)
    await asyncio.sleep(0.01)  # do work
    span["duration_ms"] = round((time.monotonic() - span["start"]) * 1000, 2)
    if flags.endswith("1"):
        print(json.dumps({k: v for k, v in span.items() if k != "start"}))
    return {"traceparent": emit_traceparent(trace_id, my_span_id, flags)}

async def main():
    # root request — no inbound traceparent
    h_gw = await handle("api-gateway", {})
    # downstream call inherits trace_id, this gateway's span becomes the new parent
    h_order = await handle("order-svc", h_gw)
    h_avail = await handle("restaurant-availability", h_order)

asyncio.run(main())

Realistic output (sampled trace):

{"service": "api-gateway", "trace_id": "9a4f7c1b8e2d6f0a3b5c7d9e1f2a4b6c", "span_id": "7d9e1f2a4b6c8d0e", "parent": null, "flags": "01", "duration_ms": 10.31}
{"service": "order-svc", "trace_id": "9a4f7c1b8e2d6f0a3b5c7d9e1f2a4b6c", "span_id": "1c3e5a7b9d2f4061", "parent": "7d9e1f2a4b6c8d0e", "flags": "01", "duration_ms": 10.18}
{"service": "restaurant-availability", "trace_id": "9a4f7c1b8e2d6f0a3b5c7d9e1f2a4b6c", "span_id": "8b6d4f2e0c1a3957", "parent": "1c3e5a7b9d2f4061", "flags": "01", "duration_ms": 10.42}

Walkthrough. The trace_ctx ContextVar holds the current span — async-safe, copied across await points. The parse_traceparent function pulls the four fields out of the inbound header; if there is no header (root request), it returns empty and handle mints a fresh 128-bit trace ID with secrets.token_hex(16). The flags decision is the sampling decision — at the root, we flip a 1% biased coin; on subsequent hops we inherit, so the trace is either fully kept or fully dropped. Why 128 bits for trace ID and 64 for span ID: at 100k traces/sec for 10 years, the birthday-paradox collision probability for 128 bits is about 10⁻¹⁸ — effectively impossible. For 64 bits the probability is about 10⁻⁵ over the same window — tolerable for spans within one trace, where you only need uniqueness among a few hundred siblings, not across the entire history. The emit_traceparent function builds the outbound header with this service's span ID slotted as the parent for the next hop — that pointer-flip is what builds the tree. The sampled spans get printed; un-sampled ones are dropped at the source, never sent to the collector. Three log lines, three spans, one trace tree, root → child → grandchild.

Sampling — the engineering decision that decides your storage bill

If you keep every span at MealRush scale — say 80k requests/sec at lunch hour, average fan-out 14 spans per trace, 1 KB per span — you generate 1.1 GB/sec of trace data, or about 95 TB/day. At cloud-storage and indexing prices that runs to roughly ₹4 crore/year just for the trace tier. Every production tracing system therefore makes a sampling decision, and the choice of sampling strategy is the single biggest engineering question in the trace pipeline.

Head-based vs tail-based samplingTwo-panel comparison. Top panel: head-based sampling — the root makes a 1-in-100 random decision; the flag propagates through the trace; the entire trace is either kept or dropped at the source. Bottom panel: tail-based sampling — every span is buffered for a few seconds at the collector, then a decision is made based on the trace as a whole (was there an error? was duration over a threshold?), and only "interesting" traces are kept. Tail-based catches errors and slow traces with much higher signal but costs more memory and adds latency to the trace tier. Head-based vs tail-based sampling Head-based (default — Jaeger, OpenTelemetry standard) Root flips a 1-in-100 coin. Flag propagates. 99% of traces are dropped at the source — never sent to the collector. root, kept child, kept child, kept — kept end-to-end Cheap. But: when an error happens in the un-sampled 99%, you have nothing. Tail-based (upgrade — keep errors and slow traces, drop the rest) Every span is buffered for ~10s at the collector. Decision based on whole-trace properties. all spans 10s buffer decide: error/slow? keep / drop Higher signal — captures every error trace. Costs ~10s × span_rate × 1KB of buffer memory.
Head-based sampling decides at trace creation; tail-based decides after the fact. Most production systems start head-based and add tail-based on errors and p99 slow traces once the trace pipeline is mature. Illustrative.

Head-based sampling is the default. The root service flips a coin once, sets the sampled bit in traceparent, and every downstream hop reads it. Cheap to run, but it has a structural blindness: when an error happens in the un-sampled 99%, you have nothing — only the flat correlation-ID logs, no tree shape, no span durations.

Tail-based sampling buffers every span at the collector for ~10 seconds, assembles the trace, and only then decides whether to keep it based on whole-trace properties: did any span have error=true? Did total duration exceed 1 second? Did this trace touch a high-priority customer? The trade-off is collector memory — you have to hold every span for 10 seconds, which at 1.1 GB/s is 11 GB of resident buffer. Why 10 seconds, not 1 or 60: the buffer must be longer than the slowest trace in your sampling tail. Most systems' p99.9 trace duration is 2–6 seconds; 10 seconds covers the long tail of slow traces and stragglers from background async work like email-send. Going to 60 seconds would 6× the memory; going to 1 second would drop genuinely slow traces — exactly the ones you want to keep — because they wouldn't have completed before the sampling decision.

Probabilistic adaptive sampling is the production refinement: keep 100% of error traces, 100% of slow traces (p99+), 1% of everything else. This is what Jaeger's adaptive sampler and the OpenTelemetry tail-sampling processor implement. The bill drops to about 3% of full retention while keeping every trace anyone is going to ask about.

The architecture: SDK → collector → storage → UI

A production tracing system has four tiers, each with its own scaling problem:

  1. Instrumentation SDK in the application process (OpenTelemetry SDK, Jaeger client) — generates spans, decides sampling, batches, and exports asynchronously over OTLP/gRPC. Must be low-overhead: ≤5% CPU, ≤2% latency. The OpenTelemetry Java agent achieves ~1.8% CPU at 50k spans/sec.
  2. Collector (Jaeger Collector, OpenTelemetry Collector) — accepts spans over OTLP/gRPC or Thrift, optionally runs tail-sampling, batches by trace, and writes to storage. This is where most of the production tuning happens — load balancing collectors, partitioning by trace ID so all spans of one trace land on the same collector instance for tail-sampling.
  3. Storage — Elasticsearch (Jaeger default), Cassandra (Dapper-style), ClickHouse (modern OTel choice), or the proprietary backends from Honeycomb / Datadog / New Relic / Tempo. Trace data is write-heavy and read-rare: 100k writes/sec but only ~100 trace lookups/sec from the UI.
  4. UI — Jaeger UI, Grafana Tempo's built-in UI, OpenTelemetry's reference UI — turns trace IDs into flame graphs and supports search by service, operation, duration, tag, error.

The collector tier is where the production pain concentrates. Tail-sampling requires that every span of one trace land on the same collector — otherwise the buffer can't see the whole trace. Jaeger and OpenTelemetry Collector solve this with a load-balancing exporter in front: the collector tier is two layers, the front layer hashes by trace ID and routes to the back layer, and the back layer does the actual sampling and storage write. This is exactly the pattern that consistent hashing was invented for — see hashring diagrams.

PaySetu hit this pain at 60k spans/sec when their single-tier collector pool kept dropping traces because tail-sampling buffer evictions didn't see all spans. The two-tier rebuild took two sprints, dropped trace-loss rate from 4% to 0.02%, and added 3 ms of collector latency — invisible to applications, since exporters batch asynchronously.

Common confusions

  • "A trace ID is the same as a correlation ID" They share a propagation pattern but answer different questions. A correlation ID is a flat string logged on every line of every request — sampled at 100% so every incident is debuggable in the log tier. A trace ID is the root of a structured causality tree, sampled at 1–5% in production for cost reasons. Most mature systems carry both: cid in every log line, trace ID for the sampled deep-dive case.

  • "Distributed tracing is just about latency" Latency is the most common use, but tracing is also how you debug causality — "which service made the bad write?", "did the auth service even get called?", "was this request retried?". An error span with error=true and a stack-trace attribute, anchored in the parent tree, shows you not just what failed but which path through the system led there. Logs lose the path; traces preserve it.

  • "Jaeger and OpenTelemetry are competitors" They are different layers. OpenTelemetry is the instrumentation SDK and wire protocol standard — what runs inside your application to generate spans. Jaeger is a trace storage and UI backend — the destination where spans are stored and queried. OpenTelemetry SDKs export to Jaeger backends; Jaeger is migrating its own SDKs to be OpenTelemetry-native. The clean modern stack is "OpenTelemetry SDK in app → OTel Collector → Jaeger or Tempo or ClickHouse for storage → Jaeger or Grafana UI for query."

  • "Sampling at 1% means I lose 99% of debugging power" With probabilistic adaptive sampling — keeping 100% of error and slow-trace cases plus 1% of healthy traces — you lose almost nothing. The 99% you drop are healthy successful requests that all look identical; the metric tier already captures their aggregate latency distribution. The bill drops to 3% of full retention; the diagnostic value drops to ~95% of full retention. It's the best ratio in observability.

  • "Spans need millisecond-accurate clocks across hosts" Span durations are measured within one process using time.monotonic() — host-local, never crosses machine boundaries — so they are unaffected by NTP clock skew between hosts. The wall-clock start times are used only for ordering display in the UI; the duration values that matter for diagnosis are exact within each span. This is one of the few distributed-systems tools that doesn't fall to clock-skew fallacies.

  • "traceparent and X-B3-TraceId do the same thing" They do the same thing under different conventions. B3 (Zipkin's headers, X-B3-TraceId, X-B3-SpanId, X-B3-Sampled) was the dominant scheme pre-2020. W3C traceparent is the post-2020 standard and what every new system should use. OpenTelemetry SDKs propagate both by default for compatibility — but a request that crosses an org boundary should preserve traceparent, and B3 should be treated as legacy.

Going deeper

The Dapper paper (2010) and what it got right and wrong

Sigelman et al.'s "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" (Google Technical Report dapper-2010-1) is the foundational paper. It introduced the core ideas every modern tracer still uses: trace IDs propagated as RPC metadata, spans as the unit of work, head-based sampling as the cost-control mechanism, low-overhead instrumentation libraries baked into the RPC framework (Stubby, in Google's case). Dapper got two things spectacularly right: (1) instrumenting the RPC framework, not the application code, so adoption is automatic when teams use the standard libraries; (2) aggressive head-based sampling at 0.1%, which made the storage bill viable at Google's 2010 scale. It got two things wrong, which OpenTelemetry has been correcting for a decade: (1) Dapper's spans were missing structured attributes — debugging required navigating to logs anyway; modern OTel spans carry http.method, db.statement, error.type, etc. directly. (2) Dapper had no tail-based sampling — every error trace in the un-sampled 99.9% was lost. Both fixes came from the open-source ecosystem (Zipkin → Jaeger → OpenTelemetry), not from Google's internal evolution.

A span has one parent. That works for synchronous RPCs, but breaks for fan-in async patterns: a queue consumer that batches 100 messages from 100 different traces and processes them in one span. Which message is its parent? OpenTelemetry's answer is span links — a span can reference one parent (the conventional causality) and an unbounded number of links to other spans (the fan-in inputs). The batch-consumer span has parent=null (it's a root in its own trace), and 100 links pointing back to the original 100 message traces. The UI renders these as a "this span was caused by" list, which is the right semantic — the consumer is genuinely a join point, not a child of any single producer.

The CricStream tail-sampling rebuild

CricStream's recommendation pipeline ran head-based sampling at 1% for two years. It worked fine for routine debugging. Then a partition of users started seeing duplicate notifications during the IPL final, and the root cause was a retry-storm pattern that only manifested for ~0.4% of requests. Head-based sampling at 1% caught approximately one example trace of the bug per hour — not enough signal to characterise the pattern. The fix was a tail-sampling rebuild: keep 100% of traces with error=true OR duration_ms > 800 OR a custom flag the recommendation service set on suspect requests. Memory cost on the collector tier went from 4 GB to 38 GB; storage cost went up 7%; bug characterisation went from "we cannot reproduce" to a one-day analysis. The lesson: head-based sampling is the right default; tail-based sampling is a debt you have to pay before your first major outage where the bug lived in the un-sampled tail.

The exemplar pattern — linking metrics to traces

A modern observability stack ties metric points back to trace IDs through exemplars. Prometheus's exemplar feature lets a histogram bucket carry a sample trace ID alongside the count. When Karan looks at the /checkout p99 latency dashboard and sees the spike at 14:07, the dashboard offers him "click here for an example slow trace from this bucket" — pivoting directly into the Jaeger UI for one specific trace ID that landed in the slow bucket. This is the production-engineer's superpower: metric tells you something is wrong, exemplar gives you one specific instance, trace shows you the shape of that instance. Three layers, one click, no log archaeology.

Where this leads next

Tracing closes the second leg of the observability triangle (logs, metrics, traces). The next chapters in Part 18 build on this: structured logs joined by trace ID for the un-sampled cases; SLO dashboards that pivot to exemplar traces; on-call workflows that start at a metric anomaly and end at a flame graph. The recurring claim is that incident debugging is a dataflow problem — you need every layer to share at least the trace ID and the cid as join keys, or you lose minutes of MTTR every time the on-call has to retype-and-grep.

The deeper pattern: tracing is the only observability tool that preserves the shape of the request, not just its summary or its log fragments. Once your tracing pipeline works, you stop debugging by reading log lines and start debugging by reading flame graphs — and the latter is roughly 10× faster for the class of bugs that lives in the fan-out, the retry, or the async hop.

References

  1. W3C, Trace Context (Recommendation 2020) — the binding standard for traceparent and tracestate.
  2. Sigelman et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" (Google Technical Report dapper-2010-1) — the foundational paper; every modern tracer descends from it.
  3. OpenTelemetry Specification, Trace API and SDK — the canonical instrumentation contract that has subsumed Jaeger client and Zipkin Brave.
  4. Jaeger documentation, Architecture — the four-tier reference design (SDK / agent / collector / storage / UI) that influenced every OSS tracing stack.
  5. Yuri Shkuro, Mastering Distributed Tracing (Packt 2019) — book by Jaeger's creator; the operational chapters on tail-sampling and collector partitioning are the practitioner's reference.
  6. Cindy Sridharan, Distributed Systems Observability (O'Reilly 2018), Chapter 6 — the "logs vs metrics vs traces" framing that the exemplar pattern operationalises.
  7. Grafana Labs, Tempo and trace-to-logs — modern reference for ClickHouse/object-store-backed trace storage and exemplar-driven workflows.
  8. See also: correlation IDs, wall: observability in distributed systems is a data problem, wall clocks and NTP.