Span, trace, context: the data model

It is 11pm in Bengaluru. Aditi, an SRE at a payments startup, has Tempo open on her left monitor. A customer just complained that a UPI mandate took 9.4 seconds. She pastes the trace_id into Tempo and gets back a tree: 47 spans, six services, three retries, one HTTP 504. She finds the slow span in 30 seconds — the trace is shaped like a Gantt chart, and one bar is conspicuously wider than the others. None of that would have worked if the span she wanted to see did not have the right parent_span_id, the right trace_id, or the right propagated context. The data model is the difference between a tree and a flat list of disconnected JSON blobs.

This chapter takes the three primitives the Dapper paper introduced — span, trace, context — and turns them into precise data structures with a query language. You will write spans, propagate context across an HTTP boundary, reconstruct a trace tree, and see the exact bytes that flow on the wire. By the end, when an on-call says "the parent_span_id was wrong and the tree split into two orphan trees", you will know exactly what went wrong.

A span is a (trace_id, span_id, parent_span_id, name, start, end, attributes) tuple representing one unit of work. A trace is the set of spans sharing one trace_id, assembled into a tree at query time by walking parent_span_id pointers. Context is the small wire-format payload (traceparent: 00-{trace_id}-{span_id}-{flags}) that propagates the active span's identity to a downstream service so its first span can claim the right parent. Get any of the three wrong and your trace is either incomplete, unparented, or merged with a stranger's request.

What is in a span — every field, exactly

A span is the fundamental unit of trace data. The OpenTelemetry specification fixes its shape precisely; every modern tracer (Tempo, Jaeger, Zipkin, AWS X-Ray, Datadog APM) emits spans that map to this schema with at most a renaming of fields. Knowing the schema cold is what lets you read a Tempo response, a Jaeger UI tooltip, or an OTLP protobuf dump and translate between them in your head.

The required fields, in the order the OpenTelemetry data model lists them, are: trace_id (16 bytes, 128 bits, unique per trace, shared across every span in the same trace); span_id (8 bytes, 64 bits, unique per span, generated independently at every service); parent_span_id (8 bytes; 0x0000000000000000 if the span is a root); name (a short, low-cardinality string like HTTP POST /checkout or db.query payments); kind (one of CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL — denoting which side of an RPC this span is); start_time_unix_nano and end_time_unix_nano (timestamps in nanoseconds since Unix epoch); status (OK, ERROR, or UNSET, with an optional message); attributes (a map of key-value pairs — http.method=POST, http.status_code=200, db.statement=SELECT...); events (a list of (timestamp, name, attributes) tuples — annotations on the span's timeline); and links (references to other spans that influenced this one but are not its parent — used for batch jobs and async fan-in). On top of those, every span carries a resource describing the emitting process (service.name=payments, service.version=1.4.2, host.name=ip-10-0-2-91) and a scope describing the instrumentation library that produced the span (opentelemetry-instrumentation-flask, version 0.43b0).

The two pieces of the schema that most often get confused are kind and resource. Kind is per-span: one HTTP call between checkout and payments produces a CLIENT span on the checkout side and a SERVER span on the payments side, with the same http.method attribute on both. Resource is per-process: every span emitted by the checkout service shares the same service.name=checkout resource, regardless of kind. Tempo and Jaeger let you query by resource ({ service.name = "checkout" }) and by attribute ({ http.status_code = 504 }), and most production trace queries combine both — { service.name = "checkout" && http.status_code >= 500 }.

Anatomy of a span — every required field, with example valuesA boxed diagram showing the OpenTelemetry span schema: identifiers (trace_id, span_id, parent_span_id), timing (start, end), descriptors (name, kind, status), payload (attributes, events, links), and context (resource, scope). Each field has a representative example from a payments service.A span — what an OTLP exporter actually sendsidentifierstrace_id9f4e2a0bdc3f7261...e8a2 (16 B)span_id3d51b07ef2c99814 (8 B)parent_span_ida0c47def81b25320 (8 B)trace_statevendor=internal,sampled=1timing & identityname"POST /checkout"kindSERVERstart_time_unix_nano1714053023412888000end_time_unix_nano1714053023687451000payloadattributeshttp.method=POSThttp.status_code=200net.peer.name=payments-apicustomer.tier=premiumtimelineevents[0]"db.query.start" @ +12msevents[1]"cache.miss" @ +47msstatusOKlinks[][ref to async settlement]resource (per-process, shared)service.name"checkout"service.version"1.4.2"host.name"ip-10-0-2-91.ap-south-1"scope (instrumentation library)name"opentelemetry.instrumentation.flask"version"0.43b0"schema_url"https://opentelemetry.io/...1.21.0"
Illustrative — the fields a single SERVER-kind span carries when a Razorpay checkout service emits it. Identifiers tie the span into the trace tree; timing answers "how long"; payload answers "what happened"; resource and scope answer "where and how was this measured".

The non-obvious thing about the schema is what is not in it. There is no field called "is_root" — a span is a root iff its parent_span_id is all zeros. There is no field called "duration" — duration is end_time - start_time, computed at query time. There is no field called "service" — that lives on the resource. There is no field called "trace name" — a trace has no name; it inherits the name of its root span at query time. Each absence is deliberate: the data model keeps span records small (so they ship cheaply) and pushes derived properties to the consumer side, which is where the cost of computation can be amortised across many queries.

Build a span tree end-to-end — runnable in 80 lines

The fastest way to internalise the data model is to build it. The Python script below stands up a small payments-style trace generator (no Flask this time — just direct function calls so you can see the data structures clearly), emits spans into a JSONL file in OTLP-shaped JSON, and reconstructs the trace tree at the end with a Tempo-style ASCII waterfall. Why JSONL on disk rather than running a real OTLP collector: the goal here is to see the data, not the network. JSONL is the most readable persistent format, and reading the file with cat /tmp/spans.jsonl | jq shows exactly what the OTLP exporter would have shipped, minus the protobuf wrapper. Running this against a real Tempo or Jaeger backend just adds CPU and a UI tab between you and the data.

# build_a_trace.py — emit spans in OTLP-shaped JSON, reconstruct the tree
# at "query time", print a Tempo-style waterfall with timing and depth.
# pip install (no extra deps — uses only stdlib)
import json, os, time, uuid, random
from contextlib import contextmanager
from collections import defaultdict

SPAN_LOG = "/tmp/spans.jsonl"
open(SPAN_LOG, "w").close()

# ---- per-thread "active span" stack — the propagation machinery in miniature ----
_active = []                              # stack of dicts; top = current parent

def _now_ns(): return time.time_ns()

@contextmanager
def span(name, kind="INTERNAL", attrs=None):
    parent = _active[-1] if _active else None
    sp = {
        "trace_id":  parent["trace_id"] if parent else uuid.uuid4().hex,
        "span_id":   uuid.uuid4().hex[:16],
        "parent_span_id": parent["span_id"] if parent else "0"*16,
        "name":  name, "kind": kind,
        "start_time_unix_nano": _now_ns(),
        "attributes": attrs or {},
        "resource": {"service.name": os.environ.get("SVC", "checkout")},
    }
    _active.append(sp)
    try:
        yield sp
        sp["status"] = "OK"
    except Exception as e:
        sp["status"] = "ERROR"; sp["error"] = str(e); raise
    finally:
        sp["end_time_unix_nano"] = _now_ns()
        _active.pop()
        with open(SPAN_LOG, "a") as f:
            f.write(json.dumps(sp) + "\n")

# ---- simulate a UPI mandate: gateway → orchestrator → fraud + npci ----
def fake_io(low_ms=2, high_ms=20):
    time.sleep(random.uniform(low_ms, high_ms) / 1000)

with span("POST /upi/mandate", kind="SERVER",
          attrs={"http.method": "POST", "http.route": "/upi/mandate"}) as root:
    fake_io(3, 8)
    with span("orchestrate_mandate", kind="INTERNAL"):
        fake_io(5, 12)
        with span("fraud_check", kind="CLIENT",
                  attrs={"net.peer.name": "fraud-svc"}):
            fake_io(15, 35)
        with span("npci_call", kind="CLIENT",
                  attrs={"net.peer.name": "npci-adapter"}) as npci:
            fake_io(40, 90)
            npci["attributes"]["npci.response_code"] = "00"
        with span("write_settlement", kind="CLIENT",
                  attrs={"db.system": "postgresql", "db.name": "payments"}):
            fake_io(8, 18)
    fake_io(2, 5)

# ---- "query time": load all spans for our trace_id, build the tree ----
trace_id = root["trace_id"]
spans = [json.loads(l) for l in open(SPAN_LOG)]
mine  = [s for s in spans if s["trace_id"] == trace_id]
by_id = {s["span_id"]: s for s in mine}
kids  = defaultdict(list)
for s in mine:
    kids[s["parent_span_id"]].append(s["span_id"])

t0 = min(s["start_time_unix_nano"] for s in mine)
total_ms = max(s["end_time_unix_nano"] for s in mine) - t0
def bar(s, width=40):
    start = (s["start_time_unix_nano"] - t0) / total_ms * width
    dur   = (s["end_time_unix_nano"] - s["start_time_unix_nano"]) / total_ms * width
    return " " * int(start) + "█" * max(1, int(dur))

print(f"trace {trace_id} — {len(mine)} spans, "
      f"total {total_ms/1e6:.1f}ms\n")
def walk(sid, depth=0):
    s = by_id[sid]
    dur_ms = (s["end_time_unix_nano"] - s["start_time_unix_nano"]) / 1e6
    print(f"{'  '*depth}{s['name']:<22} {dur_ms:6.1f}ms  |{bar(s)}|")
    for c in kids[sid]:
        walk(c, depth + 1)
for root_id in kids["0"*16]:
    walk(root_id)

A representative run (numbers vary because the simulated I/O is randomised):

trace 9f4e2a0bdc3f7261d4e8b75c821ae8a2 — 5 spans, total 168.4ms

POST /upi/mandate         168.4ms  |████████████████████████████████████████|
  orchestrate_mandate     159.7ms  |██  ██████████████████████████████████  |
    fraud_check            27.3ms  |    ██████                              |
    npci_call              82.6ms  |          ████████████████████          |
    write_settlement       14.1ms  |                              ███       |

Per-line walkthrough. The line sp["trace_id"] = parent["trace_id"] if parent else uuid.uuid4().hex is the trace-id propagation rule: if there is an active parent, inherit; otherwise mint a new 128-bit identifier. This is the single most important rule of the data model — get it wrong and your "trace" is actually 47 separate one-span traces. The line sp["parent_span_id"] = parent["span_id"] if parent else "0"*16 is the tree-link rule: every non-root span points at its parent's span_id, and roots point at all-zeros. That is what makes the tree-walk at the bottom (kids[s["parent_span_id"]].append(...)) correct: it inverts the parent-pointer into a children list. The _active stack is the in-process propagation primitive — when control flow enters a with span(...) block, that span becomes the new parent for any nested span; when control exits, the previous parent is restored. The OpenTelemetry SDK calls this exact structure a "Context" object and stores it in a thread-local (or contextvars for async). The shape is identical to this 80-line script. Why a stack and not just a single "current span" variable: nested spans are common — a SERVER span contains an INTERNAL span which contains a CLIENT span which makes a downstream HTTP call. When the CLIENT span ends, the active parent must revert to the INTERNAL span, not jump back to the SERVER span or to nothing. A stack handles arbitrary nesting depth correctly without special cases. Async code complicates this — contextvars in Python and AsyncLocalStorage in Node are async-aware versions of the same primitive — but the data structure is the same.

Context propagation — the wire format that makes a trace cross processes

The _active stack above only works inside one process. The moment your span's child is going to run in a different process, you need to serialise the parent context into a string, attach it to the outgoing request, and have the downstream service deserialise it back into a parent span_id. That string is the context, and the W3C standardised its wire format in 2020 as the traceparent HTTP header.

The format is traceparent: 00-<trace_id>-<span_id>-<flags>. The leading 00 is the version (only one defined so far). The trace_id is 32 hex characters (128 bits). The span_id is 16 hex characters (64 bits) — and crucially, this is the caller's span_id, which becomes the callee's parent_span_id when the callee creates its first span. The flags are 2 hex characters carrying a one-bit sampled flag (and 7 reserved bits). A real header looks like this:

traceparent: 00-9f4e2a0bdc3f7261d4e8b75c821ae8a2-3d51b07ef2c99814-01

That is the entire propagation payload that crosses the network. There is also an optional tracestate header for vendor-specific extensions and a baggage header for application-level key-value propagation, but neither is required to reconstruct the trace tree — traceparent alone is sufficient. The script below shows the full propagation cycle: a checkout service emits a traceparent, the payments service reads it, both spans land in the same tree, and the tree-walk at the end shows them stitched correctly across the process boundary.

# propagate_across_processes.py — show the W3C traceparent header doing its job.
# pip install flask requests
import json, threading, time, uuid, random
from flask import Flask, request
import requests

SPAN_LOG = "/tmp/cross_spans.jsonl"
open(SPAN_LOG, "w").close()

def emit(span):
    with open(SPAN_LOG, "a") as f: f.write(json.dumps(span) + "\n")

def parse_traceparent(h):
    if not h: return None, None
    parts = h.split("-")
    if len(parts) != 4 or parts[0] != "00": return None, None
    return parts[1], parts[2]      # (trace_id, parent_span_id)

def make_traceparent(trace_id, span_id, sampled=True):
    return f"00-{trace_id}-{span_id}-{'01' if sampled else '00'}"

def serve(name, port, downstream=None):
    app = Flask(name)
    @app.route("/handle")
    def handle():
        trace_id, parent_sid = parse_traceparent(request.headers.get("traceparent"))
        if trace_id is None:                  # we are the root
            trace_id = uuid.uuid4().hex
            parent_sid = "0"*16
        my_sid = uuid.uuid4().hex[:16]
        start = time.time_ns()
        time.sleep(random.uniform(0.01, 0.04))
        if downstream:
            requests.get(downstream, headers={
                "traceparent": make_traceparent(trace_id, my_sid)}, timeout=2)
        end = time.time_ns()
        emit({"trace_id": trace_id, "span_id": my_sid,
              "parent_span_id": parent_sid, "name": f"{name}.handle",
              "kind": "SERVER", "service": name,
              "start_time_unix_nano": start, "end_time_unix_nano": end})
        return {"ok": True, "trace_id": trace_id}
    threading.Thread(target=lambda: app.run(port=port, use_reloader=False),
                     daemon=True).start()

serve("payments", 9102)
serve("checkout", 9101, downstream="http://localhost:9102/handle")
time.sleep(0.4)

r = requests.get("http://localhost:9101/handle").json()
time.sleep(0.2)
print(f"trace_id: {r['trace_id']}\n")

from collections import defaultdict
spans = [json.loads(l) for l in open(SPAN_LOG)]
mine = [s for s in spans if s["trace_id"] == r["trace_id"]]
kids = defaultdict(list)
for s in mine: kids[s["parent_span_id"]].append(s)
def walk(s, d=0):
    dur = (s["end_time_unix_nano"] - s["start_time_unix_nano"]) / 1e6
    print(f"{'  '*d}[{s['service']:9}] {s['name']:18} sid={s['span_id'][:8]} "
          f"pid={s['parent_span_id'][:8]} {dur:.1f}ms")
    for c in kids[s["span_id"]]: walk(c, d+1)
for root in kids["0"*16]: walk(root)

Running it produces:

trace_id: 9f4e2a0bdc3f7261d4e8b75c821ae8a2

[checkout ] checkout.handle    sid=3d51b07e pid=00000000 47.2ms
  [payments ] payments.handle    sid=a0c47def pid=3d51b07e 24.8ms

The two spans came from two separate Flask processes (different threads in this demo, but the protocol is the same as for two pods on different nodes). The link between them is exactly the traceparent header on the outgoing requests.get. Notice that checkout's span_id (3d51b07e) becomes payments' parent_span_id (pid=3d51b07e) — the wire format encodes the parent-of-the-next-span, and the callee's first action is to mint its own span_id whose parent is the value it received. This is the entire mechanism. Why the wire format propagates the caller's span_id rather than the caller's name or some other identifier: span_ids are unique per span, so they unambiguously identify a node in the tree. A name like "checkout.handle" identifies the service but not which invocation of it — if two requests are in flight simultaneously, both of them produce spans named "checkout.handle" and their children would not know which parent they belong to. The span_id is the only identifier that disambiguates one specific node in one specific tree. That is why it goes on the wire, not the name.

The flags byte carries the sampled decision, and that decision must propagate untouched through every hop. If checkout decides to sample this trace (flags=01), payments must also keep its span; if checkout decides to drop (flags=00), payments must also drop. This is what guarantees no half-traces. A misbehaving service that re-evaluates the sampling decision per-hop produces partial trees where some spans are present and others are missing — the most common subtle bug in tracer implementations, and the reason every modern SDK ships with the propagator already wired into the auto-instrumentation.

Context propagation — how a traceparent header travels with a requestA timeline diagram showing the user's request entering the gateway, where the gateway mints a trace_id and a span_id, attaches a traceparent header to its outgoing call to checkout, where checkout reads the header, mints its own span_id with the gateway's span_id as parent, and so on through payments. Animated arrows show the header riding along each network call.One traceparent, three hops, one treeusergatewaycheckoutpaymentsHTTP POST (no traceparent)mint trace_id=9f4e..mint span_id=3d51..traceparent: 00-9f4e..-3d51..-01parent=3d51..mint span_id=a0c4..traceparent: 00-9f4e..-a0c4..-01parent=a0c4..mint span_id=82c7..trace_id is invariant; span_id is per-process; parent_span_id is the previous hop's span_idflags=01 means sampled — propagated untouched, no per-hop re-decision
Illustrative — one user request traverses gateway → checkout → payments. The trace_id stays constant; each service mints its own span_id; the traceparent header carries the caller's span_id, which becomes the callee's parent_span_id. Sampled flag (01) propagates untouched.

Real-system tie-ins — what the data model fixes in production

The argument for spending a chapter on the data model — rather than jumping to instrumentation guides — is that production tracing fails almost always at one of three places, and all three are data-model bugs.

The first failure mode is broken parent_span_id propagation. A service uses an HTTP library that does not honour the configured propagator (or that the OTel auto-instrumentation does not patch — httplib2 and some async libraries are common offenders), so its outgoing requests do not carry traceparent. The downstream service receives a request without the header, mints a new trace_id, and starts what looks like a fresh trace. In Tempo this shows up as trace fragmentation: the user's one request becomes three or four "traces", none of which is complete. The fix is to audit every outgoing call library and verify the propagator is wired in. Razorpay's 2023 OpenTelemetry rollout retro publicly cited this as the single biggest source of incomplete traces in their first three months — a requests.Session initialised before the OTel auto-instrumentation patched requests produced unparented child spans. The root cause was that RequestsInstrumentor().instrument() patches requests.Session.send at module-load time; a Session instantiated before that runs forever bypasses the patch.

The second failure mode is wrong kind. A service emits a CLIENT span for an outgoing call but never closes it (__exit__ is not called because of an exception that bypasses the context manager); the span never makes it to the collector; the corresponding SERVER span on the downstream side is recorded but appears unparented. This produces orphan spans: visible in Tempo, but not connected to any tree. The fix is hygiene — always use a with block, always set status on exception, never start_span without a matching end_span. The OTel SDK helps by emitting the span on exit even if status is ERROR; raw use of the API without with blocks in Python is the failure mode.

The third failure mode is clock skew. Two services on different hosts have their wall clocks 50ms apart (NTP not running, or a drifted virtual machine, or a container with a problematic time source). A child span's start_time_unix_nano is earlier than its parent's start_time_unix_nano, which violates the tree's temporal invariant and confuses the Tempo UI's Gantt rendering — child bars start to the left of parent bars. The data model itself does not require synchronised clocks (the tree is built from parent_span_id, not timestamps), but the visualisation does. The fix is twofold: run NTP everywhere, and have the SDK record durations as relative offsets from the parent's start where possible. Production teams that run on AWS or GCP rarely see this; teams running on bare metal or on-prem with custom timekeeping see it constantly. Hotstar's 2024 Asia Cup retro mentioned a flapping NTP server in the analytics pod's host caused a one-hour window of "impossible traces" with negative durations — a data quality issue that masqueraded as application bugs for two days.

A fourth failure mode worth knowing about is trace_id collision. The OpenTelemetry default is a cryptographic-random 128-bit trace_id, which has a 50% collision probability after 2^64 traces — at 1M traces/sec, you would hit one collision every 600,000 years. Some legacy systems use 64-bit trace_ids (Zipkin's original B3 single-header format), where collisions become possible after 2^32 traces — on a high-volume system, that is days, not centuries. If you mix 64-bit and 128-bit trace_ids in the same backend, the older spans end up in the wrong trees. Modern B3 supports 128-bit; W3C traceparent requires it. Audit any old Zipkin instrumentation and upgrade.

A second concrete production example: Swiggy's 2024 internal blog described a sampling configuration bug where the head-based sampler in their gateway was reading the configured rate from a stale environment variable and effectively sampling at 0% during a deployment window. Tempo received the trace_id-and-flags but the flags were 00 (not sampled), so every collector dropped every span. The result was a four-hour outage of tracing visibility during which on-call engineers had to debug a real customer-impacting issue with logs alone. The data-model fix is straightforward — the sampled flag is propagated, and dropping behaviour is consistent — but the configuration error is invisible unless your monitoring dashboards include "sampled trace count" as a top-line metric, which most teams do not. The data model lets you build the right alarms; whether you build them is a discipline question.

A third example, smaller scale: a Cred SRE team in 2025 added customer_id as a span attribute (not a label, attributes can be high-cardinality without breaking anything because spans are not aggregated by default) so that an on-call could query { customer.id = "abc123" } in TraceQL and pull all traces for one customer. This is a data-model-aware design — it works because spans in Tempo are stored block-columnar with attribute indexing, and customer_id adds zero cardinality cost to the tracing system (unlike Prometheus, where customer_id as a label is a cardinality bomb, see Cardinality, the master variable). The chapter on cardinality budgets discusses why; the data-model insight here is that trace data is request-scoped and naturally high-cardinality, while metric data is aggregate-scoped and breaks under high cardinality.

Common confusions

Going deeper

The OpenTelemetry data model spec — read it once, reference it forever

The single most valuable hour you can spend on tracing is reading the OpenTelemetry data-model specification end-to-end (opentelemetry.io/docs/specs/otel/trace/api/). It is roughly 30 pages and defines every field above with examples. The spec is the source-of-truth for all 12 supported language SDKs; reading it once means you can answer questions like "does the Python SDK encode span links the same as the Go SDK?" by reference rather than by experimentation. The bits worth memorising are: SpanKind enumeration (CLIENT, SERVER, PRODUCER, CONSUMER, INTERNAL — five values, no others), span attribute key-name conventions (lowercase dot-separated, e.g. http.request.method not HttpRequestMethod), and the SpanContext immutability rule (once a span is created, its context is fixed; you cannot retroactively change its parent). The spec also defines the OTLP wire format (next subsection) by reference.

The OTLP protobuf — what bytes actually flow on the wire

OpenTelemetry's wire protocol, OTLP, is a Protocol Buffers schema that the SDK serialises spans into and the collector deserialises out of. The relevant message is ExportTraceServiceRequest containing ResourceSpans containing ScopeSpans containing Span. Inspecting an actual OTLP byte payload is illuminating — the schema's nesting is hierarchical (resource → scope → span) precisely because services emit many spans per resource and many spans per scope, so factoring out the resource and scope cuts redundancy in the wire format dramatically. A 100-span batch from one service, one instrumentation library, takes about 18KB of OTLP-encoded protobuf — about 180 bytes per span average — versus roughly 3KB per span if each span were sent independently with its own copy of the resource. Reading opentelemetry-proto/python and dumping a captured batch with trace_pb2.TracesData().ParseFromString(bytes) shows the structure exactly. The collector's job is to deserialise, possibly transform (drop attributes, sample, add resource), and re-serialise to the next hop — which is why the OpenTelemetry Collector's processors configuration looks the way it does.

Span events vs span logs — and why span logs lost

Early Dapper-era and Zipkin-era tracers had a concept of span logs — log entries attached to a span, with their own timestamps and key-value payloads. OpenTelemetry redefined this as span events, which are simpler: just (timestamp, name, attributes) tuples that mark a moment within the span's timeline. The reason for the rename is that "log" was overloaded — a span log could be confused with a regular log line shipped to Loki, and the relationship between the two was murky. Span events are explicitly annotations on the span's timeline, not logs in the logging-backend sense. The right mental model: an event marks "something happened mid-span" (db.query.start, cache.miss, retry.attempt), and the data lives inside the span record in Tempo — it is queryable along with the span but does not create a separate document. When a regular log line correlates by trace_id with a span, it is a separate record in Loki, joined at query time. Both are useful; conflating them makes the architecture confusing.

Span links and async fan-in — the case the tree breaks down

The trace tree assumes every span has at most one parent. This is true for synchronous request-response chains. It breaks for async fan-in: a Kafka consumer that batches messages from 100 producers and processes them in one span has, conceptually, 100 parents. The data model handles this with links: the consumer span's links field references the 100 producer spans, but the consumer's parent_span_id points only at its immediate caller (the consumer's own dispatch). The tree-walk treats links as out-of-tree references — they are not edges of the tree, just pointers. Tempo and Jaeger render them as side-arrows in the UI rather than as parent-child edges. This is the right design for batch workloads (a Spark job, a Kafka consumer with batch-mode, an aggregation cron) but trips up engineers expecting "all related spans live in one tree". They live in one trace, but the tree is no longer fully connected; the links carry the rest. The OpenTelemetry semantic conventions document gives example uses for messaging, batch jobs, and outbox patterns.

Trace context vs baggage — two headers, two purposes

traceparent carries the trace identity (trace_id, span_id, sampled flag) and is required for trace stitching. baggage is a separate, optional W3C header that carries arbitrary application-level key-value pairs (baggage: customer.tier=premium,region=ap-south-1) — values that should propagate through the entire trace but are not part of the trace identity. Baggage is what you use when a downstream service needs to know "is this premium customer?" without re-deriving it from the request body. Critically, baggage values are not automatically copied to span attributes — if you want them recorded as attributes too, you call set_attribute() explicitly in the span's instrumentation. Baggage values are cleartext on the wire (no encryption beyond TLS), so do not put PII or secrets in them. The OpenTelemetry SDK's baggage API has been stable since 2021; it is the right primitive for "one-time set, propagate through the request, read in any downstream service".

Where this leads next

The data model is the spine of the next ten chapters. Every visualisation you build, every query you write, every sampling decision you make depends on the four primitives covered here: span identity, parent linkage, context propagation, and the tree assembled at query time. When the next chapter compares b3 vs traceparent, the comparison is about how each format encodes the same three identifiers; when the chapter on tail-based sampling shows the collector buffer in-flight traces, the buffer is keyed on trace_id; when the chapter on Tempo's columnar storage discusses block layout, the blocks are partitioned by trace_id-prefix. The data model is upstream of all of it.

References

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install flask requests
python3 build_a_trace.py            # the in-process tree
python3 propagate_across_processes.py  # the cross-process propagation
# Expected: a five-span tree from the first script (POST /upi/mandate root,
# orchestrate_mandate child, fraud_check + npci_call + write_settlement leaves);
# a two-span tree from the second showing checkout → payments stitched via
# the W3C traceparent header.