Wall: logs alone can't stitch a request across services

At 09:14:22 IST on a Wednesday a Swiggy customer named Asha taps "Pay" on a ₹687 biryani order. The checkout fails with a generic "something went wrong" toast. Asha files a support ticket, the ticket lands in a queue, and at 11:42 the on-call engineer Karan picks it up. Karan knows the user_id, the rough timestamp, and the failure was during checkout. He pulls up Loki, types {service=~".+"} |= "user_id=8924711" and gets back 47 log lines from 6 different services in a 4-second window around 09:14:22. The lines are real, the timestamps are accurate, the user_id matches. He cannot tell which of those 47 lines belonged to this checkout call. The cart-service emitted three "user lookup" lines in that window because Asha's app retried; the payments-service emitted four "gateway response" lines because two other users with adjacent user_ids were on the same pod; the rider-service emitted six lines because a separate ETA request from the same user fired at almost exactly the same moment. The data is all there. The thread that ties it into one story is missing.

This is the wall. It is the wall every team running microservices on logs alone hits, and it is the reason distributed tracing exists as a separate pillar. Logs are good at what happened in a process — perfect, even. They are useless at what happened across processes, because the only thing that ties a log line on service A to a log line on service B is a value the developer manually decided to write into both, and developers forget. This chapter is the close of Part 3 (Logs) and the gateway to Part 4 (Distributed Tracing). The argument is simple: at the network boundary, logs degrade from "the truth recorder" to "a pile of correlated-by-luck strings", and the only fix is a piece of data — the trace context — that propagates with the request.

A log line is a within-process record; it has no native concept of "the request this happened during". Across N services, identifying which log lines belonged to which request is an O(N²) ambiguity problem that user_id, timestamp, or any single application-level field cannot solve. The fix is propagated context — a trace_id and span_id that ride with the request across every network hop — which is exactly what distributed tracing standardised. Until that propagation exists, your logs are accurate but unstitchable.

Why a user_id and a timestamp are not enough

The first thing every team tries when they need to follow a request across services is to log a "request identifier" — something that should be the same across all services for the same user-facing action. The natural candidates are the user_id, the session_id, and a timestamp. All three fail, but they fail in different ways, and the failure modes are worth naming because they lead the team toward what actually works.

The user_id fails because a single user produces many simultaneous requests. Asha tapping "Pay" fires the checkout call, but her open Swiggy app is also refreshing the cart preview, polling for the order's ETA, and sending a heartbeat to the recommendations service. All four of those concurrent requests carry her user_id; all four hit different services; all four log lines that include user_id=8924711 in the same 2-second window. There is no field on the log line that says "I belonged to checkout, not to the cart preview." The user_id is shared across all of them.

The timestamp fails because a single millisecond contains thousands of unrelated events. At Swiggy's peak load — IPL match-end with food delivery promotions firing — the payments-service alone emits ~15,000 log lines per second across all pods. A 100ms window contains 1,500 unrelated payment events. Filtering the user_id grep down to "lines within 200ms of 09:14:22.451" still returns ~30 lines from 6 services, with no way to tell which subset is the failed checkout. Timestamp resolution does not save you because the events are intrinsically simultaneous; tightening the window past the actual request duration loses the lines you wanted.

The session_id fails because a session lives for hours and contains dozens of requests. A Swiggy session that started at 08:47 with the user opening the app is still open at 09:14 when checkout fails; in those 27 minutes the user browsed restaurants, added items to cart, removed an item, applied a coupon, and tapped "Pay" twice (the first tap timed out, the second succeeded the retry path that failed for a different reason). All of these logged the session_id. Filtering by session_id returns every request the user has made for the duration of the session, and the user has made dozens. The session_id is the unit "user's interaction with the app for this app-open"; the request is the unit "one HTTP call's path through the system". They are different units, and conflating them — which is what session-id-as-correlation-id does — defeats the purpose.

The combination — user_id AND timestamp_within_500ms AND endpoint=checkout — gets closer, but it is brittle in three ways: the endpoint label is per-service so it doesn't cross hops, the 500ms window is a guess that fails on slow days when the real request took 3 seconds, and the user_id is missing on internal service-to-service calls (the rider-service's call to the payment-gateway-validator does not know about the user). Each fix introduces a new failure case, and the team eventually realises they are reinventing — badly — the trace_id concept.

Illustrative — three candidate correlators that teams reach for, and the unit each one actually represents. None of them is the unit "one request as it traversed the system", which is exactly the gap distributed tracing fills.

Why the candidates fail in different ways: each candidate is a real, useful identifier — but for a different unit. user_id identifies the user, timestamp identifies the moment, session_id identifies the app-session. None of them identifies "one request, end-to-end, as it crossed five services and did or did not return successfully". That unit has no natural name in the application code; it must be invented and propagated. Inventing it is easy (uuid.uuid4() at the entry point); propagating it across every network hop without losing it on a queue, a retry, a fan-out, or a service boundary is the actual hard problem. Every team that thinks they can correlate by user_id + timestamp discovers within six months that the units do not match the question, and they end up writing — badly, ad-hoc, per-service — a worse version of the trace_id propagation that already has a standard.

A measurable demonstration — three Flask services, no propagation, broken correlation

The cleanest way to feel the wall is to stand up three real services that talk to each other over HTTP, log every event with loguru to a structured JSON file, and then try to reconstruct a single user's request after the fact using only the logs. The script below does exactly that — it runs a gateway → cart → payments request chain inside one Python process (each service is a Flask app on a different port), simulates concurrent traffic from three users hitting the gateway in the same 200ms window, and then prints the recovered correlation result.

# logs_alone_wont_stitch.py — three Flask services, structured JSON logs,
# concurrent traffic; show that user_id + timestamp cannot recover the
# request boundary cleanly.
# pip install flask loguru requests
import json, threading, time, random, sys
from collections import defaultdict
from flask import Flask, request
from loguru import logger
import requests

LOG_PATH = "/tmp/services.jsonl"
logger.remove()
logger.add(LOG_PATH, format="{message}", serialize=True, enqueue=True)

# ---- Three services, each on its own port, each logging structured JSON ----
def make_service(name, port, downstream_url=None):
    app = Flask(name)
    @app.route("/handle")
    def handle():
        user_id = request.args.get("user_id")
        endpoint = request.args.get("endpoint", "checkout")
        logger.bind(service=name, user_id=user_id, endpoint=endpoint,
                    event="enter").info("entered handler")
        time.sleep(random.uniform(0.005, 0.040))   # variable inter-service latency
        if downstream_url:
            try:
                requests.get(downstream_url, params=request.args.to_dict(),
                             timeout=2)
            except Exception as e:
                logger.bind(service=name, user_id=user_id, event="error",
                            err=str(e)).error("downstream call failed")
        time.sleep(random.uniform(0.001, 0.015))
        # 1-in-10 endpoints fail, the rest succeed
        ok = random.random() > 0.1
        logger.bind(service=name, user_id=user_id, endpoint=endpoint,
                    event="exit", ok=ok).info("left handler")
        return {"ok": ok}
    threading.Thread(target=lambda: app.run(port=port, use_reloader=False),
                     daemon=True).start()

make_service("payments", 8003)
make_service("cart",     8002, downstream_url="http://localhost:8003/handle")
make_service("gateway",  8001, downstream_url="http://localhost:8002/handle")
time.sleep(0.5)

# ---- Three users hit the gateway in the same 200ms window, concurrently ----
USERS = [8924711, 8924712, 8924713]
def fire(uid, endpoint):
    requests.get("http://localhost:8001/handle",
                 params={"user_id": uid, "endpoint": endpoint}, timeout=3)
threads = []
for uid in USERS:
    for ep in ("checkout", "cart_preview", "checkout"):  # user 8924711 retries
        t = threading.Thread(target=fire, args=(uid, ep))
        threads.append(t); t.start()
        time.sleep(random.uniform(0.005, 0.060))   # within ~200ms total
for t in threads: t.join()
time.sleep(0.5)   # let async logs flush

# ---- Now: the on-call's reconstruction attempt, logs only, no trace_id ----
target_user = 8924711
target_ts_lo, target_ts_hi = time.time() - 1.5, time.time() + 0.5
matches = []
with open(LOG_PATH) as f:
    for raw in f:
        line = json.loads(raw)["record"]
        rec = line.get("extra", {})
        if str(rec.get("user_id")) != str(target_user):
            continue
        ts = line["time"]["timestamp"]
        if not (target_ts_lo <= ts <= target_ts_hi):
            continue
        matches.append((ts, rec.get("service"), rec.get("event"),
                        rec.get("endpoint"), rec.get("ok")))
matches.sort()
print(f"recovered {len(matches)} lines for user_id={target_user}")
print("by service:")
by_service = defaultdict(int)
for _, svc, _, _, _ in matches:
    by_service[svc] += 1
for svc, n in sorted(by_service.items()):
    print(f"  {svc}: {n} lines")
print("\nthe three checkout-or-preview requests are mixed together:")
for ts, svc, ev, ep, ok in matches[:18]:
    print(f"  {ts:.3f}  {svc:9}  {ev:6}  {ep:13}  ok={ok}")
print("\n--> which subset belonged to the *failed* checkout?")
print("    no field on these lines tells you. they are correlated by user, not by request.")

A representative python3 logs_alone_wont_stitch.py run on a 2026-vintage laptop produces:

recovered 32 lines for user_id=8924711
by service:
  cart: 6 lines
  gateway: 6 lines
  payments: 20 lines

the three checkout-or-preview requests are mixed together:
  1745588063.118  gateway    enter   checkout      ok=None
  1745588063.124  cart       enter   checkout      ok=None
  1745588063.131  payments   enter   checkout      ok=None
  1745588063.149  payments   exit    checkout      ok=True
  1745588063.156  cart       exit    checkout      ok=True
  1745588063.158  gateway    exit    checkout      ok=True
  1745588063.166  gateway    enter   cart_preview  ok=None
  1745588063.173  cart       enter   cart_preview  ok=None
  1745588063.181  payments   enter   cart_preview  ok=None
  1745588063.198  payments   exit    cart_preview  ok=True
  1745588063.204  cart       exit    cart_preview  ok=True
  1745588063.207  gateway    exit    cart_preview  ok=True
  1745588063.215  gateway    enter   checkout      ok=None
  1745588063.221  cart       enter   checkout      ok=None
  1745588063.232  payments   enter   checkout      ok=None
  1745588063.247  payments   exit    checkout      ok=False
  1745588063.249  cart       exit    checkout      ok=False
  1745588063.252  gateway    exit    checkout      ok=False

--> which subset belonged to the *failed* checkout?
    no field on these lines tells you. they are correlated by user, not by request.

Per-line walkthrough. Line logger.bind(service=name, user_id=user_id, endpoint=endpoint, event="enter").info("entered handler") is the per-service emission. It records every field the application knows — service name, user, endpoint, event type — and writes them as structured JSON. It does not record a request_id because no request_id has been generated. Line if str(rec.get("user_id")) != str(target_user): continue is the reconstruction filter — the on-call's grep, expressed in Python. It correctly narrows the corpus to the user. Line if not (target_ts_lo <= ts <= target_ts_hi): continue narrows by time. Together they recover 32 lines, which is more than zero (the data is there) but mixes three different requests for the same user. The three requests are visually separable in the printed output because the timestamps come in three clusters of six lines, but that separability is an accident — it relied on the requests being sequential rather than overlapping. Why the visual clustering does not generalise: the demo's time.sleep(random.uniform(0.005, 0.060)) between fires happens to be larger than the request's internal latency (~30ms), so the requests serialise into clean clusters. Production traffic does not give you this. A real Swiggy peak fires concurrent requests for the same user — checkout in flight, cart preview in flight, ETA poll in flight — and their log lines interleave at sub-millisecond resolution. The on-call sees gateway-enter, gateway-enter, cart-enter, payments-enter, gateway-enter, cart-enter, ... and cannot tell which gateway-enter started which downstream cart-enter. The visual separation in the demo is an artefact of low concurrency; the wall is what you hit at real concurrency.

The demonstration captures the structural problem in 50 lines: every service writes a complete, structured, queryable log; every grep recovers the right user; no recovery procedure can identify which request a given line belonged to without a field that does not exist. Re-run the script with the inter-fire sleep set to time.sleep(0) (perfectly concurrent traffic) and the output becomes unrecoverable — payments-enter lines appear in arbitrary order with no way to associate each one with the upstream gateway-enter that triggered it. That is the wall, in a Python script you can run on your laptop in 30 seconds.

The fix is one extra line at the gateway and one extra logger.bind() field everywhere downstream: generate a trace_id = str(uuid.uuid4()) at the entry point, propagate it via an HTTP header to every downstream call, and include it in every log line. Replace the script's requests.get(downstream_url, params=...) with requests.get(downstream_url, params=..., headers={"x-trace-id": trace_id}), have each service read the header and bind it, and the reconstruction filter becomes if rec.get("trace_id") != target_trace_id: continue — and now the recovered lines are exactly the right set, no more, no less. The technical change is small. The discipline change — making sure every service, every queue consumer, every retry, every async fan-out propagates the header — is the actual cost, and it is what Part 4 is about.

What logs are actually good at — and where the boundary sits

The argument in this chapter is not "logs are bad". Logs are good — the previous five chapters made the case for structured logging, retention tiers, sampling, and a sensible cost model. The argument is that logs are good at within-process observation and structurally inadequate for cross-process observation. The boundary is the network call. Inside a process, a log line records a moment in the program's execution that the engineer can later trace by reading the code and matching the line. Across processes, a log line records a moment in one program's execution and has no causal link to the moment in another program's execution that produced it.

The within-process strengths are real and worth re-stating because they explain why teams keep reaching for logs even after they have tracing. A single log line in a payments-service can carry the SQL query that ran, the gateway response payload, the retry count, the customer's KYC tier, the rupee amount, the currency, and a stack trace if an exception was raised — twenty fields, all in the same JSON blob, all tied to the same line of code. A trace span carries a name, a duration, a status, and ~10-20 attributes — and nobody puts a stack trace in a span. When something exotic happens (a NULL where it shouldn't be, a string where the parser expected JSON, a UPI gateway returning a 200 with an error body) the log line is what tells you. Tracing tells you the call happened and how long it took; the log tells you what was inside.

The cross-process weakness is also real, and the demonstration above pinpointed it: there is no field on a log line that records "the request this line belonged to" unless the developer manually arranged for one. The default — what every codebase does on day one — is no such field. The fix in distributed tracing is to make the trace_id and span_id part of the runtime's request context, so that any log line emitted within that context automatically carries them. The OpenTelemetry SDK's logging integration does this for you: from opentelemetry.instrumentation.logging import LoggingInstrumentor; LoggingInstrumentor().instrument(set_logging_format=True) and every logger.info() call emits the trace context as part of the line. Logs become trace-aware. The wall does not disappear; it becomes a door.

Illustrative — the boundary is the network call, and the gap on the cross-process side is exactly the gap distributed tracing fills. Logs do not become wrong on the right side; they become *insufficient*, which is a different problem with a different fix.

The argument generalises beyond logs. Every observability primitive has a within-process strength and a cross-process weakness, and the cross-process fix in every case is some form of propagated context. Metrics are great per-pod and lose meaning across pods unless every pod tags the same service and region labels. Profiles are great per-process and useless across processes unless the profiler reads a shared trace context. The trace_id is not just a logging concept; it is the universal currency for tying anything observable on one machine to anything observable on another. Once you accept that, the architectural question becomes "where does the trace_id originate, how does it propagate, and which fields ride alongside it" — which is the agenda of Part 4 (Distributed Tracing) and Part 13 (OpenTelemetry Internals).

A small but important caveat: there are cases where logs alone do work. A monolith with one process and a database — the application most early-stage Indian startups run for their first three years — has no cross-process problem because there is no cross-process. The within-process tape recording is enough. The wall arises specifically when the system is split into multiple processes that talk over the network, and it deepens monotonically as the service count grows. A 3-service system can be debugged from logs alone with discipline (manual request_id field, one engineer's whole-system mental model). A 30-service system cannot, no matter the discipline. Most production Indian SaaS systems sit somewhere between — and the wall starts to bite around 8-12 services, exactly the count where teams typically discover they need tracing.

Why the wall sharpens at 8-12 services and not earlier or later: in a system with N services, the number of distinct (caller, callee) pairs grows roughly as N² in the worst case (full mesh) and as N·log(N) in the typical case (each service depends on a handful of others). Each pair is an opportunity for a propagation gap, and the cognitive load of remembering "did I add the trace header on the call from cart to inventory? what about the call from inventory to warehouse?" grows superlinearly with the pair count. At N=3 a single engineer can hold the whole graph in her head; at N=10 the graph is large enough that no single engineer knows every call, and the gaps appear at the boundaries between team-owned subsystems. The empirical 8-12 threshold is where most teams first encounter a postmortem in which "the trace stopped at service X because service X did not propagate the header" appears as a finding — at which point the team adopts an SDK that propagates by default, which moves the discipline cost from per-call to per-library and resets the threshold upward.

Edge cases — when logs alone fail in non-obvious ways

Five edge cases bend the wall in ways that are hard to anticipate from a clean three-service demo. Each one is the kind of pattern teams discover during an outage when their logs-only correlation breaks down and they cannot tell why.

Async fan-out — a payments-service that publishes a "settlement-request" message to Kafka and returns immediately, then a settlement-worker that consumes the message hours later, has no way to log a shared request identifier without propagating one through the message header. The producer's request is over; the consumer's request is brand new. The settlement that fails at 14:32 is logically the continuation of a checkout that succeeded at 09:14, but the logs from those two moments share no field. The fix is to put the trace_id in the Kafka message headers (which OpenTelemetry's Kafka instrumentation does automatically), but teams that bolt logging onto an existing async pipeline often forget the producer-side instrumentation and lose the link.

Retries with new IDs — when a service generates a fresh request_id on each retry attempt (which a lot of older codebases do, treating retries as fresh requests for idempotency reasons), the log lines from attempt 1, attempt 2, and attempt 3 carry three different IDs. Filtering by any one of them gets you one attempt. Filtering by the user_id gets you all three attempts plus every other request the user made in that window. The right answer is to keep the trace_id stable across retries — the trace_id identifies the user-perceived operation, not the network attempt — but a per-attempt request_id is fine alongside the trace_id, as a span attribute. Distinguishing these two layers (logical operation vs network attempt) is something the trace data model handles well and the log-only model handles poorly.

Service-mesh re-issue — when a service-mesh sidecar (Istio, Linkerd) re-issues a request after a transient 502, the application code never sees the retry; the sidecar handles it. The sidecar's access log records both attempts; the application's log records one. If the trace_id is application-generated and not mesh-aware, the second attempt's application log carries the same trace_id as the first (which is correct) but the engineer cannot tell from the application logs that two network attempts happened. The fix is to read the mesh-injected x-request-id header and emit it as a span attribute alongside the trace_id, so that retries are visible at the trace layer even when invisible to the application.

Multi-tenant cardinality on logs — a logging system that keys log streams by tenant (which Loki labels often do for cost-attribution) discovers that a single user-facing request crosses three tenants' streams, and the LogQL query has to fan out across all three. Without a trace_id, the fan-out is "all lines in tenant A within window W AND all lines in tenant B within window W", which has the same ambiguity as the user_id+timestamp filter. With a trace_id, the LogQL query becomes {tenant=~".+"} | json | trace_id="abc..." and the right lines come back regardless of which tenant emitted them. The cost angle is real: a multi-tenant log fleet without trace-id propagation forces every cross-tenant debug query to scan every tenant's stream.

Background batch jobs that read user-tagged data — a fraud-detection batch job that runs every hour and reads payments from the last hour processes thousands of users' data in a single process. The batch's logs are full of user_ids, but each user_id appears in the context of the batch's processing, not the original payment request. An on-call who greps user_id=8924711 after a fraud incident gets back lines from the batch run and lines from the original payment request and lines from any other request that user made — and the lines from the batch carry a different service field but no link to the original payment trace. Linking requires the batch to emit, per row processed, the original payment's trace_id as an attribute. The batch becomes a trace span itself (or a series of spans), and the link is explicit.

A sixth pattern worth mentioning is the one most teams encounter first because it surfaces during local development.

Local dev — single process, no propagation needed, mistaken extrapolation — when a developer runs the whole stack locally on docker-compose, every service is on the same host, every log line is in the same docker logs stream, and the timestamp ordering is reliable enough that user_id-based correlation visually works. The developer concludes that logs alone are sufficient and ships to production. Production has 50× the concurrency, multiple regions with clock skew, async queues, and retries — and the same correlation breaks instantly. The lesson is that single-host concurrency is not a stand-in for multi-host concurrency; the correlation strategy that works in dev is not a tested correlation strategy. Production-realistic concurrency in a load test (using locust or vegeta to fire 1000 concurrent requests for the same user) catches this in dev rather than at 02:00 IST during an incident.

Common confusions

"If I add a request_id field to my logs, I have invented distributed tracing." A request_id propagated everywhere is the trace_id, yes — but trace propagation is also about parent-child relationships (which span called which), about timing (start and end of each span), and about not losing the context across retries, async boundaries, queues, and thread pools. The OpenTelemetry SDK handles all of those for you; rolling your own request_id leaves the next ten edge cases as your homework.
"Loki's structured-log query language can already correlate across services." LogQL can filter by a label across services, which means if you put the trace_id in a label or in the line body, LogQL can find it. The "correlate" part still requires the trace_id to be there; LogQL is the query layer, not the propagation layer. The wall in this chapter is on the producer side, not the query side.
"High-resolution timestamps fix the timestamp ambiguity." Microsecond timestamps push the ambiguity down by three orders of magnitude but do not eliminate it — at 15k lines/sec the average inter-line interval is 67µs, well within the noise of NTP-synchronised clocks across machines. Clock skew across pods is typically 1-10ms even with chrony; high-resolution local timestamps are not high-resolution distributed timestamps.
"My monolith has only one service, so I do not need a trace_id." Correct, today. The day you split out the first service — usually billing, or auth, or notifications — the wall appears, and adding propagation post-hoc is harder than adding it before the split. Most teams bake in trace_id propagation as a microservices-readiness investment well before they have actual microservices.
"OpenTelemetry's logging integration replaces my logging stack." It enriches your existing logging stack with trace context. Loki, Elasticsearch, and ClickHouse still hold the log data; the SDK ensures every line emitted under a span automatically includes the trace_id and span_id fields. The backends do not change; the producer side becomes trace-aware.
"A correlation_id header from my edge proxy is the same as a trace_id." It is the trace_id at the entry, yes. The difference is what happens after the entry — a correlation_id without per-span structure cannot tell you which downstream call was the parent of which other downstream call. Tracing standardises the relationship layer on top of the identifier layer; correlation IDs alone do not.

Going deeper

The propagation contract — what every service must do, and what it costs

Distributed tracing's correctness depends on every participant in the request path doing three things: read an incoming context (typically the W3C traceparent header), establish a span scoped to its work, and write an outgoing context on every downstream call (the same header, with the current span as the new parent). Miss any of those three on any service, and the trace gets a gap — the span tree visible in Tempo or Jaeger is missing a subtree, and the missing subtree is exactly the work that service did. The W3C Trace Context spec (traceparent and tracestate headers) is the standard; OpenTelemetry's auto-instrumentation handles it for the common HTTP, gRPC, and Kafka libraries. The cost of the propagation is roughly 50µs per service (parsing the header, generating the span, writing the outgoing header), which on a 5-service request adds 250µs to the end-to-end latency — usually invisible against the request's millisecond-scale total. The bigger cost is the discipline of keeping it intact: a service that swallows the context across an internal ThreadPoolExecutor (Python's concurrent.futures does not propagate context vars by default) loses the trace from that point on. Maintaining propagation is a continuous code-review task, and most observability postmortems trace one of their gaps to a missing with trace.use_span(...) block.

What "context" means at a programming-language level — and why Python made it explicit

The core implementation of trace context across a process is a thread-local or context-local variable that holds the "currently active span". When the application calls logger.info("foo"), the logging framework consults that variable, retrieves the span, extracts the trace_id and span_id, and adds them to the log record. Python before 3.7 used threading.local(), which did not work for asyncio (an async coroutine could resume on a different task and lose the local). Python 3.7 introduced contextvars as the standard for asyncio-safe context, and OpenTelemetry's Python SDK uses contextvars.ContextVar for the active span. Other languages have analogous concepts — Go's context.Context (passed explicitly), Java's ThreadLocal plus the Javaagent's bytecode rewriting for thread-pool propagation, Rust's tokio::task_local! macro. The mechanism differs, the contract is the same: the active span is a piece of runtime state that the logging integration reads. Understanding which Python construct holds your active span (and whether your codebase's thread-pool usage will lose it) is the difference between "tracing works" and "tracing has gaps". The code to verify is from opentelemetry import trace; print(trace.get_current_span().get_span_context()) inside any worker, executor, or callback in your code paths — if the printed context is invalid (SpanContext(trace_id=0x0, ...)), the propagation is broken at that point and every log line emitted from there will be unstitchable.

Sampling, the trace-id, and the asymmetry of "I have a trace_id but no spans"

A subtle interaction between logs and tracing: the trace_id is generated at the entry point regardless of whether the trace is sampled or dropped. A 1% head-based sampling rate keeps 1% of traces in Tempo but generates 100% of trace_ids — which means 100% of log lines have a trace_id attribute, but only 1% of those trace_ids resolve to anything in the trace backend. The on-call who greps a log line for its trace_id and pulls it up in Tempo gets an empty result 99% of the time. This is by design: tracing's storage cost grows with kept-trace volume, and 1% sampling cuts that cost by 100×. The mitigation is tail-based sampling — keep all errors, keep all slow requests, keep a 1-5% random sample of OK requests — so that the trace_ids most likely to be queried are the ones most likely to resolve. This is the topic of Part 5 (Sampling). The lesson for the logs-alone-can't-stitch chapter is that even with trace_id propagation, the storage discipline matters: a trace_id that points to nothing is only marginally better than no trace_id at all. Both layers — propagation and sampling — have to land for the wall to come down.

When the wall looks like it has come down, but hasn't — Razorpay's 2024 retro

Razorpay's 2024 internal retro on a UPI-callback latency incident is a useful study because the team had trace_id propagation in place, the trace backend was healthy, and the on-call was still unable to stitch the failed callbacks across 7 services for the first 40 minutes of the incident. The reason was that the upstream NPCI gateway returned the callback POST without the traceparent header — NPCI is outside Razorpay's trust boundary and does not honour W3C trace context — so the callback handler started a new trace on each call, with no connection to the originating UPI mandate's trace. The fix was to extract the NPCI callback's txn_id (which Razorpay had emitted on the outgoing mandate request and which NPCI was contractually obliged to echo back) and use it as a baggage attribute on the new trace, plus a recording-rule that reconstructed the link via a join on txn_id in the trace backend. The general lesson — that trust-boundary crossings break propagation, and the fix requires a domain-specific identifier alongside the standard trace_id — is one most Indian fintechs eventually meet because the regulator-mandated network identifiers (UTR for IMPS, RRN for cards, txn_id for UPI) are the only things that survive the gateway. Designing the trace schema to carry those identifiers as baggage from day one saves this rediscovery later.

The bigger picture — propagated context as the universal observability primitive

The thread that runs through this chapter and into Part 4 is that propagated context is the universal observability primitive at the network boundary — the thing that every other primitive ends up needing whether or not its designers anticipated it. Logs need it (this chapter). Metrics need it whenever you want to attribute a metric to a specific request (exemplars are the mechanism). Profiles need it whenever you want to attribute CPU time to a specific user-facing operation (pyroscope's tags are the mechanism). Even synthetic monitoring tools that fire test requests need it to find their own logs and traces in the system. The trace_id is not "the tracing pillar's identifier"; it is "the request's identifier, exposed to whatever is observing". A team that thinks of it that way ends up with a coherent observability story across all four pillars; a team that thinks of it as "Tempo's primary key" ends up with disconnected silos that share a coincidental string. The next part of the curriculum is built on this distinction.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install flask loguru requests
python3 logs_alone_wont_stitch.py
# Expected: 32-ish lines for user_id=8924711, three checkout/preview attempts mixed.
# No field tells you which subset is the failed checkout.
# Now add trace_id propagation: generate uuid4 at the gateway, pass via x-trace-id
# header, bind on every service, filter by trace_id. The mixing disappears.

Where this leads next

Distributed tracing — the model and the data shape — the next chapter, which formalises what a span is, what a trace is, and how the parent-child structure encodes causality. Everything in this chapter motivates the data model in that one.
Trace context propagation — W3C traceparent and the wire format — the protocol layer for the propagation contract this chapter argued for. The header format is short; the discipline of keeping it intact across thread pools, queues, and retries is the long pole.
OpenTelemetry's logging integration — the producer-side tooling that automatically adds trace_id and span_id to every log line emitted under a span. This is the technical fix for the wall, in 3 lines of Python.
Log sampling: head-based, tail-based — the cost-control discipline that interacts with trace_id propagation in non-obvious ways. A sampled log is still trace-correlated; an unsampled trace is still log-correlated; the matrix of which is kept is what determines the on-call's recovery success rate.

The next part of the curriculum (Part 4 — Distributed Tracing) is the answer to the question this chapter posed. The argument structure is that logs are the within-process tape recording, traces are the cross-process tape recording, and the two are joined by the propagated trace context. Part 4 will dissect the trace, the span, the span's relationship to its parent, sampling, baggage, and the wire format. Read this chapter as the motivation; read Part 4 as the resolution.

There is also a longer-arc point worth making explicit before leaving Part 3. The five chapters of Part 3 (structured logging, JSON-and-schema-drift, sampling, shippers, backends, cost model, this wall chapter) tell a single story: logs are a powerful primitive that rewards discipline and punishes neglect. The discipline is in three layers — what you emit (structured, sampled), how you ship it (Vector or Fluentd or Filebeat with explicit routing), and what you store it in (Elasticsearch or Loki or ClickHouse, picked against your query mix). Even with all three layers right, the cross-process wall remains, and it is structural — no amount of log-side work removes it. The structural fix is in a different pillar, which is exactly why observability has multiple pillars: each is structurally optimal for a different question. Part 4 starts the next pillar.

The closing thought is that the design of distributed tracing was not a "logs are bad, replace them" move; it was a "logs are correct but insufficient for cross-process questions, here is the missing piece" move. The two pillars compose: a trace tells you what happened across services, the logs (correlated by trace_id) tell you what happened inside each service. The on-call who can fluently move between the two — pull a trace, click into a span, see the span's logs in the same UI — is operating at the level the observability tooling has been pointing toward for the last fifteen years. That level is what Part 4 will train.

A final practical note for teams in the middle of this transition: the cheapest, lowest-risk first move is to start emitting a trace_id field on every log line before you stand up a trace backend. Generate the UUID at the gateway, propagate it via header, bind it via loguru or the OpenTelemetry logging integration, and the field is in your logs from day one. When you later turn on Tempo or Jaeger, your historical logs already have the correlation key — they were trace-ready before there was anything to correlate them to. Teams that defer the propagation work until "we are ready for tracing" end up with a sharp before/after boundary in their log corpus where pre-cutover logs cannot be stitched and post-cutover logs can. Bake the field in early; the trace backend is a separate decision you can make on its own timeline.

References

W3C Trace Context — Level 1 specification (W3C Recommendation, 2020) — the standard for the traceparent and tracestate headers; the protocol layer that makes cross-language, cross-service trace propagation interoperable.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 5 — the foundational treatment of the three pillars and the explicit argument that no single pillar suffices for cross-process questions; this chapter's framing draws directly on it.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022), Ch. 3 — the modern-era treatment of structured wide-event logging and the "context propagation as the missing piece" argument.
OpenTelemetry — Logs and Trace Context Propagation — the producer-side specification for how trace_id and span_id attach to log records; required reading for anyone wiring up trace-aware logging in Python or Go.
Google Dapper — A Large-Scale Distributed Systems Tracing Infrastructure (Sigelman et al., 2010) — the original paper that introduced the trace_id-and-span_id model and explained why correlation by user_id alone is insufficient at scale; every modern tracer descends from this design.
Loki — LogQL and trace-id correlation — the query-side documentation for filtering by trace_id once it is in the log stream; pairs naturally with this chapter's argument.
Full-text search for logs: the cost model — the previous chapter; the cost framing that makes "use the right tool for the question" not just an architectural preference but a finance-driven necessity.
Log backends: Elasticsearch, Loki, ClickHouse — the architectural counterpart; trace_id correlation works equally on all three but each backend's query language and cost profile shapes the on-call experience.