Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Correlation IDs

It is 21:47 IST on a Friday. Aditi, the on-call SRE at PaySetu, gets a Slack ping from a merchant: "₹4,82,300 settlement failed at 21:31, no email, no error in our dashboard, money is gone." Aditi has 11 minutes before this becomes a Twitter post. She opens Grafana, finds the merchant's gateway request at 21:31:14.302, and sees the API gateway returned 200 OK. She switches to the settlement service logs — nothing matching that timestamp. She switches to the bank-rail adapter — 14,000 lines in that minute. She switches to Kafka consumer lag — fine. She has logs from every one of the 14 services this request touched, but she cannot tell which line in service B corresponds to which line in service A. The correlation ID is the one piece of metadata that makes the difference between "I can read 14 service logs in parallel and reconstruct one journey" and "I have an ocean of timestamped strings and no way to pair them up".

A correlation ID is a single short string — typically a UUID or 64-bit random — generated at the system edge (API gateway, mobile SDK) and threaded through every log line, every RPC header, every queue message, and every database write tag for a single user-initiated request. Without it, debugging across services is timestamp-archaeology; with it, a one-line log query reconstructs the entire journey. The mechanism is trivial; the failure is always in the propagation discipline — the one service that drops the header, the async hop that forgets to copy it, the retry that generates a new one.

What a correlation ID actually is, and what it is not

A correlation ID is one identifier per logical user request. It is born at the entry point — the moment a mobile app, web page, or upstream callee initiates work — and it travels with every byte of work that originates from that request, no matter how many services, queues, threads, or async hops it traverses. Every log line emitted by every service while handling that request carries the same string. Every outgoing RPC carries it as a header. Every Kafka or RabbitMQ message carries it in its envelope. Every database row written by that request carries it as a column or tag.

The promise is mechanical and modest: given any one log line, any one trace span, any one queue message, or any one database row, you can pivot to all the others touched by the same user request with a single substring search.

One identifier, threaded through HTTP headers, gRPC metadata, Kafka message headers, and database column tags. The dashed line marks the synchronous→asynchronous boundary at the queue — the place where most correlation-ID propagation breaks in real systems. Illustrative.

A correlation ID is not a trace ID. They are related but different. A trace ID (W3C Trace Context, OpenTelemetry) identifies one trace, with structured parent-child spans showing causality and timing. A correlation ID is a flat string with no structure beyond identity — it just answers "do these two log lines / messages / rows belong to the same user request?". Most production systems carry both: the trace ID is sampled (kept for ~1% of requests so the trace store stays affordable), but the correlation ID is logged on every line of every request, so that even un-sampled traces remain debuggable through the log tier. Why both, when traces seem to subsume correlation IDs: trace data lives in the sampled trace store and is dropped for 99% of requests under realistic sampling budgets; the correlation ID lives in every log line, which is retained at ~50× the volume but is searchable for every request, even un-sampled ones. When a single merchant complains about one specific failure, that request was almost certainly in the un-sampled 99%. The correlation ID is what saves you.

A correlation ID is also not a user ID, a session ID, an order ID, or a transaction ID. Those identify entities that persist beyond a single request. The correlation ID identifies one specific user-initiated workflow instance, even if the same user makes the same call again three seconds later. The repeat call gets a new correlation ID. (User ID and order ID should also be logged on every line — they are joins to other dimensions, not replacements for the correlation ID.)

How the propagation actually works — and where it breaks

The implementation has two halves: generation at the entry point and propagation at every hop. Both are simple to write and easy to break.

Generation happens once per request, at the outermost edge of the system: an API gateway, a load balancer with a Lua filter, the mobile SDK before its first network call, or the cron driver that initiated a batch job. The ID is generated with a CSPRNG (or uuid.uuid4() in Python), and it is the first thing the gateway does, before any auth, before any routing, before any business logic. Why the edge, and why first: if you generate the ID inside service A, then service A's own log lines that fired before the ID was generated are unjoinable. If you let service B generate one when it discovers service A didn't send one, you now have two correlation IDs for one request, and the join key is broken. The earliest-possible birth point is the rule.

Propagation happens on every hop. The convention is to use the HTTP header X-Correlation-ID (or X-Request-ID, used interchangeably; pick one and enforce it across the org) for synchronous calls, gRPC metadata for gRPC, and a message-envelope header for async queues. Every service has a middleware that extracts the header on inbound requests, stores it in a context-local (Python contextvars, Go context.Context, Java MDC), and re-attaches it to every outbound call.

# correlation_propagator.py — minimal correlation-ID middleware in asyncio.
# Run: python correlation_propagator.py
import asyncio, contextvars, uuid, logging, json

cid_ctx: contextvars.ContextVar[str] = contextvars.ContextVar("cid", default="-")

class CIDFormatter(logging.Formatter):
    def format(self, record):
        record.cid = cid_ctx.get()
        return f"{record.asctime} svc={record.name} cid={record.cid} {record.getMessage()}"

logging.basicConfig(level=logging.INFO)
for h in logging.root.handlers:
    h.setFormatter(CIDFormatter("%(asctime)s"))

async def inbound(headers: dict, log: logging.Logger):
    cid = headers.get("X-Correlation-ID") or str(uuid.uuid4())[:8]
    cid_ctx.set(cid)
    log.info("received request")
    await downstream_call(log)
    log.info("returned 200")

async def downstream_call(log: logging.Logger):
    headers = {"X-Correlation-ID": cid_ctx.get()}
    log.info(f"calling ledger with headers={json.dumps(headers)}")
    await asyncio.sleep(0.01)  # network
    log.info("ledger returned 200")

async def main():
    gw = logging.getLogger("api-gateway")
    # request 1: client supplied a cid
    await inbound({"X-Correlation-ID": "c8f4a2-7e91"}, gw)
    # request 2: client did not supply one — gateway mints
    await inbound({}, gw)

asyncio.run(main())

Realistic output:

2026-04-28 21:47:03,120 svc=api-gateway cid=c8f4a2-7e91 received request
2026-04-28 21:47:03,121 svc=api-gateway cid=c8f4a2-7e91 calling ledger with headers={"X-Correlation-ID": "c8f4a2-7e91"}
2026-04-28 21:47:03,131 svc=api-gateway cid=c8f4a2-7e91 ledger returned 200
2026-04-28 21:47:03,131 svc=api-gateway cid=c8f4a2-7e91 returned 200
2026-04-28 21:47:03,131 svc=api-gateway cid=4f3a91d2 received request
2026-04-28 21:47:03,131 svc=api-gateway cid=4f3a91d2 calling ledger with headers={"X-Correlation-ID": "4f3a91d2"}
2026-04-28 21:47:03,142 svc=api-gateway cid=4f3a91d2 ledger returned 200
2026-04-28 21:47:03,142 svc=api-gateway cid=4f3a91d2 returned 200

Walkthrough. The cid_ctx ContextVar is the load-bearing primitive — it is per-asyncio-task, automatically copied across await points, and isolated between concurrent requests. The CIDFormatter reads it on every log call, so every line emitted while handling a request inherits the current cid without the application code having to thread it through. The inbound middleware extracts the header if present (request 1) or mints a new one (request 2) — the "first hop in the system" branch. The downstream_call function reads the same ContextVar to attach the header on the outgoing call. The walkthrough captures the entire mechanism: middleware on inbound, store in ContextVar, read in formatter, re-attach on outbound. Why ContextVar and not a thread-local: in asyncio, multiple coroutines run on the same OS thread, so a thread-local would collide between concurrent requests. contextvars is async-aware — when one coroutine suspends and another resumes, Python rebinds the context, so each request sees its own cid value.

The propagation has three failure modes that account for the vast majority of "missing correlation ID" incidents:

A service that doesn't propagate. A team writes a service in a language without the org's standard middleware library, or copies an older codebase that pre-dates the convention. Inbound logs show the cid; outbound calls don't carry it; the chain breaks at that node. Detection: a query that joins logs by cid and counts service-distinct hits — services that never appear are either not handling requests, or not propagating.
The async-boundary drop. A request hits a queue. The producer code path was instrumented to read the ContextVar and put it on the message header. But the consumer code path runs in a different process, with a fresh ContextVar that defaults to - until the first message arrives — and on cold-start, the consumer's bootstrap code emits 30 log lines with cid=- before it pulls the first message. Worse, if the consumer forgets to re-set the ContextVar after pulling the message, every line it emits during the message's processing is also cid=-. The fix is symmetrical to inbound HTTP middleware, but the discipline of "every consumer is also a service edge" is what most teams forget.
The retry that generates a new one. A service times out calling a downstream, retries, and on the retry, the retry library has stripped or regenerated the headers. Now the original request's logs and the retry's logs have different correlation IDs, and reconstructing the full journey requires joining via a secondary identifier (order ID, idempotency key). The fix is a retry middleware that explicitly preserves the cid header.

Production patterns: where to put it, what to log it as

The cid lives in five places in a mature system, and each placement has a small but important convention:

Treating the cid as a first-class field at each layer — header, metadata, envelope, structured log, DB column — is the discipline that makes incident archaeology mechanical instead of forensic.

The structured-log field is the most overlooked. Many teams emit logger.info(f"processed payment for cid={cid}") — the cid is in the message string, which is searchable but not indexable. The right pattern is logger.info("processed payment", extra={"cid": cid, "merchant_id": m_id, "amount_paise": amount}) — the cid is a top-level structured field, indexed by the log backend, and queryable in O(log N) instead of O(N) full-text scan. The difference is incident-defining: at 50 TB/day of logs, a full-text scan for "cid=c8f4a2-7e91" takes 15 minutes; an indexed equality query on cid takes 200 ms.

The DB-column placement is the rarest and most valuable. PaySetu's settlement service writes a cid TEXT column on the payments table. When Aditi gets the merchant complaint at 21:47, she runs SELECT cid FROM payments WHERE merchant_id=? AND amount_paise=48230000 AND created_at BETWEEN '21:30' AND '21:35' and gets one cid in 40 ms. She pivots that cid into the log query, gets every log line across all 14 services in 800 ms, and has the timeline reconstructed by 21:51. Without the DB-column placement, she has to start from a guessed timestamp and grep — easily 10× slower, and "she guessed wrong on the timestamp" is a real failure mode that turns 11 minutes into 45.

Common confusions

"A correlation ID is the same as a trace ID" A trace ID (W3C Trace Context) is part of a structured causality graph with parent-child spans, sampled at ~1% in production for cost reasons. A correlation ID is a flat identity string logged on every line of every request, sampled at 100%. Most mature systems carry both — the trace ID for the deep-debug case where the request was sampled, the correlation ID for the 99% of incidents where the user-affecting request was not.
"A correlation ID is the same as a user ID or session ID" A user ID identifies the human; a session ID identifies the login. A correlation ID identifies one specific request workflow instance. The same user making the same API call twice gets two different correlation IDs. All three should be logged together — they answer different questions ("show me everything for user 482", "show me everything in this session", "show me everything for the request that just failed").
"We can just join logs by timestamp" Timestamps drift between hosts by 1–50 ms (NTP-best-case) or seconds (NTP-broken-case, see wall-clocks-and-ntp). Two requests arriving 4 ms apart are indistinguishable by timestamp on a 47-service fan-out. Worse, retries and async hops scramble the ordering — a request that started at 21:31 may emit log lines at 21:34 from the retry consumer. Timestamp join is what you do before you have correlation IDs; it does not work past 3 services.
"X-Request-ID and X-Correlation-ID are different things" They are the same thing under different names. Some orgs use X-Request-ID (Heroku, AWS), some use X-Correlation-ID (Microsoft, Spring Cloud). A few orgs distinguish: X-Request-ID is per-hop (a new ID for each service-to-service call), X-Correlation-ID is end-to-end. The end-to-end version is what this article is about. Pick one and enforce it across the whole org — having both with subtly different semantics is a guaranteed source of incidents where one team's logs are joinable and another team's are not.
"Generating a UUID is expensive — we should use a shorter ID" A uuid.uuid4() call is ~1 microsecond. At 100k requests/sec, that is 100 ms of CPU per second per host — about 0.01% of one core. The actual cost concern is log storage: a 36-character UUID multiplied by 50 log lines per request × 100k requests/sec × 86,400 seconds/day = 15 GB/day of cid bytes alone. Some orgs trim to 16 hex characters (64 bits of entropy, ~1 in 18 quintillion collision over a year, 2.7 GB/day). The collision math always wins; UUID-trimming for storage is the only valid optimization.
"Correlation IDs solve observability" They solve the join problem — they tell you which log lines belong together. They do not solve cardinality budgets (metrics break differently), retention costs, sampling decisions, or causality (a span tree shows who called whom; cid alone does not). Treat the cid as the floor of observability infrastructure: necessary, not sufficient.

Going deeper

W3C Trace Context — the standard that subsumes most homegrown correlation-ID schemes

The W3C Trace Context spec (Recommendation, 2020) defines two HTTP headers that carry both a trace ID and the per-hop span: traceparent: 00-{32-char-trace-id}-{16-char-parent-span-id}-{2-char-flags} and tracestate for vendor-specific extensions. The trace ID portion (32 hex chars / 128 bits) is the modern correlation ID — it serves both purposes when you adopt OpenTelemetry. The two-header design is intentional: traceparent is the standardised, mutable-per-hop part; tracestate is the extensible part where vendors and orgs add their own keyed values. A modern PaySetu-scale system runs OpenTelemetry SDKs that do trace-context propagation automatically across HTTP, gRPC, and Kafka — a service authored against the SDK gets correlation propagation "for free" without writing middleware. The teams that still maintain hand-rolled X-Correlation-ID schemes are usually pre-2020 codebases that have not yet migrated, or polyglot services where one of the languages has weak OTel support.

The async-boundary problem and the ContextVar-restoration discipline

Every queue consumer is a fresh process with a fresh ContextVar. The "right" pattern for Kafka or RabbitMQ consumers in Python: a wrapper around the message-handling loop that reads the cid header from the message envelope, sets the ContextVar, calls the handler, then resets it. Skipping this step is the single most common source of "the chain breaks at the queue" incidents. CricStream's recommendation engine had this bug for 6 months: every recommendation event landed in Kafka with a cid header, but the consumer logged everything as cid=- because the bootstrap library's message dispatcher predated the cid convention. The fix was 14 lines; the diagnostic — figuring out why one specific service was a black hole in incident timelines — took 4 sprints and three failed war-room sessions before someone read the consumer-bootstrap code.

Sampling and storage: the cid is the cheapest piece of metadata you ship

A 16-byte cid added to every log line costs ~3% extra storage. A 16-byte cid added to every Kafka message costs ~1% extra wire. A cid TEXT column on a transactional table costs ~16 bytes per row (negligible vs typical row width). The economics are not even close — there is no real cost reason to drop the cid from any layer. Compare this to traces, where keeping every span at PaySetu scale would cost ~$5M/year — the cid serves as the "always-on, always-cheap" join key, and traces serve as the "deep-dive when you need it, sampled" mechanism. The mental model: cid is the permanent steel cable through every record; trace is the inspection lamp you turn on for 1% of records.

Reproduce on your laptop

# Reproduce the propagator on your laptop
python3 -m venv .venv && source .venv/bin/activate
# Stdlib only — no extra packages needed
python3 correlation_propagator.py
# Tail the output and verify each "received" line carries through to "returned 200"

Where this leads next

Correlation IDs are the foundation; the next chapters in Part 18 build the layers above them: structured logging (so the cid is a queryable field, not a substring), distributed tracing (so the cid grows into a causality graph), and the SLO/SLI vocabulary that makes incident response actionable. Each layer assumes the cid is already correct — if your cid propagation is broken, no downstream observability investment pays off, because every dashboard, every alert, every incident-pivot tool ultimately reduces to "give me everything for cid=X".

The thread that runs from here through Part 18 is one repeated claim: observability across services is an identity problem, not a logging-volume problem. The cid is the identity; the rest of Part 18 is what you build on top of it.

References

W3C, Trace Context (Recommendation 2020) — the modern standard that subsumes ad-hoc correlation-ID headers.
OpenTelemetry Specification, Context Propagation — the canonical SDK-level propagation contract across HTTP, gRPC, and message queues.
Cindy Sridharan, Distributed Systems Observability (O'Reilly 2018), Chapter 4 — the canonical "correlation IDs are the floor" framing.
Sigelman et al., "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" (Google Technical Report dapper-2010-1) — origin of trace-ID propagation at hyperscale; correlation IDs are the structureless precursor.
Charity Majors, "Why I Strongly Prefer Structured Events Over Logs" — the argument for cid as a top-level structured field, not embedded in the message body.
Heroku, Request IDs — the public reference for the X-Request-ID convention.
AWS, X-Ray Trace Header documentation — the X-Amzn-Trace-Id variant, used across AWS-native services.
See also: wall: observability in distributed systems is a data problem, wall: clocks and NTP.