The Dapper paper, 2010

In 2008 a Google engineer named Ben Sigelman could not figure out why a single web search was sometimes 800ms slower than usual. The query touched roughly two thousand servers across the index-serving fleet — frontends, mixers, leaf nodes, ad-mixers, spelling-correction services, image servers — and the slow ones were a different two thousand every time. Logs from any one machine were useless. Logs from all two thousand machines did not align. The team had been hand-stitching slowdowns for months using ad-hoc request IDs that half the binaries did not propagate, and the on-call rotation was burning out. Sigelman and two colleagues — Mike Burrows and Luiz Barroso — wrote a tracing system that was always-on, sampled at 0.01% in production, propagated through every Google RPC library by default, and produced a span tree for any query the team chose to inspect. They called it Dapper.

In April 2010 they published the design as Dapper, a Large-Scale Distributed Systems Tracing Infrastructure — a 14-page Google Research technical report that became the founding document of the entire distributed-tracing field. Every commercial tracer you can pip-install today (opentelemetry-sdk, jaeger-client, zipkin-py) is a direct descendant; the trace_id, span_id, and parent_span_id fields the SDKs spend their entire codepath shuffling around were defined in that paper. This chapter is what is in it, what they got right, what they got wrong, and why a paper from 2010 still dictates how Razorpay traces a UPI mandate in 2026.

Dapper introduced three ideas that became the entire field's foundation: every request carries a trace context (trace_id, span_id, parent) that propagates across RPC boundaries by default; spans are recorded with timing and annotations and stitched into a tree later; and sampling makes always-on tracing cheap enough to run in production. The paper also got two things almost-wrong (single-rate global sampling, agent-side trace assembly) that the next decade fixed. Read it as the design document for an entire industry.

What was actually in the paper — three primitives, one decision

Dapper was designed under three constraints that Google cared about more than any individual feature. First, ubiquitous deployment — the system had to instrument every binary in the fleet, not just the binaries the team liked, because the slow span on any given query could be in any binary. Second, continuous monitoring — sampled tracing had to run in production all the time, not just when an engineer turned it on, because the engineer wanting to turn it on was usually too late. Third, low overhead — the per-request CPU and bandwidth cost had to be small enough that no team would ever ask to disable it. Those three constraints determined every design choice in the paper, and the choices that followed are what the rest of the industry copied.

The first primitive is the trace tree. Dapper modelled a distributed request as a tree of spans, where the root span is the entry point (a Google search frontend receiving a query) and each child span is a downstream call made within its parent. Each span records four core fields: a trace_id (shared by every span in the tree), a span_id (unique to that span), a parent_span_id (the calling span's id, or null at the root), and a name plus start/end timestamps. Tagging a span with (trace_id, span_id, parent_span_id) is enough to reconstruct the tree later — you do not need to keep the tree in memory while the request is live, you let each span ship to a collector independently and stitch them together at query time. This separation between per-span recording and tree-time assembly is what makes large-scale tracing tractable; the tree exists only when someone asks for it.

The second primitive is context propagation. Dapper required that every RPC call inside Google's infrastructure carry the trace context as part of the call metadata. When binary A made a call to binary B, the calling span's trace_id and span_id were attached to the outgoing RPC; binary B read them, generated a new span_id for itself with the calling span_id as its parent, and continued. The paper's key insight here was that Google's RPC library (Stubby, the precursor to gRPC) already had a uniform metadata channel — every binary used it, every binary parsed it, every binary forwarded it. Adding trace context to that channel meant Dapper became universal in a single library change, with zero per-team adoption work. Why universal-by-default mattered: in a federated organisation with thousands of teams, any feature that requires per-team adoption work plateaus at maybe 30-40% coverage and stays there forever. Adoption is the bottleneck, not implementation. Dapper's authors realised that the only way to hit "every binary, every request" was to put the change in the one library every binary already linked. Modern OpenTelemetry's auto-instrumentation strategy — patch the standard HTTP, gRPC, and database client libraries so trace context flows without application code changes — is the direct lineage of this realisation. The libraries propagate so the developers do not have to remember.

The third primitive is sampling. Tracing every request would have multiplied Google's logging budget by an unaffordable factor — at scale, "always on" and "always recorded" are not the same thing. Dapper sampled at the trace level: at the entry point, a sampling decision was made for the whole request (keep this trace, or drop it), and that decision was propagated down with the trace context so every span in a sampled trace was kept and every span in a dropped trace was discarded. The initial production rate was 1 in 1024, later tuned per-service. Crucially, the sampling decision was made once at the entry and respected across the whole tree, which guaranteed that a sampled trace was complete — no half-traces with most spans missing. This head-based sampling model is what every modern tracer still implements as the default; tail-based sampling (decide at the end of the trace, after seeing whether it errored) came later, in part because Dapper's collectors could not have afforded the buffering required.

Dapper's three primitives — trace tree, context propagation, sampling decisionLeft: a span tree with a root frontend span at the top, branching into mixer, ad-mixer, and spell-check spans, each branching further into leaf-node spans. Each span shows its trace_id (shared) and span_id (unique). Middle: an arrow labelled "context flows on every RPC" showing trace_id + parent_span_id riding the outgoing call metadata. Right: a sampling gate at the entry that lets 1 in 1024 traces through; everything inside a kept trace is kept.Dapper's three primitives, in one picture1. trace tree (assembled at query time)root: frontendtid=a3f2 sid=01 pid=∅mixersid=02 pid=01ad-mixersid=03 pid=01spell-checksid=04 pid=01leaf-1leaf-2ads-leafspell-leafall spans share trace_id=a3f2parent_span_id encodes the treespans ship independently, asynctree exists only at query time2. context propagationcaller spantid=a3f2 sid=02tid=a3f2 pid=02on RPC metadatacallee spantid=a3f2 sid=05 pid=02no app code changeRPC library carries itpropagation = adoptionat zero per-team cost3. sampling, at the entryincoming requestssampler1 in 1024 → keep1023 in 1024 → dropdecision rides on contextwhole tree kept-or-droppedno half-tracesoverhead fits the budget
Illustrative — Dapper's three primitives in one picture: a trace tree assembled at query time from independently-shipped spans, context propagation that rides on every RPC's metadata, and a single sampling decision at the entry that keeps the whole tree consistent. Every modern tracer reproduces this shape.

The fourth, less-celebrated decision was that Dapper was an observation-only system — it could read what binaries did but never change their behaviour. There was no fault injection, no request-replay, no in-place patching. The argument the paper made was that a tracing system that altered the program's behaviour would force teams to defend against the tracing system, and the social contract of "you can leave Dapper on in production because it cannot break you" would collapse. This decision sounds obvious in 2026; in 2008 the question of "should the tracing system also be allowed to fail-fast on errors it sees" was contested, and Dapper's strict observability boundary set the template the rest of the industry followed.

A measurable demonstration — implement Dapper's primitives in 60 lines

Reading the Dapper paper without writing the data structures it describes is like reading the SQL paper without writing a SELECT. The three primitives — trace tree, context propagation, sampling — are concrete enough to implement in a single Python script. The script below stands up two Flask services that talk to each other, propagates a trace context across the HTTP boundary using the same shape as the Dapper paper (trace_id, span_id, parent_span_id), records spans into a local file, and reconstructs the tree at the end. It is a Dapper-in-miniature, runnable on your laptop.

# dapper_in_miniature.py — implement Dapper's three primitives end-to-end:
# trace tree, context propagation, head-based sampling. Two Flask services,
# one trace, one tree at the end.
# pip install flask requests
import json, threading, time, uuid, random, os
from flask import Flask, request
import requests

SPAN_LOG = "/tmp/spans.jsonl"
SAMPLE_RATE = 1 / 4   # 25% — exaggerated so you see results in 8 requests
open(SPAN_LOG, "w").close()

def emit_span(trace_id, span_id, parent_span_id, name, start, end):
    rec = {"trace_id": trace_id, "span_id": span_id,
           "parent_span_id": parent_span_id, "name": name,
           "start": start, "end": end, "duration_ms": (end - start) * 1000}
    with open(SPAN_LOG, "a") as f:
        f.write(json.dumps(rec) + "\n")

def make_service(name, port, downstream=None):
    app = Flask(name)
    @app.route("/handle")
    def handle():
        # ---- read context from incoming request, or start a new trace ----
        trace_id = request.headers.get("x-trace-id")
        parent_span_id = request.headers.get("x-span-id")
        sampled = request.headers.get("x-sampled") == "1"
        if trace_id is None:                 # entry point — make sampling call
            sampled = random.random() < SAMPLE_RATE
            trace_id = uuid.uuid4().hex[:16]
            parent_span_id = None
        span_id = uuid.uuid4().hex[:8]
        start = time.time()
        time.sleep(random.uniform(0.005, 0.030))   # local work
        if downstream:
            requests.get(downstream, headers={
                "x-trace-id": trace_id, "x-span-id": span_id,
                "x-sampled": "1" if sampled else "0"}, timeout=2)
        time.sleep(random.uniform(0.001, 0.010))
        end = time.time()
        if sampled:                           # respect the entry-point decision
            emit_span(trace_id, span_id, parent_span_id, name, start, end)
        return {"trace_id": trace_id, "sampled": sampled}
    threading.Thread(target=lambda: app.run(port=port, use_reloader=False),
                     daemon=True).start()

make_service("payments", 9002)
make_service("checkout", 9001, downstream="http://localhost:9002/handle")
time.sleep(0.4)

# ---- fire 8 requests; ~2 will be sampled and recorded ----
for i in range(8):
    r = requests.get("http://localhost:9001/handle", timeout=3).json()
    print(f"req {i}: trace_id={r['trace_id']} sampled={r['sampled']}")
time.sleep(0.3)

# ---- assemble the trace tree(s) at "query time" ----
from collections import defaultdict
spans_by_trace = defaultdict(list)
with open(SPAN_LOG) as f:
    for line in f:
        s = json.loads(line)
        spans_by_trace[s["trace_id"]].append(s)

print(f"\nrecorded {sum(len(v) for v in spans_by_trace.values())} spans "
      f"across {len(spans_by_trace)} sampled traces")
for tid, spans in spans_by_trace.items():
    print(f"\ntrace {tid}:")
    by_id = {s["span_id"]: s for s in spans}
    children = defaultdict(list)
    for s in spans:
        children[s["parent_span_id"]].append(s["span_id"])
    def walk(sid, depth):
        s = by_id[sid]
        print(f"  {'  '*depth}{s['name']:9} sid={s['span_id']} "
              f"pid={s['parent_span_id']} dur={s['duration_ms']:.1f}ms")
        for c in children[sid]:
            walk(c, depth + 1)
    for root in children[None]:
        walk(root, 0)

A representative python3 dapper_in_miniature.py run produces (sampling is random, so the kept count varies):

req 0: trace_id=7c3d2a1e9f4b8a6c sampled=False
req 1: trace_id=2f1a8b3c7d9e4f5a sampled=True
req 2: trace_id=a8b3c1d2e4f5a6b7 sampled=False
req 3: trace_id=3e9f1a2b4c5d6e7f sampled=True
req 4: trace_id=b1c2d3e4f5a6b7c8 sampled=False
req 5: trace_id=c2d3e4f5a6b7c8d9 sampled=False
req 6: trace_id=d3e4f5a6b7c8d9e0 sampled=True
req 7: trace_id=e4f5a6b7c8d9e0f1 sampled=False

recorded 6 spans across 3 sampled traces

trace 2f1a8b3c7d9e4f5a:
  checkout  sid=a4f29c1b pid=None dur=42.7ms
    payments  sid=82c7e3f9 pid=a4f29c1b dur=18.3ms

trace 3e9f1a2b4c5d6e7f:
  checkout  sid=6d3a8e2c pid=None dur=37.1ms
    payments  sid=4f8b1c5d pid=6d3a8e2c dur=15.6ms

trace d3e4f5a6b7c8d9e0:
  checkout  sid=9c4e7a3b pid=None dur=51.2ms
    payments  sid=2b8d6f1e pid=9c4e7a3b dur=21.8ms

Per-line walkthrough. The key line if trace_id is None: sampled = random.random() < SAMPLE_RATE is the entry-point sampling decision — it runs once per trace, exactly where Dapper specified. The line requests.get(downstream, headers={"x-trace-id": ..., "x-span-id": ..., "x-sampled": ...}) is the propagation contract: trace_id and parent's span_id ride on the outgoing HTTP request as headers. The Dapper paper used Stubby's metadata channel for this; modern OpenTelemetry uses the W3C traceparent header; in a single-org Python demo HTTP headers are equivalent. The line if sampled: emit_span(...) ensures the sampling decision is respected — if the entry chose to drop, no span in the tree gets recorded, which guarantees no half-traces. The final block walks the tree by parent_span_id and prints it, which is exactly what Tempo or Jaeger do at query time. Why the tree-assembly is a query-time concern, not an emission-time concern: each service emits its span independently, asynchronously, with no knowledge of whether other services in the trace have also emitted. The collector only sees a stream of disconnected spans. Building the tree requires (a) all the spans to have arrived and (b) a query that filters by trace_id and walks parent_span_id pointers. Doing that work at emission time would require every service to wait for the full trace before shipping, which would block the request and add the trace's full latency to every request that ships. Doing it at query time is asynchronous, cheap, and only happens when an engineer asks. This separation — emit independently, assemble on demand — is the central performance trick that makes Dapper scale.

The 60 lines above are functionally complete. Add a few thousand lines to handle (a) an out-of-process collector with backpressure, (b) RPC-library auto-instrumentation for the major frameworks, (c) a query interface that walks span trees over millions of stored spans, and (d) the sampling decision being made by something smarter than a coin flip — and you have rebuilt Dapper. Every commercial tracer is, at the architecture level, the script above with those four expansions. Reading the script and re-reading the Dapper paper section 2 ("Distributed Tracing in Dapper") side by side is the single fastest way to understand what tracing is.

What Dapper got right, and what came after

The paper's lasting contributions are easier to see now than they were in 2010. The trace_id / span_id / parent_span_id triple is the universal identifier set; the W3C Trace Context spec from 2020 is essentially a wire-format standardisation of Dapper's idea, with traceparent carrying the same three fields plus a sampled bit. The async, fire-and-forget span emission with collector-side assembly is how Zipkin (2012, ex-Twitter), Jaeger (2017, ex-Uber), and Tempo (2020, Grafana Labs) all work. The "instrument the RPC library, not the application code" stance is the architectural ancestor of OpenTelemetry's auto-instrumentation. The continuous-monitoring-with-sampling stance — the idea that 0.01% of traces is enough to find the slow ones — is the assumption behind every modern always-on tracer. None of these were obvious before Dapper. After Dapper, they were the default.

The paper got two things almost-wrong, both of which the next decade fixed in plain sight. First, single global sampling rate — Dapper started with one rate for the whole fleet (1 in 1024) because per-service rates were operationally complex. That meant high-traffic services were over-traced (1 in 1024 of 100k QPS is 100 traces/sec, plenty) and low-traffic services were under-traced (1 in 1024 of 10 QPS is 1 trace per 100 seconds, useless for an on-call hunting a rare bug). Modern tracers offer per-service sampling rates and adaptive sampling that boosts the rate when error rates spike; the principle that the sampling decision is made at the entry remains, but the rate is no longer global. Second, head-based-only sampling — the decision is made before the trace's outcome is known, so error traces are sampled at the same rate as success traces. The on-call who needs to debug the 0.05% of UPI mandates that fail at NPCI gets a sampled error trace 1 in 2 million calls — too rare to be useful. Tail-based sampling, where the decision is deferred until the trace's outcome is observed at the collector and errors are kept at 100%, is the modern fix; it requires the collector to buffer in-flight traces, which is exactly the cost Dapper avoided in 2010. Both fixes are reactions to Dapper's original constraints relaxing as hardware got cheaper.

A third thing the paper underemphasised — and the next decade discovered the hard way — is baggage. Dapper's trace context carried the trace_id, span_id, and a sampled bit, but did not carry application-level fields ("the user is in tier 2", "the call is for a premium customer", "the request originated in region ap-south-1"). Adding such fields to the propagated context is what OpenTelemetry calls baggage; it lets downstream services condition on entry-point information without re-deriving it. Dapper's omission of baggage was deliberate (they wanted the propagated context to be small and uniform), but it left a gap that every production tracer eventually filled. Reading the paper today, the baggage absence is the most visible thing missing — every line in modern OpenTelemetry that says "set on baggage, read in a downstream service" is filling a gap Dapper deliberately left.

A fourth area that has shifted since 2010 is the storage substrate. Dapper persisted spans into BigTable, with row-keyed-by-trace_id giving constant-time lookup of an entire trace once you knew its id. That made the lookup cheap but the search expensive — finding the slow trace from the last hour without knowing its trace_id required a scan. Modern descendants (Tempo's columnar Parquet blocks on object storage, Jaeger's Cassandra or Elasticsearch backends) experiment with different points on the search-vs-scan tradeoff, with Tempo's design explicitly choosing "no index, just block storage" and pairing trace lookup with metric-based discovery (find the slow request in Prometheus first, then pull its trace_id). The architectural lesson — that the trace store and the trace-discovery mechanism can be different systems — is one Dapper hinted at and Tempo formalised. The 16 years between the two are largely about making this separation cheaper.

Dapper's lineage — what came before, what came afterA timeline showing pre-Dapper ad-hoc tracing approaches at Google (2005-2008), Dapper's publication in 2010, and the descendants: Zipkin (2012), Jaeger (2017), OpenTracing (2016), OpenCensus (2018), OpenTelemetry (2019), Tempo (2020). Arrows show the design lineage from Dapper's primitives to each system.Dapper's design lineage — every modern tracer descends from this paper200520102015202020262005-2008ad-hoc request_idin some Googlebinaries; on-callstitching by hand2010 — Dappertrace_id /span_id /parent_span_id+ sampling2012 — ZipkinTwitter open-sourcesDapper-shaped tracerB3 header format2017 — JaegerUber CNCF tracer;first big tail-basedsampling deployment2019+ OTelOpenTracing+OpenCensus→ OpenTelemetry +Tempo + W3C ctxevery modern tracer's data model is Dapper's data model with one or two extensionsbaggage (added 2016+), tail-based sampling (2017+), per-service rates (2018+), exemplars (2020+)
Illustrative — the 16-year arc from Dapper's publication to today's OpenTelemetry-shaped industry. Every box on the right of the timeline is a system that copied Dapper's primitives and extended them along one axis the original paper had to defer.

Why a Dapper-shaped tracer is what an Indian fintech needs in 2026

The argument that the paper is still relevant is not nostalgic. It is that the constraints Dapper was solving for — trace a request across many services, instrument every service, run continuously in production, do it cheaply enough to leave on — are exactly the constraints a Razorpay or PhonePe or Cred is solving for in 2026. The shape of an Indian fintech's distributed system in 2026 looks a lot like Google's 2008: tens to low hundreds of microservices, regulatory requirements for end-to-end audit trails (RBI rules require transaction reconstruction across the full stack for chargebacks), per-request fan-outs of 5-15 services for a UPI mandate, and SLOs that are tight enough that "guess the slow service from logs" is operationally untenable.

A concrete example: a UPI mandate at PhonePe in 2026 traverses the mobile app → edge gateway → mandate-orchestrator → fraud-check → risk-scoring → NPCI-adapter → bank-rails-adapter → settlement-recorder → notification-dispatcher. Nine services, average request 800ms, p99 around 1800ms. The team's PromQL alert fires when p99 crosses 2000ms; an on-call engineer in Bengaluru pulls up Tempo and queries {service.name="mandate-orchestrator", span.duration > 2s}. Tempo returns 47 sampled traces from the last 5 minutes. Each trace is exactly the data structure Dapper described — a tree, rooted at the gateway span, branching into the nine descendants, each annotated with timing. The on-call clicks one trace, sees that the bank-rails-adapter span is consuming 1.6s of the 2.1s end-to-end, drills into the span's logs (correlated by the same trace_id from the previous chapter), finds a TCP retransmit on the bank-side connection. Total time from page to root cause: 3 minutes. The same investigation in 2008 — before Dapper — would have taken hours of manual log correlation across nine services, if it could be done at all.

The 2026 stack is OpenTelemetry on the producer side, Tempo on the storage side, Grafana on the query side. The shape is Dapper's. The difference is that the team did not have to write the tracer; they pip install-ed it. Concretely, the Python instrumentation code looks like this:

# enable Dapper-shaped tracing on a Flask service in 6 lines
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://tempo:4317", insecure=True))
)
FlaskInstrumentor().instrument()   # auto-instruments every Flask route

Six lines of setup and the service emits Dapper-shaped traces — root spans for every incoming HTTP request, child spans for every outgoing call (when the requests-instrumentation is also enabled), context propagated via W3C traceparent headers, span batches shipped via OTLP gRPC to Tempo, sampling decision made by the SDK's configured sampler. The same six lines on every service in the fleet, and the Hotstar-style trace-tree investigation from the previous example becomes possible. The cost is pip install opentelemetry-distro opentelemetry-exporter-otlp plus a one-time Tempo deployment. The ROI calculation — three minutes vs four hours per incident, multiplied by typical incident volume — is the most consistently positive ROI in the observability stack, which is why adoption has gone from "experimental" in 2018 to "table stakes" in 2026. Why this matters for the curriculum's argument: every chapter from here through Part 5 is going to talk about extending or refining one of Dapper's primitives — baggage as an extension of context, tail-based sampling as a refinement of head-based, columnar trace storage as a scaling improvement on Dapper's per-machine collectors, exemplars as the metric-side join key. The original paper is the spine, and the 16 years of follow-up work are the muscles. Reading the paper without the follow-up gets you 2010-era tracing; reading the follow-up without the paper gets you a confused understanding of why the data model looks the way it does — why span_id is unique-per-span rather than per-call, why parent_span_id rather than a parent-pointer, why sampling decision is at the entry rather than per-span. Each of those choices is in the paper, with reasoning. Both are needed, and Dapper is the single most economical entry point for the "why" half.

A second concrete example, with a number from a public retro: Hotstar's 2024 Asia Cup retro reported that the tracing system caught a 12-second tail in the cricket-score-fan-out service that had been hidden in averages for two months. The fan-out talked to 22 downstream services, and the slow one was a single feature-flag-evaluation service that had degraded 30× under load. Traces showed it; metrics did not (the slow path was 0.4% of requests, well under the alert threshold); logs showed nothing wrong (every individual log line said "OK"). The trace's tree structure made the bottleneck visible at a glance — one branch of the tree was 12 seconds wide while the other 21 were under 100ms. Dapper's data model is what made that visible. A flat list of log lines, even with perfect grep, would not have.

A third example, smaller in scale but instructive: a Bengaluru-based Series A fintech with 14 services and ~200 QPS at peak adopted OpenTelemetry-on-Tempo in early 2025 after spending six weeks debugging a payment-failure rate that hovered at 0.8% with no obvious cause. Pre-tracing, the team had 14 separate Loki queries that they ran in sequence whenever a customer complained, and the answer was usually "we cannot tell". Post-tracing, the first failure they investigated revealed the root cause within seven minutes: a downstream KYC verification service was timing out at exactly 5 seconds for a small subset of transactions, and the upstream caller was retrying twice before giving up — the user-perceived failure was at 15 seconds, but the trace showed three identical 5-second KYC spans, immediately readable as "this service is the problem". The team's post-adoption metric was time-to-root-cause, which dropped from a median of 4 hours to a median of 11 minutes. The numerator of that ratio — Dapper's data model — was cheap; what was costly before was the absence of it.

The pattern across these examples is that trace data answers a different shape of question than metrics or logs do. Metrics answer "is something wrong, in aggregate" — fast, but only at the level of granularity the metrics carry (typically per-service-per-endpoint). Logs answer "what happened on this one machine" — high detail, but bounded to one process. Traces answer "what was the structure of this one request, end-to-end" — the only one of the three that crosses the network boundary as a first-class concern. A team trying to answer trace-shaped questions with metrics-shaped data ends up building progressively more elaborate metric labels (which blow up cardinality, see Part 6) until the labels approximate a trace_id and the team realises they have re-invented Dapper, badly. Going to Dapper directly is cheaper.

Common confusions

Going deeper

Section 3.2 of the paper — the inline collector daemon

Dapper's data path was not "service emits span directly to a central server". Each Google machine ran a per-machine collector daemon (literally called dapperd); services called a small library that wrote span records to a Unix domain socket; the daemon batched, compressed, and shipped to a central BigTable cluster (Dapper Depot). This three-tier shape — application library → on-host daemon → central store — is what every modern OpenTelemetry deployment also looks like (SDK → Collector → backend). The paper's reasoning was twofold: the on-host daemon decouples service-side latency from collection-side backpressure (a service write to a local socket is microseconds; a write to a remote service might be seconds), and the daemon can buffer during transient collector outages without dropping spans. Modern OpenTelemetry Collectors are direct descendants of dapperd. Reading the paper's section 3.2 alongside the OpenTelemetry Collector's documentation is illuminating: the architecture is unchanged in 16 years.

The "out-of-band trace recording" decision and why it matters

A tempting design Dapper rejected was to ship spans with the RPC response — every service replies to its caller with both the response and any spans the service recorded. That would have eliminated the need for a separate collection pipeline, but at the cost of bloating every RPC response with trace data the caller does not necessarily want. Dapper chose the out-of-band path: spans go to the collector, RPC responses are clean. The justification was that callers and tracing systems have different latency budgets, different failure modes, and different scaling axes, and conflating them couples two systems that should be independent. Modern OpenTelemetry inherits this: traces ship to OTLP collectors via gRPC or HTTP, completely off the request hot path. The performance implication is that adding tracing does not add latency to the user's request; only the collection pipeline absorbs the extra load. This separation is non-obvious; every team that has tried to "save infrastructure" by piggybacking trace data on responses re-discovers why Dapper said no.

The Dapper paper's blind spot — what it under-specified

Reading the paper carefully exposes three under-specifications that the next decade had to fill in. First, clock skew across machines — Dapper assumed Google's NTP-synchronised fleet had millisecond-or-better synchronisation, so span start/end timestamps from different machines were comparable. In 2026 Indian fintechs running across AWS Mumbai and an on-prem NPCI gateway often see 10-50ms skew between regions; the trace's apparent timing can be misleading. The fix is to anchor durations to the parent span's start (relative time, not absolute time), which OpenTelemetry does. Second, schema evolution of span attributes — the paper showed annotations as free-form key-value pairs, with no story for evolving the schema as services change. OpenTelemetry's "semantic conventions" (http.method, db.system, messaging.destination_name standardised attributes) are a 2020+ reaction to a decade of every team naming their attributes differently. Third, sampling rate observability — the paper assumes you know the sampling rate so you can scale up the count to estimate the true rate. In a system with adaptive per-service sampling, the rate is itself a time-varying quantity, and recovering "true count" from sampled count requires the sampling probability to be recorded alongside the span. Modern tracers do this; Dapper did not need to.

The Razorpay + UPI angle — why every Indian fintech eventually reads this paper

Razorpay's engineering blog (2023) describes their migration from a homegrown tracer (a request_id field bolted onto each service's logs) to OpenTelemetry-on-Tempo. The migration document explicitly cites the Dapper paper as the design source for the new system. The reason the migration was urgent was an RBI audit requirement: post-March 2023, all UPI payment service providers must be able to reconstruct the full chain of an individual transaction within 30 seconds, including timing on each hop, for any transaction in the last 12 months. The pre-OTel system could partially do this (logs joined by request_id) but failed when an internal retry generated a new request_id, or when an async settlement job touched the transaction hours later. The Dapper-shaped data model — trace_id stable across retries, baggage for txn_id, async-friendly out-of-band collection — solved every gap. The paper that was published to describe a Google-internal system in 2010 ended up being the design document for an RBI-compliant payment-tracing infrastructure in 2023. That kind of longevity is rare in systems papers; it is worth understanding why.

The bigger picture — Dapper as the start of "request-oriented observability"

Before Dapper, observability was machine-oriented: collect CPU, memory, log lines per machine; debug by querying that machine. Dapper proposed reorganising the data around requests: every observation is tagged with the request it belonged to, and the observation's relationship to other observations is captured by the request graph. This shift is the conceptual centre of modern observability. Metrics with exemplars (a metric carries the trace_id of the slowest request in its histogram bucket) are request-oriented. Continuous profiling with trace-aware tags (a flamegraph attributable to a specific user-facing span) is request-oriented. Even logs — once correlated by trace_id, as the previous chapter argued — are request-oriented. The 2010 paper articulated the unit of observation that the entire industry has been refining for 16 years. That unit is "the request, end to end, across every machine it touched".

Reading Dapper alongside the X-Trace and Magpie papers

Dapper is the most-cited but not the first paper in the distributed-tracing lineage. X-Trace (Fonseca et al., NSDI 2007) introduced the trace_id-and-causality-edge data model from a research perspective, with explicit support for non-RPC propagation (UDP, multicast, kernel-level pushdown). Magpie (Barham et al., OSDI 2004) was earlier still, focused on Windows tracing and per-event causality reconstruction from kernel ETW logs. Reading the three papers in chronological order — Magpie 2004, X-Trace 2007, Dapper 2010 — shows the field converging on the trace_id + span_id + parent_span_id triple as the minimum sufficient data model, with each paper choosing a different production constraint to optimise for. Magpie chose "no application changes" (kernel-side reconstruction, expensive at scale); X-Trace chose "any propagation medium" (general but adoption-hostile); Dapper chose "ubiquitous in one organisation" (specific but operationally tractable). Dapper won the design-pattern war because its constraint matched what production teams actually had — uniform RPC libraries — and the tradeoff was acceptable: lose the ability to trace into the kernel, gain ubiquitous coverage of application-level work.

What a 2026 reader should still take from the paper

The single most-quoted sentence from the paper — "ubiquitous deployment, continuous monitoring, low overhead" as the three design constraints — is still the right starting point for evaluating any tracing system you build or adopt. If a tracing system requires per-team adoption work, ubiquity will fail. If it cannot run continuously in production without a feature-flag, you will turn it on too late. If the per-request CPU or bandwidth cost is more than ~1%, teams will negotiate to disable it under load. These three constraints are not 2010-specific; they are properties of how observability systems coexist with the systems they observe. A useful exercise when evaluating a vendor's tracing product is to score it against these three constraints honestly — most score well on two and badly on the third, and the third is usually the one that determines whether the system survives its first production crisis.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install flask requests
python3 dapper_in_miniature.py
# Expected: 8 requests fire; ~25% are sampled and kept;
# the kept traces print as small two-span trees (checkout → payments).
# Read alongside the Dapper paper's section 2 (Distributed Tracing in Dapper).

Where this leads next

The four chapters that follow this one in Part 4 (data model, propagation format, sampling, storage) are an extended commentary on the original Dapper paper's design choices. Read this chapter as the paper's summary; read the next four as the modern industry's annotations on it. The voice that speaks across all five chapters is the one that says "here is what Dapper got right, here is what it got almost-wrong, here is what the field has done since" — and the answer to the third clause is what makes a 2026-vintage observability stack feel different from a 2010-vintage one even though the bones are the same.

A practical pointer for readers wanting to read the paper itself: the version everyone cites is the Google Research technical report from April 2010 (linked below), 14 pages, no equations, mostly prose with one architecture diagram. It reads in about 90 minutes. Section 2 ("Distributed Tracing in Dapper") and Section 3 ("Dapper Implementation") are the most-quoted; Section 6 ("Dapper in Use") is the war stories that motivate the design and is worth a careful read alongside any Hotstar, Razorpay, or Flipkart engineering blog post. Section 7 ("Other Lessons Learned") includes the famous observation that the engineering work was 80% of the effort and the algorithmic novelty was 20% — a sentence that applies to most successful observability systems and that the rest of this curriculum will keep proving.

A closing pointer for the student reading this at 11pm in Hyderabad: Dapper is not a closed chapter of computer science history. The paper is the entry point, but the field is alive — OpenTelemetry's specification grows monthly, Tempo's columnar storage is being rebuilt every two years, the sampling literature has had a renaissance since 2020. Reading the paper as background and then following the OpenTelemetry CNCF working group's GitHub issues is how a curious engineer keeps current. The paper is the floor; the ceiling is wherever the on-call's curiosity takes them.

One last note for anyone planning to instrument a real service after reading this chapter: the gap between "trace data is being emitted" and "trace data is useful" is bigger than the SDK setup suggests. A service that emits well-named, semantically-conventional spans (http.method=POST, http.route=/checkout, db.system=postgresql) is debuggable; a service that emits spans named func1, inner_call, do_thing is no more debuggable than a flat log file. The OpenTelemetry semantic conventions are the modern equivalent of "name your variables well" — they are what makes a trace queryable months after the engineer who wrote it has forgotten what they were doing. The chapter on the data model takes this up next; the lesson it inherits from Dapper's paper is that the data model only pays off when the data carries meaning a future on-call can read.

References