Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Service graphs from traces

It is the second day at a hypothetical Bengaluru-based wallet startup we will call CoinPaisa. Karan, a new SRE on his second on-call shift, opens Grafana to debug a payments-api p99 spike and gets stuck within thirty seconds. Three engineers across four time zones have edited the architecture diagram in Confluence over the last eighteen months; the most recent version shows nine services, a Redis, and a Postgres. The trace he is looking at touches sixteen services he has never heard of — idempotency-cache-shard-2, merchant-tier-router, npci-rail-adapter-v3, consent-state-machine. Two of them depend on the other in a direction the Confluence diagram has reversed. The actual call graph has been a black box since the previous architect left.

He clicks Service Graph in the Tempo data source. A tab opens with a force-directed graph: each service a node, each call an edge, the edge thickness proportional to QPS, the edge colour mapped to error rate. The payments-api node is fed by the API gateway and fans out to seven downstream services. One edge — payments-api → merchant-tier-router — is glowing red at 14% error rate. Karan has the diagnosis in ninety seconds. The Confluence diagram never named merchant-tier-router. The traces did. The service graph did not get built by a human; the spans built it, and the build runs continuously.

A service graph is a derived view computed from the parent-child edges in your distributed traces — for every span pair (parent.service.name, child.service.name), you tally calls, error rate, and latency, then render the result as a directed graph. It is the only architecture diagram in your org that cannot drift, because it is regenerated from the traffic itself every fifteen seconds. The sharp edges are cardinality (operation-level edges blow up the index), aggregation window (too short = sparse, too long = stale during incidents), and asymmetric coverage (services without OTel SDKs become invisible nodes in your network).

What the trace tree already tells you about your topology

A distributed trace is a tree of spans. The root span belongs to whichever service first received the request — typically the API gateway or the edge ingress. Every child span has a parent_span_id and lives inside some service identified by the service.name resource attribute. From those two facts alone, the architecture of your system is implicit in every trace: the root span's service.name calls the next-hop span's service.name, that calls its children's service.name, and so on, recursively, until the leaves.

The service graph is what you get when you discard everything except the service-to-service edges — strip the request-specific data, strip the per-span latency, strip the trace-level branching, keep only (parent.service.name, child.service.name). Aggregate over a fifteen-second or one-minute window, count occurrences, and you have a directed graph where the edges are call-rate-weighted. Add the average error rate over those edges (proportion of child spans with status.code = ERROR) and you have an edge colour. Add the p99 latency across the children and you have an edge tooltip. The whole structure is a pure function of the trace stream — no manual maintenance, no Confluence drift, no architect-with-a-whiteboard-marker.

This sounds simple because, mechanically, it is. The hard parts are not in deriving the graph but in the operational decisions around it: how do you scale the aggregation when 100% of traces become 38,000 RPS? How do you handle services that don't emit OTel spans at all (third-party SDKs, Lambdas, legacy mainframes, the NPCI hop)? How do you keep the graph readable when your microservice fleet is 240 services strong and the result looks like a dense ball of yarn? Each of these is a real engineering decision; this article walks through them.

From a single trace to a service graph edge — the projection from spans to servicesA diagram with two halves arranged left and right. The left half shows a single distributed trace as a tree: a root span labelled api-gateway with three children labelled payments-api, fraud-check, and audit-log. The payments-api node has two children labelled merchant-tier-router and ledger-write. The merchant-tier-router has one child labelled npci-rail-adapter. The right half shows the equivalent service graph with five nodes connected by directed edges. Each edge is annotated with a call-count number aggregated over the fifteen-second window and an error-rate percentage. The arrow between the two halves is labelled projection from span pairs to service pairs, then aggregation over a fifteen-second window. A footer notes that one trace contributes one increment to each edge it touches; thousands of traces in a window aggregate to the QPS and error-rate numbers shown.one trace tree → one set of edge increments → aggregated service graphIllustrative — span tree on the left, derived graph on the right.single trace — span treeapi-gatewaypayments-apifraud-checkaudit-logmerchant-routerledger-writenpci-adaptererror7 spans, 6 service-to-service edgescontribute +1 to each edge belowproject+ aggregateservice graph — 15s windowapi-gwpaymentsfraudauditrouterledgernpci320 rps280 rps280 rps320 rps320 rps14% err
Illustrative — one trace contributes one increment to each of the six service-to-service edges; thousands of traces in a fifteen-second window aggregate to the call counts and error rates shown on the right.

Why the projection is from (parent.span.service, child.span.service) rather than from service A → service B directly: traces don't natively carry "service A called service B" — they carry parent-child spans, and the service identity comes from the resource attribute on each span. The projection requires the service.name attribute to be present on every span (handled by the OpenTelemetry SDK's resource auto-detection). When two consecutive spans in the parent-child chain belong to the same service (e.g. an HTTP-server span and an HTTP-client span both inside payments-api), the projection produces a self-loop edge that you typically filter out — internal calls are not architectural edges. The filter is one line: if parent.service != child.service. Forgetting this is the second-most-common reason a freshly-rolled service graph looks "weirdly busy".

How the aggregation actually works (and where it breaks)

Tempo, Jaeger, OpenTelemetry Collector's servicegraph processor, and Grafana Cloud Traces all compute service graphs the same way at the ten-thousand-foot level — they tail the trace stream, project span pairs to service pairs, and emit a stream of (source, target, count, error_count, latency) tuples on a fixed window. The differences are operational. Tempo emits Prometheus metrics (traces_service_graph_request_total, traces_service_graph_request_failed_total, traces_service_graph_request_server_seconds_bucket); Jaeger maintains an in-memory adjacency table flushed to a backing store. The Prometheus-metrics path is what most production setups use because the rest of the observability stack is already wired to ingest Prometheus.

The aggregation runs in two stages. Stage 1 — span pairing: as spans arrive, the processor maintains a hash map keyed by trace_id + parent_span_id. When a child span arrives whose parent is not yet seen, it sits in the map waiting. When the parent later arrives, the processor emits the pair (parent.service, child.service) and increments the counter. Pairs that never resolve (parent never arrived, span timed out) are dropped after a max-wait — typically 10 seconds in Tempo. Stage 2 — windowed counting: the resolved pairs feed a Prometheus counter with labels client="parent_service", server="child_service". Prometheus scrapes this counter every 15s, and the service-graph UI computes rate-of-counter over a 1-minute or 5-minute window for the edge weights.

The 10-second pairing wait is the first sharp edge. If a parent span is very slow (say, 12 seconds — a long batch operation, or a service stalled on a lock), its children's pairs time out and the edges they would have contributed are dropped. Your service graph silently undercounts the call rate to the slow service — the edge looks thinner, not thicker, which is the opposite of what an SRE expects. A genuine outage that slows the parent service can therefore hide itself in the service graph by causing pair-timeout-dropouts. The mitigation is to raise the max-wait (Tempo: processors.servicegraph.max_items and the implicit time bound), but this trades memory for accuracy — the in-flight pair table grows linearly with offered RPS × max-wait. At 38,000 traces/sec and a 30-second wait, you are caching ~1.1M pending pairs continuously.

The second sharp edge is cardinality. The Prometheus counter traces_service_graph_request_total{client, server} has cardinality equal to the number of distinct service pairs. With 240 services, the upper bound is 57,600 series — fine. Add connection_type="messaging|database|http" and you double it. Add client_operation="GET /api/v1/orders|GET /api/v1/payments|..." and the cardinality explodes. Most production teams set the processor to service-pair edges only, no operation, no connection-type — accepting that "GET /orders is slow" needs a different diagnostic (the latency-by-operation metric, not the service graph). The service graph answers "which services talk to which", not "which endpoints of which services".

The third sharp edge is drift between traces and metrics. Tempo's service-graph processor runs on a sampled subset of traces (typically the same 1–10% sample rate as your trace pipeline). The Prometheus counters built from this sample are then assumed to represent total traffic. Tempo correctly multiplies by 1/sample_rate at scrape time — but only if sample_rate is constant. Tail-based sampling, dynamic sampling, or per-service different sample rates break this assumption and produce edges with systematically wrong call counts. A common bug: an SRE notices the service-graph QPS for payments-api is reporting 320 rps but the payments-api Prometheus counter directly reports 8,500 rps. The 26× ratio is exactly the inverse of the trace sampling rate; the service graph is right per its sample, just unreconciled with the unsampled metrics.

# service_graph_demo.py — minimal end-to-end service-graph builder
# from the OTel Collector docs translated into a Python demo.
# pip install opentelemetry-api opentelemetry-sdk \
#             opentelemetry-exporter-otlp-proto-http prometheus-client requests flask
import time, random, threading, collections, json
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
from prometheus_client import Counter, Histogram, start_http_server

# 1. each "service" is just a tracer with its own service.name resource
def tracer_for(service_name: str):
    provider = TracerProvider(resource=Resource.create({"service.name": service_name}))
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)

# 2. the service-graph processor — keyed by (client, server) pairs
EDGE_RPS = Counter(
    "traces_service_graph_request_total",
    "Service graph edge call counts",
    ["client", "server"],
)
EDGE_ERR = Counter(
    "traces_service_graph_request_failed_total",
    "Service graph edge error counts",
    ["client", "server"],
)
EDGE_LAT = Histogram(
    "traces_service_graph_request_server_seconds",
    "Server-side latency per service-graph edge",
    ["client", "server"],
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
)

# 3. the in-flight pair-resolver — keyed by trace_id + parent_span_id
pending: dict = {}
PAIRING_TIMEOUT_S = 10.0

def on_span_end(span_dict):
    """Called for every finished span; pairs to its parent and emits an edge."""
    key = (span_dict["trace_id"], span_dict["parent_span_id"])
    parent_service = pending.get(key)
    if parent_service is None:
        # parent not yet seen — register self for any children
        pending[(span_dict["trace_id"], span_dict["span_id"])] = span_dict["service"]
        return
    if parent_service != span_dict["service"]:  # skip self-loops
        EDGE_RPS.labels(client=parent_service, server=span_dict["service"]).inc()
        if span_dict["status"] == "error":
            EDGE_ERR.labels(client=parent_service, server=span_dict["service"]).inc()
        EDGE_LAT.labels(client=parent_service, server=span_dict["service"]) \
                .observe(span_dict["duration"])
    pending[(span_dict["trace_id"], span_dict["span_id"])] = span_dict["service"]

# 4. simulate three services calling each other for ten seconds
def simulate_traffic():
    services = ["api-gateway", "payments-api", "merchant-router", "npci-adapter"]
    edges = [("api-gateway", "payments-api"),
             ("payments-api", "merchant-router"),
             ("merchant-router", "npci-adapter")]
    end_time = time.time() + 10
    trace_n = 0
    while time.time() < end_time:
        trace_n += 1
        tid = f"trace-{trace_n}"
        for i, (client, server) in enumerate(edges):
            err = "error" if (server == "npci-adapter" and random.random() < 0.14) else "ok"
            on_span_end({"trace_id": tid, "span_id": f"sp-{trace_n}-{i+1}",
                         "parent_span_id": f"sp-{trace_n}-{i}", "service": server,
                         "status": err, "duration": random.uniform(0.005, 0.250)})
        time.sleep(0.001)
    print(f"emitted {trace_n} traces")

if __name__ == "__main__":
    start_http_server(8000)  # Prometheus exposes at :8000/metrics
    t = threading.Thread(target=simulate_traffic, daemon=True)
    t.start()
    t.join()
    # dump the resulting service graph as JSON
    import requests
    text = requests.get("http://localhost:8000/metrics").text
    edges = collections.defaultdict(lambda: {"rps": 0, "errs": 0})
    for line in text.split("\n"):
        if line.startswith("traces_service_graph_request_total{"):
            parts = line.split("}")[0].replace('traces_service_graph_request_total{','')
            kv = dict(p.split("=") for p in parts.replace('"','').split(","))
            edges[(kv["client"], kv["server"])]["rps"] = float(line.split()[-1])
        if line.startswith("traces_service_graph_request_failed_total{"):
            parts = line.split("}")[0].replace('traces_service_graph_request_failed_total{','')
            kv = dict(p.split("=") for p in parts.replace('"','').split(","))
            edges[(kv["client"], kv["server"])]["errs"] = float(line.split()[-1])
    print(json.dumps({f"{k[0]} -> {k[1]}": v for k, v in edges.items()}, indent=2))

Sample run:

$ python3 service_graph_demo.py
emitted 9847 traces
{
  "api-gateway -> payments-api":   {"rps": 9847.0, "errs": 0.0},
  "payments-api -> merchant-router": {"rps": 9847.0, "errs": 0.0},
  "merchant-router -> npci-adapter": {"rps": 9847.0, "errs": 1382.0}
}

The load-bearing lines: Resource.create({"service.name": service_name}) is what plants the service identity on every span — without it, the projection has no key to group by and the entire mechanism collapses. pending: dict = {} is the in-flight pair-resolver, the same hash table Tempo's processor maintains in production (sized in millions of entries at 38k RPS); the Python dict is sufficient at demo scale, but a real implementation uses a bounded LRU with a max-wait timeout to avoid unbounded growth on orphan spans. if parent_service != span_dict["service"] is the self-loop filter — it is one line that decides whether your graph shows internal RPC calls (rarely useful) or only architectural edges (almost always what you want). EDGE_RPS.labels(client=..., server=...).inc() is the Prometheus counter that the Grafana service-graph UI scrapes to render edges; the cardinality of this counter is the cardinality of your service-pair set, which is the master variable controlling how much memory the processor consumes. if (server == "npci-adapter" and random.random() < 0.14) simulates the 14% error rate on the NPCI edge that the lead's Karan saw — in production, the rate would come from real span.status.code = ERROR annotations propagated by the SDK on RPC exceptions.

Why the pairing key is (trace_id, parent_span_id) and not just parent_span_id: span IDs are 16 bytes, randomly generated, and not globally unique — two unrelated traces in the same window can collide on a span ID with non-trivial probability at scale. The trace_id (16 bytes for W3C, 8 bytes for B3) is required to disambiguate. At 38,000 traces/sec with an average tree depth of 12 spans, you have roughly 450,000 span IDs in flight in any 1-second window; without the trace_id prefix, birthday-paradox collisions occur every few minutes and silently merge unrelated edges.

Three production realities the textbook diagram doesn't show

A clean four-node service graph in a blog post is one thing; a 240-service production graph is another. Three realities surface as the deployment scales, each with a known-good mitigation.

Reality 1: services with no SDK become invisible nodes — or worse, ghost nodes. A common failure mode at hypothetical Mumbai-based payments processor PayWeave: their primary services run OpenTelemetry, but their Python Lambdas, their Java legacy mainframe adapter, and their NPCI client SDK do not. When payments-api calls into the NPCI client SDK, no child span is emitted; the trace tree ends at payments-api. The service graph then shows payments-api as a leaf node — no outgoing edge to NPCI. To Karan staring at the graph during the IPL final, this looks fine. To NPCI, it is the source of the 240ms latency tail because their adapter is queueing. The service graph cannot show a service that does not emit spans, and the absence is silent.

The mitigation is a span-emission contract: every service in the request path must emit at least an entry/exit span carrying service.name. Where the service is third-party (NPCI), an adjacency span is emitted by the calling service to record the outgoing call — span.kind = CLIENT, with peer.service set to the third-party service. Tempo and Jaeger both honour peer.service when computing edges — a CLIENT-only span with peer.service="npci-rail" produces an edge payments-api → npci-rail even though no NPCI service ever emitted spans. The fix is one resource attribute and one SDK setting; the engineering investment is the cross-team agreement that every team must set it.

Reality 2: messaging fan-out pollutes the graph. A hypothetical Hyderabad-based logistics service we will call ShipBee Express runs a Kafka-based event bus. When order-created is published to the bus, six downstream consumers (notifications, billing, warehouse, partner-router, audit, ledger) receive it. Each consumer creates a child span linked to the producing trace. The service graph then shows order-api with six outgoing edges — but operationally, the producer does not "call" the consumers; they consume asynchronously. The graph implies a synchronous request-response shape that does not match the system. A naive SRE looking at the graph during a billing latency incident will assume order-api is waiting on billing-service; in fact, order-api returned to the user 20ms after the publish, and billing-service is processing the message hours later from a backlog.

The OpenTelemetry messaging convention exists for this case: spans for messaging operations carry messaging.operation = "publish" | "receive" and span.kind = PRODUCER | CONSUMER. Tempo's service-graph processor honours these and renders messaging edges with a different visual style (dashed line, "async" badge, different colour). When the convention is followed, the SRE distinguishes "synchronous downstream call" from "async fan-out" at a glance. When it is not (the dev who wrote the consumer didn't tag the span), the graph looks lying-but-plausible. Auditing the messaging spans is a one-time engineering investment worth doing once your fleet has any pub-sub or queue-based component.

Reality 3: graph readability collapses past ~30 nodes. A force-directed layout of 240 nodes with hundreds of edges is a hairball. No human can extract anything from it. Hotstar's SRE org reportedly hit this exact wall during the IPL final — their service graph rendered correctly but was visually useless. The mitigation is graph filtering: filter to a single service's neighbourhood (1-hop or 2-hop), filter to edges over a QPS threshold (drop edges with <1 RPS that pollute the rendering with curiosities), filter by namespace or product line ("show only payments-related services"). Tempo's tempo-cli traces graph command and Grafana's service-graph panel both support these filters. The reader's mental model is now "the graph for my service and its immediate neighbours" — a manageable five-to-fifteen-node subgraph that fits on one screen and answers the question "what is downstream of payments-api right now?".

Three realities — invisible nodes, async fan-out, hairball collapseA diagram with three columns showing the production realities. Column one labelled invisible nodes shows a sparse two-node graph where payments-api is a leaf with no outgoing edge to NPCI, and a note that the NPCI service is missing because it does not emit spans. Column two labelled async fan-out shows order-api with six dashed edges fanning out to consumers labelled notifications, billing, warehouse, partner, audit, ledger, with a note about messaging.operation = publish distinguishing async from sync. Column three labelled hairball shows a dense 24-node tangle with the note that filtering to a 1-hop neighbourhood reduces it to a readable 5 nodes. A footer summarises that all three are real engineering decisions, not bugs.three realities every production service graph hits past ~30 nodes1 — invisible nodespayments-apimissing edgenpci-railno SDK = no spansleaf-node trapfix: peer.serviceon the CLIENT span2 — async fan-outorder-apinotifbillingauditledgerwhdashed = asyncproducer/consumer span.kindfix: messaging.operationattribute on the span3 — hairball240 nodes, no signalforce-directed = noisefix: 1-hop filter+ QPS threshold→ readable subgraphof 5–15 nodes
Illustrative — three production realities every service graph hits as the fleet grows. Each has a single-line mitigation, but you have to know to apply it; the graph silently lies if you don't.

Why the hairball problem is fundamentally about visualisation, not data: the data is correct — every edge is real, every weight is accurate. The problem is that human spatial-perception bandwidth is roughly 7±2 chunks; a 240-node graph is two orders of magnitude past that. Filtering to a 1-hop neighbourhood is not a "clever trick" but a perceptual necessity. The same data, rendered as a sortable table of "top-10 incoming and top-10 outgoing edges for payments-api", is more useful than the visual graph for a senior SRE — text-based summaries scale linearly with the number of services; force-directed layouts scale combinatorially.

Common confusions

  • "A service graph is the same as my architecture diagram." It is not. Your architecture diagram is what an architect intended; the service graph is what the traffic actually does. The two diverge as soon as the first dev adds a "temporary" lookup call to a service that "shouldn't" be called from there. Service graph = ground truth. Confluence diagram = aspiration. When they conflict, fix the diagram (or the code), but trust the graph.
  • "The service graph shows all my services." It shows only services that emit spans and appear as either client or server in a recent trace. A service that runs a batch job once a day, or a service called only by a deprecated client, or a service that crashes before its instrumentation initialises, is invisible until it next emits a span. The graph is a recency view, not an inventory.
  • "Edge weights are total traffic." Edge weights from Tempo's service-graph processor are sampled traffic × inverse-sample-rate. If your tail-based sampler keeps 1% of OK traces and 100% of error traces, the inverse-rate calculation is per-bucket and not always correct in the metric. Reconcile against direct service metrics (http_requests_total on the server side) when the numbers matter for capacity planning.
  • "A service graph replaces APM dashboards." It does not — it complements them. APM dashboards show RED metrics per service (rate, error, duration) but rarely show which other service is the source of the rate or the cause of the errors. The service graph attributes traffic to its origin; the APM panel quantifies its volume. Both, side by side, is the workflow — graph for the topology, panel for the magnitude.
  • "More services = more useful service graph." The opposite, past ~30 services. The cognitive load of a 240-node graph exceeds the diagnostic value, and most production teams use the graph filtered to a single service's neighbourhood — which they could have hand-drawn. The graph's value scales with how often the topology changes, not with how many nodes it has; a fast-moving fleet of 80 services benefits more than a stable fleet of 240.
  • "Self-loops in the graph mean a service is calling itself." Sometimes. More commonly, it means the same service.name is being applied to two logically distinct services — a deployment misconfiguration where two pods carry the same OTel resource attribute. The service-graph processor sees two distinct services as one and renders the cross-call as a self-loop. The fix is auditing your deployment manifests for unique service.name values; the symptom is the surprising self-loop.

Going deeper

How Tempo's service-graph processor scales to 38k traces/sec

Tempo runs the service-graph processor as part of its ingester component (or, in newer versions, as a dedicated metrics-generator process). The processor maintains an in-memory hash table keyed by (trace_id, span_id) mapping to service.name. When a child span arrives, the table is consulted; if the parent is present, a Prometheus counter is incremented. The hash table is bounded — typically 1.5M entries in production setups — and entries that exceed the max_items limit or the timeout are dropped. The dropped pairs become the "edges that don't exist in the graph but did happen in reality" — a known false negative the operator must monitor.

At 38k traces/sec with average tree depth of 12 spans, the processor sees ~456k span events per second. Each event is an O(1) hash-table lookup and a counter increment — roughly 200ns per event on a modern x86, or ~91ms of CPU per second of traffic, comfortably within a single core's budget. The bottleneck is not CPU but memory: at 1M in-flight pairs at ~120 bytes per entry, the hash table sits at ~120MB — fine on a 4GB ingester pod, problematic on a 1GB sidecar. Tempo's documentation recommends running the metrics-generator as a separate deployment for fleets above ~100 services, where the memory footprint becomes meaningful. The reference docs are at grafana.com/docs/tempo/latest/metrics-generator.

The OpenTelemetry Collector's servicegraph processor — wire format and configuration

Outside Tempo, the OpenTelemetry Collector ships a servicegraphprocessor that does the same work, language-agnostically, before traces are forwarded to any backend. The configuration (in YAML) is small but every knob matters:

processors:
  servicegraph:
    metrics_exporter: prometheus
    latency_histogram_buckets: [0.001s, 0.01s, 0.1s, 1s, 10s]
    dimensions: [http.method, http.status_code]  # adds labels — cardinality risk
    store:
      ttl: 10s          # the pairing timeout — see Reality 1
      max_items: 1000   # the in-flight cap — see scaling section
    cache_loop: 10s     # how often expired pairs are swept
    store_expiration_loop: 2s

The dimensions list is the cardinality landmine: each dimension multiplies the edge count. A 240-service graph with 60 service-pair edges, gaining http.method (5 values) × http.status_code (10 values), explodes to 3,000 metric series — still fine. Adding http.url (high-cardinality, thousands of distinct values) blows past the cardinality budget and OOMs the metrics-generator. The rule: never add a high-cardinality dimension to the service graph processor. If you need per-endpoint analysis, use a different metric (http_server_requests_seconds_bucket{handler="..."}) or use exemplars to drill from the service-graph metric to the specific trace.

The asynchronous-call honesty problem

Section 3 mentioned messaging fan-out; the deeper question is what edge weights mean for asynchronous flows. If order-api publishes to Kafka and billing-service consumes 400ms later from a backlog, the service-graph processor records an edge order-api → billing-service — but the latency it records is the consumer's processing time, not the end-to-end wait time. The edge p99 of 12ms makes the system look healthy; the actual user-visible billing latency might be 400ms because of the queue.

The OpenTelemetry messaging convention adds messaging.operation_duration as a separate metric. Some service-graph implementations (Tempo 2.4+) honour it and render async edges with a queueing-aware p99. Most don't. The practical workaround at hypothetical Hyderabad-based ShipBee Express was to render two service graphs side by side: one filtered to span.kind = SERVER only (synchronous request-response edges), another filtered to span.kind = CONSUMER only (async edges with queueing time). The two together tell the story; either alone misleads. This is also the reason the /wiki/log-to-trace-correlation-trace-ids-in-logs drill-down is critical for async investigations — the trace tells you the consumer's processing path; the logs tell you when it actually happened relative to the publish.

Detecting topology changes — the diff that pages

A service graph is most useful not as a static view but as a delta. If payments-api suddenly starts calling a new downstream service experimental-pricing-engine, that is information: a feature flag flipped, a deploy went out, an SDK upgraded. A change in the graph that no one announced is a candidate for an incident. Some teams (notably Hotstar's SRE org) run a topology-diff alert: every 60 seconds, compute the symmetric difference between the current edge set and the edge set from one hour ago; if the diff exceeds a threshold (e.g., 5 new edges or 5 removed edges in a single hour), page the platform team. The alert catches deployment-induced topology changes within minutes rather than during the next incident.

The implementation is one Prometheus query plus one alert rule:

abs(
  count by (cluster) (
    traces_service_graph_request_total
    unless ignoring(__name__) traces_service_graph_request_total offset 1h
  )
) > 5

The unless operator computes set-difference of the label set; the wrapping count by (cluster) collapses to a per-cluster integer; the > 5 threshold fires only on meaningful changes. This is the kind of secondary-derivation observability that the service graph enables once it exists; the graph itself is a primary derivation, and alerts derived from its delta are a tertiary derivation. The data flows once; the value compounds.

When the service graph is wrong on purpose

For some debugging tasks, the service graph is intentionally misleading because it shows the intended call pattern, not the actual call pattern. A misbehaving canary deploy might receive 1% of traffic for the first ten minutes — the service graph shows the canary as a node with an edge from the gateway, weighted at 1% of total. This is correct. But the SRE chasing a "why did p99 spike?" question wants to know about the 1%, not the 99%. Filtering the graph to "only edges with QPS over 100" hides the canary; the SRE then looks at "the production graph" and concludes "everything is fine", when the fault is in the canary subgraph they filtered out.

The correction is procedural: when a deploy is in progress, run unfiltered service-graph queries; otherwise run the filtered version for normal navigation. Some teams encode this in their on-call runbook: "if the question is 'is anything new', show all edges; if the question is 'what is the steady-state shape', filter to QPS > 10". This is the same kind of filter-discipline that applies to dashboards generally, and the service graph is no exception. The graph is a tool, not a truth — it is as honest as the filter you apply to it.

Where this leads next

The service graph sits at the apex of the trace-derived view stack. The lower layers — span-trace context (/wiki/span-trace-context-the-data-model), trace-context propagation across service boundaries (/wiki/b3-w3c-trace-context), and trace storage (/wiki/trace-storage-at-scale-tempos-columnar-approach) — are what make the service graph computable. The higher layer — operational topology dashboards, deploy-correlated topology diffs, SLO-aware service-graph annotations — is where you go once the graph is stable enough to alert from.

The natural next reading is /wiki/the-one-pane-of-glass-promise-and-its-limits (chapter 86) for the meta-question of how the service graph composes with metrics dashboards and log streams into a single investigation surface, and /wiki/trace-sampling-head-tail-adaptive for why the sampling decision upstream of the processor systematically biases edge weights — the graph you see is conditional on the sampling regime that produced it.

For the broader theme of "telemetry that builds a derived view of the system you didn't have to maintain", see /wiki/exemplars-linking-metrics-to-traces — exemplars are the per-bucket version of what service graphs are at the edge level, and both share the design principle that the primary signal (a metric, a trace span) carries enough information to construct a secondary signal (a graph edge, an exemplar pointer) without storing it explicitly. Your traces already encode the architecture; the service-graph processor is just the pure function that extracts it.

# Reproduce this on your laptop
docker run -d -p 3200:3200 -p 4317:4317 grafana/tempo:latest \
  -config.file=/etc/tempo.yaml
docker run -d -p 9090:9090 prom/prometheus
docker run -d -p 3000:3000 grafana/grafana:latest
python3 -m venv .venv && source .venv/bin/activate
pip install opentelemetry-api opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-http \
            prometheus-client requests flask
python3 service_graph_demo.py
# in Grafana: add Tempo data source, open Explore, switch to "Service Graph"
# tab, watch the four-node graph render with the 14% NPCI error rate.

References