Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Service dependency graphs
It is 03:14 IST on a Tuesday at PaySetu, and the wallet-balance service has started returning 502s. Riya, the on-call SRE, sees the alert and asks the question every responder asks: what does wallet-balance actually call? The runbook says "talks to ledger and risk-score". The reality, reconstructed from traces twenty minutes later, is that wallet-balance calls 11 services across three teams, one of which is a six-month-old experiment that nobody owns and that has been responding in 8 seconds when the upstream timeout is 2. The graph in someone's head was wrong; the graph in production was the truth. A service dependency graph is the production-truth structure of who calls whom — and at any fleet over ~40 services, you cannot debug an outage without one.
A service dependency graph is a directed multigraph where nodes are services and edges are RPC call relationships, weighted by call rate, latency, and error rate. The graph is reconstructed continuously from distributed-tracing data (spans), service-mesh telemetry (Envoy access logs), or eBPF-captured TCP flows. It enables four things engineers cannot do by reading code: blast-radius prediction (who breaks if I deploy this?), critical-path identification (which edges dominate p99?), cycle detection (am I about to ship a deadlock?), and drift detection (when did the graph change without a PR?). The graph is never the architecture diagram in Confluence — it is the runtime topology, which is always more entangled.
What the graph actually is, and why it is not your architecture diagram
The architecture diagram on the team Wiki shows boxes and arrows that someone drew when the service was launched. The dependency graph is the set of edges your binaries actually traverse at 14:37 on Tuesday. They diverge within weeks of launch — a junior engineer adds a quick HTTP call to risk-score from a request handler that was supposed to be CPU-only; a refactor leaves a dead RPC client wired into health checks; a feature flag flips on a path that calls four services nobody remembered. By month six, the architecture diagram and the runtime graph share maybe 60% of their edges. By year two, the diagram is a museum piece.
A service dependency graph is, formally, a directed multigraph G = (V, E) where V is the set of running services and E is the set of (caller, callee) pairs over a time window W. Each edge carries a tuple of metrics: call rate (RPS), error rate, p50/p99 latency, and metadata such as the protocol (HTTP, gRPC, Kafka), the operation name, and an owner team derived from the caller. Why a multigraph and not a simple graph: the same pair (A, B) can appear under multiple operation names — wallet-balance calls ledger:GetBalance and ledger:GetTransactions and these are different edges with different latency profiles, different blast radii on failure, and possibly different owners. Collapsing them into one "A→B" edge throws away the information you most need during an incident. The window W matters too: a 60-second window over a low-traffic dawn period misses edges that fire only during business hours. Production graphs are typically computed over rolling 5-minute and 1-hour windows.
The three production sources for these edges are distributed traces (every span carries parent_service and service, so the edge falls out of the span aggregation directly), service-mesh telemetry (Envoy or Linkerd emits per-RPC access logs that include source and destination; no in-process tracing required), and eBPF kernel hooks (Cilium Hubble, Pixie — capture every TCP connect call and the corresponding endpoint). Each has a bias: traces miss services that don't propagate trace context (legacy ones), mesh telemetry misses services not on the mesh, eBPF misses application-level semantics (it sees 10.4.5.7:8080, not "this was the GetBalance call"). The mature observability stack uses all three and reconciles.
Building the graph from spans — the actual algorithm
The simplest correct algorithm: scan every span in a time window, group by (parent.service, span.service, span.operation), and aggregate metrics per group. The output is one row per edge per window. Why this is enough: a distributed trace is by definition a tree of spans with parent pointers; the union of all parent→child service transitions across all traces in a window is exactly the dependency graph for that window. You do not need graph-theoretic cleverness — you need an aggregation pipeline that scales to billions of spans per hour.
The Python below builds the graph from a span stream and ranks edges by traffic. It runs on the laptop; the production version of this is a streaming Flink job over a Kafka topic of spans, but the logic is identical:
from collections import defaultdict
from dataclasses import dataclass, field
from statistics import median
@dataclass
class EdgeStats:
calls: int = 0
errors: int = 0
latencies_ms: list = field(default_factory=list)
def p99(self):
if not self.latencies_ms: return 0
s = sorted(self.latencies_ms)
return s[min(len(s)-1, int(0.99*len(s)))]
def build_graph(spans):
edges = defaultdict(EdgeStats)
for sp in spans:
if sp.get("parent_service") is None:
continue # root span, no incoming edge
key = (sp["parent_service"], sp["service"], sp["operation"])
e = edges[key]
e.calls += 1
if sp.get("error"): e.errors += 1
e.latencies_ms.append(sp["latency_ms"])
return edges
# example span stream — what a real trace exporter produces
spans = [
{"parent_service": "api-gateway", "service": "wallet-balance", "operation": "GetBalance", "latency_ms": 42, "error": False},
{"parent_service": "wallet-balance", "service": "ledger", "operation": "Read", "latency_ms": 18, "error": False},
{"parent_service": "wallet-balance", "service": "risk-score", "operation": "Score", "latency_ms": 31, "error": False},
{"parent_service": "wallet-balance", "service": "exp-fraud-v2", "operation": "Check", "latency_ms": 8400, "error": True},
{"parent_service": "wallet-balance", "service": "ledger", "operation": "Read", "latency_ms": 22, "error": False},
{"parent_service": "wallet-balance", "service": "exp-fraud-v2", "operation": "Check", "latency_ms": 7900, "error": False},
] * 1000 # simulate a busy minute
g = build_graph(spans)
ranked = sorted(g.items(), key=lambda kv: -kv[1].calls)
print(f"{'caller':<16}{'callee':<18}{'op':<14}{'calls':>7}{'err%':>7}{'p99':>9}")
for (caller, callee, op), s in ranked:
err_pct = 100.0 * s.errors / s.calls
print(f"{caller:<16}{callee:<18}{op:<14}{s.calls:>7}{err_pct:>6.1f}%{s.p99():>7}ms")
Realistic output:
caller callee op calls err% p99
wallet-balance ledger Read 2000 0.0% 22ms
wallet-balance exp-fraud-v2 Check 2000 50.0% 8400ms
wallet-balance risk-score Score 1000 0.0% 31ms
api-gateway wallet-balance GetBalance 1000 0.0% 42ms
The walkthrough: the EdgeStats dataclass holds the running counters and a list of latencies (in production you would replace this with a t-digest — see /wiki/tail-latency-aggregation — because keeping every latency is unbounded). build_graph scans every span and skips root spans (parent_service is None); root spans are entry points (the API gateway itself), not incoming edges. The grouping key is (parent, child, operation) — the multigraph form. The output sorts edges by call count, which is the most common starting question during an incident: "what does this service mostly call?". The realistic output above immediately surfaces the broken edge: exp-fraud-v2:Check has a 50% error rate and an 8.4-second p99, while every other edge is clean. In production, this is the table that pages the on-call.
Four things the graph lets you answer that nothing else can
Once you have the graph, four classes of question that were previously unanswerable become routine. First, blast radius: if I deploy a breaking change to wallet-balance, who breaks? Run a reverse BFS from wallet-balance over incoming edges; every node reached is a potential casualty. The teams that own those nodes get a pre-deploy notification. KapitalKite's deploy pipeline blocks any change that increases blast radius beyond a configured limit without explicit sign-off — this is how a 4-line config tweak in pricing-cache stopped breaking 14 unrelated trading dashboards.
Second, critical-path identification. For a given user-facing endpoint, the critical path is the chain of services whose latencies add up (along the longest dependency path) to dominate end-to-end p99. Reading code can't tell you this — fan-out parallel calls don't add, fan-in serial calls do, and the mix changes per request. Why the longest weighted path and not the longest unweighted path: a chain of three fast services contributes less to p99 than a chain of two services where one is slow. The critical path is defined by accumulated latency on serial edges, not by hop count, so the graph algorithm is a longest-path-with-edge-weights — solvable in linear time on a DAG, NP-hard with cycles, which is one more reason cycle detection matters first. The graph, weighted by latency, lets you compute the longest-weighted path and find the one slow service that is gating everything. PaySetu found that 78% of checkout p99 was contributed by a single edge to tax-calculation, a service no one had measured because it "felt fast" in dev.
Third, cycle detection. Synchronous dependency cycles are deadlock factories — A calls B which calls C which calls A under the same request, and the moment any of them slows down the whole cycle stalls. Tarjan's strongly-connected-components algorithm over the graph surfaces every cycle in O(V+E). MealRush ran this once and found seven cycles, two of which had been silent for months because traffic had not yet hit the conditional branch that closed the loop. Async-via-queue cycles are fine; sync RPC cycles are bugs.
Fourth, drift detection. The graph at noon today should look almost identical to the graph at noon last Tuesday. When a new edge appears that did not exist a week ago, something happened — a deploy, a feature flag, an experiment, a reconfiguration. PlayDream runs a daily diff of the graph and posts new edges to a #graph-drift Slack channel; this caught the time a third-party A/B-testing SDK started phoning home through the entire user-service tree without anyone reviewing it.
What is hard about this in production
The hard part is not the algorithm — every step above is undergraduate graph theory. The hard parts are scale, completeness, and trust. Scale: BharatBazaar processes 2.4B spans/hour at peak. Aggregating these into a graph in real time means a streaming aggregation in Flink or Spark Structured Streaming, with stateful per-edge accumulators and watermarks for late-arriving spans. Completeness: services that don't propagate trace context disappear from the graph; legacy systems, third-party SDKs, and database calls (which usually don't have outbound spans of their own) all leave holes. The mesh + eBPF backstop closes most of these, but the reconciliation logic that decides "this Envoy edge and this trace edge are the same edge" is non-trivial. Trust: every dashboard built on the graph implicitly trusts that the graph reflects reality. When the graph misses an edge, debugging is harder than if there were no graph at all because engineers stop double-checking.
The accepted production architecture is two graphs: a high-fidelity one from traces (for latency / error analysis) and a high-completeness one from mesh + eBPF (for blast radius and security). Reconciling discrepancies between them is itself a signal — an edge that mesh sees but traces miss is a service that needs trace instrumentation; an edge that traces see but mesh misses is a service that bypassed the mesh and should not have.
Common confusions
- "The architecture diagram is the dependency graph." No — the diagram is what someone wrote down once. The graph is what the running fleet actually does, and the two diverge within weeks of any service launch. Treat the diagram as documentation of intent and the graph as documentation of reality.
- "Tracing alone gives a complete graph." No — anything that doesn't propagate trace context (legacy services, most database drivers, every third-party HTTP client until you wrap it) is invisible to traces. You need mesh telemetry or eBPF as a backstop, otherwise your "complete" graph is missing 20–40% of edges.
- "Edge weights should be call counts." Mostly — but for blast-radius and critical-path queries you also need latency and error-rate weights. A 0.1-RPS edge with 100% error rate matters more than a 1000-RPS edge with 0% error rate, and call count alone hides this.
- "Service dependency graphs are the same as call graphs." No — a call graph is per-process (which functions call which functions inside one binary); a dependency graph is across services (which binaries call which binaries across the network). The algorithms are similar but the data sources, time windows, and failure modes are completely different.
- "Cycles in the graph mean a deadlock." Only if the cycle is synchronous and on the same request path. A → queue → B → queue → A is fine because the queue decouples the timing. Tarjan's algorithm finds all cycles; you then have to filter by edge type (sync RPC vs async messaging).
- "Async edges don't belong in the graph." They do, but as a separate edge type. A Kafka publish from
paymentsto a topic thatnotificationsconsumes is a real dependency —notificationswill be empty ifpaymentsstops publishing — and ignoring it gives you a false sense of decoupling.
Going deeper
Tarjan's algorithm and the dependency cycle question
Tarjan's strongly-connected-components algorithm finds, in O(V+E), every maximal set of nodes mutually reachable from each other. Applied to the runtime dependency graph filtered to synchronous edges only, every non-trivial SCC (size ≥ 2) is a cycle and a potential deadlock. The classic cited paper is Tarjan, "Depth-First Search and Linear Graph Algorithms" (1972) — the algorithm is simple enough that the production version usually fits in 50 lines, and the bottleneck is not the graph traversal but the upstream edge-classification (sync vs async, with-timeout vs without).
Architecture-as-code and the graph as a contract
Some shops invert the relationship: they declare the dependency graph as code (in YAML or Cue), check it into version control, and use the runtime graph to verify compliance. New edges that were not declared trigger a build failure or an alert. This is the inverse of drift detection: drift detection asks "did the runtime graph change?", architecture-as-code asks "is the runtime graph still equal to the declared one?". The pattern is most common in regulated industries — Spotify's Backstage and AWS's Application Composer both lean this direction, and the open-source dependency-cruiser is a building block.
The graph during a partition
A network partition splits the graph into components that cannot reach each other; the dependency graph during a partition is the graph computed over only the spans that completed, which means partitioned services disappear from the graph entirely. This is actually a useful signal — a sudden drop in edge count from a service to its database is exactly the symptom of a partition. The partition section of /wiki/wall-distributed-is-a-failure-first-design covers the broader implications; the graph view of partitions is one of the cleaner ways to see them.
The graph as a feature for autonomous remediation
The graph is increasingly the input to autonomous remediation systems. A degraded edge (high error rate, high latency) becomes a candidate for automated rollback if the deploy that introduced it can be identified, automated traffic shedding via the mesh, or automated capacity expansion. Discord and Cloudflare have published variants of this; the catch is that the graph has to be authoritative — a remediation system that acts on a stale or incomplete graph will make outages worse. This is why production graph pipelines run with sub-minute end-to-end latency.
Where this leads next
Once you have the graph, the next questions are about what to compute on it. The graph is the substrate for service-level objectives (/wiki/slo-definition-error-budgets), for tail-latency aggregation across edges (/wiki/tail-latency-aggregation), and for incident-response tooling that uses the graph to suggest probable root causes (/wiki/incident-response-tooling). The graph is also the structural input to chaos engineering — you cannot intelligently choose a fault-injection target without a graph telling you which services have the highest blast radius.
The deeper arc is that observability in distributed systems is increasingly graph-shaped rather than time-series-shaped. Metrics, logs, and traces all become projections of the underlying graph: a metric is a property of a node, a trace is a path through the graph, a log is an annotation on an edge. The graph is the unifying structure (/wiki/wall-observability-in-distributed-systems-is-a-data-problem).
References
- Sigelman, B. et al. — Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google Technical Report (2010). The foundational paper; every modern dependency graph descends from Dapper's span model.
- Tarjan, R. — Depth-First Search and Linear Graph Algorithms, SIAM J. Computing (1972). The strongly-connected-components algorithm used for cycle detection.
- Ousterhout, K. et al. — Making Sense of Performance in Data Analytics Frameworks, NSDI (2015). Critical-path analysis adapted to distributed call graphs; the longest-weighted-path technique referenced in §3.
- The Linkerd / Istio service-mesh documentation — Envoy access-log format and the canonical schema for mesh-derived edges.
- Cilium Hubble and Pixie documentation — eBPF-based dependency-graph extraction without application instrumentation.
- Beyer, B. et al. — Site Reliability Engineering (Google, 2016), chapter "Monitoring Distributed Systems". The blast-radius and drift framing comes from here.
/wiki/distributed-tracing-w3c-dapper-jaeger— the upstream tracing pipeline that produces the spans this article aggregates./wiki/wall-observability-in-distributed-systems-is-a-data-problem— why the graph is the right unifying structure for observability data.