Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Debugging cross-service outages
It is 21:47 on a Friday at CricStream during the second over of a knockout match, and the player-stats panel has stopped updating for 9% of viewers. The on-call channel lights up with seventeen alerts in ninety seconds: stats-aggregator p99 has spiked, commentary-service is throwing 503s, the CDN is reporting cache-miss surges in the Mumbai PoP, the recommendation feed has frozen, and somewhere a Kafka consumer-lag graph has gone vertical. Arjun, the responder, has six dashboards open and a critical question with no obvious answer: of the seventeen things that look broken, which one is actually broken, and which sixteen are just downstream casualties? Every minute he spends on the wrong service costs roughly 180,000 viewer-seconds of degraded experience. Cross-service debugging is the discipline of answering that question fast enough to matter.
Debugging a cross-service outage is the inverse of building one — you start with a fan-out of symptoms across the dependency graph and work backwards along the critical path until you find the single edge that originated the failure rather than propagated it. The signals that distinguish origin from propagation are temporal (who started failing first, by milliseconds), structural (which service has no failing dependencies of its own), and statistical (whose error pattern leads the others by a constant lag). The mature playbook narrows from "everything is broken" to "this one bad deploy" in under ten minutes by combining traces, the dependency graph, and a small repertoire of scripted queries — not by reading dashboards in a panic.
What makes cross-service outages structurally different
A single-service outage is a tractable engineering problem: one binary, one log stream, one memory profile, one git history to bisect. A cross-service outage is a different kind of object — it is a propagation pattern through a graph of services, where the visible symptoms are usually nowhere near the cause. The classic version: a database in one service starts returning slow queries, the service holds connections open longer, the connection pool of every upstream caller drains, the upstream callers' health checks fail, the load balancer takes them out of rotation, the remaining instances get traffic-storm crushed, and an entirely different set of services starts paging because their queues are backing up. By the time the first human looks at a dashboard, the original slow query is two services deep and four minutes in the past.
This shape — small early cause, large late symptom — has three implications that govern every debugging strategy. First, the loudest service is almost never the cause. The service that owns the most user-visible traffic shows the largest dashboard spike, but it shows it because it is the integration point where many failing dependencies collide. Walking from the loudest dashboard inward is the standard novice mistake. Second, the wall-clock ordering of alerts encodes the propagation path. If you can reconstruct who started failing first, by milliseconds, you can read the causal arrow off the timeline directly. Third, services with no failing dependencies of their own are origin candidates. Every other failing service has some upstream that is also failing; the origin is the unique service whose error rate climbed without any of its dependencies climbing first. This is the structural definition of root cause in a graph: a node whose failure was not predicted by the failures of its neighbours.
The propagation arrow — finding origin from temporal ordering
Distributed-tracing systems give you per-span timestamps with millisecond resolution; the dependency graph (/wiki/service-dependency-graphs) gives you the directed edges. The combination lets you compute, for any incident window, an ordered list of "who started failing when" — and the origin is approximately the service at the head of that list. Why approximately and not exactly: clock skew between hosts can be tens of milliseconds even with NTP, propagation delays add 1–5 ms per hop, and the very first errors are statistical noise indistinguishable from the steady-state error rate. So the head of the list is a candidate, not a verdict — but in practice the candidate is correct >90% of the time when the gap to the next candidate is more than a few seconds.
The query is a 30-line script. For each service, find the first 5-minute window in the incident period where its error rate exceeded its trailing-week 99th percentile, and rank by that timestamp ascending. The Python below runs the calculation on a span stream — production versions execute over a Spark or Flink job indexing months of trace data, but the algorithm is identical:
from collections import defaultdict
from statistics import quantiles
def find_origin(spans, incident_start, incident_end, baseline_pct=99):
by_service_min = defaultdict(lambda: defaultdict(lambda: [0, 0])) # svc -> minute -> [errs, total]
for sp in spans:
m = int(sp["ts"] // 60) * 60
by_service_min[sp["service"]][m][1] += 1
if sp.get("error"):
by_service_min[sp["service"]][m][0] += 1
first_breach = {}
for svc, mins in by_service_min.items():
baseline = [errs / max(1, total)
for ts, (errs, total) in mins.items()
if ts < incident_start and total > 50]
if len(baseline) < 20:
continue # not enough history to baseline
threshold = quantiles(baseline, n=100)[baseline_pct - 1] + 0.005 # absolute floor
for ts in sorted(mins):
if ts < incident_start or ts > incident_end:
continue
errs, total = mins[ts]
if total > 50 and errs / total > threshold:
first_breach[svc] = ts
break
return sorted(first_breach.items(), key=lambda kv: kv[1])
# example: synthesised spans where 'pricing-cache' goes bad first
import random, time
incident = 1714410000 # Mon 2026-04-28 22:00:00 IST
spans = []
for t in range(incident - 7 * 86400, incident + 600, 30):
for svc, base in [("pricing-cache", 0.001), ("order-router", 0.002), ("checkout", 0.001)]:
bad = (svc == "pricing-cache" and t >= incident + 0) or \
(svc == "order-router" and t >= incident + 45) or \
(svc == "checkout" and t >= incident + 110)
rate = 0.30 if bad else base
for _ in range(60):
spans.append({"ts": t + random.random() * 30, "service": svc,
"error": random.random() < rate})
ranking = find_origin(spans, incident, incident + 600)
for svc, ts in ranking:
print(f"{svc:<16} first breach at +{int(ts - incident):>4}s")
Realistic output:
pricing-cache first breach at + 0s
order-router first breach at + 60s
checkout first breach at + 120s
The walkthrough: by_service_min collects per-service per-minute error and total counts directly from the span stream. baseline is the trailing week of error rates excluding the incident window — its 99th percentile (plus a small absolute floor of 0.5%) gives the per-service alert threshold. We walk forward through the incident minutes and stop at the first breach for each service. The ranked output is the propagation arrow: pricing-cache breached first, order-router followed 60 seconds later (it is pricing-cache's direct caller), and checkout followed 120 seconds after that (one hop further upstream). The responder's eye lands on the bottom of the list because the bottom is what the user-facing dashboards show, but the top of the list is the bug. This single inversion is the hardest habit to teach; experienced responders look at the head of the list first by reflex.
The structural test — origin nodes have clean dependencies
The temporal ranking is the first cut, but it has failure modes: clock skew, alert thresholds that fire at slightly different sensitivities, and bursts of cascading failure where the millisecond-level ordering is genuinely ambiguous. The structural test sharpens it. Take the dependency graph (computed continuously per /wiki/service-dependency-graphs), restrict it to services that breached during the incident, and find the nodes whose outgoing edges are clean — i.e., whose dependencies did not breach. Those nodes are the origin candidates; every other failing node has at least one failing dependency that is a more parsimonious explanation.
The combination of temporal first and structural source is the strongest cheap signal a debugger has. PaySetu's incident-response tooling computes both within thirty seconds of the first page and posts the candidate list to the war room channel before anyone has finished opening their laptop. Why this combines well with traces specifically: the trace data already carries both the temporal signal (span timestamps) and the structural signal (parent-child service edges) in one stream. You do not need to join two systems together; one query over the span store yields both ranking and graph topology, which is why distributed tracing earns its substantial cost only when the responder workflow uses it this way. Without this tooling, the team relies on whoever shouts loudest in the channel — usually whoever owns the most user-facing service, who is by construction the last service to fail.
The statistical test — leading-lag correlation as a backstop
Both tests above can fail when the cause and effect are very close in time, when the dependency graph is incomplete (services that don't propagate trace context), or when there are multiple correlated origins (two services failing for shared reasons, neither downstream of the other). The statistical backstop is leading-lag correlation: for every pair of failing services, compute the cross-correlation of their error-rate time series at lags up to 60 seconds. The pair with the largest correlation at the largest positive lag has the lead-follow relationship that suggests causality. KapitalKite uses this to disambiguate the case where two services share a database, both fail simultaneously, and neither obviously precedes the other in the trace data — the database itself, which is rarely instrumented, is unmasked by being the common lead-time signal that beats every service-to-service pair.
Lag correlation is not causal proof; two services that share an upstream load balancer will correlate without a direct causal link.
Why the lag has to be positive and bounded: a negative-lag correlation says the "cause" started failing after the "effect", which is anti-causal and means you have either flipped the sign or you are looking at coincident shared-cause failure. An unboundedly large positive lag says the two series happen to share a slow trend (daily traffic curve, weekly seasonality) rather than incident-level causality. Production implementations restrict the search to lags between +1s and +120s and discard correlations whose absolute value falls below ~0.6, which trades sensitivity for false-positive control.
But in practice, when temporal-first and structural-source agree, lag correlation only refines the answer; when they disagree, lag correlation is the tiebreaker. Three independent signals lining up on the same service is what passes the bar for an automated rollback.
The MealRush incident — a worked example
MealRush ran a delivery rider tracking service that started failing on a Saturday evening. Within four minutes, the on-call had alerts from rider-tracker, eta-prediction, order-status, notifications, and the customer mobile-app BFF — five services across three teams. The instinct was to investigate rider-tracker because that was where the alert chain started visually on the dashboard. The temporal-first query took eleven seconds to run and put geo-index at the head of the list, breaching the error threshold 73 seconds before rider-tracker. geo-index was a shared service nobody had paged because nobody owned it any more — the original team had been re-orged six months prior. The structural test confirmed: geo-index had no failing outgoing edges (its dependencies, two Cassandra clusters, were green); every other failing service had geo-index as a failing dependency. Total time from first page to identified-and-isolated origin: 9 minutes and 12 seconds, of which 8 minutes were spent waiting for the rollback to deploy. A rollback that was the right rollback because the diagnosis was correct on the first attempt. The post-incident write-up estimated that without the temporal-first tooling the responders would have spent another 20–30 minutes investigating rider-tracker first.
Common confusions
- "The service with the loudest alert is the cause." No — it is the most distant symptom. The cause is upstream of the loudest alert, in the direction of dependencies, and is usually only a few percent off baseline. Walk inward, not at the dashboard.
- "Distributed tracing is enough on its own." No — traces give you per-request paths and timings, but the outage-level questions (whose error rate climbed first, which service has clean dependencies) require aggregations over traces, which are a different system on top of the trace store. Most teams that "have tracing" can answer per-request questions and not outage questions, and conflate the two.
- "Cross-service debugging is the same as distributed tracing." Tracing is the data substrate; cross-service debugging is the discipline of querying that substrate during an incident. The skill is choosing the right query first, not having the data available.
- "If two services failed at the same time, they have a common cause." Often true, but not always — clocks skew by tens of milliseconds, alert thresholds fire at slightly different sensitivities, and "the same time" at the dashboard granularity (15s) hides ordering at the millisecond granularity. Always pull the millisecond-resolution view before concluding simultaneity.
- "Once you find the origin, the bug is also there." Sometimes — but the origin is the first service to misbehave, not necessarily the service whose code is wrong. A service can misbehave because its config was wrong, because a dependency it doesn't directly call (a shared queue, a DNS server) is degraded, or because a kernel parameter changed under it. Origin localisation narrows the search; it does not end it.
- "You should always look at the most recent deploy." Useful but not sufficient — many cross-service outages are triggered by traffic-pattern changes (a marketing campaign, an external integration spamming you), data changes (a corrupted record propagating through a pipeline), or environmental shifts (a noisy neighbour on the underlying VM host) with no recent deploy at all. Deploy correlation is one of several signals, not the master signal.
Going deeper
Why mean-time-to-recovery is mostly mean-time-to-localise
Industry MTTR data consistently shows that for cross-service incidents, the time from "first alert" to "we know which service to fix" is 60–80% of the total recovery time, and the time from "we know which service" to "we have shipped the fix" is the remaining 20–40%. This is why every minute spent improving localisation has a much higher ROI than every minute spent improving deploy speed or rollback automation — those late stages are already short. Google's SRE book makes this point obliquely; the explicit measurement comes from VOID (Verica Open Incident Database) reports across hundreds of public post-mortems.
The role of synthetic monitoring during the incident
Synthetic probes (continuous black-box transactions executed from outside the system) often show a different and more sensitive picture than internal alerts during an outage. A synthetic transaction that is failing at 100% but for which no internal alert has fired yet often points at the originating service before the metric thresholds catch up. The trade-off is that synthetic probes are sparse — you have a handful of representative paths, not coverage of the full service surface. The pattern in mature organisations is to run both: synthetics to see the user view in real time, internal metrics to localise. Discord and Cloudflare have written about this; the observability section of /wiki/wall-observability-in-distributed-systems-is-a-data-problem covers the broader frame.
Topology-aware alert grouping
Most paging systems will fire one alert per service per threshold breach, which during a cross-service outage produces dozens of pages within a minute. Topology-aware alert grouping — where the paging system uses the dependency graph to collapse alerts into clusters and pages only the origin candidates — has become a feature of recent observability platforms. The semantics are: if service A is paging and service A has a failing dependency on service B which is also paging, suppress A's page until B is acknowledged. This works because the structural-source test is exactly the right inversion of "what to suppress". PagerDuty's "incident intelligence" and Grafana's incident features both lean this direction, with mixed results — the failure mode is over-suppression when the graph is incomplete.
The role of chaos engineering in debugging muscle
Chaos engineering — deliberately injecting faults in production — is mostly framed as a resilience-testing tool, but its underrated benefit is that it builds the cross-service debugging muscle in calm conditions. Responders who have repeatedly diagnosed a latency injection on payment-service exercise are dramatically faster at diagnosing the real version when it happens at 03:00 IST. Netflix's Chaos Monkey was the original; Gremlin productised the practice; the discipline is now standard at every fleet that operates at scale.
Where this leads next
Cross-service debugging sits at the intersection of three of the most important upstream chapters: distributed tracing (/wiki/distributed-tracing-w3c-dapper-jaeger), the dependency graph (/wiki/service-dependency-graphs), and incident-response tooling (/wiki/incident-response-tooling). The next chapter on the arc — /wiki/postmortems-blameless-culture — is about what happens after the incident is resolved, when the same data is re-examined for systemic patterns rather than acute root cause.
The deeper trajectory is that debugging is becoming a structural problem rather than a heroic one. The senior engineer who could "smell" the right service is being replaced — not entirely, but mostly — by tooling that runs the temporal-first and structural-source queries automatically and surfaces a candidate list before the human has logged in. The remaining heroism is in the long tail: the 5% of incidents where the tooling is wrong because the graph was incomplete, the data was stale, or the cause was outside the system entirely. That tail will not vanish, but it is shrinking, and the discipline of cross-service debugging is the discipline of making it shrink faster.
References
- Verica — VOID (Verica Open Incident Database) Annual Reports (2021–2024). The empirical source for the "MTTR is mostly MTTL" claim, drawn from hundreds of public post-mortems.
- Sigelman, B. et al. — Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google Technical Report (2010). The trace data model that makes temporal-first localisation possible.
- Beyer, B. et al. — Site Reliability Engineering (Google, 2016), chapters on monitoring and incident response.
- Allspaw, J. — Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages (Master's thesis, Lund University, 2015). The cognitive-science foundation for how responders actually narrow causality.
- Netflix Tech Blog — Tooling for Production Debugging (various posts, 2018–2024). Practical patterns for topology-aware alerting and rollback automation.
- Grafana / PagerDuty incident-management documentation — current implementations of topology-aware alert grouping.
/wiki/service-dependency-graphs— the upstream graph data this article queries during incidents./wiki/distributed-tracing-w3c-dapper-jaeger— the upstream span pipeline this article reads.