Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Routing and escalation
At 09:17 IST on a Friday, an alert named LedgerWriteLatencyHigh fired against the payments service at a hypothetical Razorpay-pattern company. The alert was correctly written, the threshold was correctly tuned, the runbook was correctly linked — and it was routed to the infra-platform team because the underlying database lived on a Postgres cluster that the platform team owned. Infra-platform's primary on-call, Karan, was paged at 09:17:14, looked at the alert, saw "ledger" in the name, and waited 4 minutes for the payments team to investigate before realising payments had not been paged at all. The incident's mean-time-to-diagnose was 11 minutes longer than it needed to be — not because anyone made a mistake, but because the routing rule was authored as if every alert had exactly one owning team, and LedgerWriteLatencyHigh had two. Routing and escalation are where the alerting system meets the org chart, and the org chart almost never matches the failure surface.
Routing decides which on-call rotation receives an alert; escalation decides what happens when that rotation does not respond within a deadline. Both are routinely configured as a single-owner mapping from alert label to team, which fails the moment a real incident crosses team boundaries — and almost every interesting incident does. The fix is to design the routing graph and escalation chain explicitly, with measured handoff latencies and tested fallback paths, before the next cross-team incident exposes the gap.
What routing actually does — and why "service label maps to team" is wrong
Routing is the function that takes a fired alert and decides which on-call rotation receives the page. The naive implementation reads a single label off the alert (service or team or owner) and looks it up in a static map: payments → payments-oncall, ledger → ledger-oncall, gateway → platform-oncall. Every modern alerting tool — PagerDuty, Opsgenie, VictorOps, Alertmanager itself — supports this mapping in 30 seconds of configuration. The mapping is also wrong on its first non-trivial day, because alerts come from telemetry that was instrumented by one team and crosses systems owned by other teams, and the alert's most useful labels were never written with routing in mind.
A LedgerWriteLatencyHigh alert that fires when the ledger.write_duration_ms histogram exceeds 200ms at p99 has at least three plausible owners. The team that owns the ledger Postgres cluster sees a database-side cause: a long-running query, a vacuum stall, an autovacuum thrash, a replication lag spike. The team that owns the payments service sees a client-side cause: a connection-pool exhaustion, a query-plan regression in their ORM, a feature flag that just rolled out and added a new query path. The team that owns the network between them sees a third cause: a switch buffer overflow, an MTU misconfiguration, a TCP retransmit storm during a deploy. Why this matters for routing: the alert label says nothing about which of the three causes produced the latency spike, because at firing time nobody knows. The routing rule that picks one owner deterministically is making a guess — and the guess is correct, on average, less than half the time for any cross-team metric. The remaining cases are exactly the cross-team incidents that benefit most from fast response, because they are the ones where waiting for someone to forward the page costs minutes of mean-time-to-diagnose.
The technical framing is that routing is an inverse problem. The forward direction — given a fault in subsystem X, which alerts will fire — is well-defined and is what runbooks document. The inverse — given an alert, which subsystem caused it — is underdetermined for any alert that observes a metric whose value depends on multiple subsystems. Latency, error rate, saturation, queue depth, and throughput are all multi-cause metrics; only counters of internal state changes (deploys completed, leader elections, GC pauses) are single-cause. Most alerts are written against multi-cause metrics because those are what the SLO measures, which means most alerts are inverse-problem alerts and a single-owner routing rule is the wrong primitive.
The fix is not to enumerate every cross-team alert and write a custom rule for each — that is unmaintainable and the org chart changes faster than the alert library. The fix is to recognise routing as a graph: an alert has zero, one, or many responder edges to teams, and the routing engine fans out the page along all of them, with explicit deduplication and cooperation. PagerDuty calls this "responders"; Opsgenie calls it "teams of responders"; some teams roll their own using Alertmanager's routes tree with continue: true and per-route receivers. The mechanism varies; the principle is the same — alerts that observe shared resources fan out to all responsible teams, and the responsibility for not duplicating effort sits with the on-call interaction protocol (Slack channel, conference bridge, incident commander), not with the routing layer.
There is a corollary worth naming: the alert label and the routing label should not be the same field. The alert label answers "what fired" — it is part of the alert identity, used for deduplication, silencing, and dashboard grouping. The routing label answers "who responds" — it is part of the dispatch decision, used for fan-out and escalation. Conflating them (using service for both, say) means every change to routing forces a re-deploy of the alerting rules, and every change to the alert library risks accidentally rerouting pages. The mature pattern is two label namespaces: service, severity, runbook for alert identity; route_to, route_continue, route_off_hours for dispatch. Alertmanager and PagerDuty both support this separation; few teams use it, with the result that their alert configs and routing configs are tangled and any change requires expert review. The cost of separating them is one afternoon of refactor; the benefit is years of cleaner change reviews.
A second axis routing must respect: time of day. The same alert at 14:00 IST during business hours and at 02:47 IST off-hours is the same incident with very different optimal responder mixes. During business hours, fan out to all relevant teams' Slack channels and only page the canonical owner — junior engineers from the other teams will see the broadcast and self-route if their domain is implicated. Off-hours, fan out to all relevant teams' on-call rotations because there is no Slack channel watching, and the cost of waking a second on-call who turns out not to be needed is much smaller than the cost of waiting 11 minutes for the first on-call to forward the page. Most routing configurations conflate these two modes; the configuration that is correct for 14:00 IST is wrong for 02:47 IST and vice versa. The fix is time-of-day-aware routes (every modern alerting tool supports this; few teams use it).
Escalation: what happens when nobody picks up
Escalation is the time-bounded fallback that fires when an alert is paged and not acknowledged within a deadline. The textbook escalation policy looks like a chain: page primary; if no ack in 5 minutes, page secondary; if no ack in 5 more minutes, page the manager; if no ack in 5 more, page the entire team. Every alerting platform ships a default chain that looks roughly like this, and most teams adopt it without modification on day one.
The chain has three failure modes in production. The first is escalation collapse: the primary acks the page on autopilot from bed (the alert-fatigue case from /wiki/alert-fatigue-as-a-production-failure), the system marks the page acknowledged, and the chain never fires — even though no human action will follow. The escalation policy as written cannot distinguish "acked and being investigated" from "acked and ignored", because the only signal it has is the ack itself. Teams that have lived through this failure mode add a second deadline — if no incident-channel activity in 10 minutes after ack, re-escalate — but this requires the alerting tool to integrate with the incident-channel tool, which is brittle and rarely tested.
The second is escalation explosion: a real incident at 02:47 IST that takes 30 minutes to mitigate produces a sequence of escalations — secondary at 02:52, manager at 02:57, team-wide at 03:02, VP at 03:07 — even though the primary is awake and working on it. The chain was designed to handle the case where the primary is unreachable, not the case where the primary is reachable and busy. Teams that have lived through this add a if primary is in active incident, suppress further escalations rule, which again requires tool integration that is brittle. Why both failure modes have the same root: the escalation chain is configured as a function of time-since-page, not as a function of incident-state. The right primitive is "escalate when the response is not progressing", but progress is a higher-order signal that requires either incident-management tooling integration or explicit human declaration ("I am on it; pause escalation"). Most teams are between the two: they have neither the tooling integration nor the cultural habit of declaring incident state, so they live with both failure modes and tune the deadline timers as a compromise.
The third is escalation handoff black hole: the primary acks, gets stuck, hands off to the secondary at minute 12, the secondary takes over but does not extend or reset the escalation timer, the manager gets paged at minute 15 anyway, the manager joins, sees the secondary is on it, leaves, and the escalation continues to fire upward at minute 20, 25, 30 — at which point the VP joins, asks "what is happening", and the team has to spend 4 minutes briefing them on a fully-handled incident. The escalation chain has no concept of handoff because the alerting tool does not model who is actively working the incident; it only models who has acknowledged the original page. Teams fix this by either integrating the alerting tool with the incident-management tool (PagerDuty + FireHydrant, or Opsgenie + Jeli), or by establishing a strict cultural rule that the active responder re-acks the page every 10 minutes and the escalation timer is reset on each re-ack.
A fourth failure mode appears at smaller scale and is worth mentioning even though it is less dramatic: the escalation skip. A team with only two on-calls (a startup of 8 engineers, say) may configure the chain as primary → secondary → all-of-engineering, with the all-of-engineering step intended as a last resort. In practice, when the primary's phone runs out of battery overnight (more common than people admit), the secondary gets paged 5 minutes later, also has their phone on do-not-disturb because they were not technically on call, and the all-of-engineering page fires at minute 10 — waking 8 people, none of whom have the runbook context, and producing an incident response that is worse than no response. Small teams should design the chain so that the third step is something useful — a different team's primary, or a manager who has explicitly opted in to be a backstop — not a broadcast that produces noise. The general principle: every step in the chain should have a defined responder identity, defined contact mechanism that bypasses do-not-disturb, and defined runbook context. Steps that fail any of these tests are decorative, and the chain effectively ends one step earlier than it appears to.
There is also a question of who controls the chain. Most teams put the chain in the alerting tool's UI (PagerDuty's "escalation policy" editor), which means the chain is owned by whoever has admin rights — usually a manager or SRE lead. The chain becomes invisible to the engineers it pages: they see the page, they don't see what happens if they don't ack. This produces predictable surprises ("I didn't realise paging me would also page my manager 5 minutes later") and trust deficits. The fix is to put the chain config in version control alongside the alert rules, render it in the team wiki, and walk every new on-call through the chain on their first day. Teams that do this report fewer surprise escalations and faster acknowledgment latencies — partly because the on-call understands the cost of not acking, and partly because the on-call trusts the chain enough to hand off without anxiety.
A measurement that distinguishes a healthy chain from an unhealthy one: track escalation rate — the percentage of pages that escalate past the primary — and terminal-step rate — the percentage that reach the final step. A healthy chain escalates 2–8% of the time (the primary is not always reachable; nothing is perfect) and reaches the terminal step under 0.5% (the chain is the safety net, not the routine path). A chain that escalates 30%+ has a primary-reachability problem (the rotation is unstaffed, the primary's phone settings are wrong, the alerting tool's notification channel is broken); a chain that reaches the terminal step in 5%+ has a chain-design problem (the deadlines are too short, or the secondary is identical to the primary in some failure mode). Both are visible in the data; few teams look.
A measurement that distinguishes a healthy chain from an unhealthy one — already named above as escalation rate and terminal-step rate — has a third companion that is harder to compute and more useful: time-from-first-page-to-correct-responder. The escalation chain's job is not to fire escalations; it is to ensure that, conditional on the primary failing to act, a correct responder is on the bridge fast. Measuring the chain by the rate at which it fires is measuring the brake; measuring it by the time-to-correct-responder is measuring the car's stopping distance. Teams that look at the right metric find their chains are well-tuned for the average case and badly tuned for the rare cross-team case where the canonical primary acks but does not have the domain knowledge — a case in which the chain should, but rarely does, escalate to a different team rather than to the primary's manager. Fixing this requires changing the chain from "primary → secondary → manager → team" to "primary → cross-team-fallback → manager", which most teams would call a major redesign and which is in fact a one-line config change in any modern alerting tool. The reason it is rarely done is that the metric that justifies it is rarely measured.
Building a routing-and-escalation simulator to see the failure modes
The clearest way to see how routing and escalation interact with the org chart is to simulate a team's incident timeline against different routing graphs. The script below simulates 60 days of alerts at a hypothetical Hotstar-pattern streaming team. Real cross-team incidents are seeded; three routing strategies are compared (single-owner, fan-out, time-of-day-aware fan-out); the script reports time-to-first-correct-responder for each.
# routing_sim.py — simulate three routing strategies against a 60-day incident timeline
# pip install numpy pandas
import numpy as np, pandas as pd, datetime as dt
from collections import defaultdict
np.random.seed(72)
DAYS = 60
TEAMS = ["payments", "ledger", "platform-network", "streaming"]
ACK_LATENCY_BASE = 90 # seconds — primary picks up
HANDOFF_LATENCY = 240 # seconds — primary forwards to wrong team
ESCALATION_AT = 300 # seconds — secondary paged
# Seeded cross-team incidents — (day, hour, true_responsible_teams)
# Most real incidents have 2+ correct responders; the alert label points to one.
incidents = [
(3, 10, ["ledger", "payments"]),
(8, 2, ["platform-network", "streaming"]),
(14, 14, ["payments", "ledger"]),
(19, 22, ["streaming", "platform-network"]),
(27, 3, ["ledger"]), # single-owner case
(33, 9, ["payments", "platform-network"]),
(41, 17, ["streaming"]), # single-owner case
(52, 4, ["payments", "ledger", "platform-network"]), # 3-way
]
# The alert's routing label always picks the first listed team as the
# canonical owner — that's how the labels were authored historically.
def simulate(strategy):
"""Return per-incident time-to-first-correct-responder in seconds."""
results = []
for day, hour, true_teams in incidents:
canonical = true_teams[0]
is_off_hours = (hour >= 22) or (hour < 7)
if strategy == "single_owner":
# Page only canonical; correct responder is canonical
# If true_teams contains other teams, primary forwards
paged = [canonical]
correct_in_paged = canonical in true_teams
ttc = ACK_LATENCY_BASE if correct_in_paged else ACK_LATENCY_BASE + HANDOFF_LATENCY
elif strategy == "fanout_always":
# Page all plausible teams (in this sim, all listed responders)
paged = true_teams[:]
ttc = ACK_LATENCY_BASE # someone correct picks up
elif strategy == "fanout_time_aware":
# Business hours: page canonical + Slack-broadcast others
# Off-hours: page all responders
if is_off_hours:
paged = true_teams[:]
ttc = ACK_LATENCY_BASE
else:
paged = [canonical]
# Slack broadcast helps in business hours — assume a
# 30s self-routing delay if non-canonical needed
ttc = ACK_LATENCY_BASE if canonical in true_teams else ACK_LATENCY_BASE + 30
results.append({
"day": day, "hour": hour, "off_hours": is_off_hours,
"true_teams": ",".join(true_teams), "ttc_sec": ttc,
})
return pd.DataFrame(results)
for strat in ["single_owner", "fanout_always", "fanout_time_aware"]:
df = simulate(strat)
p50 = df["ttc_sec"].median()
p95 = df["ttc_sec"].quantile(0.95)
pages_off = df[df["off_hours"]]["ttc_sec"].mean()
pages_on = df[~df["off_hours"]]["ttc_sec"].mean()
print(f"\n--- strategy: {strat} ---")
print(f"p50 time-to-correct-responder: {p50:.0f}s")
print(f"p95 time-to-correct-responder: {p95:.0f}s")
print(f" off-hours mean: {pages_off:.0f}s")
print(f" on-hours mean: {pages_on:.0f}s")
Sample run:
--- strategy: single_owner ---
p50 time-to-correct-responder: 90s
p95 time-to-correct-responder: 330s
off-hours mean: 222s
on-hours mean: 210s
--- strategy: fanout_always ---
p50 time-to-correct-responder: 90s
p95 time-to-correct-responder: 90s
off-hours mean: 90s
on-hours mean: 90s
--- strategy: fanout_time_aware ---
p50 time-to-correct-responder: 90s
p95 time-to-correct-responder: 120s
off-hours mean: 90s
on-hours mean: 105s
The output reveals the trade-off cleanly. single_owner has identical p50 to fan-out — most incidents are caught — but p95 is 330s because the cross-team cases all eat the 240s handoff penalty. fanout_always crushes p95 to 90s because the right team is always already paged, but in production this strategy wakes 2–3 people for every incident, which is unacceptable during business hours where Slack-broadcast is sufficient. fanout_time_aware keeps off-hours fan-out (where waking an extra person is cheap relative to the alternative) and uses Slack broadcast during business hours (where the unneeded responders see the message and ignore it, cost zero). Why the time-of-day split is the practical choice: the cost of waking a person during off-hours is large (sleep loss, family disruption, next-day productivity) but only paid by people on the rotation; the cost of fan-out paging during business hours is small (a notification on a laptop) but paid by everyone you fan out to. Inverting these — fan out during business hours, single-owner off-hours — is a configuration mistake that is easy to make and produces both high off-hours TTCA and high business-hours noise.
The simulator misses two real-world dynamics that matter in practice. First, the secondary's response latency is not the same as the primary's; if the primary fails to ack, the secondary is being woken by an escalation, which adds 30–90 seconds of "what is happening" cognitive load on top of the base ack latency. Second, the canonical-owner team's prior on the alert affects ack latency through alert fatigue (see /wiki/alert-fatigue-as-a-production-failure) — a team whose alert label routes to them but whose team is rarely the actual cause develops a low prior, which lengthens their ack latency, which makes the single-owner strategy even worse than this simulator shows. Adding both dynamics changes the absolute numbers but not the ranking; the ranking is robust.
Edge cases: what breaks when the routing layer itself is the failure
Routing failures are particularly insidious because they are silent. When the metric-collection pipeline fails, you stop receiving metrics and notice; when the routing layer fails, alerts fire and pages disappear, and the only signal is the eventual outage. Three categories worth naming:
The routing-layer outage. PagerDuty itself goes down (it has, multiple times — most recently a multi-hour outage that affected page delivery globally). Alertmanager's clustering goes split-brain and one half stops sending pages. The webhook from your metrics backend to the alerting tool times out and gets discarded silently because retry logic was never configured. In each case, alerts are firing against the metrics backend correctly, the dashboards show the alert state, and no human is being paged. The mitigation is a watchdog alert that fires if no alerts have fired in the last 24 hours — which sounds backwards but is the only end-to-end test of the alerting pipeline that runs in production. The watchdog is routed through a different alerting tool entirely, so a single-vendor outage does not silence the watchdog with everything else. Few teams run the watchdog; the ones that do detected the most recent PagerDuty outage 4 minutes after it started, instead of 40 minutes after when their first real incident fired blind.
The misrouted page. A label was renamed two months ago, the routing rule was not updated, and now LedgerWriteLatencyHigh routes to a deprecated team that has zero on-calls. The page fires, gets sent to nobody, and the alerting tool's UI shows it as delivered (because the tool delivered it to the configured rotation, which happens to be empty). The fix is a CI check on the routing config that asserts every route lands on a non-empty rotation — and a separate weekly job that synthetically fires a test alert against every routing destination and asserts a human acks it. The weekly job is annoying to run; teams that run it catch routing breakage within 7 days instead of within 60.
The on-call-tool authentication failure. An engineer's PagerDuty mobile app loses its auth token (the company rotated SSO, the engineer never re-authed on mobile). Pages are sent to the engineer's phone, the OS shows them as delivered, and the engineer's app shows "no recent alerts" because it cannot fetch them. This one is impossible to detect from the alerting tool's side — from PagerDuty's perspective, the page was delivered. The mitigation is the periodic synthetic test mentioned above: a quarterly drill where every on-call must respond to a synthetic page within 5 minutes, and any engineer whose synthetic page is missed gets their setup audited. The drill is unpopular and necessary; the alternative is finding out about the auth failure during a real outage.
A fourth and increasingly common category: the cross-tool desync. A growing pattern is to use one tool for alert generation (Prometheus + Alertmanager), a second for paging (PagerDuty), and a third for incident management (FireHydrant). Each tool has its own state for an incident — Alertmanager has the alert state, PagerDuty has the page state, FireHydrant has the incident state — and the three states drift. An alert that resolves in Alertmanager does not always close the PagerDuty page, which does not always close the FireHydrant incident. Conversely, a PagerDuty page that is acknowledged does not always update Alertmanager's silence list, which means the alert keeps firing and the on-call gets paged again 5 minutes later for the "same" alert. The desync is a routing-infrastructure failure even though every individual tool is working as designed. The mitigation is a state-reconciliation job that runs every 30 seconds, reads the canonical state from one tool (usually the incident-management tool), and pushes the state down to the others. Few teams build this; the ones that do eliminate a class of "phantom" pages that otherwise consume 5–10% of total page volume.
These categories have a shared property: they are failures of the routing infrastructure, not failures of the alerts themselves, and they cannot be detected by inspecting alert quality. The runbook ladder for "is my routing healthy" is a separate thing from the runbook ladder for "is my alert ruleset healthy", and most teams have only the second. Why this matters for the structure of an on-call team's investment: the routing-layer health monitoring is unsexy infrastructure work that does not appear in any team's quarterly OKRs unless someone has been bitten by a routing failure recently. The work compounds — a routing failure during a real incident produces 10× the user impact of the same outage with healthy routing — but the compounding is invisible until the failure happens. Teams that have been bitten once invest; teams that haven't, don't, until they get bitten.
What the on-call interaction protocol must do that routing cannot
Once routing has fanned out a page to multiple responders, the responders need a coordination protocol. The routing layer cannot provide this — it is a one-shot fan-out, not a stateful conversation — so a separate layer must. The protocol has three minimum responsibilities, and the difference between a routing graph that produces fast incident response and one that produces a chaotic many-cooks scene is almost entirely about the protocol's quality, not the routing config.
First, incident channel allocation: the moment two or more responders are paged, an incident-specific Slack channel (or Teams channel, or whatever the company uses) is created with all responders auto-joined. The channel is named with the incident ID and the suspected service (e.g., inc-2026-04-25-09-17-ledger-write-latency), is durable, and becomes the timeline of the incident for postmortem purposes. Modern incident-management tools (FireHydrant, Jeli, Rootly, incident.io, Spike) automate the channel allocation; teams without such tooling do it manually, which adds 60–120 seconds to incident bootstrap and frequently fails when the on-call cannot remember the naming convention at 02:47 IST.
Second, role declaration: within the first 2 minutes, one responder declares themselves Incident Commander (IC) and posts in the channel. Without this, the responders work in parallel on the same hypothesis and miss the orthogonal hypotheses. The IC's first job is not investigation; it is to enumerate the active hypotheses across responders, assign each to a different person, and timebox the first round (typically 5 minutes). Routing brought the right people to the table; the IC role is what turns the table into a search.
Third, standdown signalling: a responder who has confirmed their domain is not implicated must explicitly say so in the channel and leave the active responder list. Without explicit standdown, the responder lingers (afraid to leave in case they are wrong), the channel stays noisy, and the incident's effective responder count grows over time as the chain escalates. The protocol must make standdown low-cost — a single message like standdown: payments not implicated, latency is not on our path — and the IC must acknowledge each standdown so the standing-down responder can sleep without guilt. The protocol's quality is what separates fan-out routing from broadcast spam.
The three responsibilities are minimum, not maximum. Mature teams add a fourth — explicit hypothesis tracking, where the IC maintains a numbered list of active hypotheses in the channel and assigns each to a named responder with a timeboxed investigation window — and a fifth — status broadcasting, where the IC posts a one-line update every 5 minutes to a designated stakeholder channel so non-responders (managers, customer-facing teams, executives during severe incidents) get a structured stream of information without paging the IC for status. Teams that ship both report measurably shorter incidents and lower IC fatigue, because the IC's cognitive load is dominated by stakeholder-management cost during long incidents, and a structured broadcast cap reduces it.
Common confusions
-
"Routing and escalation are configuration tasks done once." Both are dynamic systems that drift as the org chart, alert library, and rotation roster change. A routing config that was correct 6 months ago is approximately 30% wrong today on any team that has reorganised, hired, or split a service. The right cadence is monthly review of routing destinations and quarterly synthetic-fire drills against each route.
-
"The escalation chain handles the case where the primary is asleep." It handles the case where the primary does not acknowledge. If the primary acknowledges and goes back to sleep (the alert-fatigue case), the chain never fires and the incident is undetected until the secondary alarm goes off — which is usually a downstream symptom alarm with a much larger blast radius. Acknowledge-without-action is the failure mode the chain is least equipped to handle.
-
"Page everyone always — it is safer." Fan-out paging during business hours produces noise pages for teams whose Slack broadcast would have been sufficient; those noise pages train alert fatigue (see
/wiki/alert-fatigue-as-a-production-failure) on the over-paged teams, making their response slower over the months timescale. The safety of fan-out is local and short-term; the cost is structural and long-term. -
"Routing rules belong with the alert definition." They are usually authored together (in the same YAML file) but they are independent decisions. The alert defines what is observed; the routing defines who responds. Coupling them in the same file produces the failure mode where the alert author picks the routing without input from the responding team, which is how
LedgerWriteLatencyHighends up routed to a single team for a multi-cause metric. Separate the files; review the routing config with the responding teams. -
"Escalation timers should be aggressive — 2 minutes is fine." Aggressive timers produce escalation explosion on real incidents, where the primary is actively working and getting paged for the same alert again every 2 minutes. The right timer is "the median ack-and-get-context latency for your team plus a small margin" — which for most teams is 5–7 minutes, not 2.
-
"If the alerting tool delivers the page, my responsibility ends." The tool's delivery confirmation is not a delivery guarantee. The page is delivered when a human is awake and looking at it, which depends on the engineer's device state, OS notification settings, app auth state, and physical proximity to the device. End-to-end synthetic drills are the only test that closes this gap; trusting the tool's delivery confirmation alone is how routing-layer failures stay invisible until the next outage.
Going deeper
How Alertmanager's route tree implements fan-out
Alertmanager (the routing component of Prometheus) uses a tree-structured routes config where each node has a matcher (match: { service: payments }), a receiver (a notification destination), and an optional continue: true flag. Without continue, the first matching node terminates the walk; with continue, the engine continues to evaluate sibling nodes. Fan-out routing is implemented by making the top-level routes for shared metrics use continue: true and chain into multiple receivers, each scoped to a different team. The subtlety: deduplication is the receiver's job — if payments-oncall and ledger-oncall both route to the same PagerDuty service via different integration keys, PagerDuty will create two separate incidents and the on-calls will not see each other. The right pattern is one PagerDuty service per incident, with multiple responders attached, which Alertmanager cannot model directly — it has to be done in PagerDuty's responder config, with Alertmanager just sending the page once.
A second Alertmanager subtlety worth naming: the group_by, group_wait, and group_interval parameters silently affect routing latency. group_wait is the time Alertmanager waits before sending the first notification for a new alert group, defaulting to 30 seconds; group_interval is the time between updates within a group, defaulting to 5 minutes. A team that does not tune these defaults pays a 30-second floor on every page (because Alertmanager is hoping more alerts will arrive that group with this one), and pays a 5-minute floor on hearing about additional alerts in the same group. For incident response, both defaults are too slow; production teams typically set group_wait: 10s and group_interval: 30s for high-severity routes and accept the slightly noisier multi-alert grouping in exchange for faster first-page delivery.
Why off-hours fan-out is not the same as paging extra people
A team that fans out off-hours pages 2.5 people per incident on average (canonical + 1.5 cross-team). The naive read is that this is 2.5× the on-call burden. The actual measured impact is closer to 1.4× because (a) the cross-team responder is awake for the duration of the incident, not the entire shift; (b) the cross-team responder, once they confirm they are not implicated, can stand down within minutes, while the canonical owner stays on for the full mitigation; and (c) the cross-team responder's accumulated sleep loss across a 90-day rotation is small if the cross-team incidents are infrequent. The structural cost is the right one to track: cross-team page-minutes per quarter, not pages per night. Teams that track the right metric find off-hours fan-out is cheap; teams that track the wrong one think it is expensive and stay on single-owner routing.
There is a fairness wrinkle worth naming: fan-out distributes the cost of cross-team incidents across more rotations, which is good for individual sleep but can hide the fact that one team is the source of most cross-team incidents. If payments is the canonical owner of 60% of cross-team incidents and the fan-out also pages ledger and platform-network, the individual ledger and platform-network engineers experience occasional cross-team pages and feel the system is fair. But the structural picture — payments is producing 60% of cross-team incidents, the org should invest in payments reliability — is invisible in per-engineer sleep data. The instrumentation that makes the structural picture visible is per-team incident-origination counts, not per-engineer page counts. Teams that report on the latter alone optimise the wrong target.
A real-world routing graph: Razorpay's UPI payment chain
A hypothetical Razorpay-pattern UPI alert that fires on UPIBeneficiaryAckLatencyHigh has at least four plausible owners: the Razorpay client SDK (latency could be client-side serialization), the Razorpay payments backend (latency could be backend processing), NPCI (latency could be the NPCI hop the request makes externally), and the bank's UPI handler (latency could be the bank's response). The first two are inside Razorpay; the latter two are outside. The routing graph fans out to both internal teams during off-hours and Slack-broadcasts to the SRE channel that handles NPCI/bank escalations. When the alert fires, both internal on-calls join the bridge within 90 seconds; one of them quickly determines the symptom is on the external hop and posts in the SRE channel, which kicks off the external-vendor escalation path (NPCI's own incident-response, the bank's relationship manager). The total mean-time-to-correct-responder is around 4 minutes for an incident that crosses 4 organisations — which is only achievable because the routing graph was designed to fan out across the full causal chain, not to identify a single owner.
The same pattern shows up at Hotstar / JioCinema for IPL-final streaming alerts (CDN edge → CDN origin → encoder → ingest), at Zerodha Kite for trading alerts (broker API → exchange feed → execution engine → ledger), and at Swiggy / Zomato for delivery-rider chain alerts (rider app → routing service → restaurant partner → payment). In each case, the alert label points to one team but the failure surface spans 3–5; the routing graph that makes the system debuggable in production is the one that fans out to all of them, with explicit time-of-day-aware narrowing during business hours and a documented standdown protocol so the cross-team responders can stand down quickly when their domain is not implicated.
The org-chart-coupling problem and the matrix-routing fix
Routing rules that hard-code team names to alert labels are tightly coupled to the current org chart. Every reorg — and at any company past Series B, reorgs happen every 6–9 months — invalidates a chunk of the routing config. Teams handle this in three ways. The first is to ignore the drift, accept that 20–30% of routing rules will be stale at any time, and rely on the cross-team incident bridge to recover. This works at small scale and breaks at medium scale. The second is to write a migration script that runs after every reorg and updates the routing config from a source-of-truth team-ownership file (the same file that backs the team page in the company wiki). This works if the source-of-truth file is actually maintained, which is itself an org problem. The third — used by larger orgs — is matrix routing: the routing config maps alerts to capabilities (e.g., database-postgres-ownership, payments-domain-knowledge) rather than to teams, and a separate config maps capabilities to current rotations. A reorg only changes the second config, which is small and easy to maintain. Most teams never need this; the ones that do find it pays for itself within one reorg cycle.
A second-order effect of matrix routing: it forces the org to articulate its capability map explicitly, which surfaces capability gaps. If database-postgres-ownership cannot be assigned because no team currently owns Postgres at that depth, the matrix-routing system fails closed (it cannot route the alert) and the missing capability becomes visible. Single-owner routing fails open in the same situation — the alert routes to whichever team's name was on the rule, which keeps pages flowing and hides the gap. Visible capability gaps get fixed; hidden ones do not. Teams that adopt matrix routing usually report this as the unexpected benefit, with the routing-stability benefit being a smaller second-order win.
Escalation chains for blameless postmortem culture
A subtle property: escalation policies that punish primary on-calls for missed pages produce worse outcomes than ones that treat escalations as system signals. If missing a page produces a performance-review consequence, primaries will ack pages they cannot work on (to avoid the consequence) — which produces escalation collapse. The blameless framing is to treat primary-missed-pages as a routing-design signal: if a specific primary misses pages repeatedly, the question is "what about this rotation's setup is producing miss patterns" not "who do we replace". Most teams that have measured this find the missed-page rate correlates with rotation length, time-zone fit, and alerting-tool config — not with the individual engineer.
The companion to this is the synthetic-fire drill — a quarterly per-engineer test where an automation fires a fake alert at a random time and the on-call must ack within 5 minutes and post a code phrase in a designated channel. The drill catches the entire class of routing-infrastructure failures (phone-auth lapses, do-not-disturb misconfigurations, mis-configured rotations) that are otherwise silent until the next real outage. Combined with blameless escalation-tracking, the drill turns a routing system from a static config into a tested production system. Teams that adopt the drill after their first routing-failure outage rarely abandon it; teams that have not been bitten yet typically argue against it on the grounds that "we can trust the tools" — which is precisely the position the next outage will reveal as untenable.
Where this leads next
The routing graph is one half of how alerts produce action; the other half is how the action gets coordinated once multiple responders are paged. That coordination is the subject of incident command, which is structurally separate from the alerting layer and which most teams build on top of the routing graph. The next chapters in Part 11 cover incident command, the runbook conventions that make a paged alert actionable in 90 seconds, and the postmortem ladder that closes the loop from alert → response → ruleset reform.
A second thread that runs through the rest of Part 11 is the relationship between routing decisions and alert fatigue. The fan-out routing strategy described above only works if the receiving teams' priors on the alert remain healthy. If fan-out routing dilutes attention to the point where every team treats every fanned-out page as "probably not us", the fan-out has produced the same failure mode the single-owner strategy was supposed to fix — just distributed across more people. The mitigation is to track per-team page volume and per-team standdown rate, and to prune fan-out edges where standdown rate exceeds a threshold (typically 70%); above that threshold, the team is being paged for incidents they almost never own, and the cost of their attention exceeds the benefit of their occasional involvement. Routing graphs need pruning the way alert rules need pruning — and the metric that drives pruning is the same: signal-to-noise ratio per route, measured over a rolling 30-day window.
/wiki/alert-fatigue-as-a-production-failure— why the on-call's prior on a page determines whether routing matters at all./wiki/runbook-driven-alerts— how the page links to the action, and why a runbook-less page is half a routing decision./wiki/symptom-based-alerts-the-google-sre-book— how alert content (symptom vs cause) interacts with routing destination./wiki/multi-window-multi-burn-rate— how the alert's firing condition affects how often the routing layer is exercised.
A final orientation note: routing and escalation sit at the intersection of three engineering disciplines — observability (the alerts), human factors (the responders), and organisational design (the teams). Treating it as purely an observability problem produces over-engineered routing graphs that no team can maintain. Treating it as purely a human-factors problem produces under-engineered routing that compensates with heroic responders. Treating it as purely an org problem produces routing that mirrors the org chart but does not match the failure surface of the system. The pieces have to be designed together, with the same engineers in the room, on the same cadence as the rest of the alerting work. Teams that own this boundary explicitly — through a platform team, an SRE charter, or a dedicated incident-response group — produce routing that is durable across reorgs and resilient to ruleset reform; teams that don't, rebuild the routing layer from scratch every two years.
A practical implication for an engineer reading this on a Friday at 11pm: if you only do one thing this quarter to your routing config, audit which of your high-traffic alerts are routed to a single owner despite observing a multi-cause metric. The list will be longer than you expect; converting the top five to fan-out routing — with explicit time-of-day-aware narrowing and a documented standdown protocol — will cut your cross-team incidents' mean-time-to-correct-responder by minutes per incident, and will surface the gaps between your routing graph and your real failure surface long before the next outage does.
References
- Site Reliability Engineering (Google, O'Reilly 2016), chapter "Practical Alerting" — the canonical text on routing and escalation as engineering primitives.
- Seeking SRE (David N. Blank-Edelman, ed., O'Reilly 2018), chapters on incident response — multiple authors describe routing-graph design at scale.
- PagerDuty engineering blog, "Designing Escalation Policies for Real Incidents" — practitioner-grade essay on escalation chain failure modes.
- Alertmanager documentation:
prometheus.io/docs/alerting/latest/configuration— the routing tree spec and thecontinue: truesemantics. - Charity Majors et al., Observability Engineering (O'Reilly 2022), chapter on alerting — modern framing of routing in OpenTelemetry-shaped systems.
- Wertz et al., "Effects of Sleep Inertia on Cognition" (JAMA 2006) — the underlying sleep-inertia data behind off-hours response latency.
/wiki/alert-fatigue-as-a-production-failure— the immediate precursor in the curriculum./wiki/reducing-on-call-pain— the operational reforms that pair with routing redesign.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy pandas
python3 routing_sim.py