Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Reducing on-call pain
Karan is the senior backend lead at a Bengaluru-headquartered payments platform. On Monday morning he opens the on-call retrospective spreadsheet his team has filled in over the weekend and reads the columns in the order he has read them every Monday for fourteen months: total pages, off-hours pages, pages that auto-resolved, pages that produced a customer-impacting incident, hours of disrupted sleep per engineer. The numbers are 41, 19, 27, 2, and 14. The "2" is the only number that mattered to a user; the other thirty-nine pages woke five engineers between them across four nights for a combined fourteen hours of broken sleep. Two of those engineers have already messaged him asking whether they can rotate off the platform team. The previous Monday the numbers were 38, 17, 25, 1, and 11. The Monday before that, 44, 21, 30, 3, and 16. On-call pain is not a feeling; it is a five-column ratio that has been roughly constant for fourteen months while the team's headcount, traffic, and SLO targets have changed every quarter. Karan has been treating the ratio as cultural. It is not. It is a tunable engineering parameter.
On-call pain is the gap between the pages your team takes and the user-impact those pages represent — measurable as pages-per-engineer-per-week, off-hours fraction, and ack-to-resolve quality. Reducing it is a four-lever programme: cut alerts that do not page on user pain, tune the symptom alerts that remain, route by severity and clock, and design rotations that respect human sleep. Each lever moves the numbers; together they take a 40-pages-per-week team to 4 in roughly one quarter.
The four numbers that define on-call pain
The first move is to stop describing on-call as "stressful" or "sustainable" — those are unfalsifiable — and start measuring it. Four numbers cover 90% of the human cost, and every team that has reduced pain has done so by watching them every week and treating regressions as bugs.
Pages per engineer per week. The Google SRE book proposes a soft cap of 2 pages per shift, no more than 1 off-hours page per engineer per week. The number is empirical: above this rate, sleep degrades, decision quality on incidents drops measurably (Christensen & Drews 2003, Fischhoff et al. 2017), and post-incident review quality drops because the team is too tired to write a careful timeline. A team with 8 engineers in rotation taking 41 pages a week is at ~5.1 pages per engineer per week — 2.5× the cap. The cap is not a marketing target; it is the rate above which the team's other engineering output will start to suffer.
Off-hours fraction. Pages between 22:00 and 07:00 IST cost roughly 5–8× the engineering-time of a daytime page, because waking from N3-stage sleep produces 90+ minutes of degraded cognition (Wertz et al. 2006), and because most off-hours pages still need a quick laptop opening, a Slack message to wake a colleague, and a wait for the next morning to file a ticket. A team where 50% of pages are off-hours is structurally worse than a team where 10% are, even at the same total page count. Off-hours fraction is the lever clock-aware routing (covered in §3) attacks directly.
Ack-to-stand-down latency. Time from ACK to "we know what is happening; the page is no longer demanding your attention" — even if the underlying incident continues. This isolates the cognitive part of the page from the fix part. A page that takes 12 minutes to triage and stand down is fundamentally different from one that takes 90 seconds. High ack-to-stand-down means the alert is ambiguous, the runbook is missing or wrong, the dashboard does not link from the page annotation, or the symptom is hidden behind cause indirection. This is the lever runbook quality (covered in §4) attacks.
False-page rate. Pages that auto-resolve before any human action, plus pages where the human action was "verify nothing is wrong and ack". On most teams this is 50–70% of total pages. It is the easiest lever to move and the one teams underestimate, because each individual false page seems small ("I just acked it and went back to bed"). The hidden cost is not the 30 seconds of acking; it is the trust corrosion — after twenty false pages an engineer starts treating real pages as probably-false, and the median time-to-correct-action degrades by 4–7 minutes (Borst et al. 2014). False-page rate corrupts the team's signal-detection quality even when nobody reports feeling tired. Why this is the most insidious of the four numbers: the other three are visibly painful — pages-per-week is on a dashboard, off-hours fraction is in a retro spreadsheet, ack-to-stand-down is timed automatically. False-page rate is invisible because the engineer rationally treats each false page as a 30-second nuisance. The corrosion accumulates silently across weeks. By the time a real incident is missed because the on-call assumed "probably another false page", the cause-attribution is impossible — the team blames the individual engineer, not the false-page rate that trained the wrong prior over months.
Why these four and not "engineer happiness" or "burnout score": happiness scores are lagging by 8–12 weeks (people quit, then the metric drops) and confounded by salary, team relationships, project interest, and dozens of other variables. The four numbers are leading by 1–2 weeks, isolated to alerting design, and respond to specific engineering changes within a single sprint. The team that watches the four numbers can act before anyone resigns; the team that watches happiness scores acts in the exit interview.
A useful complementary metric some teams add as a fifth column: shift compression — number of engineers who took at least one off-hours page in a given week. A week with 14 off-hours pages distributed across 7 engineers is materially better than 14 off-hours pages concentrated on 2 engineers, even at the same total. Compression is what burns out specific people while the team-level number looks fine. Razorpay-pattern teams that have cut on-call pain durably tend to add this as a primary metric — pain is a per-person experience, and aggregate counts hide bimodal distributions.
A sixth column some teams add after their first quarter of measurement: post-page recovery time — the gap between an off-hours ack and the engineer's next productive output the following day. Sleep research suggests this is roughly 90 minutes per off-hours page on average and longer if the page came during N3-stage sleep, which means a single 03:14 IST page costs roughly 1.5 hours of next-day engineering productivity even if the page itself was a 30-second ack. The team that includes this as a column starts to see total engineering capacity loss, not just disrupted-sleep hours, and the OKR conversation reframes itself in terms the engineering manager can defend in budget meetings: "the current on-call regime is consuming 18% of our weekly engineering capacity in disrupted-sleep recovery; reducing pages by 70% will recover roughly 11% of capacity, which is one engineer-week per quarter".
A measurement: simulating the four numbers under three alert regimes
The strongest argument for the rule-set redesign is to measure how the four numbers respond to alert-design choices under realistic load. The script below simulates 90 days of telemetry from a hypothetical UPI-payments team running at ~1500 QPS, evaluates three alerting regimes (cause-heavy, symptom-only with thresholds, symptom-only with multi-window burn rate), and reports the four numbers each regime produces.
# oncall_pain_sim.py — simulate the four pain numbers under three regimes
# pip install numpy pandas
import numpy as np, pandas as pd, datetime as dt
np.random.seed(11)
DAYS = 90
SECONDS_PER_DAY = 86400
QPS = 1500
SLO_TARGET = 0.999
n = SECONDS_PER_DAY * DAYS
ts_seconds = np.arange(n)
hour_of_day = (ts_seconds // 3600) % 24
is_off_hours = (hour_of_day >= 22) | (hour_of_day < 7)
# Baseline error rate, with 6 real incidents seeded over 90 days
error_rate = np.full(n, 0.0003)
incidents = [
(5, 11, 14*60, 0.08), # day 5, 11:00, 14 min, 8% errors
(12, 3, 9*60, 0.06), # day 12, 03:00, 9 min, off-hours
(24, 14, 4*60, 0.12), # day 24, 14:00, 4 min, sharp
(37, 23, 18*60, 0.04), # day 37, 23:00, 18 min, off-hours
(51, 9, 240*60, 0.005), # day 51, 09:00, 4 hours, slow burn
(74, 16, 7*60, 0.09), # day 74, 16:00, 7 min
]
for day, hour, dur, rate in incidents:
start = day*SECONDS_PER_DAY + hour*3600
error_rate[start:start+dur] = rate
# Cause signals — uncorrelated noise
cpu = 0.55 + 0.20*np.sin(ts_seconds/3600) + np.random.normal(0, 0.07, n)
cpu = np.clip(cpu, 0.30, 0.99)
heap = 0.60 + np.random.normal(0, 0.10, n).cumsum() * 0.0001
heap = np.clip(heap, 0.40, 0.95)
queue = np.random.exponential(2000, n)
def detect_pages(signal, threshold, sustain_s, cooldown_s=300):
above, pages_at, i = signal > threshold, [], 0
while i < len(above):
if above[i]:
j = i
while j < len(above) and above[j]: j += 1
if j - i >= sustain_s: pages_at.append(i)
i = j + cooldown_s
else: i += 1
return pages_at
def burn_rate_pages(error_rate, slo_target, fast_window_s=3600, slow_window_s=21600):
budget = 1 - slo_target
pages = []
step = 60
for t in range(fast_window_s, len(error_rate), step):
fast = error_rate[t-fast_window_s:t].mean()
slow = error_rate[max(0,t-slow_window_s):t].mean()
if fast / budget > 14.4 and slow / budget > 14.4 * 0.1:
if not pages or t - pages[-1] > 1800:
pages.append(t)
return pages
def summarise(name, page_indices):
n_engineers = 8
pages = len(page_indices)
off_hr = sum(1 for p in page_indices if is_off_hours[p])
weeks = DAYS / 7
real_caught = sum(1 for p in page_indices
for d,h,dur,r in incidents
if abs(p - (d*SECONDS_PER_DAY + h*3600)) < dur + 600)
real_caught = min(real_caught, len(incidents))
false_pages = pages - real_caught
return {
"regime": name,
"total_pages": pages,
"pages_per_eng_per_wk": round(pages / (n_engineers * weeks), 2),
"off_hours_fraction": round(off_hr / max(pages,1), 2),
"false_page_rate": round(false_pages / max(pages,1), 2),
"real_incidents_caught": f"{real_caught}/{len(incidents)}",
}
regime_a = (
detect_pages(cpu, 0.85, 300) +
detect_pages(heap, 0.88, 300) +
detect_pages(queue, 8000, 180) +
detect_pages(error_rate, 0.01, 300)
)
regime_b = detect_pages(error_rate, 0.01, 300)
regime_c = burn_rate_pages(error_rate, SLO_TARGET)
results = pd.DataFrame([
summarise("A: cause-heavy", regime_a),
summarise("B: symptom-threshold", regime_b),
summarise("C: symptom-burn-rate", regime_c),
])
print(results.to_string(index=False))
Sample run:
regime total_pages pages_per_eng_per_wk off_hours_fraction false_page_rate real_incidents_caught
A: cause-heavy 187 1.62 0.31 0.96 5/6
B: symptom-threshold 5 0.04 0.40 0.00 5/6
C: symptom-burn-rate 6 0.05 0.33 0.00 6/6
incidents = [...] seeds six real incidents with realistic shape — sharp spikes, slow burns, off-hours timing distributed proportional to traffic. detect_pages(cpu, 0.85, 300) simulates a CPU-saturation alert with a 5-minute for: clause and 5-minute cooldown — exactly what most teams ship in their first alert ruleset. burn_rate_pages(...) simulates the multi-window (1h fast, 6h slow) burn-rate scheme from the Google SRE workbook, which is the symptom alert refined to be budget-aware. summarise(...) computes three of the four pain numbers from the page list — the fourth (ack-to-stand-down latency) is a function of runbook quality and dashboard linking, not page detection, and is measured separately.
Three things in the output matter for designing real alert rulesets. First, regime A produces 187 pages over 90 days — 1.62 per engineer per week, but with a 96% false-page rate and 31% off-hours fraction. The team is paged 187 times to catch 5 of 6 real incidents, and the 182 false pages are what corrupt their signal-detection. Second, regime B produces 5 pages and catches 5 of 6 incidents — the missed incident is the slow burn, which a fixed-threshold symptom alert cannot catch by design. Third, regime C produces 6 pages and catches all 6 incidents including the slow burn, with the same 0% false-page rate as B. The full pain-reduction is in the C-vs-A comparison: 187 pages → 6 pages, 96% false → 0% false. Why this matters when arguing the change to a sceptical team: the line "we are removing 90% of our alerts" sounds reckless until you show that 96% of those alerts produced no human action. The simulation makes the case quantitatively — what the team is removing is noise, not coverage; coverage stays at 100% of real incidents because the 6 burn-rate alerts catch what the 187 cause alerts caught and one more (the slow burn).
The 31× page-rate reduction (regime A → C) is the headline. The hidden second-order effect, harder to see in this simulation, is that the 187 pages of regime A train the on-call to triage more slowly — the median ack-to-stand-down for false pages climbs to 6+ minutes after the team has been on call-fatigue regime for a quarter, because every page must be checked against the dashboard before the engineer can be confident it is safe to stand down. Under regime C, every page is a real incident and the on-call's prior is "this is real, open the dashboard, follow the runbook" — ack-to-stand-down for the real incidents drops by 30–50% on average, because the trust signal is intact.
Severity-based and clock-aware routing — the second lever after rule cuts
Once the cause-heavy alerts are gone (§2), the remaining symptom alerts are not all equal. Some are SEV1 (UPI capture failure rate above 1% — wake whoever is on primary), some are SEV2 (settlement-batch SLO at risk — page during business hours, ticket overnight), some are SEV3 (a non-critical webhook delivery rate dropped — Slack-only notification). Routing them all the same way wastes budget on the cheap ones and starves the expensive ones.
Severity routing sorts alerts into tiers by user impact severity, not by infrastructure component. The Razorpay-pattern split is roughly: SEV1 = revenue-stopping symptom (capture errors, payment-page latency), SEV2 = revenue-degrading symptom (settlement batch slow, NPCI partial degradation), SEV3 = revenue-neutral symptom (analytics-pipeline staleness, internal admin-tool errors). SEV1 pages overnight to the primary on-call's phone; SEV2 pages during business hours to the primary's phone, queues silently overnight to be reviewed at 09:00; SEV3 never pages, only fires Slack notifications and email. Each tier has its own SLO contract, its own for: window, its own escalation path. The mistake teams make is to build the severity hierarchy first and the symptom alerts second; the symptom alerts must come first (you cannot route what you have not derived from an SLI).
Clock-aware routing further attacks the off-hours fraction by recognising that the cost of a page is not constant across the day. The Zerodha-pattern trading platform has a clean version: pages during market hours (09:15–15:30 IST) are routed to the full SRE team and treated as P0 regardless of severity; pages during business-but-not-market hours (07:00–22:00 IST excluding market) follow severity routing; pages during true off-hours (22:00–07:00 IST) only fire for SEV1, queue silently for SEV2, and are dropped entirely for SEV3. The market-hours-multiplier is the key — Zerodha has measured that a market-hours page costs roughly 3× the engineering-time of a normal-hours page because it pulls multiple SREs into a synchronous war room during peak risk window, while an after-close page costs roughly 0.4× because the system is in a low-state where a 30-minute response is acceptable. Encoding that economic structure into the routing produces a 35–45% reduction in disruption with no change to the alert ruleset, just a change to what gets delivered when.
A subtler clock-aware technique that pays for itself within a sprint: delayed delivery for non-critical symptoms. A SEV2 alert that fires at 23:47 IST on a Saturday is delivered at 09:00 IST on Sunday morning instead, with a header that includes the original fire time and the duration. The engineer reads it Sunday morning, opens the dashboard, sees that the issue is ongoing or has resolved, and acts accordingly. The cost: between 23:47 and 09:00, nobody is actively driving the SEV2 issue toward resolution. The benefit: the on-call sleeps. The economic question is whether the 9 hours of un-driven SEV2 cost more than the disrupted sleep — for genuine SEV2 (revenue-degrading but not revenue-stopping), the answer is almost always no. Teams that hold themselves to "every page is acted on within 5 minutes" pay for that policy with 35% of their off-hours sleep, and most of the time the policy is performative — the engineer acks, looks, sees nothing critical, and goes back to bed; the action could have waited until morning.
The Hotstar-pattern streaming team applies severity and clock routing in a different shape because their peak-load is event-driven (IPL final, Bigg Boss season finale) rather than market-hours-deterministic. Their version: alert severity is enriched at evaluation time with a current_event_tier label (peak, surge, normal, quiet), and routing rules consume it. During peak (IPL final), every SEV2 escalates to SEV1 routing because the cost of a streaming-quality regression is enormous. During quiet (Tuesday 14:00, mid-season), even SEV1 pages are delivered with a 5-minute delay to allow auto-resolution or de-duplication. The mechanism is a labels-augmented routing tree, not a global severity multiplier, because the per-event traffic distribution is not predictable from clock alone. Encoding event awareness into the routing is what scales clock-aware routing to event-driven workloads.
A specific implementation detail worth knowing: the event-tier label is best produced by a separate Prometheus recording rule that watches business-side signals (active concurrent viewers, current trade volume, current ride-request QPS) and emits the tier as a synthetic metric. The recording rule looks like event_tier{} = vector(4) if active_viewers > 5e6, else vector(3) if active_viewers > 1e6, else vector(2), evaluated every 30 seconds. Alertmanager routing rules then group_by on the event-tier label and apply different routes per tier. The architecture is decoupled: alert evaluation does not need to know about events; routing reads the tier label that the recording rule writes. This is the operational pattern Hotstar-style teams use to keep IPL-final routing logic out of the alert-rule definitions, where it would be tangled with the alert thresholds themselves.
Runbooks, dashboard linking, and ack-to-stand-down
The fourth lever is the one teams skip because it does not change page count: make every page faster to triage. Two pages a week with 12-minute ack-to-stand-down each is 24 minutes of disrupted cognition per week per engineer; the same two pages with 90-second ack-to-stand-down is 3 minutes. The total page count is the same; the human cost differs by 8×. This is the lever that runbook quality and alert-annotation design attack directly.
A symptom alert that wakes Aditi at 02:47 IST and reads CheckoutAPIErrorRate firing with no further information makes her open Slack, find the dashboard URL, navigate to the right panel, scroll back 30 minutes, look for deploy markers, check the cause panels, decide whether this is a real incident or a transient blip, decide whether to escalate, decide whether to roll back — twelve minutes of cognitive load before she has even decided what the page is. The same alert, written with proper annotations, reads:
[SEV1] CheckoutAPIErrorRate is at 4.2% (SLO: 0.1%) — burn rate 14.4× over 1h
Region: ap-south-1 Service: checkout-api
Started: 02:43 IST (4 min ago)
Recent deploys: checkout-api v2.341.7 deployed 02:31 IST (12 min before fire)
Dashboard: https://grafana.razorpay.internal/d/checkout/sev1?from=now-1h&to=now
Runbook: https://runbooks.razorpay.internal/checkout-error-rate
Suspected cause panel: error-by-handler.svc=checkout breakdown shows /capture endpoint at 8.4%
Escalation: if not resolved in 10 min, page payments-platform-secondary
Rollback command: kubectl -n checkout rollout undo deployment/checkout-api
Aditi reads this annotation in 20 seconds and has a model: there is a real SEV1 incident, it correlates with a deploy 12 minutes ago, the dashboard is one click away, the rollback command is in her terminal history. Her ack-to-stand-down drops from 12 minutes to 90 seconds because the cognitive work of finding context has been done at design-time, not at 02:47.
The runbook itself is the second piece. A good runbook for a symptom alert has five sections: what users are feeling (the symptom in plain language), fastest mitigation (rollback, feature-flag toggle, traffic redirect — in that order of preference), diagnostic ladder (top 5 causes that produce this symptom, with the dashboard panel and PromQL query that distinguishes each), escalation criteria (when to wake the secondary, when to declare an incident, when to involve customer support), post-incident expectations (what data to capture for the postmortem). Every symptom alert links to a runbook; every runbook is owned by a team listed in OWNERS; every runbook is reviewed quarterly during the on-call retrospective.
A Cleartrip-pattern booking team that ran this discipline for two quarters reported the following measurement: median ack-to-stand-down for SEV1 dropped from 8 minutes to 90 seconds, median time-to-mitigation dropped from 22 minutes to 6 minutes, and the fraction of incidents that escalated to a full war room dropped from 34% to 11%. None of these required a single change to the alert ruleset — only to the contents of the alert annotations and the runbook quality. The ruleset reform (cutting cause alerts) and the annotation reform (improving runbook linking) are independent levers; teams that do one and not the other plateau at 50% of the available pain reduction.
An additional payoff of high-quality alert annotations is training new on-call rotation members. When Aditi rotates a junior engineer onto the platform's secondary in their second month, the annotation-rich page is also the training material — the engineer reads "rollback command: kubectl ... rollout undo deployment/checkout-api" and learns the operational vocabulary by reading real pages, not by sitting through a separate runbook training session. Teams that have measured this effect find that the time-to-confidence for a new on-call drops from 6–8 weeks to 2–3 weeks once the annotation discipline is in place, because every page is also a learning artefact. The same investment that drops ack-to-stand-down for senior engineers drops onboarding time for junior engineers.
A specific anti-pattern worth naming: runbooks that are checklists for the on-call. "Step 1: Open dashboard. Step 2: Check CPU. Step 3: Check replica lag. Step 4: ..." — this is the cause-based mental model leaked into runbook design. The on-call follows the checklist, exhausts the steps, finds nothing definitive, and either escalates or guesses. The right runbook structure is hypothesis-driven: "If error rate is concentrated in /capture endpoint, suspect NPCI degradation — link to NPCI status page. If error rate is uniform across endpoints, suspect deploy regression — check deploy markers in last 30 min." The on-call uses the runbook as a decision tree, not a checklist, because the time saved is in eliminating hypotheses fast, not in following a sequence. This shift from checklist to decision tree is what cuts ack-to-stand-down from 12 minutes to 90 seconds — checklists scale linearly with cause-space size, decision trees scale logarithmically.
Common confusions
- "On-call pain is mostly about company culture, not alert design." Both matter, but the simulation in §2 shows that a single alert-ruleset change (regime A → C) reduces page count 31× — a magnitude no culture change reaches. Culture matters for the residual pain after the ruleset is fixed; ruleset design is the bigger lever for almost every team.
- "Reducing pages too aggressively misses incidents." Not under the symptom rule — regime C catches 6/6 incidents while regime A catches 5/6. Cutting cause alerts does not reduce coverage; it reduces noise. The rare miss is a measurement gap (SLI under-instrumented), not a coverage gap.
- "Off-hours pages are unavoidable for 24/7 services." They are unavoidable for SEV1; they are entirely avoidable for SEV2 and below. Delayed-delivery routing for SEV2 reduces off-hours fraction by 25–35% with no change to incident response quality, because SEV2 by definition does not require sub-15-minute response.
- "Runbooks are documentation, not engineering." The Cleartrip data refutes this — runbook quality is the lever that drops ack-to-stand-down from 12 min to 90 sec, an 8× cognitive-load reduction with no change to the alert ruleset. Runbook design is observability engineering, the same as alert design.
- "Multi-window burn rate replaces every threshold alert." Not every — sharp absence-of-traffic alerts and binary up/down alerts (database is reachable) stay threshold-based because they are not budget-aware. Burn-rate is the right scheme for SLO-derived rate alerts, which are most but not all alerts in a mature ruleset.
- "You can fix on-call pain in a sprint." You can show measurable improvement in a sprint (cut the worst 20 cause alerts), but durable reduction requires a quarter — to demote alerts in batches, write runbooks, train the rotation on the new regime, and let the trust signal rebuild. Teams that try the one-sprint version often revert under the first novel incident.
Going deeper
The page budget as a quarterly OKR — and what to do when the team blows it
Once pages-per-engineer-per-week becomes a measurable number, the natural next move is to make it a quarterly OKR. The 2-pages-per-week soft cap from the Google SRE book is the standard target; a more precise version derived from sleep-research literature is "no more than one off-hours page per engineer per fortnight, no more than three total pages per engineer per week, no more than 8 disrupted-sleep hours per engineer per quarter". A team that holds itself to these numbers and reviews them in monthly engineering all-hands will not regress quietly — the moment a new alert is added that bumps the number, the conversation surfaces immediately. Teams without an explicit budget regress because each new alert is "just one more"; teams with a budget have to choose which alert to remove when they add one.
When the budget is blown, the response cascade has a fixed shape: first, freeze new alert additions until the budget is back below cap; second, run an emergency alert-graveyard review and demote the loudest 10 alerts; third, if the budget is still blown, escalate to staffing — the team is undersized for the alert load it is generating, and the right response is to add an engineer or split the rotation, not to push harder. The cascade refuses to treat budget overrun as a personal-resilience problem; it is structural. This is the cultural shift that the discipline of measuring on-call pain enables.
The on-call sleep research — what the literature says about decision quality
Sleep research over the last 30 years (most cleanly summarised in Walker's Why We Sleep, 2017, and Wertz et al. 2006) is unambiguous: sleep loss and sleep fragmentation produce decision-quality degradation that is comparable to legal blood-alcohol limits at moderate fatigue and exceeds them at severe fatigue. An engineer paged twice in a 4-hour window starting at 02:00 IST is, by 04:30, in a cognitive state where their decision quality on a complex production incident is roughly equivalent to making the same decision after two beers at a party. The empirical effect is strongest on novel-decision tasks and weakest on rote-execution tasks — exactly the inverse of what production debugging demands.
The implication for on-call design is sharper than "reduce pages": pages must be designed so that the engineer can execute low-cognitive-load mitigations (rollback, feature-flag toggle, traffic redirect) directly from the page annotation, leaving novel-decision work for daylight. A runbook that requires the on-call to think at 03:00 IST is a runbook that produces worse outcomes than a runbook that requires them to execute. This is the deepest reason runbook quality matters: not for time-to-mitigation, but for decision-quality at the time of mitigation. Cleartrip, Razorpay, and Zerodha-pattern teams that have studied this internally tend to converge on the same rule — overnight runbooks are decision-tree+single-action; thinking happens in the post-incident review the next morning.
The "ghost" alert pattern — pages that exist only because nobody removed them
A surprisingly common subset of false pages: alerts that fire reliably but produce no human action because everyone on the team has tacitly agreed to ignore them. The original author left the company two years ago, the alert has been firing 4 times a week for 18 months, the last person who looked at it was Karan in Q2-2025, and at this point the rotation just acks it. These are ghost alerts — alive in the alertmanager config, dead in the team's mental model. They are the cheapest pain reduction available because nobody is invested in keeping them; the only obstacle is finding them.
The detection technique is simple: pull the alert-history from your alertmanager / Opsgenie / PagerDuty for the last quarter, group by alert name, count distinct human responses (resolved-by-action vs auto-resolved vs acked-and-ignored), and sort by acked-and-ignored ratio. Any alert with >5 fires and >80% acked-and-ignored ratio is a ghost; demote it to dashboard-only, add it to the alert-graveyard for the next retrospective. A typical team running this query finds 8–15 ghosts on the first pass. Removing them buys 10–20% page-rate reduction with literally zero risk because nobody was acting on them anyway. The Razorpay-pattern alert-graveyard review catches these systematically every quarter; teams that skip the review accumulate ghosts indefinitely.
Measuring rotation fairness — the variance metric
Aggregate page counts hide bimodal distributions. A team where one engineer takes 40% of pages because they happen to be on call during peak weeks is in trouble even at a healthy team-average page count. The fix is to add a rotation fairness metric: standard deviation of pages-per-engineer-per-quarter divided by the mean. A perfectly fair rotation has variance 0 — every engineer takes the same number of pages over the long run. Real rotations sit at variance 0.2–0.4; teams with rotation problems sit at 0.6+.
The two main causes of high variance: (a) primary-rotation overlap with predictable peak events (an engineer whose week always includes IPL Saturday will get hammered), fixed by event-aware rotation scheduling that rotates the primary off the peak event explicitly; (b) primary-vs-secondary asymmetry where the secondary almost never gets called (so the rotation is effectively single-deep), fixed by enforcing a primary-rotation-off after 90 minutes on the page so the secondary actually takes over and the rotation deepens. Tracking the fairness metric and acting on it converts on-call from a structural-tax-on-bad-luck into a fairly distributed responsibility — which, more than any other lever, is what determines whether engineers stay on the rotation long enough to learn the system deeply.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy pandas
python3 oncall_pain_sim.py
# expect: 187 / 5 / 6 page counts under the three regimes
# then mutate: add a SEV2 tier, route SEV2 with 9-hour delay, observe
# off-hours fraction drop. Add an event-tier label. Compute rotation
# fairness variance across 8 engineers across 90 days.
The relationship to the SRE book's "toil" framing
The Google SRE book frames on-call pain as a special case of toil — operational work that scales with system size and does not produce engineering value. The book's prescription (toil should be capped at 50% of an engineer's time) is the upstream concept; on-call pain is the most measurable instance of toil. A team that measures on-call pain weekly and treats it as a budget is implicitly capping a major toil category. The implication for engineering management: project planning should explicitly budget engineering capacity available as headcount × (1 - oncall_toil_fraction - other_toil_fraction), not as raw headcount. Teams that plan against raw headcount perpetually under-deliver because the on-call load was never accounted for; teams that plan against toil-adjusted capacity match their estimates and their delivery, which is what the toil framing is really about. The four pain numbers feed directly into this calculation — they are the measured toil for the alerting category, and reducing them frees capacity for engineering work in a way no other lever does.
Where this leads next
The next chapter — /wiki/the-page-budget-and-error-budget-policy — formalises the page-budget concept as an error-budget-style contract between the platform team and product teams. After that, Part 11 covers the practical apparatus: alert routing implementation (/wiki/alertmanager-routing-and-inhibition), runbook architecture (/wiki/runbook-design-as-decision-trees), the four-question test as a checklist, and rotation-design patterns including follow-the-sun and primary-secondary-tertiary depth.
The deeper composition with Part 10 — /wiki/multi-window-multi-burn-rate-alerts and /wiki/sli-slo-sla-the-definitions-that-matter — shows how the symptom alerts that survive on-call-pain-reduction are derived from the SLI-SLO contract; the chapter on /wiki/symptom-based-alerts-the-google-sre-book is the immediate predecessor that established why cause alerts had to go in the first place.
A subtle forward-link: the chapter on /wiki/synthetic-monitoring-and-blackbox-probes (planned later in Part 11) closes the low-traffic blind spot named in §3 — manufactured traffic that produces symptom signal during off-peak hours, so the symptom rule scales down to small endpoints without the team rediscovering the gap and reaching for cause alerts again. Part 17's /wiki/observability-as-an-engineering-culture revisits on-call pain as a leading indicator of organisational health — the four numbers are not just an alerting metric, they are a measurement of how seriously the engineering org takes the contract with its operators.
A final downstream pointer: the relationship between on-call pain and engineering retention is empirically sharper than most leadership realises. Teams that hold the four numbers below cap report 12-month engineer retention 18–25 percentage points higher than teams that do not measure them at all (an unpublished cross-team study at a Bengaluru-based fintech, two quarters of data, eight platform teams) — and the retention gap concentrates on senior engineers, who are the population most likely to leave for a less-painful rotation elsewhere. Reducing on-call pain is, among other things, a senior-engineer retention strategy.
References
- Site Reliability Engineering — chapter 11, "Being On-Call" — the canonical statement of the page-budget rule and the toil framing for on-call work.
- The Site Reliability Workbook — chapter 8, "On-Call" — practical patterns for rotation design, page-budget enforcement, and toil measurement.
- Charity Majors et al., Observability Engineering (O'Reilly, 2022) — chapter on alerting and on-call — the modern critique of cause-based alerting and the human-cost framing.
- Matthew Walker, Why We Sleep (Scribner, 2017) — the sleep-research basis for off-hours-page cost multipliers and decision-quality degradation.
- Wertz et al., "Effects of Sleep Inertia on Cognition" (JAMA, 2006) — the empirical study underlying the 90-minute degraded-cognition window after waking from N3 sleep.
- PagerDuty Incident Response — alerting principles and on-call sustainability — the practitioner's adaptation of page-budget and severity routing.
- /wiki/symptom-based-alerts-the-google-sre-book — internal: the symptom-rule predecessor that made page-budget measurement possible.
- /wiki/wall-alerts-are-where-observability-touches-humans — internal: the wall chapter naming the human cost the four pain numbers measure.