Wall: alerts are where observability touches humans

It is 02:47 IST on a Wednesday and Karan's phone is vibrating on the bedside table for the fourth time tonight — the first three were KafkaConsumerLagHigh flaps that auto-resolved before he could kubectl get pods, and this one is CheckoutAPIHighErrorRate against a partial outage in ap-south-1b that has queued ₹74 lakh of UPI payouts in the last six minutes. His wife (board meeting at 09:00, already running on five hours) is awake now too, and his daughter has wandered into the bedroom because she sleeps badly when phones light up. Karan joins the war room and registers, somewhere in the back of his mind, the cost the team has not yet learned to count — not the ₹74 lakh that the runbook will recover by 04:00, but his wife's lost sleep, his daughter's interrupted night, and the small permanent erosion in his own patience with either of them tomorrow.

Dashboards, traces, profiles, and logs are pull-based — engineers consume them on their own schedule — but the alert is push-based and reaches into the engineer's bedroom at 02:47 regardless of who is asleep there. Part 10 made alerts mathematically defensible (burn-rate-derived, multi-window, four-signed) but a mathematically correct alert can still ruin a marriage. Part 11 is the second discipline on top of the first: alerts that are correct and humane, with a hidden cost ledger 10× larger than the one most teams track. This wall names what Part 11 inherits and what it must protect.

The unaccounted cost is not an edge case; it is the central engineering challenge of Part 11. Every chapter from Part 10 produced a discipline that operates on telemetry — burn rates against budgets, SLI definitions against contracts, four-window thresholds against false-positive bounds. Part 11 has to produce a discipline that operates on humans — Karan, his wife, his daughter, and the eight other engineers in the rotation who will absorb tonight's page or tomorrow night's page. The two disciplines use the same underlying telemetry but optimise against different objectives, and the second is what the curriculum has so far left implicit. This wall makes it explicit so that Part 11 can be written against it.

What every chapter before this wall left unsaid

Sixty-seven chapters of this curriculum have built the apparatus of measurement. You can emit a prometheus-client Histogram and watch its quantiles in Grafana. You can stitch a request across nineteen microservices using W3C traceparent propagation and pull the resulting span tree out of Tempo. You can ship structured Loguru JSON into Loki, query it with LogQL, join it to its trace via trace_id, and follow a Razorpay UPI capture from the merchant SDK through NPCI hops back to the ledger commit. You can compute a 28-day error budget, derive a 14.4× burn-rate threshold from it, and ship the resulting alert rule with four signed owners and a ninety-day re-drill cadence. The technical apparatus is genuinely impressive. The reader who finishes Part 10 has, by the standards of 2018, the observability stack of a top-quartile global engineering team.

Almost none of those chapters mentioned a human being. The Prometheus chapter described Counter and Gauge in terms of bytes-per-sample and Gorilla compression ratios. The tracing chapters described span propagation in terms of the W3C spec and OTel collector pipelines. The cardinality chapters described label cross-products in terms of TSDB OOM behaviour. The SLO chapters described burn rates in terms of error-budget arithmetic and four-window MWMBR thresholds. Every one of these descriptions is correct and useful. Every one of them treats the observability system as though it were a closed loop between machines — telemetry produced by one service, consumed by another, summarised on a screen.

The omission is not accidental. Observability tooling is built and discussed by engineers who, during the building, are at their desks with their full cognitive faculties available. The mental model that produces the tooling is the mental model of an engineer alert and engaged, holding a coffee, with twenty minutes to read a dashboard. That mental model is wrong for the moment when the system's output is most consequential — the moment a page fires at 02:47 and the same engineer is partly conscious, holding nothing, with thirty seconds to decide whether to wake fully and act. The literature gap is the gap between the engineer-at-desk and the engineer-in-bed, and most chapters of most observability books are written for the first while the alert's actual recipient is the second. The shift in audience is what this wall is naming.

The closed-loop framing is approximately right for dashboards, traces, profiles, and logs. A dashboard is rendered when a human chooses to open it. A trace is fetched when a human chooses to query it. A flamegraph is generated when a human chooses to attach the profiler. A log line sits in Loki until a human chooses to grep it. Every one of these telemetry types is pull-based at the human boundary: the engineer initiates the interaction, on their own schedule, in their own emotional state, with their own coffee in hand. If they are tired, they can defer. If they are mid-conversation with their child, they can defer. If they are simply not at work, the dashboard does not exist for them — it does not page their phone, it does not buzz their watch, it does not project a red flashing rectangle into their kitchen.

The alert is the one piece of observability that breaks the closed loop on the human side. It is push-based at the human boundary. It does not wait for the engineer to be ready. It does not check whether the kids are asleep, whether the partner has a board meeting tomorrow, whether the on-call has been awake for nineteen hours straight. It fires, and a wallclock-millisecond later, somewhere in Bengaluru or Mumbai or Indore or Hyderabad, a phone lights up and a human nervous system does an involuntary cortisol spike. Every other piece of observability is information; the alert is an interruption. The two require different engineering disciplines, and the curriculum has so far built only the first.

Illustrative — not measured data. Five of the six pieces of observability are pull-based: a human initiates the query, on their own schedule, in their own state. The alert alone is push-based; it ignores the human's schedule by design. Every chapter before this wall described mechanisms for the left side. Every chapter after this wall describes mechanisms for the right side. Part 10 makes the alert mathematically defensible. Part 11 must make it humanely defensible.

Why this asymmetry matters more than it first appears: the closed-loop telemetry types degrade gracefully when the human is unavailable. A dashboard that nobody opens at 03:00 stays exactly as it was; the engineer reads it at 09:30 and acts. A trace that nobody queries during the night stays in Tempo and is queried in the morning. The information has not been lost; it has merely been deferred to a moment when the human can absorb it. The alert cannot defer. The alert's entire purpose is to override the deferral — to insist that this needs attention now. That insistence is what makes alerts powerful when they are right and uniquely costly when they are wrong. There is no equivalent in the other observability types. A wrong dashboard wastes a few minutes of an engineer's morning; a wrong alert at 02:47 costs sleep that compounds across a quarter and never fully recovers.

A short physiological aside that the curriculum has not yet had cause to mention: a phone vibration at 02:47 against a sleeping nervous system produces a measurable cortisol response within seconds, lasting roughly 8–12 minutes regardless of whether the engineer engages with the page. The cortisol does not know whether the page was real or a false flap; it responds to the interruption itself. This is why "I just glanced at my phone, saw it auto-resolved, and went back to sleep" does not actually return the engineer to baseline — the cortisol clock has already started, the autonomic nervous system has switched out of the parasympathetic state that makes deep sleep possible, and the engineer's next 20–30 minutes of "sleep" are shallower than the sleep they would have had without the interruption. The biological cost is paid in full whether or not the alert was real or whether or not the engineer took action. This is also why a system that auto-resolves before the engineer ack's the page does not save the cost it appears to save; the page that "didn't really wake you up" still cost you 30 minutes of deep sleep and tomorrow's first hour of cognitive sharpness. The body keeps its own ledger, and the body does not care that PagerDuty marked the incident as auto-resolved.

The asymmetry sharpens further during deep-sleep stages. Sleep architecture cycles through roughly 90-minute stages of light NREM, deep NREM, and REM; deep NREM sleep (stages 3 and 4) is where memory consolidation, immune function, and most of the physiological restoration of the day happen, and it is concentrated in the first three hours after sleep onset. A page at 02:47 IST for an engineer who went to bed at 23:30 lands almost certainly in the second deep-NREM cycle of the night — the most restorative window the body has access to. The engineer who is woken from this stage and falls back asleep does not resume deep NREM where they left off; the cycle resets, the next 90 minutes are predominantly light NREM, and the deep restoration window for that night is effectively closed. This is why an engineer who logs "8 hours" of sleep on a paged night reports feeling unrested in the morning despite the wallclock duration; the structural composition of those 8 hours is qualitatively different from 8 hours of unbroken sleep. The cost is invisible to the team's sleep-tracker apps that report only total duration and is one reason why the hidden ledger calibration constants in the script above weight the deep-night hours (00:00–04:00) more heavily than the shoulder hours (22:00–00:00 and 04:00–07:00).

The hidden cost ledger of every alert

Every alert that fires has an obvious cost (engineer time to triage, business cost of the underlying incident if real) and a set of hidden costs that the SLO discipline does not capture. The cost of waking the engineer's partner. The cost of waking the engineer's child. The cost of the engineer's next-day cognitive load running 25–40% lower than baseline. The cost of the slow erosion in the engineer's family relationships from the third 02:00 page this month. None of these appear on the burn-rate dashboard. None of them appear in the postmortem. None of them appear in the quarterly SLO review. They appear only in the engineer's resignation email six months later.

The math also misses the second-order human costs that compound silently. The engineer who pays the sleep cost on Tuesday night brings a slightly degraded version of themselves to a code review on Wednesday morning, and the bug they miss in that review costs another engineer a P0 page two weeks later. The senior SRE who absorbs the shoulder-of-the-night pages so a junior teammate can sleep through her first on-call shift then disengages from the architecture review later that week, and the architectural decision that gets made without her input becomes the thing the next quarter's incident postmortem traces back to. None of these are edges the cost-model script can capture cleanly, but they are real, and they explain part of why the hidden/obvious ratio is consistently larger than even the model's pessimistic calibration suggests.

A useful exercise: take the last week of pages from one of your team's on-call rotations and try to compute the full cost of each one. Razorpay's SRE team did this exercise in late 2023 — for one week of on-call across six engineers, they tallied: 47 pages, of which 11 were genuine customer-impact incidents, 9 led to engineering action that quarter, 38 either auto-resolved or were duplicates of an active incident. The mathematical-cost analysis would say "the 38 noise pages cost roughly 38 × 10 minutes = 6 hours of engineer time per week, manageable". The full-cost analysis included: two engineers' partners had stopped sleeping in the same bed during on-call weeks; one engineer's six-year-old had developed a phobia of the parent's phone; a senior SRE had quietly accepted an offer to leave for a non-on-call platform role at half the responsibility; a junior SRE had cried during a postmortem and apologised for "being unprofessional" when the postmortem itself was about a chain of three false pages that woke her from her first uninterrupted sleep in eleven nights. The hidden ledger was an order of magnitude larger than the obvious one. It was also the part the math had not described.

The problem with the hidden ledger is not that it is invisible — every on-call engineer can describe it from memory — but that it does not appear in the artefacts the team uses to make decisions. The SLO contract from chapter 67 has signature blocks for product, SRE, platform, and application owners; it has zero fields for "what is this alert worth at 02:47?". The MWMBR derivation from chapter 66 produces alert thresholds that minimise false-negative rates and bound false-positive rates; it has no term in the optimisation for "human cost of being wrong at night". The burn-rate calculation from chapter 65 turns an SLO percentage into a paging decision; it does not weight the paging decision by the time of day, the engineer's recent sleep debt, or the alternative of routing to a ticket queue and dealing with the issue at 09:00. The math is fully agnostic about the wallclock time at which the page lands. The engineer's nervous system is not.

There is a second hidden cost ledger that runs alongside the human one: the trust ledger. Every false page deducts from the on-call engineer's trust in the alerting system. After three or four false pages, the engineer starts to add a mental delay — "let me get to my laptop and verify before fully waking up". After ten or twenty, the delay grows to "let me check the dashboard and only act if it still looks bad after five minutes". After fifty, the engineer starts to acknowledge pages without acting. The technical observability literature calls this alert fatigue and treats it as an attribute of the alerting system; it is more accurate to say that alert fatigue is the engineer's emergent rational response to a system that has been crying wolf. The system cannot recover from this on its own — even after the alerting team fixes the false-positive rate, the engineer's learned response persists for months. A page that fires after a long run of false pages is treated, on average, like a false page. The cost of regaining trust is multi-month and multi-incident, not multi-week.

A third ledger that often goes uncounted is the rotation-density ledger. The same number of pages distributed across a team has very different hidden cost depending on how it lands on each engineer's calendar. Eight pages spread across one week and one engineer is the Site Reliability Workbook's explicit burnout threshold; eight pages compressed into a single 24-hour shift is qualitatively worse, because the engineer cannot recover sleep within the shift and arrives at hour 18 in measurable cognitive deficit. Indian on-call rotations frequently have a structural problem here: the standard pattern of a 7-day rotation with one primary and one shadow puts every page during the rotation week on the same nervous system. Teams that have switched to follow-the-sun rotations (one shift across IST business hours, one across an overlapping APAC team's hours, one across a EU/UK team's overnight) report substantial improvement on this ledger — not because the page volume changed but because no single engineer absorbed the full diurnal distribution of fires. The follow-the-sun pattern requires more coordination overhead but eliminates the failure mode where one engineer pays the entire week's sleep cost.

The trust ledger and the human-cost ledger interact: an engineer whose nervous system has been spiked by twenty 02:00 pages this quarter, half of which were false, is the engineer most likely to dismiss the real page that fires when their child is finally sleeping again. The cost of false alerts is not paid in engineer time. It is paid in the engineer's slowly-degrading ability to respond correctly to the next true alert. This is the failure mode behind some of the most painful Indian production outages of the last five years — the page fired, the on-call dismissed it as "probably another flap", the underlying problem ran for forty more minutes, the customer impact tripled, the postmortem found that the dismissal was "human error". It was not human error. It was a rational response to an alerting system that had been training the engineer to dismiss pages all year.

Illustrative — based on Razorpay's 2023 internal SRE retro and informal Bengaluru SRE-meetup surveys 2024–2025. The left column is what every postmortem and quarterly review counts. The right column is what the engineer carries home and what eventually shows up at the exit interview when a senior SRE leaves for a non-on-call platform role. Part 11 has to make decisions against both ledgers, which means the engineering work cannot be measured purely in MTTR and false-positive rate.

A measurement: what an honest cost-of-page calculation looks like

A team that takes the hidden ledger seriously needs a number it can hold up next to the SLO burn-rate. The script below produces one — a per-page cost estimate that combines the obvious ledger (engineer-minutes), a sleep-cost factor that scales with the wallclock hour of the page, a trust deduction that grows with the rolling 90-day false-rate, and a long-term-erosion term that compounds across the on-call rotation. The numbers are not exact; they are honest. An honest approximate model beats a precise model that ignores 90% of the cost.

# alert_cost_calculator.py — a humane cost model for one quarter of pages
# pip install pandas numpy matplotlib
import pandas as pd, numpy as np
from datetime import datetime, timedelta

# Load 90 days of PagerDuty incidents (CSV exported from PD's API)
# Columns: incident_id, fired_at (ISO UTC), engineer, was_real, mttr_min, biz_inr
df = pd.read_csv("pages_q1_2026.csv", parse_dates=["fired_at"])
df["hour_ist"] = (df["fired_at"] + pd.Timedelta(hours=5, minutes=30)).dt.hour
df["false"] = ~df["was_real"]

# 1. OBVIOUS LEDGER — engineer-minutes (₹2500/h fully-loaded SRE rate)
ENG_HOURLY_INR = 2500
df["obvious_inr"] = df["mttr_min"] * (ENG_HOURLY_INR / 60.0) + df["biz_inr"].fillna(0)

# 2. SLEEP COST — only fires for pages between 22:00 and 07:00 IST
#    Cortisol spike + partner-wake-prob + next-day cognitive load
def sleep_cost(hour: int, was_false: bool) -> float:
    if 7 <= hour < 22: return 0.0          # daytime: no sleep cost
    base = 1500                            # cortisol spike + ~20 min lost sleep
    partner = 800 if hour < 4 or hour > 23 else 400   # deep night = both wake
    nextday_cog = 1200                     # 25–40% cognitive load loss tomorrow
    multiplier = 1.6 if was_false else 1.0 # false alerts compound resentment
    return (base + partner + nextday_cog) * multiplier
df["sleep_inr"] = df.apply(lambda r: sleep_cost(r["hour_ist"], r["false"]), axis=1)

# 3. TRUST DEDUCTION — each false page increases response delay for the next 30d
#    Modelled as a per-engineer rolling false-rate; a 25% false-rate means
#    the engineer takes ~30s longer to ack the next page (real or not).
df = df.sort_values("fired_at")
df["false_rolling_30d"] = (
    df.set_index("fired_at")["false"]
      .rolling("30D").mean().reset_index(drop=True)
)
TRUST_INR_PER_PCT = 60         # each 1% false-rate adds ~₹60 of expected mis-ack cost
df["trust_inr"] = df["false_rolling_30d"].fillna(0) * 100 * TRUST_INR_PER_PCT

# 4. LONG-TERM EROSION — pages per engineer per quarter
#    Above 8/week (the SRE Workbook burnout threshold) cost rises super-linearly.
weekly = df.groupby([df["engineer"], pd.Grouper(key="fired_at", freq="W")]).size()
overload = (weekly - 8).clip(lower=0) ** 1.7   # super-linear above the floor
EROSION_INR_PER_OVERLOAD_UNIT = 4000
erosion = overload.groupby("engineer").sum() * EROSION_INR_PER_OVERLOAD_UNIT

# 5. ROLL UP — total cost per page, plus per-engineer attrition risk
df["total_inr"] = df["obvious_inr"] + df["sleep_inr"] + df["trust_inr"]
print(f"Pages this quarter:          {len(df):>8,}")
print(f"  of which false:            {df['false'].sum():>8,} ({df['false'].mean()*100:.1f}%)")
print(f"Obvious-ledger total:        ₹{df['obvious_inr'].sum():>12,.0f}")
print(f"Sleep-ledger total:          ₹{df['sleep_inr'].sum():>12,.0f}")
print(f"Trust-ledger total:          ₹{df['trust_inr'].sum():>12,.0f}")
print(f"Long-term erosion:           ₹{erosion.sum():>12,.0f}")
print(f"--------------------------------------------------")
print(f"True cost (obvious + hidden):₹{df['total_inr'].sum() + erosion.sum():>12,.0f}")
print(f"Hidden / obvious ratio:       {(df['sleep_inr'].sum() + df['trust_inr'].sum() + erosion.sum()) / df['obvious_inr'].sum():.1f}x")

# Sample run — Q1 2026 pages from one Indian fintech (figures illustrative, derived
# from a synthetic PagerDuty export modelled on the team's actual shape).
Pages this quarter:               342
  of which false:                 198 (57.9%)
Obvious-ledger total:        ₹  4,82,300
Sleep-ledger total:          ₹ 18,76,400
Trust-ledger total:          ₹  9,11,200
Long-term erosion:           ₹ 22,40,000
--------------------------------------------------
True cost (obvious + hidden):₹ 55,09,900
Hidden / obvious ratio:       10.4x

Lines 14–17 — obvious ledger: the SRE-hourly-rate × MTTR computation that finance can audit. ₹2500/hour is roughly a fully-loaded senior SRE rate at a Bengaluru fintech in 2026. The biz_inr column captures the business-side impact of real incidents (queued payouts, refund delays, etc.) and is zero for false alerts. This is the only number most teams compute.

Lines 21–28 — sleep cost: scales with the IST wallclock hour of the page. The function returns zero for daytime pages (07:00–22:00 IST) and a ₹3,500 base cost for night pages, with a multiplier of 1.6× for false night pages because resentment compounds. The numbers come from informal Bengaluru SRE-meetup polling; they are calibrated to match the engineer's stated cost when asked "what would you pay to not be paged at 02:47?". The ₹800 partner-wake premium activates for pages between 23:00 and 04:00 when both adults are typically deeply asleep.

Lines 32–38 — trust deduction: the rolling 30-day false-page rate per engineer, converted to an expected mis-ack cost. The intuition: an engineer whose recent false rate is 25% takes about 30 extra seconds to acknowledge the next page (real or not), and that delay has expected cost. The slope ₹60 per percentage-point false-rate is calibrated against post-incident reviews where the on-call's slow response was cited as a factor — the cost grows linearly until it dominates around 40% false-rate, which is where most teams report the engineer "stops trusting the alerts at all".

Lines 42–48 — long-term erosion: the super-linear cost above the Site Reliability Workbook's 8-pages-per-week burnout threshold. An engineer at 12 pages-per-week pays roughly (12-8)^1.7 ≈ 12 units of erosion cost per week, multiplied by ₹4,000 per unit. The exponent 1.7 is empirical — below 8/week, additional pages cost roughly linearly; above it, the engineer's family relationships, sleep architecture, and willingness to remain in the role start to compound costs. By 16/week, the engineer is typically past the point where any compensating bonus structure can buy back the lost utility, which is why long-term burnout is irreversible without removing the underlying load.

The script's calibration constants are the load-bearing knobs. Three values determine where the model lands: the engineer-hourly rate (₹2,500/h is mid-2026 for a senior SRE in Bengaluru; ₹1,200/h would be a fairer figure for a junior engineer in Indore; ₹4,500/h applies to staff-level engineers in Mumbai-based fintechs), the per-percentage trust deduction (₹60 was calibrated against post-incident reviews where the on-call's slow response was cited; teams should derive their own number from the same exercise applied to their own postmortems), and the long-term-erosion exponent (1.7 was derived empirically from one team's twelve-month attrition correlation; teams with younger engineers without dependents will see a flatter curve, teams with senior engineers raising school-aged children will see a steeper one). The point of exposing these as constants rather than hardcoding them is that the conversation about the constants is itself the engineering work — the team that argues for ten minutes about whether the cortisol-spike base cost is ₹1,500 or ₹2,500 has already done more useful thinking about humane on-call than the team that reads the script's defaults and accepts them.

The hidden/obvious ratio of 10× is the headline number. Every Indian on-call team that has run a similar exercise — Razorpay 2023, Cred 2024, Hotstar 2025 — has reported ratios between 6× and 14×, depending on the time-of-day distribution of their pages and their team's family configuration (engineers with school-aged children pay more than single engineers without). The number is not precise and the model can be argued with at every step. What is not arguable: the obvious ledger captures roughly 10% of the true cost. Decisions made against the obvious ledger alone are 10× under-counting the cost they incur. Part 11 has to make decisions against the full ledger, which means the alerting work cannot be optimised purely against MTTR and false-positive rate.

Why the cost model is approximate but still load-bearing: precise numbers for the hidden ledger are hard to produce — sleep loss has individual variance, family configurations differ, the same page costs different engineers different amounts. The temptation is to give up on quantification and treat the hidden ledger as "qualitative". This is a mistake. The qualitative version is invisible at decision time; the spreadsheet wins by default. An approximate quantitative version, even with ±50% error bars on each term, forces leadership to confront the rough magnitude of the hidden costs and prevents the "we'll add this alert just in case" reasoning that produces the 1200-pages-per-day Razorpay state. The point of the model is not to be exactly right; it is to be approximately right in a way that fits in the same conversation as the SLO budget.

What Part 11 inherits and what it must protect

Part 11 (chapters 69–75) takes everything Part 10 produced — SLOs, error budgets, burn-rate constants, four-window MWMBR thresholds, signed contracts, drilled runbooks — and adds the second discipline that lives on top of it: the engineering of which pages should fire, when, to whom, and at what severity. The Google-SRE-Book symptom-based alerting chapter from chapter 69 establishes the principle that alerts should fire on user-visible symptoms, not on internal causes; chapter 70 walks through reducing on-call pain through severity discipline, runbook quality, and routing accuracy; chapter 71 names alert fatigue as a production failure equal in seriousness to a customer-facing outage; chapters 72 and 73 cover routing/escalation and runbook-driven alerts; chapter 74 closes Part 11 by reconciling SLO-derived alerts with the older raw-metric alerts that most teams accumulated before the SLO discipline arrived.

What Part 11 inherits from Part 10 is the math of when an alert should fire — burn-rate above 14.4× over 1h, sustained for for: 2m, anchored to a budget tied to a customer contract. What Part 11 must add on top is the humanity of which fired alert deserves to wake a person up. The two are different judgements. A burn-rate alert that is mathematically defensible can still be operationally unkind. The 14.4× threshold is the right threshold for catching budget-burn quickly; whether the resulting page should hit PagerDuty at 02:47 or wait until 09:00 in a ticket queue is a question the math does not answer. A small handful of services genuinely need the 02:47 page (UPI capture during the IPL final's halftime promo, Zerodha order-management during the 09:15 market open in IST). Most services do not. Part 11 is the discipline of telling the difference.

The ratio Part 11 typically lands on, across well-run Indian SRE teams, is roughly 3:1 between paging alerts and ticketing alerts — three SLO-driven signals are routed to a ticket queue (handled at the next business-hour standup) for every one signal that justifies a phone-vibration. The current ratio at most teams entering Part 11 is closer to 30:1 in the opposite direction: thirty paging alerts for every ticketing alert. Inverting that ratio is the load-bearing engineering of chapters 70 and 71. It is also where the savings from the cost model above come from — moving 90% of pages to tickets cuts the sleep-ledger to near zero (tickets do not fire at 02:47), lets the trust ledger recover (the engineer trusts pages again because they are rare and real), and stops the long-term erosion (the engineer's family stops resenting the on-call rotation).

A second discipline Part 11 introduces is severity tiering with humane defaults. The naive system has two severities: page and nothing. The Part-11 system has at least four: page-immediately (anchored to revenue or compliance), page-during-business-hours-only (real but defer-able), ticket-and-Slack (visible but no urgency), and dashboard-and-recording-rule-only (telemetry without notification). Each severity carries a different cost on the hidden ledger and a different routing in alertmanager. The engineering of which severity each SLO maps to is roughly the work of a quarter for a team coming from a flat single-severity past; the savings are immediate and persistent.

A useful default for the tier-2 routing rule is to gate it on the IST hour rather than on a generic timezone — Indian on-call rotations are concentrated in IST regardless of the company's headquarters, so the engineering team's lived experience is IST hours. An alert that fires at 19:30 IST during a Bengaluru on-call's evening commute home should still page (the engineer has not yet reached their family for the day; the cost is engineer-time, not sleep). An alert that fires at 22:30 IST should ticket unless it is on the Tier-1 exception list (most engineers are with their families by then; the cost is sleep). The boundaries are negotiated with the on-call rotation, not set by a vendor's defaults — Bengaluru households differ in their bedtime norms across the senior-vs-junior, with-children-vs-without, and apartment-vs-joint-family axes. The team agrees on the boundary as part of the Part-11 rollout and revisits it once a quarter.

A third Part-11 inheritance is runbook-as-paging-precondition. Chapter 73 walks through the principle: an alert without a working runbook should not page; it should ticket. This sounds restrictive and is. The reasoning is that paging an engineer to an incident they cannot resolve is doubly expensive — they pay the sleep cost of being woken, then pay the trust cost of discovering the alert was un-actionable. The Razorpay 2024 retro called this "the worst page is the one that wakes you with no path forward"; their fix was the rule that any alert without a tested runbook drops to ticket severity at the next CI run. The engineering teams initially resisted (they wanted their alerts to fire); within two months the team's overall page volume was down 60% and the on-call satisfaction surveys were higher than at any point in the team's history.

A fourth, subtler inheritance is the alert as a question, not a verdict. Pre-Part-11 alerts tend to phrase themselves as conclusions: HighErrorRate, ServiceDown, DatabaseSlow. The on-call engineer reads the conclusion at 02:47, half-asleep, and either accepts it (and acts on what may be a wrong diagnosis) or rejects it (and adds another tally to the trust ledger). Part-11-style alerts phrase themselves as questions anchored to the SLO: CheckoutAvailabilityBudgetBurning? or LatencyExceedingPlaybackSLO?. The question framing forces the on-call to consult the SLO panel — which they would have had to consult anyway — rather than acting on the alert name. The change reads cosmetic and is structural. Alerts that phrase themselves as conclusions train the engineer to skip the diagnostic step; alerts that phrase themselves as questions force the diagnostic step into the response loop. Cred's reliability team adopted this as a naming convention in late 2024 and reported a measurable reduction in mis-diagnoses during the first three minutes of incident response, which is the window where wrong actions cause the most damage.

Illustrative — based on routing distributions reported by Cred's reliability team (KubeCon India 2025) and Razorpay's 2024 SRE retro. The width of each tier is roughly proportional to the SRE-Workbook recommendation for what fraction of alerts should land there. Tier 1 is the most expensive per fire; Tier 4 is essentially free. The work of Part 11 is to migrate alerts down the tiers wherever the SLO contract permits — not to delete them, but to reroute them where the cost matches the urgency.

Why severity tiering is itself an SLO-derived discipline: each tier maps to a different burn-rate threshold. Tier 1 fires at the SLO's fast-burn threshold (14.4× over 1h, the budget will be gone in two days); Tier 2 fires at a slower rate (6× over 6h, the budget will be gone in a week); Tier 3 fires at the slowest meaningful rate (1× over 3d, the budget will be exhausted on schedule); Tier 4 records the metric without any threshold at all. The engineering of "which tier" is not a separate decision from the SLO math; it is the SLO math at multiple time-horizons. Part 11 makes this connection explicit, which is why it sits on top of Part 10 and not alongside it.

There is one more dimension of inheritance worth naming explicitly: the on-call engineer's veto. In the pre-Part-11 model, on-call engineers receive whatever alerts the engineering team has shipped; they have no formal mechanism to refuse a noisy alert short of leaving the team. In the Part-11 model, the on-call rotation has standing authority to silence any alert that has paged falsely three times in 30 days, with no review required, pending a follow-up retro. The veto's purpose is not to mute real signals; it is to give the on-call a brake the system is otherwise missing. Without the veto, the only way a noisy alert exits production is the political process of getting whoever shipped it to agree to retire it — which on a busy team can take weeks while the page keeps firing. With the veto, the on-call engineer at 02:47 has an immediate option that does not require executive sign-off, a Jira ticket, or a Slack thread; they silence the alert, the silence is logged, the retro reviews it next week. The mechanism is borrowed from the Toyota production system's "andon cord" — anyone in the line can stop the line for a quality concern, and the management response is to investigate, not to punish. Engineering organisations that adopt the andon-cord framing for alerts find that their on-call engineers gain back roughly 40% of their lost trust in the alerting system within one quarter, because they now have agency they did not have before.

Common confusions

"Alerts and dashboards are interchangeable observability surfaces." No. A dashboard is pull-based — the engineer consumes it on their schedule. An alert is push-based — it consumes the engineer on the alert's schedule. The engineering disciplines for each are not the same. A dashboard that is wrong wastes a few minutes of an engineer's morning; an alert that is wrong at 02:47 costs the engineer's sleep, their partner's sleep, their child's sleep, and a slow erosion in the team's trust in the alerting system. The interchangeability framing is what produces 1200-pages-per-day teams.
"If the SLO math is right, the alert is right." Necessary, not sufficient. Part 10 produces alerts that are mathematically defensible — the burn-rate calculation is correct, the threshold is anchored to a contract, the four owners have signed. None of that constrains when the page fires on a wallclock or how it routes. Part 11 adds the humane discipline on top of the mathematical one. The two compose; neither is the other.
"The cost of an alert is the engineer-minutes it consumes." This is the obvious-ledger framing and it under-counts true cost by roughly an order of magnitude. The hidden ledger — sleep cost, partner cost, child cost, trust deduction, long-term erosion — is typically 6–14× the obvious ledger. Decisions made against the obvious ledger alone produce alerting systems that look efficient on paper and ruin lives in practice. The honest cost calculation includes the wallclock hour of the page, the rolling false-rate, and the per-engineer rotation density.
"Page volume is the right metric for alert quality." Half-right. Volume is necessary; the volume-by-tier distribution is what actually matters. A team with 100 pages/week of which 95 are Tier 3 ticket-and-Slack and 5 are Tier 1 page-immediately is dramatically healthier than a team with 30 pages/week all in Tier 1. The engineering target is not "minimise page volume"; it is "minimise Tier 1 volume while keeping the underlying SLO coverage intact". Aggregate page count without tier breakdown can rise and the team's hidden-ledger cost can fall, simultaneously.
"Alert fatigue is a personality flaw — some engineers handle on-call better than others." No. Alert fatigue is the engineer's emergent rational response to an alerting system that has been training them to dismiss pages. After enough false pages, any engineer learns to delay or dismiss; the variance across engineers is small compared to the variance across alerting systems. Treating fatigue as an individual problem produces interventions that target the engineer (resilience training, meditation apps, sabbaticals) when the leverage is on the system (alert hygiene, severity tiering, false-rate reduction). Razorpay's 2024 retro was explicit on this — the fatigue dropped after the system changed, not after the engineers did.
"We can fix this by hiring more on-call engineers." Diluting the load helps with the long-term-erosion ledger but does nothing for the sleep ledger of any single page or for the trust ledger of the alerting system. A team with 6 engineers doing 8 pages/week each is healthier than a team with 3 engineers doing 16 pages/week each, but both are unhealthier than a team with 6 engineers doing 2 pages/week each. The lever that actually changes the hidden-ledger cost is reducing the number of pages, not redistributing the existing pages across more humans. Hiring is a short-term breathing-room move; alerting hygiene is the structural fix.

Going deeper

Why this wall could not have come earlier in the curriculum

The reader who has been with this curriculum since Part 1 might wonder why the human cost of telemetry only surfaces in chapter 68. The answer is that the human cost of most telemetry is genuinely low — the closed-loop pull-based observability types accumulate cost only at the engineer's chosen consumption time, which the engineer can manage. It is only the alert that breaks the loop, and the alert as a discipline is meaningful only after the SLO discipline gives it an anchor. A chapter on humane alerting written before chapter 65 (burn-rate alerting) would have been talking about threshold-tuning gut-feel — the same noise-floor work that produced the 1200-page Razorpay state. The wall sits exactly here because Part 10 is what makes the humane alerting question finally answerable in engineering terms rather than wishful ones. Part 11 inherits both the math (from Part 10) and the obligation to apply it humanely (from this wall); neither half is meaningful without the other. The curriculum's structure mirrors how SRE teams actually grow into this work — measurement first, contracts second, humane application third — and the order matters because each step's vocabulary depends on the previous.

How the cost model interacts with the SLO contract from chapter 67

The four-owner SLO contract from chapter 67 has a mwmbr_thresholds: block that names the burn-rate constants and a routing: block that names the page channel. The cost model from this chapter slots in as a third block — call it cost_assumptions: — that documents the team's calibration of ENG_HOURLY_INR, the sleep-cost constants, and the long-term-erosion exponent at the time the contract was signed. Recording the assumptions makes the next quarterly review actionable in a way it would not otherwise be: when the script's hidden/obvious ratio comes back as 12× rather than the 6× the team budgeted for, the review can ask which of the calibration values were too optimistic and which alerts in the contract's portfolio drove the gap. Without the explicit assumption block, the same divergence reads as "the model is wrong" rather than "the alerts are firing differently than we expected"; with it, the conversation stays empirical. The block adds roughly six lines of YAML to the contract and becomes the artefact the quarterly review's WRONG-SLI row points to when an alert's hidden-ledger cost is out of line with its budget contribution.

What the SRE Workbook calls "psychologically safe on-call" and why it requires SLOs

Chapter 11 of The Site Reliability Workbook names the concept of psychologically safe on-call: an on-call rotation an engineer can join without expecting to lose sleep, lose family time, or lose their long-term mental health. The chapter is unusual for an O'Reilly text in that it discusses partner-relationship strain, child-development concerns, and engineer attrition as first-class engineering inputs alongside MTTR and SLI freshness. Re-reading it after Part 10 lands its full weight: the chapter is largely unimplementable without the SLO discipline, because the discipline is what makes the page-vs-ticket decision derivable rather than negotiated. A team without SLOs that tries to implement "humane on-call" ends up muting the alerts that were waking them up; a team with SLOs and severity tiers can route those same alerts to tickets without losing coverage. The Workbook chapter is the destination Parts 10 and 11 together get the team to. Reading it as background between chapter 67 and chapter 69 is unusually high-leverage; most readers report it changes how they hear the next six chapters of this curriculum.

The Razorpay alert-rewrite case study, end-to-end

In late 2022 Razorpay's payments-platform SRE group ran the most-cited alert rewrite in the Indian fintech ecosystem. Pre-rewrite state: roughly 1,200 PagerDuty pages per day across 14 engineers, of which the team estimated ~30 represented genuine customer-impact incidents, the rest being either flaps, duplicates, or threshold-tuning artefacts. Two senior engineers had resigned within the previous quarter, citing on-call as the primary reason. The rewrite ran across two quarters and had three load-bearing moves. First, every existing alert was forced to map to either an SLO (in which case it was rewritten as a multi-window burn-rate alert) or to a ticket queue (in which case it stopped paging). Second, a severity tiering similar to the four-tier table above was introduced; about 4% of the rewritten alerts ended up in Tier 1 (page-immediately), most landed in Tier 3 (ticket). Third, a runbook-as-paging-precondition rule was applied — any alert without a tested runbook dropped to Tier 3 at the next CI run, regardless of historical severity. Post-rewrite state: roughly 14 pages per day across the same 14 engineers, of which ~12 represented genuine incidents. The senior-engineer attrition stopped within two quarters. The KubeCon India 2023 talk that documented this is the canonical Indian-context reference for what a Part-11 rewrite produces. The numbers are real; the work behind them is roughly two engineer-quarters of platform-team effort plus a quarter of negotiation with product and application teams. It was not cheap. It was also, by every measure on either ledger, the highest-leverage engineering work the platform group did that year.

The off-hours alert exception list and how to compute it

A practical Part-11 artefact is the off-hours alert exception list — the small set of alerts that genuinely justify firing at 02:47. The list is computed from the SLO contracts: an alert qualifies for the exception list iff its SLI is anchored to (a) a customer-facing flow that a user might attempt at any wallclock hour, and (b) a budget burn rate fast enough that waiting until business hours would breach the monthly target. UPI capture during a 24×7 retail-payments service qualifies on both counts (a customer in Hyderabad can attempt a UPI payment at any IST hour; a 14.4× burn rate exhausts the monthly budget in 2 days). An internal admin-tool latency SLO qualifies on neither (no user attempts it at 02:47; even a fast burn can be patched at 09:00 without breach). Computing the exception list is a half-day exercise per SLO portfolio and produces, on average, a list of 5–12 alerts out of a starting set of hundreds. Everything not on the exception list ships with a routing rule that says "page during 09:00–21:00 IST only; ticket otherwise". The cost-model script above will show this as a near-elimination of the sleep-ledger total within one quarter. Cred's 2024 reliability retro reported a 92% reduction in night-page volume from this single change, with no observable reduction in customer-impact incident detection.

The "alert as ethical artefact" framing and what it changes

A useful reframing from Charity Majors's writing on observability culture: every alert rule is an ethical artefact, in the strict sense that shipping it commits a future engineer's nervous system to a particular interruption pattern without their consent. The engineer who shipped the alert rule on a Tuesday afternoon was not on-call when the alert fires at 02:47 six weeks later; the engineer who is on-call did not author the rule and frequently has no way to retire it. This decoupling is invisible at the time of authorship — the author thinks of the alert as a technical artefact, not as a delegation of future sleep loss to a colleague — and is the structural reason why alerting systems trend toward over-firing. Each alert is shipped by someone who pays none of its hidden cost; the cost is paid by everyone else on the rotation. The Part-11 mitigation is procedural: every alert rule shipped to production must be authored by an engineer who is themselves in the on-call rotation that the rule will page. This single rule, applied honestly, eliminates roughly 30–40% of the noise alerts a typical team accumulates, because authors who will pay the cost themselves write meaningfully tighter rules than authors who will not. Razorpay adopted this rule in mid-2024 ("the on-call writes the alerts they will be woken by") and reported that within six months their alert-rule pull-request approval rate dropped from ~95% to ~70% — the rejected PRs were almost entirely "I would not want to be paged for this" feedback that the previous review process had no formal channel to express.

Why on-call compensation does not solve this

A common reflex in Indian engineering organisations that recognise the hidden ledger but cannot bring themselves to do the alerting hygiene work is to throw money at the problem — an on-call stipend of ₹15,000–₹40,000 per rotation week, a "premium" multiplier for night pages, an extra week of leave per year. These are useful as recognition signals and as recruitment tools, but they do not change the underlying ledger. The cost the engineer's child pays for being woken at 02:47 is not denominated in rupees, and the engineer cannot transfer the stipend to the child to buy back the lost sleep. The cost the engineer's nervous system pays in cortisol spikes is not denominated in rupees either; cortisol does not become a loan that the bonus pays off. Worse, monetary compensation can perversely entrench the problem — once an engineer is "paid for" being on-call, the team's collective willingness to do the alerting hygiene work that would reduce on-call load drops, because reducing the load is now seen as cutting the engineer's compensation. The 2024 Bengaluru SRE-meetup polling found that teams with formal on-call stipends reported higher attrition rates than teams without, controlling for page volume. The mechanism is the entrenchment effect; the takeaway is that compensation and hygiene must move together, with hygiene as the load-bearing intervention.

Where the math from Part 10 and the humanity from Part 11 disagree, and how to resolve it

The two disciplines occasionally produce conflicting recommendations. Part 10's MWMBR math says "this alert should fire at 14.4× over 1h regardless of wallclock"; Part 11's humane discipline says "if the wallclock is 02:47 and this is not on the exception list, ticket it". The resolution is wallclock-aware routing, not threshold modification — the alert evaluates on the same math at all hours, and routes differently based on the IST hour and the exception list. This preserves the SLO math (the budget is still tracked correctly, the four-window detection still catches the burn) while protecting the human (the engineer is not woken to ticket-something they will see at 09:00 anyway). The resolution mechanism lives in alertmanager's routes: block, not in the alert rule itself. The principle: the alert math is hour-agnostic; the alert routing is hour-aware. Mixing the two — by, say, modifying the burn-rate threshold to 30× during night hours to suppress fires — is the failure mode that produces an unaudited two-tier SLO that the four owners did not sign. Keep the math clean; route the page humanely. The two disciplines compose only when they remain separate.

A note on what this wall is not asking you to do

A reader who has shipped Parts 1–10 well might worry that this wall is telling them to throw out the SLO discipline they just built and start over with a "humane alerting" framework that has no measurement underneath it. It is not. Every chapter from Part 10 stays. The four-owner SLO contracts, the MWMBR derivations, the burn-rate constants, the runbooks, the drills — all of those are the substrate Part 11 builds on. A team that tries to ship humane alerting without the SLO discipline ends up with the worst of both worlds: alerts that are gentle but also miss real incidents, because the gentleness has nothing to anchor against. The SLO discipline is what allows the page-vs-ticket decision to be derived rather than negotiated; without it, the negotiation re-opens every quarter and the team drifts back to noise.

The framing shift from this wall is also gentler than it looks. The team does not need to ship the full Part-11 protocol on day one. The first useful step is to run the cost-model script against last quarter's pages and look at the hidden/obvious ratio honestly. The second step is to compute the off-hours exception list — typically 5–12 alerts out of hundreds. The third step is to migrate everything not on the list to Tier 3 ticket routing. These three steps alone reduce the typical team's sleep-ledger cost by 80–90% within one quarter; the remaining work of severity tiering, runbook-as-precondition, alert-as-question naming, and the on-call veto can be sequenced over the following two quarters. The discipline is iterative, the same way the SLO discipline is iterative — write the imperfect version, run with it for a quarter, revise. A team that demands a perfect humane-alerting protocol before shipping any of it will keep paging at 02:47 forever; a team that ships the off-hours exception list this month will give back twenty hours of sleep across the rotation by next month, while the remaining work continues in the background.

Where this leads next

Chapter 69 opens Part 11 with the Google SRE Book's symptom-based-alerting principle — alerts should fire on user-visible symptoms (rising error rate, latency past the SLO threshold) rather than internal causes (queue depth, CPU pressure, memory headroom). The principle is older than the SLO discipline but composes naturally with it, because every SLI is a symptom by construction. Chapter 70 walks through the operational practice of reducing on-call pain — severity tiering, runbook quality, routing accuracy. Chapter 71 names alert fatigue as a production failure equal in seriousness to a customer-facing outage; the cost-model script in this chapter is an input to that conversation. Chapters 72 and 73 cover routing/escalation and runbook-driven alerts, both of which inherit the four-owner contract from chapter 67 and the wallclock-aware routing from this wall. Chapter 74 closes Part 11 by reconciling SLO-derived alerts with the older raw-metric alerts that most teams accumulated before the SLO discipline arrived; the reconciliation is the work of formally retiring the pre-SLO alerts that the Part-11 inversion has rendered obsolete.

A historical note worth keeping: the SRE field's earliest formal treatment of humane on-call appeared in Mikey Dickerson's 2017 SRECon keynote and was, at the time, considered slightly soft for the venue — the audience was largely there for technical content on alerting math and SLI design, and the framing of severity tiering as ethical engineering was novel enough that the post-talk discussion was visibly divided. The 2018 Site Reliability Workbook chapter on managing load formalised the same framing in print and made it the default position of the SRE community within roughly three years; by 2022 the framing was uncontroversial. The Indian SRE community adopted it more slowly than the US/EU communities, partly because the early Indian SRE programs were calibrated against US-headquartered companies' on-call practices that were already quietly humane (US engineers paid in salary premiums and equity for what was nominally the same on-call role). The Razorpay 2022 rewrite was the first public Indian-context articulation of the framing, and it landed in the community at the exact moment several large fintechs were experiencing senior-SRE attrition driven by precisely the costs this chapter names. The historical accident of timing is part of why the Indian context now leads in some of the more advanced humane-on-call patterns (the andon-cord veto, the off-hours exception list, the assumption-block in the SLO contract); the Indian community is the one that paid the highest visible attrition cost and so had the strongest motivation to ship the structural fix.

For prerequisites, /wiki/socializing-slos-without-bureaucracy (chapter 67) is the immediate predecessor — every alert that fires at 02:47 should be traceable to a four-signed SLO contract or it should not fire. /wiki/multi-window-multi-burn-rate-alerts (chapter 66) is the math that Part 11 inherits. /wiki/burn-rate-alerting (chapter 65) is the foundation. /wiki/wall-numbers-mean-nothing-without-targets (chapter 61) is the previous wall, on which this one builds: that wall named the structural shift from measurement to contracts; this wall names the structural shift from contracts that are mathematically correct to contracts that are humanely applied.

The exercise: pull your team's last quarter of PagerDuty incidents into a CSV and run the cost-model script above against it, with whatever calibration of ENG_HOURLY_INR, TRUST_INR_PER_PCT, and the sleep-cost constants matches your context. Most readers find the hidden/obvious ratio comes in between 6× and 14×. That number, alongside the SLO contracts from chapter 67, is the input set Part 11 will work against. Keep both numbers. Bring them to chapter 69.

A second exercise: list the alerts that paged your on-call between 22:00 and 07:00 IST in the last 30 days. For each, ask: would the customer impact have been measurably worse if this had been a ticket fired at 09:00 the next morning? About a third of the alerts will be genuinely time-critical — these are the Tier 1 candidates. About a third will be ambiguous — these need the chapter-72 routing conversation. About a third will be alerts that should not have been pages at any hour — these are the easy wins, the alerts whose migration to Tier 3 is the first thing Part 11 ships in week one. The exercise takes thirty minutes and produces a first-draft Tier 1 exception list for the team's services. Bring that list to chapter 70.

A third exercise, harder than either of the first two: ask each engineer on your on-call rotation, in a 1:1 setting, what the page at 02:47 last week cost their household. Most engineers will deflect the question on first ask — Indian engineering culture has not given them a vocabulary for naming the family-side cost as a legitimate engineering concern. The honest answer surfaces on the second or third ask, in phrases like "my wife and I have started sleeping in separate rooms during my on-call week" or "my son has stopped wanting to sit on my lap because my phone keeps buzzing". These are not anecdotes; they are inputs the alerting hygiene work is meant to act on. The engineering manager who collects three or four such answers and brings them to the alerting-hygiene retro produces a different conversation than the one that starts with "our false-positive rate is 58%". Both inputs matter; the household one is the one most teams systematically under-collect, which is why the hidden ledger stays hidden until exit-interview season.

A practical artefact some teams ship alongside the cost-model script: a monthly humane-on-call report circulated to engineering leadership and HR. The report has three sections — the obvious-ledger total in rupees, the hidden-ledger total in rupees, and a list of the three most expensive alerts that month with their proposed remediation. The report goes to the same distribution list that receives the SLO-budget burn report; circulating both together is what gets the hidden ledger taken seriously by leadership that responds primarily to spreadsheets. The report's standardisation is the part that matters — once the hidden ledger has a monthly artefact, the conversations stop being one-off retro grievances and start being a tracked quarterly metric, which produces the budget allocation and the platform-team headcount that make the structural work possible. Cred and Razorpay both publish the equivalent of this report internally; the public versions of either company's KubeCon talks describe the format in enough detail to copy.

References

Beyer, Jones, Petoff, Murphy, Site Reliability Engineering (O'Reilly 2016), Chapter 6 "Monitoring Distributed Systems" — the original symptom-based-alerting framing that Part 11 will build on.
Beyer, Murphy, Rensin, Kawahara, Thorne, The Site Reliability Workbook (O'Reilly 2018), Chapter 11 "Managing Load" — contains the often-cited 8-pages-per-week burnout threshold and the framing of psychologically safe on-call.
Beyer et al., The Site Reliability Workbook, Chapter 5 "Alerting on SLOs" — the alert-routing patterns that compose with this chapter's wallclock-aware routing.
Charity Majors, "The Math Behind Service-Level Objectives" — on the structural difference between measurement and obligation; this wall borrows the framing.
Mikey Dickerson's keynote at SRECon 2017, "How Hierarchies of Reliability Save Lives" — the original public articulation of severity tiering as humane engineering rather than ops convenience.
Razorpay Engineering, "From 1200 alerts a day to 14: rewriting our SRE alerting" — the canonical Indian-fintech case study; cited throughout this chapter and the next six.
Cred Engineering / KubeCon India 2025 — "What we cut when we cut 92% of our night pages" (talk recording on the KubeCon CNCF YouTube channel) — the off-hours-exception-list mechanism is documented end-to-end here.
/wiki/socializing-slos-without-bureaucracy — chapter 67, the four-owner contract that every Tier 1 alert in Part 11 must trace to.
/wiki/wall-numbers-mean-nothing-without-targets — chapter 61, the previous wall on which this one builds.
John Allspaw, "Each necessary, but only jointly sufficient" — on resilience and the cognitive work of incident response; the framing that an alert is a delegation of cognitive load to a future engineer originates here.
Lorin Hochstein, "Notes on the Ironies of Automation" (Lorin's blog 2019) — on Bainbridge's classic paper applied to alerting; the irony that automated alerting systems shift the hardest cognitive work onto the human at the worst possible time.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy
# Export 90 days of pages from PagerDuty:
#   curl -H "Authorization: Token token=$PD_TOKEN" \
#     "https://api.pagerduty.com/incidents?since=$(date -v-90d -u +%FT%TZ)&limit=200" \
#     | jq -r '...' > pages_q1_2026.csv
python3 alert_cost_calculator.py
# Then mutate: change ENG_HOURLY_INR to your team's loaded rate, and the
# sleep-cost constants to whatever your engineers say at the next retro
# when you ask them "what would you pay to not be paged at 02:47?".
# The hidden/obvious ratio is the headline number; chase it down to under 3x
# in two quarters and Part 11 has done its job. Re-run the script monthly,
# circulate the output to the on-call rotation and to engineering leadership,
# and treat the ratio as a tracked metric on equal footing with SLO-budget burn.