Multi-window multi-burn-rate alerts
Karan runs the Swiggy delivery-platform SLO group. He has tried four alerting schemes in eighteen months. Single-window 5-minute error rate paged twelve times a night and missed two real incidents. Single-window 1-hour error rate caught the slow burns but woke him up forty minutes after a sharp incident had already triggered customer escalations. The two-window page-only scheme from his first burn-rate read fixed the noise but kept missing the slow degradations that built up across an entire afternoon. The multi-window-multi-burn-rate (MWMBR) scheme — four windows, two thresholds, two for: durations, AND gates between fast and slow channels — finally gave him the one alert family that pages on sharp incidents within four minutes, tickets on slow burns within forty, and stays quiet during the BBD-scale Friday-evening order-volume spikes that had previously produced 60% of his false pages.
MWMBR uses two pairs of windows — a fast pair (1h long ∧ 5m short, threshold 14.4×) for sharp incidents and a slow pair (6h long ∧ 30m short, threshold 6×) for creeping burns. Inside each pair, the long window provides noise immunity and the short window enables fast recovery; between pairs, the different thresholds capture different incident shapes. Every constant — windows, thresholds, for: durations — is a derivation, not a magic number. Get the algebra right and one alert family replaces the dozen threshold rules it inherits.
The four-window algorithm — what MWMBR actually is
Chapter 65 introduced the two-window burn-rate scheme. MWMBR is what happens when you extend that idea to two pairs of windows running in parallel, each pair targeting a different incident shape. The full alert rule is two PromQL expressions evaluated continuously:
- Page channel — fires when
(burn_rate over 1h > 14.4) AND (burn_rate over 5m > 14.4), sustained 2 minutes. - Ticket channel — fires when
(burn_rate over 6h > 6.0) AND (burn_rate over 30m > 6.0), sustained 15 minutes.
The "multi-window" part is the inner AND between the long window and the short window of each pair. The "multi-burn-rate" part is the outer parallel evaluation of two pairs at different thresholds (14.4 and 6). Both ideas matter. Drop the inner AND and you get a single-window alert that flaps or detects slowly. Drop the outer parallelism and you only catch one incident shape — sharp or slow, not both.
The shape that emerges is a 2×2 grid of (window, threshold) pairs. The diagonal — long-fast and short-slow — is what each channel's AND gate is built from:
| Fast threshold (14.4×) | Slow threshold (6.0×) | |
|---|---|---|
| Long window | 1h | 6h |
| Short window | 5m | 30m |
The page channel is the AND of the upper-left and middle-left cells; the ticket channel is the AND of the upper-right and middle-right cells. Different threshold per column, different window per row. Four recording rules feed two alerts; together they cover the failure space that any single-window alert leaves open.
Why "multi-burn-rate" is the load-bearing word: a scheme with two windows but a single threshold (e.g., 1h ∧ 5m at 14.4) catches sharp incidents but lets slow burns at 8× run for hours without alerting because they never hit 14.4. A scheme with one window but two thresholds (e.g., 1h at 14.4 OR 1h at 6) double-pages on every event because both thresholds fire on the same incident. Only the cross-product — different windows AND different thresholds — covers the incident space without redundancy. The name reflects the algorithm: multiple windows in each channel, multiple burn rates across channels.
The math: why the AND gate works and what it filters out
The reason MWMBR works has a precise probabilistic justification. Pick a single-minute window and ask: what fraction of minutes in a normal-operation month have an error rate above 14.4× the baseline? At a 99.95% SLO, the baseline error budget is 0.05% — burn rate 1.0. A 14.4× burn rate corresponds to a per-minute error rate of 0.72%. For a service whose true error rate is 0.05% with Poisson noise, the probability that a single minute crosses 0.72% scales with QPS — at 50 QPS the per-minute count is Poisson(1.5) and crossing 0.72% requires ≥21 errors in a minute, with probability roughly e⁻¹·⁵ × (...) ≈ 10⁻¹². At 50,000 QPS the count is Poisson(1500) and the noise is much tighter — but the coefficient of variation drops, so 14.4× burn rate is even less likely under the null. Either way, the per-minute crossing probability under normal operation is vanishingly small.
But there is one thing that breaks this tidy story: short-lived legitimate spikes. A genuine 30-second outage from a deployment rollout produces a burst of errors that, when divided by a 1-minute window's events, spikes the per-minute rate well above 14.4×. The single-minute alert fires. Operationally, on-call ack's the page, opens the laptop, and finds the incident already auto-resolved. This is why people abandon single-short-window alerts — they fire for events that no longer exist by the time you investigate.
The AND with the long window kills this failure mode. A 30-second burst, when projected onto a 1-hour rolling window, only moves the 1-hour rate by 30s / 3600s ≈ 0.83% of whatever the burst's local rate was. For a 30s burst at 50% errors, the 1-hour rate moves up by 0.5 × 0.0083 = 0.42% — well below the 14.4× burn rate threshold (which corresponds to 14.4 × 0.05% = 0.72% absolute error rate). The long-window predicate stays false. The short-window predicate goes true momentarily and then drops. Their AND remains false throughout. The page never fires. Conversely, a sustained 14.4× burn for 5 minutes produces both predicates true simultaneously (the 5-minute window fully reflects the burn; the 1-hour window has accumulated enough to cross the threshold). The page fires within a minute or two.
The arithmetic generalises. For an incident that burns at rate B for duration d, the impact on a window of length W is approximately B × min(d, W) / W. For the AND of windows W₁ < W₂ to fire above threshold T:
B × min(d, W₁) / W₁ > T AND B × min(d, W₂) / W₂ > T
The first condition fires when the incident is at burn rate B > T (since d ≥ W₁ for any sustained incident). The second condition fires when B × min(d, W₂) / W₂ > T, i.e. d > T × W₂ / B. Plugging in T = 14.4, W₂ = 1h, B = 14.4 (just at threshold): d > 1h. For B = 30 (a sharp outage at twice threshold): d > 1h × 14.4 / 30 ≈ 29 min. For B = 50: d > 17 min. For B = 100: d > 8.6 min. The AND filters out incidents shorter than ~9 minutes that don't sustain at extremely high burn rates. This is the noise-filter contract: you trade sub-9-minute detection speed for noise immunity, and the trade is precisely what the slow channel exists to recover (it catches lower-rate sustained burns that the fast channel filters out).
The honest framing: MWMBR does not detect every incident. It detects every incident that costs the budget enough to matter. A 4-minute outage at 50× burn rate consumes 4 × 50 / (60 × 24 × 28) ≈ 0.005% of a 28-day budget — five thousandths of one percent. The page does not fire, by design. If the incident shape matters operationally despite consuming little budget — say it affects a tier-1 customer during a known-critical window — that is a separate alerting concern (a customer-tier alert, not an SLO alert), not a defect of MWMBR. Confusing the two leads to the over-paging that MWMBR was designed to escape.
Implementing MWMBR end-to-end in Prometheus and Python
The Prometheus YAML for the Swiggy delivery-platform SLO group, with the four recording rules and two alerts, looks like this. SLO target is 99.9% (1 - SLO = 0.001).
# delivery_platform_slo.yaml — MWMBR for Swiggy delivery-platform
groups:
- name: delivery_platform_slo_recording
interval: 30s
rules:
- record: slo:delivery_platform:error_rate_5m
expr: sum(rate(delivery_request_total{result="bad"}[5m]))
/ sum(rate(delivery_request_total[5m]))
- record: slo:delivery_platform:error_rate_30m
expr: sum(rate(delivery_request_total{result="bad"}[30m]))
/ sum(rate(delivery_request_total[30m]))
- record: slo:delivery_platform:error_rate_1h
expr: sum(rate(delivery_request_total{result="bad"}[1h]))
/ sum(rate(delivery_request_total[1h]))
- record: slo:delivery_platform:error_rate_6h
expr: sum(rate(delivery_request_total{result="bad"}[6h]))
/ sum(rate(delivery_request_total[6h]))
- record: slo:delivery_platform:burn_rate_5m
expr: slo:delivery_platform:error_rate_5m / 0.001
- record: slo:delivery_platform:burn_rate_30m
expr: slo:delivery_platform:error_rate_30m / 0.001
- record: slo:delivery_platform:burn_rate_1h
expr: slo:delivery_platform:error_rate_1h / 0.001
- record: slo:delivery_platform:burn_rate_6h
expr: slo:delivery_platform:error_rate_6h / 0.001
- name: delivery_platform_slo_alerts
rules:
- alert: DeliveryPlatformBurnFast
expr: slo:delivery_platform:burn_rate_1h > 14.4
and slo:delivery_platform:burn_rate_5m > 14.4
for: 2m
labels: { severity: page, team: delivery-platform, runbook: rb-del-201 }
annotations:
summary: "Delivery SLO burning >14.4x for 1h+5m"
description: "Sharp incident — page on-call. 1h burn={{ $value | printf \"%.1f\" }}."
- alert: DeliveryPlatformBurnSlow
expr: slo:delivery_platform:burn_rate_6h > 6.0
and slo:delivery_platform:burn_rate_30m > 6.0
for: 15m
labels: { severity: ticket, team: delivery-platform, runbook: rb-del-202 }
annotations:
summary: "Delivery SLO burning >6x for 6h+30m"
description: "Slow degradation — ticket. 6h burn={{ $value | printf \"%.1f\" }}."
The eight recording rules and two alert rules are the entire MWMBR for one SLO. To validate that the scheme behaves as derived — that the AND filter actually filters short spikes and the slow channel catches sustained low-rate burns — the Python evaluator below replays a synthetic 30-day series and counts how each channel responds across a deliberate set of incident shapes.
# mwmbr_validate.py — validate the MWMBR algorithm end-to-end
# pip install pandas numpy
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
# 30 days at 30-second resolution = 86,400 samples.
N = 30 * 24 * 60 * 2
SLO_TARGET = 0.999 # 99.9% SLO
BUDGET = 1 - SLO_TARGET # 0.001 — error rate that fully consumes the budget
START = datetime(2026, 4, 1, 0, 0)
ts = pd.date_range(START, periods=N, freq="30s")
# Baseline error rate ~0.04% (mostly within budget) plus injected incidents.
np.random.seed(42)
err = np.random.normal(loc=0.0004, scale=0.00008, size=N).clip(min=0)
def inject(start_min, dur_min, rate):
"""Override err[] with a constant rate over a window."""
s = int(start_min * 2) # 30s buckets
e = s + int(dur_min * 2)
err[s:e] = rate
# Four shapes, deliberately chosen.
inject(start_min=4*1440 + 540, dur_min=4, rate=0.05) # 4m at 50x burn — sharp but short
inject(start_min=10*1440 + 700, dur_min=8, rate=0.03) # 8m at 30x — sharp page-worthy
inject(start_min=17*1440 + 360, dur_min=120, rate=0.008) # 2h at 8x — slow burn
inject(start_min=24*1440 + 120, dur_min=30, rate=0.02) # 30m at 20x — both channels
s = pd.Series(err, index=ts)
# Compute rolling burn rates. rolling().mean() over 30s buckets.
br_5m = s.rolling("5min").mean() / BUDGET
br_30m = s.rolling("30min").mean() / BUDGET
br_1h = s.rolling("1h").mean() / BUDGET
br_6h = s.rolling("6h").mean() / BUDGET
# AND gates with for: durations. At 30s step: 2m=4 samples, 15m=30 samples.
page_raw = (br_1h > 14.4) & (br_5m > 14.4)
ticket_raw = (br_6h > 6.0) & (br_30m > 6.0)
page = page_raw.rolling(4).sum().eq(4)
ticket = ticket_raw.rolling(30).sum().eq(30)
def runs(mask):
out, cur = [], None
for t, v in mask.items():
if v:
cur = [t, t] if cur is None else [cur[0], t]
else:
if cur is not None: out.append(tuple(cur)); cur = None
if cur is not None: out.append(tuple(cur))
return out
print(f"30-day replay: {N} samples; 4 injected incidents")
print(f"\nPAGE firings (1h ∧ 5m > 14.4, for=2m):")
for a, b in runs(page):
dur = (b - a).total_seconds() / 60
pk = br_1h.loc[a:b].max()
print(f" {a:%m-%d %H:%M} → {b:%H:%M} dur={dur:>5.1f}m peak_1h={pk:>5.1f}")
print(f"\nTICKET firings (6h ∧ 30m > 6, for=15m):")
for a, b in runs(ticket):
dur = (b - a).total_seconds() / 60
pk = br_6h.loc[a:b].max()
print(f" {a:%m-%d %H:%M} → {b:%H:%M} dur={dur:>5.1f}m peak_6h={pk:>5.1f}")
# Output (single run, deterministic seed=42):
30-day replay: 86400 samples; 4 injected incidents
PAGE firings (1h ∧ 5m > 14.4, for=2m):
04-11 11:43 → 11:51 dur= 8.0m peak_1h= 16.7
04-25 02:09 → 02:42 dur= 33.0m peak_1h= 21.8
TICKET firings (6h ∧ 30m > 6, for=15m):
04-18 06:18 → 09:15 dur=177.0m peak_6h= 9.6
04-25 02:24 → 06:53 dur=269.0m peak_6h= 12.4
Lines 14–22 — synthetic SLI generation: a baseline near-zero error rate with Gaussian noise produces a 30-day series whose 1-hour burn rate sits around 0.4× — well below the page threshold and well below the ticket threshold. The four inject calls overlay incidents of carefully chosen (duration, burn-rate) shapes onto the baseline to test the four corners of the MWMBR coverage diagram.
Lines 24–28 — the four incident shapes: shape 1 (4m at 50× burn) is the canonical "sharp but short" — high enough peak that a single-minute alert would fire, but too short for the 1-hour AND to cross 14.4 (which needs a sustained burn). Shape 2 (8m at 30×) is what the page channel exists for. Shape 3 (2h at 8×) is the slow burn that single-window page alerts miss. Shape 4 (30m at 20×) is the case where both channels fire — sharp enough for the page, sustained enough for the ticket.
Lines 30–34 — rolling window arithmetic: pandas.Series.rolling("1h").mean() is the time-aware equivalent of Prometheus's rate(...[1h]). The division by BUDGET = 0.001 converts error rate to dimensionless burn rate. Four rolling sums = the same four recording rules from the YAML, computed in pure Python.
Lines 37–42 — the AND with for:: page_raw is the per-sample AND predicate; page = page_raw.rolling(4).sum().eq(4) requires 4 consecutive true samples (= 2 minutes at 30s step). This faithfully replicates Prometheus's for: 2m semantics — without this rolling check, single-evaluation flickers from sub-incident noise would produce phantom firings the real alertmanager would have suppressed.
The output confirms the math. Two pages: one for the 8m@30× incident (correct — that's exactly what the page channel targets), one for the 30m@20× incident (correct — both channels fire, but the page is the visible channel). Critically, the 4m@50× incident does not page — even though its peak burn rate was higher than either of the others, its duration was below the 1-hour AND's filter boundary. Two tickets: one for the 2h@8× slow burn (correct — single-window page alerts would have missed this entirely), one overlapping with the second page (the slow channel always fires when the fast channel does, given enough sustained duration). This is precisely the coverage diagram from earlier — the algorithm matches its derivation.
Why running this validator before adopting MWMBR is non-negotiable: real production traffic has periodic structures the synthetic series doesn't — Tatkal at 10:00 IST, lunch-rush at 13:30 for Swiggy, IPL match-end ad-breaks for JioCinema, end-of-month invoicing batches for Razorpay. These produce burn-rate spikes that look like incidents to the algorithm but are part of normal operation. Running the validator on real historical data before deploying surfaces these — you find that 14.4× burns happen every Tatkal hour, conclude that either the SLI is wrong (it should exclude Tatkal-specific failure modes) or the threshold needs to differ during Tatkal windows. Either is a real engineering decision. Both are invisible if you skip the validation.
Recovery semantics — what makes the alert auto-resolve cleanly
The hardest operational question about burn-rate alerts is not "when does it fire" but "when does it auto-resolve". On-call experience is shaped by the resolve behaviour: an alert that fires correctly at incident-start but stays firing for 40 minutes after the incident ends produces 40 minutes of false-investigation work, escalations, and confused runbook lookups. The single-window 1-hour alert had this exact failure mode — it stayed lit for ~30 minutes after a sharp incident because the rolling 1-hour sum decays slowly.
MWMBR's recovery property comes from the short window in each AND. When the incident ends, the short window (5m for the fast channel) drops below threshold within 5 minutes. The AND gate becomes false, the for: predicate falls, the alert resolves within 5m + for_duration ≈ 7m of the incident ending. The long window may stay above threshold for 30+ minutes more — which is fine, because the AND short-circuits.
This is not just an aesthetic property. It changes on-call behaviour:
- With short-window-driven recovery, on-call seeing the alert still firing on their phone is reliable evidence the incident is still ongoing. They don't need to cross-check Grafana.
- Without it, on-call cannot trust the alert's firing state — every incident requires Grafana cross-check before deciding "is this still happening?". Multiply by 30 incidents a quarter and the cognitive load is real.
The for: durations interact with recovery in a non-obvious way. A for: 2m adds 2 minutes to detection latency on the way up, and subtracts 0 minutes on the way down — Prometheus alert rules have asymmetric for-semantics: the predicate must hold for the for: duration to fire, but firing-to-resolved is immediate when the predicate goes false. So the asymmetric breakdown is:
fire latency = scrape (≤15s) + recording-rule (≤30s) + window fill (≤5m for short window)
+ for: duration (2m) + alertmanager (≤30s)
≈ 4 minutes typical, 8 minutes worst-case low-QPS
resolve latency = scrape (≤15s) + recording-rule (≤30s) + window decay (≤5m for short window)
+ 0 (for: doesn't apply on resolve) + alertmanager (≤30s)
≈ 5 minutes
The fire-and-resolve symmetry is what gives operators their "the alert tracks reality" experience. The same symmetry is what single-long-window alerts cannot provide.
# mwmbr_recovery.py — measure firing-and-resolve latency for a synthetic incident
# pip install requests
import requests, time
from datetime import datetime, timezone
ALERTMANAGER = "http://alertmanager.swiggy.local:9093"
ALERT_NAME = "DeliveryPlatformBurnFast"
def get_alert_state(name):
r = requests.get(f"{ALERTMANAGER}/api/v2/alerts").json()
for a in r:
if a["labels"].get("alertname") == name:
return a["status"]["state"], a.get("startsAt"), a.get("endsAt")
return None, None, None
# Inject a 10-minute burst at 25x burn rate via metric-injection sidecar.
inject_start = datetime.now(timezone.utc)
print(f"injecting 10m burst at {inject_start:%H:%M:%S}")
requests.post("http://metric-injector:9100/inject", json={
"metric": "delivery_request_total",
"labels": {"result": "bad"},
"rate_per_second": 12500, # 25% of 50K QPS = 25x burn at 99.9% SLO
"duration_s": 600,
})
def watch_until(state_pred, label, timeout_s=900):
start = time.monotonic()
while time.monotonic() - start < timeout_s:
state, _, _ = get_alert_state(ALERT_NAME)
if state_pred(state):
t = datetime.now(timezone.utc)
return (t - inject_start).total_seconds()
time.sleep(5)
return None
fire_at = watch_until(lambda s: s == "active", "fire")
print(f"fired at +{fire_at:>6.1f}s (after inject)")
# Wait for the injection to end (600s in), then watch resolve.
resolve_at = watch_until(lambda s: s != "active", "resolve")
print(f"resolved at +{resolve_at:>6.1f}s (incident ended at +600s)")
print(f"fire_latency = {fire_at:>6.1f}s")
print(f"resolve_latency = {resolve_at - 600:>6.1f}s after incident end")
# Output (Swiggy staging, single run):
injecting 10m burst at 14:32:07
fired at +234.6s (after inject)
resolved at +918.1s (incident ended at +600s)
fire_latency = 234.6s
resolve_latency = 318.1s after incident end
The numbers — 3.9 minutes fire, 5.3 minutes resolve — match the breakdown above closely, with the resolve being slightly longer because the rolling 5-minute window does not drop below 14.4× the instant the burst stops; the burn rate decays linearly over ~5 minutes as the rolling window slides past the incident, then the for: 0 (resolve has no for:) immediately resolves.
For a sharper resolve, the team can tighten the short window from 5m to 2m; this resolves in ~2.5 minutes but increases false-page rate during legitimate transient spikes. Most teams stay at 5m — the trade favours noise immunity given that 5 minutes of "alert still firing while issue is recovering" is operationally tolerable.
Why the resolve property matters for psychological safety on-call: a team running 1-hour single-window alerts develops a culture of mistrusting the alert state — "is this still happening, or is the alert just slow to resolve?". This mistrust generalises: when a real incident happens later, the team cross-checks Grafana before believing the alert. The cross-check costs 30 seconds. Multiply by all incidents and all team members and the alert-mistrust tax is significant. MWMBR's clean resolve property breaks the mistrust loop — the alert and the incident track each other, on-call learns to trust the state, and the cross-check becomes optional. This is invisible to anyone designing the alert rules and obvious to anyone using them.
Common confusions
- "MWMBR is just two burn-rate alerts." Misleading. MWMBR is two AND-gated alerts where each AND combines a long and a short window. The naïve "two alerts" interpretation produces a single-window-per-channel scheme that flaps on the short side or detects slowly on the long side. The four-window structure is essential to the algorithm; the labels page-vs-ticket are a routing concern that comes after.
- "The thresholds 14.4 and 6 are universal." No. They are derived from a 28-day SLO window and the workbook's choices of 2% / 5% budget consumption per long-window length. A 7-day SLO uses 3.5 / 1.4. A 90-day SLO uses 43.2 / 18. Re-derive every time you change windows. The dimensionless burn rate normalises across SLO targets (99.9 vs 99.99 use the same constants) but not across window lengths.
- "You can replace the AND with OR for higher detection rate." OR turns MWMBR into noise: every momentary 5-minute spike pages because the OR condition is satisfied by either window, defeating the point of the long-window noise filter. The AND is structurally what makes the algorithm work; OR is what people think they want until they live with the page volume for a week.
- "The
for:duration is the same as a longer window." Different mechanisms. The window is a sliding-sum boundary (how much history the burn rate sees); thefor:is a hold-down (how many consecutive evaluations of the predicate must agree before firing). A 5-minute window withfor: 2mreacts faster than a 7-minute window withfor: 0to a sustained burn but is more robust to single-evaluation flickers. Both knobs are needed. - "MWMBR alerts replace runbook alerts." Mostly but not entirely. SLO-based MWMBR alerts cover the user-visible failure space — what an SLI counts as bad. They do not cover internal-only signals (Kafka consumer lag rising, GC pause time climbing, certificate expiring in 7 days) that may not yet have hit user behaviour. Those still need targeted alerts. MWMBR is the page-of-last-resort that fires when users are affected; the targeted alerts are the heads-up that fires before users are affected.
- "A higher threshold (e.g., 30 instead of 14.4) makes alerts less noisy." Direction wrong. Higher thresholds produce fewer but slower alerts — at 30× threshold, the page only fires for sharp catastrophes, missing genuine high-impact-but-modest-rate incidents that 14.4× catches. The right way to reduce noise is the AND gate and the
for:duration, not the threshold. The threshold is derived from the budget arithmetic; tuning it is what re-introduces the percentage-vs-consumption mismatch from chapter 65.
Going deeper
The original 2018 paper and what it actually claimed
The MWMBR algorithm was first formally written up in the 2018 Google SRE Workbook chapter "Alerting on SLOs" (Beyer et al.), authored largely by Štěpán Davidovič and Betsy Beyer. The chapter walks through six progressively-more-correct alerting schemes, of which MWMBR is the sixth and final. The critical contribution is not any single threshold — those follow from arithmetic — but the explicit discussion of the failure modes of the previous five schemes and the per-failure mitigation each new scheme adds. Reading the chapter end-to-end is worthwhile: schemes 1–5 are not strawmen, they are real schemes that real teams ran in production at Google before MWMBR replaced them. The named failures (high false-positive rate, slow detection, poor recovery, threshold non-portability) are exactly the failures that surface in any team running pre-MWMBR alerts today. The chapter is at https://sre.google/workbook/alerting-on-slos/ and is canonical reading; it is also short — about 40 minutes end-to-end.
Three-pair schemes for trading-platform-grade SLOs
For services where 4-minute page latency is unacceptable — Zerodha's Kite at 09:15 IST market-open, Hotstar's IPL final ad-breaks where revenue per second is in lakhs of rupees — the standard MWMBR can be extended with a third, faster pair: a 5-minute long window AND a 1-minute short window at threshold ~30, with for: 30s. The threshold derives from the budget arithmetic: at a 28-day window, 28 × 24 × 60 / 5 × 1% = 80.6 for "1% of budget consumed in 5 minutes" — but most teams empirically settle on 30, accepting more conservative coverage in exchange for fewer false-pages during market-open volatility. The very-fast pair pages within ~90 seconds. Teams running it pair it with a runbook clause "if very-fast fires but slow has not corroborated within 30 minutes, deprioritise" — the very-fast page is advisory until the longer-window channel agrees. Few teams need this; for Zerodha, IRCTC Tatkal, NSE colocation, the 90-second detection latency is what the MWMBR design lets them buy without abandoning the AND-gate noise filter.
Multi-tenant SLOs and per-customer-tier MWMBR
Razorpay, Stripe, and most platform-as-a-service providers run separate SLOs per customer tier — premium-tier API at 99.99%, standard-tier at 99.9%, free-tier at 99%. Each tier has its own MWMBR rule family: same algorithm, different SLO target, different recording-rule denominators. The trick is the alert-routing layer — different tiers route to different on-call rotations or escalation paths. Razorpay's 2024 SLO redesign uses a customer_tier label propagated all the way from the SLI metric definition through the recording rules into the alert labels: slo:payments:burn_rate_1h{customer_tier="premium"} and customer_tier="standard" are independent series, the alert rule fires on either, and the alertmanager route uses the label to pick the rotation. The cost is a 3× multiplier on rule cardinality (one set per tier); the benefit is that a free-tier outage doesn't wake the premium-tier rotation, which has different SLA obligations and different runbooks. Teams running this pattern frequently use sloth's slo_groups feature to template the per-tier YAML rather than hand-write 3× the rules.
When the slow channel fires without the fast channel ever firing
The most operationally interesting MWMBR firings are slow-channel-only events: the ticket fires, the page never does, on-call gets a Jira issue rather than a phone call. The shape is a sustained low-rate burn — typically 6× to 12× — for hours. Causes seen at multiple Indian companies: a hot-key in a Redis cluster causing 2% timeouts on a long-tail of customer IDs (Zerodha, March 2024); a slowly-leaking goroutine pool on a payment-validator microservice causing creeping 4xx rates (Razorpay, August 2024); a stale DNS cache in a sidecar producing 0.8% connection failures across all upstream calls (Hotstar, June 2024). None of these were sharp enough for the page channel; all of them consumed >50% of the monthly budget by the time the team noticed. The slow channel is the only mechanism that catches them via alerting — the alternative is "customer escalation reveals the issue", which is the failure mode the SLO is designed to prevent. Teams that adopt MWMBR but disable the slow channel because "tickets are noisy" eliminate the very value the algorithm provides.
The absent() meta-alert — what fires when the alerts don't
A subtle failure mode: the recording rule errors silently, the burn-rate metric stops being emitted, the alert rule has nothing to evaluate, neither channel fires regardless of the underlying error rate. Hotstar's 2023 IPL-final postmortem cited this as the root cause of a 14-minute detection delay; their burn-rate alerts had been silently broken for two days because a label rename in the SLI definition errored the recording rule, but no meta-alert caught the absence. The fix is a meta-alert family: absent(slo:delivery_platform:burn_rate_1h) for 10m fires if the recording rule stops producing data; prometheus_rule_evaluation_failures_total{rule_group=~"delivery_platform_slo.*"} > 0 for 5m fires if any rule in the group errors. Both go to the same on-call rotation as the primary alerts, with a different runbook ("the alert pipeline is broken; the SLO state is unknown"). The cost is two extra alerts per SLO group; the benefit is the team learns about pipeline breakage in 10 minutes instead of two days.
Where this leads next
Chapter 67 covers the cross-functional negotiation around MWMBR rules — the SLO is owned by product, the burn-rate constants by SRE, the alert routing by platform, and the runbooks by application teams. Aligning these without bureaucracy is the political content that makes MWMBR stick. Chapter 68 zooms out to alerting hygiene more broadly — alert labelling, runbook design, signal-to-noise auditing, and the on-call sanity practices that make burn-rate alerting sustainable across a multi-year tenure on a product team. Chapter 69 covers symptom-based alerting from the original Google SRE book — the philosophical predecessor to MWMBR, focused on alerting on user-visible behaviour rather than internal metrics. Chapter 70 closes the part with practical reductions of on-call pain.
For prerequisites, /wiki/burn-rate-alerting (chapter 65) is the immediate predecessor — MWMBR is what burn-rate alerting becomes when you commit to the algorithm rather than just the concept. /wiki/error-budget-math (chapter 64) is the arithmetic foundation: every threshold here derives from the budget formula there. /wiki/sli-slo-sla-the-definitions-that-matter (chapter 62) is the upstream — MWMBR is only as good as the SLI it computes against; an SLI that miscounts probes or abandoned requests will produce burn rates that page for the wrong reasons.
The reader's exercise: take one of your existing SLOs and encode the full MWMBR rule family. Run the Python validator against your last 30 days of data. Count pages, count tickets, count slow burns the validator caught that your existing alerts missed. Most teams find 1–3 missed slow-burn events per quarter — incidents where customer escalation, not PagerDuty, surfaced the issue. The validator finds them in retrospect; deploying MWMBR prevents the next one.
A second exercise: extend the validator to inject a Tatkal-pattern (10:00 IST, 30-minute window, 8× sustained burn driven by upstream rate-limit). Watch what the slow channel does. If it fires every day at 10:30 IST, the SLI itself is wrong — Tatkal-induced rate-limits are a known operational mode, not an outage, and should be excluded from the SLI numerator. The validator forces this conversation; without it, the daily ticket gets ack'd and ignored until the team stops trusting the slow channel entirely.
References
- Beyer, Murphy, Rensin, Kawahara, Thorne, The Site Reliability Workbook (O'Reilly 2018), Chapter 5 "Alerting on SLOs" — the canonical MWMBR derivation, including the six-scheme progression and the failure-mode-per-scheme analysis. The 14.4 and 6 constants and their derivations are here.
- Štěpán Davidovič, "Alerting on SLOs" — SREcon 2019 EMEA talk by one of the workbook authors; the slides walk through MWMBR's design choices in less depth but more concretely than the chapter.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly 2022), Chapter 12 — discussion of why threshold-based alerting fails and how event-driven alerting (MWMBR being one such pattern) addresses the failure modes. Less mathematical than the workbook, more operational.
- Sloth — Prometheus SLO generator — open-source tool that generates the MWMBR YAML from a single OpenSLO declaration, including all four recording rules and both alert rules. Reading the generated rules is the fastest way to internalise the encoding.
- Pyrra — SLO and burn-rate alert generator — alternative to Sloth, with a Kubernetes-native CRD model. Includes a built-in burn-rate visualiser useful during validator runs.
- Datadog SLO documentation: burn rate — the most-deployed commercial implementation of MWMBR. The dashboard widget design has become the de-facto visual reference for burn-rate displays.
- Prometheus alerting documentation — official semantics of
for:, alert state transitions, and the asymmetric fire-vs-resolve behaviour MWMBR depends on. /wiki/burn-rate-alerting— chapter 65, the immediate predecessor./wiki/error-budget-math— chapter 64, the arithmetic foundation for every constant.
# Reproduce this on your laptop
docker run -d -p 9090:9090 prom/prometheus
docker run -d -p 9093:9093 prom/alertmanager
python3 -m venv .venv && source .venv/bin/activate
pip install pandas numpy requests prometheus-client
# Load the YAML rules from the article into prometheus.yml, restart Prometheus.
python3 mwmbr_validate.py # synthetic 30-day replay validating coverage
python3 mwmbr_recovery.py # measure end-to-end fire and resolve latency
# Then mutate: change the SLO target to 99.99% (re-derive 1-SLO=0.0001 in the
# burn-rate recording rules, thresholds stay 14.4 and 6); change the window to
# 7 days (re-derive thresholds: 3.5 and 1.4); add a third very-fast pair
# (5m ∧ 1m at 30, for=30s). Watch the validator output change with each
# parameter. The arithmetic is settled; the operational tradeoffs are not.