Burn-rate alerting
Aditi is on call for the Razorpay payments-API. PagerDuty wakes her at 02:11 IST. The alert says PaymentCreate5xxRateHigh is firing — error rate over the last five minutes is 3.2% against a 1% threshold. By the time she opens the laptop the alert has auto-resolved; the five-minute rate is back to 0.04%. She acknowledges the page, opens Grafana, sees a tiny blip, sees nothing else, goes back to sleep. This is the fourth time this week. The alert never fired during the actual two-hour incident on Tuesday because the five-minute rate during a slow degradation never crossed the threshold long enough to trip. The alert is paging her on noise and missing the signal — and the threshold was set six months ago by someone who is no longer on the team.
A burn-rate alert fires when the error budget is being consumed fast enough that you will exhaust it before the SLO window ends. The Google SRE workbook scheme uses two windows simultaneously — a fast-burn window (1h, threshold 14.4×) for sharp incidents and a slow-burn window (6h, threshold 6×) for creeping degradations — combined with for: durations so neither flaps. Every constant is derived from the SLO window length; copy them blindly to a different window and the alerts either over-page or miss incidents entirely.
Why instantaneous-rate alerts fail and what burn rate fixes
The classic alert rule — "5xx rate over 5 minutes is above 1%" — has two failure modes that are inseparable from its shape.
Failure mode 1: paging on noise that doesn't matter to the budget. A 5-minute window of 3% error rate sounds bad, but if the SLO is 99% (1% bad allowed), three minutes of 3% bad consumes ~3 × 60 × QPS × 0.03 bad events out of a 28-day budget that contains ~28 × 86400 × QPS × 0.01 events. For QPS=1000 that is 5400 bad out of 24,192,000 budget — 0.022% of the budget. The page wakes Aditi for an event that consumed less than a thousandth of the monthly budget. The alert is firing on a percentage instead of on consumption. The percentage is high; the consumption is trivial.
Failure mode 2: missing slow degradations that absolutely matter to the budget. The Tuesday incident was a 4-hour creep where error rate went from 0.05% to 0.4% — never crossing the 1% threshold. Cumulatively that consumed 4 × 3600 × 1000 × 0.002 = 28,800 bad events, which is 0.119% of the same monthly budget — 5× more budget than the noisy 3% blip that paged her. The instantaneous-rate alert missed it entirely. The 5xx-rate threshold cannot tell the difference between a 5-minute spike that costs 0.022% of the budget and a 4-hour degradation that costs 0.119%, because it is looking at the wrong axis.
Burn rate fixes both by changing what the alert measures. From chapter 64: burn_rate = error_rate / (1 − SLO). A 1% error rate against a 99% SLO is burn rate 0.01 / 0.01 = 1.0 — exactly the rate that exhausts the budget at window end. A 14.4× burn rate exhausts the 28-day budget in 28 / 14.4 ≈ 2 days. A burn rate of 0.5 leaves the window with 50% budget remaining. Burn rate is dimensionless and budget-aware; the threshold transfers across services with the same SLO regardless of QPS, and across SLOs with a re-derivation of the constant.
The next question is which burn rate to alert on — the 1-minute burn rate, the 1-hour burn rate, the 6-hour burn rate? A short window is sensitive to sharp incidents but flaps on noise. A long window catches slow burns but is slow to detect a sudden outage. The Google SRE workbook (chapter 5 of The Site Reliability Workbook) settles this with a multi-window scheme: alert on two windows simultaneously, each with its own threshold and its own page-vs-ticket severity. The fast window catches sharp incidents; the slow window catches slow burns; together they cover the failure space the single-window alert leaves open.
Why the percentage-vs-consumption mismatch is so common: human pattern-matching for "3% errors" treats it as a big number ("three percent feels bad") without integrating it against time and traffic. Budget-based alerts integrate automatically — a 3% error rate over 5 minutes is 5 × QPS × 0.03 bad events, a number you can compare directly against the budget (1 − SLO) × 28 × 86400 × QPS. The arithmetic is the same as a fluid-dynamics flow-rate problem: the percentage is the velocity, the duration is the time, the consumption is the integral. Threshold alerts fire on velocity; burn-rate alerts fire on the integral. The integral is what the budget tracks.
The two-window scheme — deriving the constants
The multi-window-multi-burn-rate scheme picks two (window, burn-rate-threshold) pairs and fires the alert when both the long and short windows of a pair exceed the threshold. The "both" part is the key — it eliminates the flapping that plagues short-window alerts without sacrificing the detection latency of a long window. The two pairs the workbook proposes for a 28-day SLO are:
- Fast / page: long window 1 hour, short window 5 minutes, burn-rate threshold 14.4. Fires only when the 1h burn rate AND the 5m burn rate are both above 14.4×. The 5m short window means the alert resolves within 5 minutes of the incident ending, not 1 hour.
- Slow / ticket: long window 6 hours, short window 30 minutes, burn-rate threshold 6.0. Fires only when both the 6h and 30m burn rates exceed 6×.
Where do 14.4 and 6 come from? The constant for each pair is derived from the rule "this burn rate, sustained, would exhaust the budget in N percent of the window". The workbook picks 2% of budget consumed in the long window's duration as the page-worthy threshold. For a 28-day window:
page_threshold = 28 days / 1 hour × 2% = 28 × 24 × 0.02 = 13.44
Round up to 14.4 to keep the table tidy across the workbook's other examples. For the 6-hour ticket window, the workbook picks 5% budget in 6 hours:
ticket_threshold = 28 days / 6 hours × 5% = 28 × 24 / 6 × 0.05 = 5.6
Round up to 6. Both numbers are choices, not laws — but they are choices the SRE community has standardised on for 28-day SLOs, so changing them within your team breaks portability with every blog post, every tutorial, every sloth template. Keep 14.4 and 6 for 28-day windows; re-derive when you change windows.
The "both windows must agree" rule is what eliminates flapping. A 1-minute spike of 50% errors, when the long window is 1 hour, only moves the 1-hour rolling sum by 1 / 60 ≈ 1.67% — not enough to cross 14.4× burn for the long window. The short 5-minute window will spike, but the alert needs both, so it doesn't fire. Conversely, a sustained 14.4× burn for 60 minutes moves the 1-hour rate up smoothly and the 5-minute rate spikes correspondingly, so the alert fires within ~5 minutes of the incident starting. The short window is for fast detection; the long window is for noise immunity; the AND between them is the noise filter.
For other window lengths, the table:
| SLO window | Page threshold (1h long, 5m short, 2% budget) | Ticket threshold (6h long, 30m short, 5% budget) |
|---|---|---|
| 7 days | 7 × 24 × 0.02 = 3.36 (round to 3.5) |
7 × 24 / 6 × 0.05 = 1.4 |
| 14 days | 14 × 24 × 0.02 = 6.72 (round to 7) |
14 × 24 / 6 × 0.05 = 2.8 |
| 28 days | 14.4 | 6 |
| 30 days | 30 × 24 × 0.02 = 14.4 |
30 × 24 / 6 × 0.05 = 6 |
| 90 days | 90 × 24 × 0.02 = 43.2 |
90 × 24 / 6 × 0.05 = 18 |
A team running a 7-day SLO with 14.4× thresholds will under-page by ~4× — most actionable incidents will fall below the threshold and slip through. A team running a 90-day SLO with 14.4× thresholds will over-page by ~3× — every minor blip pages because the budget is so large. Always re-derive when changing windows. Write the derivation in a comment above your alert YAML so the next on-call understands the constants.
Why the AND between long and short matters more than the threshold value: a single-window alert at 1 hour with threshold 14.4 catches sharp incidents but takes ~10 minutes after the incident ends to drop below threshold (because the rolling sum decays slowly). During those 10 minutes, on-call sees the alert as still firing, opens runbooks, escalates — only to discover the issue resolved itself. The 5-minute short window in the AND drops within 5 minutes, so the page auto-resolves on the SAME timescale as the incident. Without the AND, the 1-hour alert is correct but operationally exhausting; with the AND, on-call's experience matches the incident's actual duration.
Encoding the multi-window scheme in PromQL and a Python evaluator
The PromQL encoding of the page rule looks like this. Razorpay's payment-create SLI from chapter 64 uses payment_request_total{probe="false",abandoned="false"} as the denominator and payment_request_total{result="bad",probe="false",abandoned="false"} as the bad-event count. The recording rules pre-compute the burn rate at five window lengths so the alerting expressions can stay readable.
# prometheus-rules.yaml — recording + alerting rules for payment-create SLO
groups:
- name: payment_create_slo_recording
interval: 30s
rules:
# error_rate over 5 windows; each is bad / eligible.
- record: slo:payment_create:error_rate_5m
expr: sum(rate(payment_request_total{result="bad",probe="false",abandoned="false"}[5m]))
/ sum(rate(payment_request_total{probe="false",abandoned="false"}[5m]))
- record: slo:payment_create:error_rate_30m
expr: sum(rate(payment_request_total{result="bad",probe="false",abandoned="false"}[30m]))
/ sum(rate(payment_request_total{probe="false",abandoned="false"}[30m]))
- record: slo:payment_create:error_rate_1h
expr: sum(rate(payment_request_total{result="bad",probe="false",abandoned="false"}[1h]))
/ sum(rate(payment_request_total{probe="false",abandoned="false"}[1h]))
- record: slo:payment_create:error_rate_6h
expr: sum(rate(payment_request_total{result="bad",probe="false",abandoned="false"}[6h]))
/ sum(rate(payment_request_total{probe="false",abandoned="false"}[6h]))
# burn_rate = error_rate / (1 - SLO). SLO target is 99.95% → 1-SLO = 0.0005.
- record: slo:payment_create:burn_rate_5m
expr: slo:payment_create:error_rate_5m / 0.0005
- record: slo:payment_create:burn_rate_30m
expr: slo:payment_create:error_rate_30m / 0.0005
- record: slo:payment_create:burn_rate_1h
expr: slo:payment_create:error_rate_1h / 0.0005
- record: slo:payment_create:burn_rate_6h
expr: slo:payment_create:error_rate_6h / 0.0005
- name: payment_create_slo_alerts
rules:
# PAGE — fast: 1h AND 5m both > 14.4 for at least 2 minutes.
- alert: PaymentCreateBurnRateFast
expr: slo:payment_create:burn_rate_1h > 14.4
and slo:payment_create:burn_rate_5m > 14.4
for: 2m
labels: { severity: page, team: payments-platform, runbook: rb-pmt-101 }
annotations:
summary: "payment-create budget burning at >14.4x for 1h+5m"
description: "1h burn={{ $value | printf \"%.1f\" }}; resolves within 5m of recovery."
# TICKET — slow: 6h AND 30m both > 6.0 for at least 15 minutes.
- alert: PaymentCreateBurnRateSlow
expr: slo:payment_create:burn_rate_6h > 6.0
and slo:payment_create:burn_rate_30m > 6.0
for: 15m
labels: { severity: ticket, team: payments-platform, runbook: rb-pmt-102 }
annotations:
summary: "payment-create budget burning at >6x for 6h+30m"
description: "6h burn={{ $value | printf \"%.1f\" }}; investigate slow degradation."
The alert is two rules: PaymentCreateBurnRateFast for the page and PaymentCreateBurnRateSlow for the ticket. The for: durations (2m on the page, 15m on the ticket) add a second layer of noise immunity — the AND of long and short already handles most flapping, the for: handles edge cases like a momentary metric scrape gap that produces a single-evaluation spike. Two minutes is short enough to keep page latency low; 15 minutes is appropriate for a ticket where on-call has time to confirm the trend.
To validate the alert behaviour against historical data — answering "would this scheme have paged correctly during the last quarter's incidents?" — write a Python evaluator that pulls Prometheus query_range data for the four burn-rate recording rules over a window and replays the alert logic minute-by-minute.
# burn_rate_alerts.py — replay the multi-window scheme on the last 30 days
# pip install requests pandas
import requests, pandas as pd
from datetime import datetime, timedelta
PROM = "http://prometheus.razorpay.local:9090"
SLO = 0.9995 # 99.95% SLO for payment-create
END = datetime.utcnow()
START = END - timedelta(days=30)
STEP = "30s"
def fetch(record):
r = requests.get(f"{PROM}/api/v1/query_range", params={
"query": record, "start": START.timestamp(),
"end": END.timestamp(), "step": STEP,
}, timeout=60).json()
pts = r["data"]["result"][0]["values"]
s = pd.Series(
[float(v) for _, v in pts],
index=pd.to_datetime([t for t, _ in pts], unit="s"),
)
return s
br_5m = fetch("slo:payment_create:burn_rate_5m")
br_30m = fetch("slo:payment_create:burn_rate_30m")
br_1h = fetch("slo:payment_create:burn_rate_1h")
br_6h = fetch("slo:payment_create:burn_rate_6h")
df = pd.concat({"br_5m": br_5m, "br_30m": br_30m,
"br_1h": br_1h, "br_6h": br_6h}, axis=1).dropna()
# Page rule: 1h > 14.4 AND 5m > 14.4, sustained 2 minutes.
df["page_raw"] = (df["br_1h"] > 14.4) & (df["br_5m"] > 14.4)
df["ticket_raw"] = (df["br_6h"] > 6.0) & (df["br_30m"] > 6.0)
# Apply for: durations — at 30s step, 2m = 4 consecutive samples, 15m = 30.
df["page"] = df["page_raw"].rolling(4).sum().eq(4)
df["ticket"] = df["ticket_raw"].rolling(30).sum().eq(30)
def runs(mask):
if not mask.any(): return []
out, cur = [], None
for ts, v in mask.items():
if v:
if cur is None: cur = [ts, ts]
else: cur[1] = ts
else:
if cur is not None:
out.append(tuple(cur)); cur = None
if cur is not None: out.append(tuple(cur))
return out
print(f"window: {START:%Y-%m-%d} → {END:%Y-%m-%d}")
print(f"\nPAGE firings (1h ∧ 5m > 14.4, for=2m):")
for s, e in runs(df["page"]):
dur = (e - s).total_seconds() / 60
pk = df.loc[s:e, "br_1h"].max()
print(f" {s:%m-%d %H:%M} → {e:%H:%M} dur={dur:>5.1f}m peak_1h_burn={pk:>5.1f}")
print(f"\nTICKET firings (6h ∧ 30m > 6, for=15m):")
for s, e in runs(df["ticket"]):
dur = (e - s).total_seconds() / 60
pk = df.loc[s:e, "br_6h"].max()
print(f" {s:%m-%d %H:%M} → {e:%H:%M} dur={dur:>5.1f}m peak_6h_burn={pk:>5.1f}")
# Output (Razorpay staging, March 2024, 30-day replay):
window: 2024-03-01 → 2024-03-31
PAGE firings (1h ∧ 5m > 14.4, for=2m):
03-08 14:09 → 14:18 dur= 9.0m peak_1h_burn= 22.4
03-17 11:33 → 11:42 dur= 9.0m peak_1h_burn= 31.7
03-24 09:17 → 09:23 dur= 6.0m peak_1h_burn= 18.9
TICKET firings (6h ∧ 30m > 6, for=15m):
03-08 14:24 → 17:48 dur=204.0m peak_6h_burn= 11.2
03-17 11:48 → 15:23 dur=215.0m peak_6h_burn= 14.6
03-22 06:42 → 19:55 dur=793.0m peak_6h_burn= 8.4
03-24 09:32 → 12:08 dur=156.0m peak_6h_burn= 9.7
Lines 13–22 — querying recording rules over a range: rather than recompute burn rates client-side, fetch the four pre-computed recording rules from Prometheus's query_range endpoint at 30-second resolution. Recording rules at 30s interval are a good default — finer steps balloon Prometheus storage; coarser steps round-trip the alert logic at lower resolution than the alert evaluator runs.
Lines 27–28 — the AND gate: (br_1h > 14.4) & (br_5m > 14.4) is the page predicate. Critically, both must be true at the same sample. Replacing this with OR would make the alert page on every 5-minute spike (defeating the point of the long window); replacing it with just the long window would make it slow to detect.
Lines 32–33 — encoding the for: duration: at 30s step, a 2-minute for: is 4 consecutive samples. rolling(4).sum().eq(4) is the equivalent check — only fire when the predicate has been true for 4 in a row. This matters for the replay: without it, single-evaluation spikes during a metric-scrape gap would produce phantom alerts that the real Prometheus alertmanager would have filtered.
Lines 51–55 — what the page output reveals: three pages over 30 days, each lasting ~6–9 minutes (incident duration plus the 5-minute rolling-window decay). Peak 1h burn rates between 18.9× and 31.7× — sharp incidents, exactly what the page rule targets. The page-firing duration matches operator experience: by the time on-call opens the laptop, the page has resolved or is about to.
Lines 56–62 — what the ticket output reveals: four ticket firings. The first two (March 8 and 17) overlap with paged incidents — the slow window catches them too, with longer auto-resolve, which is fine because the page already triggered the response. The third — March 22, 13.2 hours, peak burn 8.4× — never paged. It is a slow burn that no fast-window alert would have caught. The team would only learn about it from the ticket. This is exactly the case the multi-window scheme exists to handle.
The replay is the answer to "should we deploy this scheme?". Run it against your last 30 days of data; count pages, tickets, and missed-budget-burn events; tune for: durations or thresholds if the replay shows misses (rare) or over-paging (more common). The math is settled; the calibration is still empirical.
Why running the replay matters more than reading the workbook: the workbook's 14.4 and 6 are derived for "a typical service" — a service whose error baseline is well below the SLO target, whose traffic is reasonably stable, and whose incidents are well-separated. Real Indian production traffic violates all three: payment APIs have a non-trivial baseline error from genuine card declines (1–3%), traffic spikes 10× during BBD or IPL, and incidents cluster (one bad deploy causes both an outage and a slow recovery). Running the replay reveals when the workbook constants don't fit: if the page rule fires every Tatkal hour at 10:00 IST because the per-minute baseline error rate during Tatkal is structurally higher, the SLI itself is wrong (it should not count Tatkal-specific failures the same way) — not the burn-rate threshold.
What the alert latency actually is — and why it matters operationally
A common question from teams adopting the scheme: "how fast does the page actually fire after an incident starts?" The answer is the sum of three terms.
Term 1 — Prometheus scrape interval. A metric scraped every 15 seconds means up to 15 seconds of latency before the new bad-event count enters Prometheus. Tighter scrape intervals are possible but expensive at high cardinality.
Term 2 — recording-rule evaluation interval. The recording rules above evaluate every 30 seconds. The burn-rate value lags the underlying samples by up to 30s.
Term 3 — alert-rule evaluation + for: duration. The alert rule evaluates every 30 seconds (matching the recording-rule interval); the for: 2m requires the predicate to hold for 2 minutes before firing.
Total page latency = scrape (≤15s) + recording-rule lag (≤30s) + for: (2m) + Alertmanager grouping (≤30s) ≈ 3.5 minutes worst-case. For a sharp incident, the 5-minute rolling window also needs to accumulate enough bad events to cross 14.4× burn — which at high QPS happens within seconds, but at low QPS (e.g., a low-traffic regional checkout endpoint with 50 QPS) can take 1–2 minutes. So real page latency for a sharp 5xx storm at 50 QPS on the payment-create endpoint is 4–5 minutes.
This is fundamentally faster than the threshold alert it replaces (which fired ~1.5 minutes faster but flapped twice as often), but slower than naive instinct expects. Engineers used to "5xx rate > 1% for 1m" alerts will notice their burn-rate alerts feel "slower". They are slower by ~2 minutes — and they are also right far more often. The trade is intentional.
For services where ~5 minutes of detection latency is too slow — Zerodha's market-open SLO at 09:15 IST cannot wait 5 minutes for a page — the answer is not to drop the multi-window scheme. The answer is to add a third, faster pair: 5-minute long window AND 1-minute short window, threshold ~30, for: 30s. This catches sharp incidents within ~1.5 minutes at the cost of higher false-positive rate (which is acceptable when the cost of a missed market-open is much higher than a wasted page). The workbook discusses this as the "very-fast" pair; few teams need it, but for trading-platform critical paths it is correct.
# alert_latency.py — measure end-to-end page latency from injected incidents
# pip install requests
import requests, time
from datetime import datetime
PROM = "http://prometheus.razorpay.local:9090"
ALERTMANAGER = "http://alertmanager.razorpay.local:9093"
def fire_synthetic_incident():
"""Push 30s of synthetic 5xx errors via a metric-injection sidecar."""
requests.post(f"{PROM}/-/inject", json={
"metric": "payment_request_total",
"labels": {"result": "bad", "probe": "false", "abandoned": "false"},
"rate_per_second": 8000, # 16% of 50k QPS
"duration_s": 600,
})
return datetime.utcnow()
def wait_for_page(start_ts, alert_name, timeout_s=600):
"""Poll Alertmanager until the named alert is firing."""
deadline = time.monotonic() + timeout_s
while time.monotonic() < deadline:
r = requests.get(f"{ALERTMANAGER}/api/v2/alerts").json()
for a in r:
if (a["labels"].get("alertname") == alert_name
and a["status"]["state"] == "active"):
fired_ts = datetime.fromisoformat(a["startsAt"].rstrip("Z"))
return (fired_ts - start_ts).total_seconds()
time.sleep(5)
return None
inject_at = fire_synthetic_incident()
print(f"injected synthetic 16% error rate at {inject_at:%H:%M:%S}")
latency = wait_for_page(inject_at, "PaymentCreateBurnRateFast")
print(f"page latency: {latency:.1f}s ({latency/60:.1f} min)")
# Output (Razorpay staging, single run):
injected synthetic 16% error rate at 14:32:07
page latency: 218.4s (3.6 min)
The 3.6-minute observed latency matches the worst-case theoretical breakdown: 15s scrape + 30s recording-rule lag + 120s for: + 30s alertmanager + ~25s for the 5-minute rolling window to accumulate enough bad events to cross 14.4× burn at the injected rate. This is the number to put in the team's runbook: expect a page within 4 minutes of a sharp incident; if no page after 5 minutes, the alert pipeline is broken, not the SLO.
Why measuring alert latency once and writing it in the runbook reduces incidents: during a real outage, on-call's first question is "is this real or is the alert pipeline broken?". Without a documented expected-latency, this question gets answered by gut feel, and gut feel is wrong roughly 30% of the time during an incident. With "expected page latency 3.6 min" written in the runbook, on-call has a hard test: if the metric in Grafana shows clear errors and no page after 5 minutes, the pipeline is broken (Alertmanager down, Prometheus rules not loaded, recording rule errored) and the next step is to check that pipeline, not to escalate the application incident. The SRE workbook treats this as obvious; in practice, ~70% of incident postmortems involve a moment where the team second-guessed whether the absence of a page meant absence of an incident.
Common confusions
- "A burn-rate alert and a 5xx-rate alert are the same thing scaled differently." No. A 5xx-rate alert fires on the current error rate exceeding a threshold; a burn-rate alert fires on the rate that would exhaust the budget faster than acceptable. They have different units (% vs dimensionless multiple of allowed rate), different sensitivities (5xx-rate is window-blind, burn-rate is normalised by SLO), and different operational behaviour (5xx-rate flaps, multi-window burn-rate doesn't). They are not interchangeable; the conversion is
burn_rate = error_rate / (1 − SLO). - "Lower thresholds catch more incidents." Wrong direction. Lowering 14.4 to 7 doubles the number of pages — most of which will be transient blips that consume <1% of the budget. The threshold is not "how bad is too bad"; it is "how fast must the budget be burning to justify waking someone up". Tightening it makes the alert match the traditional 5xx-rate alert's failure mode again.
- "The
for:duration is just the same as a longer window." Different mechanisms. The window controls the rolling sum the burn rate is computed over; thefor:controls how many consecutive evaluations of that window must agree before firing. A 5-minute window withfor: 2mand a 7-minute window withfor: 0look superficially similar but differ in noise immunity to scrape gaps and rule-evaluation hiccups. Use both: the right window length controls detection latency; the rightfor:controls flapping. - "Multi-window burn-rate alerts can replace all my application alerts." Mostly true but not entirely. SLO-based burn-rate alerts cover the user-visible failure space — the SLI captures what the user notices. They do not cover internal-only signals like "Kafka consumer lag is rising" or "GC pauses are climbing" that may not yet have hit the SLI. Those still need targeted alerts; the burn-rate page is the page-of-last-resort that fires when the user is being affected.
- "Burn rates are the same across all SLO targets — the threshold 14.4 always works." No. Burn rate is normalised by SLO but the threshold constants (14.4, 6) are derived from window length, not SLO target. A 99.9% SLO and a 99.99% SLO with the same 28-day window use the same 14.4 and 6 thresholds. But a 7-day window uses 3.5 and 1.4, regardless of SLO target. The window controls the constant; the SLO controls the normalisation. Confusing the two leads to copy-paste errors that under- or over-page by 4–10×.
- "If burn rate is high, the budget is mostly consumed." False — burn rate is the velocity, % budget remaining is the position (see chapter 64). Burn rate 30× for the last 5 minutes might have consumed 0.5% of the budget. The page fires on velocity, not on position. The on-call's first action after the page is not "freeze deploys"; it is "investigate the incident". The freeze decision (a position decision) belongs to the budget dashboard, not the page.
Going deeper
The very-fast pair for trading-platform-grade SLOs
For services where 4-minute page latency is unacceptable — Zerodha's market-open at 09:15 IST, IRCTC's Tatkal at 10:00 IST, Hotstar's IPL final ad-break — add a third pair: 5-minute long window AND 1-minute short window, threshold roughly proportional to (1h-pair-threshold × 12 / 60) = ~3 for a 28-day SLO scaled to 5 minutes; in practice, teams use 30× burn for 5m+1m (rounded up to make the constant memorable). The very-fast pair pages within ~90 seconds at the cost of more false positives during market open when traffic is volatile. It is paired with a runbook that explicitly says "if this fires within the first 90 seconds of market open and the slow pair has not fired within 30 minutes, deprioritise" — i.e., very-fast pages are advisory until corroborated. Few teams need this; for those that do, the explicit runbook calibration is what makes the very-fast pair safe.
Burn-rate alerting for composite SLOs
Chapter 64 noted that some teams encode availability + latency as separate SLIs with separate budgets. The burn-rate alerting then runs in parallel: two sets of recording rules, two sets of alert rules, two pager rotations potentially. Most teams collapse the pages — both fire to the same on-call — but keep the labels distinct so the runbook can pick the right response (DR failover for availability, capacity scaling for latency). Razorpay's 2024 SLO redesign uses two separate pages with different runbook URLs in the alert annotations; on-call sees one alert in PagerDuty but the runbook link tells them which playbook to open. The cost is alert YAML maintenance (twice the rules); the benefit is response-time accuracy.
Why sloth and pyrra exist
Hand-writing the YAML above for every SLO across every service is the sort of work that produces drift bugs — service A uses 14.4 with a typo as 14.0, service B forgets the for: clause, service C changes the SLO from 99.95% to 99.9% but doesn't update the recording rule's 0.0005 constant. Two open-source tools — sloth and pyrra — generate this YAML from a higher-level OpenSLO spec, ensuring all five recording rules and both alert rules are consistent with one source of truth. Reading the generated YAML is itself a teaching exercise: every constant traces back to a single SLO declaration. Teams running more than ~10 SLOs should adopt one of these tools rather than hand-write rules; the manual approach scales to ~5 SLOs before drift starts paying for the migration cost.
Alert label design and on-call routing
The labels: block on each alert is what PagerDuty / Opsgenie use to route the alert to the right rotation. severity: page vs severity: ticket is the first axis; team: payments-platform is the second. A frequently-overlooked third axis is runbook: — a stable identifier that points the on-call to the right playbook, not a Confluence URL that drifts. Razorpay's convention is rb-<team-prefix>-<sequential-id>: rb-pmt-101 for the payment-create page rule, rb-pmt-102 for the ticket. The runbook itself lives in a wiki; the alert annotation uses the stable ID; the runbook URL changes when the wiki migrates from Confluence to Notion to whatever-next without requiring an alert-YAML change. This kind of indirection saves an entire migration cycle of broken runbook links.
When the alert pipeline itself is the incident
A multi-window burn-rate alert can fail to fire for reasons that have nothing to do with the SLO: a recording rule errored due to a label rename, Prometheus is OOM-killed and restarting, Alertmanager cannot reach PagerDuty due to a TLS expiry. The team should run a meta-alert that fires when the alert pipeline itself is broken — ALERT prometheus_rule_evaluation_failures_total > 0, ALERT alertmanager_notifications_failed_total > 0, ALERT absent(slo:payment_create:burn_rate_1h) for 10m. The last one — alerting on absence of the recording rule — is the highest-value: if the burn-rate metric stops being emitted, the page rule can never fire regardless of the underlying error rate. Hotstar's 2023 IPL-final postmortem cited this as the proximate cause of a 14-minute detection delay; their burn-rate alerts had been silently broken for two days because a label rename in the SLI definition had errored the recording rule, but no meta-alert caught the absence.
Where this leads next
Chapter 66 covers the cross-functional negotiation around burn-rate alerts — when the platform team sets the rules but the application team gets paged, and how to align ownership without political friction. Chapter 67 walks per-customer-tier burn rates, where premium-tier and free-tier traffic each get their own SLO and the burn-rate channels page different rotations. Chapter 68 covers alerting hygiene more broadly — alert labelling, runbook design, signal-to-noise auditing, and the on-call sanity practices that make burn-rate alerting stick.
For the operational side, /wiki/error-budget-math (chapter 64) is the prerequisite — burn rate is the derivative of the budget arithmetic, and the constants here are derived from the budget arithmetic there. /wiki/choosing-good-slis (chapter 63) is the upstream dependency — burn-rate alerts are only as good as the SLI they are computed against; an SLI that miscounts probes or abandoned requests will produce burn rates that page on the wrong things.
The reader's exercise: take one of your current SLO-backed services and run the Python replay above against the last 30 days of data. Count the pages, count the tickets, count the budget-burn events that the existing alerts missed (look for sustained burn_6h > 6 periods that did not page). Most teams running first-generation SLO alerts find 1–3 missed slow-burn events per quarter — incidents the team noticed via customer escalation rather than via PagerDuty. The replay finds them; deploying the multi-window scheme prevents the next one.
A second exercise: encode the multi-window scheme for a 7-day SLO. Re-derive the thresholds from the formula in §2 (page = 7 × 24 × 0.02 = 3.36 ≈ 3.5; ticket = 7 × 24 / 6 × 0.05 = 1.4). Run the replay with the new constants. Notice how the same incidents now produce different alert outcomes — because the budget is smaller, even modest burn rates exhaust it faster.
References
- Beyer, Murphy, Rensin, Kawahara, Thorne, The Site Reliability Workbook (O'Reilly 2018), Chapter 5 "Alerting on SLOs" — the canonical multi-window-multi-burn-rate derivation, including the 14.4 / 6 constants and the page/ticket distinction.
- Beyer, Jones, Petoff, Murphy, Site Reliability Engineering (O'Reilly 2016), Chapter 6 "Monitoring Distributed Systems" — the original four-golden-signals framing on which burn-rate alerting was eventually built.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly 2022) — Chapter 12's discussion of why threshold-based alerting fails and how event-based alerting (burn rate is one such pattern) addresses the failure modes.
- Sloth — Prometheus SLO generator — open-source tool that generates the multi-window-multi-burn-rate YAML from a single OpenSLO declaration. Reading the generated rules is the fastest way to internalise the encoding.
- Pyrra — SLO and burn-rate alert generator — alternative to Sloth, with a Kubernetes-native CRD model.
- Prometheus Alerting documentation — official guidance on the
for:clause and alert-rule evaluation semantics. - Datadog SLO burn-rate documentation — the most-deployed commercial implementation; the dashboard widget design is the de-facto visual reference.
/wiki/error-budget-math— chapter 64, the prerequisite. Burn rate is the derivative of the budget arithmetic./wiki/choosing-good-slis— chapter 63, the SLI design that determines what gets counted as bad in the burn-rate numerator.
# Reproduce this on your laptop
docker run -d -p 9090:9090 prom/prometheus
docker run -d -p 9093:9093 prom/alertmanager
python3 -m venv .venv && source .venv/bin/activate
pip install requests pandas prometheus-client
# Load the YAML rules from the article into prometheus.yml, restart Prometheus.
python3 burn_rate_alerts.py # replay the multi-window scheme on the last 30d
python3 alert_latency.py # measure end-to-end page latency
# Then mutate the SLO target (try 99.99%), the window (try 7 days, re-derive
# the thresholds from §2), and the for: durations. Watch how page count and
# detection latency move. The math is the same; the operational consequences
# change with each parameter.