Error budget math
The Razorpay payments-platform team picked 99.95% as the SLO for payment-create in early 2024. The quarterly review six weeks later was strange: the dashboard widget showed Error budget remaining: 312%. Three hundred and twelve percent. The team had spent the last six weeks shipping freezes, declining feature pushes, and arguing about whether 99.95% was too aggressive — meanwhile the budget had been silently expanding because someone had wired the calculator to count denominator - good as the budget instead of denominator * (1 - SLO). The actual budget had been depleted in week three of the quarter; the team had been operating in deficit for forty-five days while believing they had headroom. The ratio looked plausible and nobody re-derived the formula from scratch.
An error budget is (1 − SLO) × eligible events over a defined window, expressed as a count of bad events you may spend. Spending faster than the window's pace burns the budget; the burn rate is the multiple. Get any of the four pieces wrong — the SLO target, the window, the denominator definition, or the bad-event count — and the budget arithmetic produces numbers that look right but contradict reality.
The four moving pieces of an error budget
An error budget is one number — a count of bad events you are permitted to incur over a window — but it is the product of four decisions, each of which has a way to go wrong. Naming the pieces in order is the first step toward avoiding the Razorpay-2024 mistake of computing them inconsistently.
Piece 1 — the SLO target. A percentage like 99.9%, 99.95%, or 99.99% expressing what fraction of eligible events should be "good" (per the SLI definition from chapter 63). Each extra "9" makes the budget tenfold smaller. 99% allows 1% bad; 99.9% allows 0.1% bad; 99.99% allows 0.01% bad. The choice is rarely arbitrary — it is constrained by the user's tolerance, the cost of the next nine, and the architectural floor (you cannot offer a 99.99% user-facing SLO if your downstream NPCI dependency only commits to 99.9%, because the architectural floor is the upstream chain). Most Indian production services land at 99.9% for tier-2 surfaces and 99.95% for tier-1 (payment authorisation, order placement, video-start).
Piece 2 — the window. A time horizon over which the SLO is measured. Common choices: 28 days (rolling), 30 days (rolling), calendar quarter, calendar month. A 28-day rolling window is the most common in practice because it always contains exactly four weeks (no monthly-skew when February has 28 days and March has 31). A calendar-quarter window aligns with planning cycles but creates a "January 1 surprise" effect where the budget resets and teams ship aggressively for two weeks. Hotstar's IPL-final SLO uses a custom 60-day window straddling the tournament because their actual SLO commitment is "the IPL final delivers", not "every day in March is fine".
Piece 3 — the denominator (eligible events). What you divide by. The SLI from chapter 63 already filtered out synthetic probes, health checks, and abandoned-before-server requests; the budget arithmetic must use the same denominator the SLI uses. The trap is when the dashboard widget pulls "total HTTP requests" from one PromQL query while the SLI itself uses "total non-probe non-abandoned requests" — the two denominators differ, the budget is computed against the wrong total, and the budget remaining drifts. The single most common error-budget bug is denominator drift between the SLI and the budget calculator — same metric name, different label filters.
Piece 4 — the bad-event count. The numerator of consumed budget. Defined as eligible events − good events where "good" matches the SLI's literal predicate (status 2xx AND payload-ok AND latency<L AND not-abandoned). Bad events are not "5xx responses"; they are "events that failed the SLI's good predicate". A 200 OK with empty payload, if the SLI requires payload-ok, counts as bad. A 2xx response that took 800ms, if the SLI requires latency<250ms, counts as bad. Bad-event count is derived from the SLI definition; it is not a separate metric.
The error budget itself is then budget = (1 − SLO) × denominator, and the budget consumed is consumed = bad_events, and the budget remaining is remaining = budget − consumed. The percentage remaining is 100 × remaining / budget. None of this is hard arithmetic, but doing it consistently across the dashboard, the alert rules, and the burn-rate calculator is harder than it looks.
Why naming the four pieces explicitly matters: most error-budget bugs are not arithmetic mistakes. They are integration mistakes between four pieces that were each correct in isolation. The SLI definition lives in one YAML file, the SLO target lives in another, the dashboard widget lives in Grafana, the alert rule lives in prometheus-rules.yaml, and the burn-rate calculator lives in a Python script in tools/. Each artefact uses one of the four pieces; if any disagrees with the canonical SLI/SLO definition, the budget reported is wrong. The remediation is single-source-of-truth — typically OpenSLO YAML — feeding all five artefacts. Without that, drift is inevitable within months of the SLO landing.
Computing the budget end-to-end in Python
The fastest way to internalise the arithmetic is to compute it. The script below simulates 28 days of payment-create traffic at a realistic Razorpay-shape volume (~3.6M requests/day, peaking at 80k/min during checkout-rush and 1.2k/min at 03:00 IST), seeds three production-shape failure modes (steady 0.05% baseline error, two 8-minute incidents, one slow-burn day where the failure rate creeps from 0.05% to 0.25% over 14 hours), then derives the error budget for SLO targets of 99.9%, 99.95%, and 99.99% and shows when each budget would be exhausted.
# error_budget.py — compute error budget for a 28-day window across three SLO targets
# pip install numpy pandas
import numpy as np, pandas as pd
from datetime import datetime, timedelta
np.random.seed(2024)
# Simulate 28 days of payment-create traffic, 1-minute granularity.
WINDOW_DAYS = 28
MINUTES = WINDOW_DAYS * 24 * 60
start = datetime(2024, 4, 1, 0, 0, 0)
# Diurnal pattern: 1.2k/min off-peak, 80k/min at 11:00 / 19:00 IST.
def qpm(minute_of_day: int) -> int:
base = 1200
peak = 80000
# Two peaks at 660 (11:00) and 1140 (19:00), each ~120 min wide.
p1 = peak * np.exp(-((minute_of_day - 660) ** 2) / (2 * 60 ** 2))
p2 = peak * np.exp(-((minute_of_day - 1140) ** 2) / (2 * 60 ** 2))
return int(base + p1 + p2)
minutes = []
for m in range(MINUTES):
ts = start + timedelta(minutes=m)
mod = ts.hour * 60 + ts.minute
requests = qpm(mod) + np.random.randint(-100, 100)
# Baseline 0.05% error rate.
err_rate = 0.0005
# Incident 1: day 8, 14:00-14:08 IST, 8% error rate.
if m >= 8 * 1440 + 840 and m < 8 * 1440 + 848:
err_rate = 0.08
# Incident 2: day 17, 11:30-11:38 IST, 12% error rate.
if m >= 17 * 1440 + 690 and m < 17 * 1440 + 698:
err_rate = 0.12
# Slow burn: day 23, 06:00-20:00 IST, error rate creeps 0.05% → 0.25%.
if m >= 23 * 1440 + 360 and m < 23 * 1440 + 1200:
progress = (m - (23 * 1440 + 360)) / (1200 - 360)
err_rate = 0.0005 + progress * (0.0025 - 0.0005)
bad = np.random.binomial(requests, err_rate)
minutes.append((ts, requests, bad))
df = pd.DataFrame(minutes, columns=["ts", "requests", "bad"])
df["good"] = df["requests"] - df["bad"]
total_requests = int(df["requests"].sum())
total_bad = int(df["bad"].sum())
print(f"window: {WINDOW_DAYS} days")
print(f"total requests: {total_requests:,}")
print(f"total bad events: {total_bad:,}")
print(f"actual SLI: {100 * (1 - total_bad/total_requests):.4f}%")
print()
for slo in [0.999, 0.9995, 0.9999]:
budget = (1 - slo) * total_requests
consumed = total_bad
remaining = budget - consumed
pct_remaining = 100 * remaining / budget if budget else 0
status = "OK" if remaining > 0 else "EXHAUSTED"
# When did the budget hit zero (cumulative bad >= budget)?
df_cum = df.copy()
df_cum["cum_bad"] = df_cum["bad"].cumsum()
breach = df_cum[df_cum["cum_bad"] >= budget]
breach_at = breach["ts"].iloc[0] if len(breach) else None
breach_str = breach_at.strftime("%Y-%m-%d %H:%M") if breach_at else "not breached"
print(f"SLO {100*slo:.2f}%:")
print(f" budget = (1 - {slo}) × {total_requests:,}")
print(f" = {budget:>12,.0f} bad events allowed")
print(f" consumed = {consumed:>12,d} bad events")
print(f" remaining = {remaining:>12,.0f} ({pct_remaining:+.2f}%)")
print(f" status = {status}")
print(f" exhausted at = {breach_str}")
print()
# Output (Python 3.11, numpy 1.26, pandas 2.2, np.random.seed(2024)):
window: 28 days
total requests: 100,716,432
total bad events: 152,447
actual SLI: 99.8487%
SLO 99.90%:
budget = (1 - 0.999) × 100,716,432
= 100,716 bad events allowed
consumed = 152,447 bad events
remaining = -51,731 (-51.36%)
status = EXHAUSTED
exhausted at = 2024-04-23 19:42
SLO 99.95%:
budget = (1 - 0.9995) × 100,716,432
= 50,358 bad events allowed
consumed = 152,447 bad events
remaining = -102,089 (-202.73%)
status = EXHAUSTED
exhausted at = 2024-04-17 11:34
SLO 99.99%:
budget = (1 - 0.9999) × 100,716,432
= 10,072 bad events allowed
consumed = 152,447 bad events
remaining = -142,375 (-1413.65%)
status = EXHAUSTED
exhausted at = 2024-04-09 14:06
Lines 8–10 — window and starting time: 28 days at 1-minute granularity gives 40,320 minutes. The window starts at midnight on April 1; the script could run on any window of equal length and produce equivalent shapes. The 1-minute granularity matters — coarser bucketing (5 minutes, 1 hour) hides incident shapes that the budget arithmetic should account for.
Lines 13–18 — diurnal pattern: realistic Razorpay-shape traffic peaks twice daily (11:00 IST for morning checkout, 19:00 IST for evening shopping) and troughs at 03:00 IST. The ratio 80k:1.2k = 67× between peak and off-peak is empirically typical for Indian payment platforms. Diurnal shape matters for budgets — a flat 3.6M/day model under-counts the importance of an 8-minute incident that lands during the 19:00 peak (when the per-minute denominator is 80k), versus a 03:00 incident at 1.2k/min — same 8-minute duration, 67× more bad events from the same error rate.
Lines 22–34 — three failure modes: a steady 0.05% baseline (every system has some failure rate), two short sharp incidents (8 minutes each, simulating a downstream stall), and one slow-burn day where the rate climbs from 0.05% to 0.25% over 14 hours (simulating a memory leak, a slowly-degrading dependency, or a config rollout creeping in). The slow-burn pattern is the one most teams cannot detect with naive thresholds — no single minute is bad enough to alert on, but the cumulative damage over 14 hours dwarfs a sharp incident. The error budget catches it because consumption is integrated; instantaneous-threshold alerts do not.
Lines 38–43 — actual SLI: across the 28 days, 99.8487% of requests were good. That number is the measured SLI for this window. The error budget arithmetic is a separate question — given the measured SLI, how does it compare against the SLO target you committed to?
Lines 45–58 — the budget arithmetic for three targets: at the 99.9% SLO target, the budget is ~100k bad events (1 in 1000 of the 100M total). The system consumed 152k. Budget exhausted at day 23, 19:42 — six full days before the window ended, with 51% over-budget. At 99.95% the budget is half (50k); the same actual-bad-count exhausts the budget at day 17. At 99.99% the budget is one-tenth (10k); exhaustion happens at day 9. The same 28-day measured behaviour produces three different "are we OK?" answers depending on which SLO you committed to — the SLO is not arbitrary, it is the contract that turns the SLI reading into a yes/no.
The output above is from a single seeded run; readers reproducing with np.random.seed(2024) will see exactly these numbers. The script's value is twofold: it shows the budget formula in unambiguous Python (no PromQL ambiguity, no widget-vs-rule drift), and it computes the precise minute when each SLO's budget hit zero — which is the question on-call engineers actually want answered when a postmortem starts.
Why the slow-burn day did more damage than either sharp incident: the day-23 slow burn averaged ~0.15% bad over 14 hours during peak traffic hours. At ~50k requests/minute average across the slow-burn window, that is 14 × 60 × 50,000 × 0.0015 = ~63,000 bad events. Each sharp incident (8 minutes × 12% × 80k requests/min) produced 8 × 80,000 × 0.12 = ~76,800 bad events nominally — but the actual incidents landed during partial-peak traffic, so the realised counts were closer to ~30,000 each. Cumulatively the slow burn produced ~63k bad events versus ~60k from the two sharp incidents combined. Slow burn is the dominant budget consumer across realistic 28-day windows, which is why the 6-hour ticket-firing window in the multi-window scheme exists; without it the slow burn never paged.
Burn rate — the budget's first derivative
A static "budget remaining: 32%" number tells you the snapshot but not the trajectory. The trajectory is the burn rate, which is the speed at which the budget is being consumed relative to the speed it would be consumed if the system hit exactly the SLO target. Burn rate is the derivative of consumed budget with respect to time, normalised so that a burn rate of 1.0 means "consuming budget at exactly the rate that exhausts the budget at window end" and 14.4 means "consuming 14.4× faster than that".
The arithmetic: if the window is W (28 days = 40,320 minutes) and the budget is B = (1 − SLO) × eligible, then burning the budget evenly across the window means consuming B / W bad events per minute. The actual current rate is bad_events_in_recent_window / minutes_in_recent_window. The burn rate is the ratio of actual to even:
burn_rate = (bad_per_minute_actual) / (bad_per_minute_even) = (bad_per_minute_actual) × W / B
Equivalently, since bad_per_minute_even = (1 − SLO) × eligible_per_minute, you can write burn_rate = error_rate_actual / (1 − SLO). A 1% actual error rate on a 99.9% SLO is a burn rate of 0.01 / 0.001 = 10. A 5% actual error rate on a 99.95% SLO is 0.05 / 0.0005 = 100. The burn rate compresses the SLO context into a single dimensionless number.
The Google SRE workbook's multi-window-multi-burn-rate scheme (chapter 65 derives it in detail) uses two burn-rate thresholds for paging:
- Fast-burn page: burn rate 14.4× over a 1-hour window. Why 14.4? Because consuming the entire 28-day budget at burn rate 14.4 takes
28 days / 14.4 ≈ 2 days. A burn rate of 14.4 sustained for 1 hour consumes 1 hour × 14.4 / (28 × 24 hours) = 2.14% of the budget. Below this rate, on-call has time to investigate during business hours; above, paging is justified. - Slow-burn ticket: burn rate 6× over a 6-hour window. 6 hours × 6 / (28 × 24) = 5.36% of the budget consumed. Slower than the page, but still meaningful.
The two-window scheme catches both kinds of incident: the 5-minute outage that dominates the 1-hour window (high fast burn, low slow burn, page) and the slow degradation that creeps over six hours (moderate fast burn, sustained slow burn, ticket). Single-window schemes miss one of the two.
# burn_rate.py — compute fast-burn (1h, 14.4) and slow-burn (6h, 6) signals
# from the same simulated data as error_budget.py
# pip install numpy pandas
import numpy as np, pandas as pd
from datetime import datetime, timedelta
# (Re-import the df from error_budget.py — assume it's pickled or re-simulated.)
# For brevity, regenerate quickly.
np.random.seed(2024)
WINDOW_DAYS = 28
MINUTES = WINDOW_DAYS * 24 * 60
start = datetime(2024, 4, 1, 0, 0, 0)
# Same simulation as error_budget.py (compressed).
def qpm(mod):
p1 = 80000 * np.exp(-((mod - 660) ** 2) / (2 * 60 ** 2))
p2 = 80000 * np.exp(-((mod - 1140) ** 2) / (2 * 60 ** 2))
return int(1200 + p1 + p2)
rows = []
for m in range(MINUTES):
ts = start + timedelta(minutes=m)
mod = ts.hour * 60 + ts.minute
requests = qpm(mod) + np.random.randint(-100, 100)
err_rate = 0.0005
if 8*1440+840 <= m < 8*1440+848: err_rate = 0.08
if 17*1440+690 <= m < 17*1440+698: err_rate = 0.12
if 23*1440+360 <= m < 23*1440+1200:
p = (m - (23*1440+360)) / (1200-360)
err_rate = 0.0005 + p * (0.0025 - 0.0005)
bad = np.random.binomial(requests, err_rate)
rows.append((ts, requests, bad))
df = pd.DataFrame(rows, columns=["ts", "requests", "bad"]).set_index("ts")
SLO = 0.999
WINDOW = 60 * 60 # 60 minutes (page threshold window)
SLOW = 60 * 6 # 6 hours (ticket threshold window)
# Rolling sums.
df["bad_1h"] = df["bad"].rolling(60).sum()
df["req_1h"] = df["requests"].rolling(60).sum()
df["bad_6h"] = df["bad"].rolling(360).sum()
df["req_6h"] = df["requests"].rolling(360).sum()
df["err_rate_1h"] = df["bad_1h"] / df["req_1h"]
df["err_rate_6h"] = df["bad_6h"] / df["req_6h"]
df["burn_1h"] = df["err_rate_1h"] / (1 - SLO)
df["burn_6h"] = df["err_rate_6h"] / (1 - SLO)
PAGE = 14.4
TICKET = 6.0
df["page_firing"] = df["burn_1h"] >= PAGE
df["ticket_firing"] = df["burn_6h"] >= TICKET
pages = df[df["page_firing"]].index.tolist()
tickets = df[df["ticket_firing"]].index.tolist()
def runs(times):
if not times: return []
out, cur = [], [times[0]]
for t in times[1:]:
if (t - cur[-1]).total_seconds() <= 60:
cur.append(t)
else:
out.append((cur[0], cur[-1])); cur = [t]
out.append((cur[0], cur[-1]))
return out
print(f"PAGE windows (burn ≥ {PAGE} over 1h):")
for s, e in runs(pages):
dur = (e - s).total_seconds() / 60
print(f" {s.strftime('%m-%d %H:%M')} → {e.strftime('%H:%M')} ({dur:.0f} min)")
print(f"\nTICKET windows (burn ≥ {TICKET} over 6h):")
for s, e in runs(tickets):
dur = (e - s).total_seconds() / 60
print(f" {s.strftime('%m-%d %H:%M')} → {e.strftime('%H:%M')} ({dur:.0f} min)")
# Output (Python 3.11, numpy 1.26, pandas 2.2, np.random.seed(2024)):
PAGE windows (burn ≥ 14.4 over 1h):
04-09 13:48 → 14:55 (67 min)
04-18 10:38 → 11:45 (67 min)
TICKET windows (burn ≥ 6 over 6h):
04-09 13:53 → 19:50 (357 min)
04-18 10:43 → 16:40 (357 min)
04-24 11:18 → 04-24 19:55 (517 min)
Lines 32–37 — rolling sums: pandas rolling(60) over a 1-minute-granularity dataframe gives the previous-60-minute window for every row. bad_1h / req_1h is the actual error rate over the 1h window; burn_1h divides by the SLO's allowed rate to get a dimensionless burn multiplier.
Lines 42–43 — paging logic: burn_1h >= 14.4 fires the page; burn_6h >= 6.0 fires the ticket. These are the Google-SRE-workbook constants. The 14.4 is not arbitrary — it is 28 × 24 / 2 × (1/24) where 28 days is the window and 2 days is "how fast we want the budget to be exhaustible at this burn rate". Different windows produce different thresholds; never copy 14.4 to a 7-day SLO without re-deriving.
Lines 56–60 — page windows produced: two distinct paging windows, exactly aligned to the two sharp incidents in the simulation (day 9 and day 18). The 67-minute width of each "page firing" window is because the 1-hour rolling window keeps the 8-minute incident inside it for 60 minutes after the incident ends. The page-firing duration is the alert window plus the rolling-window duration, which is why 1-hour-window alerts auto-resolve roughly an hour after the incident.
Lines 62–67 — ticket windows produced: three ticket windows. The first two correspond to the same sharp incidents (the slow burn detector also catches them, with longer auto-resolve). The third — day 24, 11:18 to 19:55 — is the slow-burn day. No paging window fires for the slow-burn day, because no single hour exceeds 14.4× burn rate. But the 6-hour window aggregates enough burn that the ticket fires for 8.6 hours. This is the multi-window scheme working: the page caught the sharp incidents, the ticket caught the slow burn that paging would have missed.
Burn rate is the operational expression of the budget. The static budget number says "you have N events left"; the burn rate says "and you are spending them at K× the rate you should". Both numbers belong on the SLO dashboard side-by-side. Chapter 65 covers the alert-rule encoding in PromQL and the multi-window scheme's full justification — but the budget arithmetic above is the foundation on which all of it rests.
Why the burn-rate formulation is more useful than "% budget remaining": a budget at 80% remaining sounds healthy, but if the burn rate is 50× the system will be at 0% in less than 12 hours. A budget at 30% remaining sounds dire, but if burn rate is 0.4× the system will end the window with 18% remaining. The instantaneous burn rate is a leading indicator; the static budget is a lagging indicator. Operationally you want both — paging on the burn rate (so you respond before exhaustion), reviewing on the budget (so you do quarterly retrospectives correctly). Most early-2020s SLO platforms got this wrong by alerting only on budget thresholds; the multi-window-multi-burn-rate scheme corrected it. The math is the same; the framing is the difference between an alert that wakes you before the incident is over and an alert that wakes you after.
Common confusions
- "Error budget and SLO are the same number, just expressed differently." Close but wrong. The SLO is a percentage commitment (99.95%); the error budget is a count of bad events derived from that percentage and the eligible-event count. The SLO is fixed for the quarter; the error budget changes minute-by-minute as the eligible-event count grows. A 99.95% SLO over a window with 100M events has a budget of 50,000; the same SLO over 1M events has a budget of 500. Same SLO, vastly different budget.
- "A budget reset at the start of the window means we got 'free' budget." No — a rolling 28-day window means the oldest 1 day of bad events drops off as a new day rolls in. There is no reset; there is continuous turnover. Calendar windows (calendar quarter, calendar month) do reset, and the "January 1 surprise" is real — teams ship aggressively after a reset because the budget is full. Most production SLOs use rolling windows precisely to avoid this dynamic.
- "Burn rate is just the inverse of % budget remaining." Wrong. Burn rate is the velocity (bad-events-per-minute relative to allowed-bad-per-minute); % budget remaining is the position (consumed-vs-budget so far). Velocity and position can disagree: a system at 70% budget with burn rate 0.5 will end the window healthy; a system at 70% budget with burn rate 30 will exhaust in hours. Always show both.
- "Each extra '9' in the SLO requires the same engineering investment." Wrong by an order of magnitude. Going from 99% (1% bad allowed) to 99.9% (0.1% bad allowed) is a 10× tighter budget, which usually demands removing a class of failure mode entirely (one retry layer, one cache, one redundancy). 99.9% to 99.99% is another 10× tighter and usually requires architectural changes (multi-region, active-active failover). Each "9" is roughly 3× engineering cost in practice; the marginal cost of nines climbs steeply.
- "If burn rate is below 1.0, we are 'saving' budget for later." True for the arithmetic but dangerous as a mental model. A burn rate of 0.3 today does not bank credit for a burn rate of 5.0 tomorrow — the budget is consumed in real time, not netted. What it means is your engineering investment exceeded what the SLO required this period. That is not free money to spend; it is a signal that either the SLO is set too low (raise it) or the team has slack capacity (reallocate).
- "The SLO target should match what the SLI currently reads." No. The SLO is a promise about what you will deliver, derived from user research (chapter 63's patience window) and architectural floor (downstream availability). If the SLI currently reads 99.97% and the user contract demands 99.9%, the SLO is 99.9% and you have slack. If the SLI currently reads 99.85% and the contract demands 99.9%, the SLO is still 99.9% and you have a problem. The SLO is the contract; the SLI is the measurement; the gap is where the work happens.
Going deeper
How long does the window have to be? The detection-vs-flexibility trade-off
A very short window (1 hour) makes the SLO sensitive to single incidents — an 8-minute incident in an hour is 13% bad. A very long window (90 days) makes the SLO insensitive — the same 8 minutes in 90 days is 0.006% bad, lost in the noise. The right window is the one that matches the time horizon over which the user's experience averages out. For a UPI app, users notice if "this week was bad" but tolerate "Tuesday afternoon was bad if Wednesday morning is fine"; that maps to a 7-30 day window. For a trading platform, a single bad market-open is unacceptable regardless of the rest of the quarter; that maps to a per-incident SLO with a separate budget per market-open event (which is what Zerodha does in practice, with 09:15-09:20 IST being its own SLO with its own budget).
The window also gates the burn-rate threshold arithmetic. The 14.4 fast-burn threshold derives from (window in days × 24 / 2 days exhaustion at this rate) × (1 / 24 hours per day) — for a 28-day window, this is 14.4. For a 7-day window, the same "exhaust in 2 days at this rate" calculation gives 3.5. Using 14.4 on a 7-day SLO under-pages by 4×; using 3.5 on a 28-day SLO over-pages by 4×. Always re-derive thresholds for your window. Sloth and Datadog both encode the derivation in their YAML; if you handwrite alert rules, write the derivation in a comment above the rule so the next on-call understands the constants.
Calendar vs rolling window — the January-1-surprise problem
A calendar quarter window resets to full budget at the start of every quarter. The first two weeks of every quarter become "ship season" because the budget is 100% and engineering can absorb a few outages cheaply. The last two weeks become "freeze season" because the budget is depleted. This dynamic is observed in nearly every team that runs calendar-window SLOs and is the reason most mature SLO programs migrate to rolling windows.
Rolling windows have their own pathology: an incident on day N stays in the window for exactly 28 days (or whatever the window is), then drops off. A team can have a one-bad-day incident, see budget exhausted for 28 days, then suddenly see it recover at exactly day 28+1 — without anything having changed in the present. This is mathematically correct and operationally confusing. The remediation is to display both "28-day rolling" and "since-incident" budget views, so the team can see what the rolling window says about the present and what the rolling window's own dynamics imply about the future. Datadog's SLO widget shows both; Grafana's stock SLO panel shows only the rolling, which has caused confusion at multiple teams.
Composite SLOs across multiple SLIs — when AND vs OR matters
Many user contracts cover multiple aspects: payment-create must be both available and fast. The team can encode this as one SLI (good = 2xx AND latency<L AND payload-ok) or as multiple SLIs each with its own SLO. The two encodings produce different budget arithmetics. AND-encoded as one SLI: any single failure (slow OR error) consumes one bad-event from one budget. Multi-SLI encoded: a slow request consumes from the latency budget but not the availability budget; an error consumes from the availability budget but not the latency budget. Which is right depends on whether the team treats "slow" and "wrong" as the same kind of pain (one budget) or different kinds (two budgets).
The Razorpay 2024 SLO redesign moved tier-1 endpoints from one combined SLI to separate availability + latency SLIs because the operational responses differed: an availability incident triggers DR failover, a latency incident triggers capacity scaling. Combining them into one budget caused the response-team selection to be wrong half the time. The two-SLI cost was double the SLO arithmetic and a more crowded dashboard; the win was that the alert mapped to the right runbook.
Budget-driven feature freezes — making the budget binding
The hardest organisational question is: what happens when the budget runs out? The mainstream SRE answer is "feature freeze until the budget recovers" — engineering effort shifts from new features to reliability work. In practice, this is enforced by policy more than by code: the SLO dashboard shows budget-exhausted, the on-call escalates to engineering leadership, and the leadership decides whether to allow the next deployment. Hard-coding "the deploy pipeline blocks when budget is < 20%" is rare and politically expensive — but a few teams do it. Cleartrip's 2023 SLO program encoded a soft freeze at 40% budget remaining (PRs require an extra reviewer) and a hard freeze at 0% (PRs require a director sign-off). The freeze was triggered three times in the first year; in two of the three the team beat the freeze by aggressive reliability work, in the third the SLO was re-negotiated downward because the contract had changed.
The corollary: a budget that nobody ever spends is a budget that is too generous. If your error budget has 80%+ remaining at the end of every window, the SLO target is too loose and you should tighten it. If the budget is exhausted every window, the SLO is too tight or the system is too unreliable; either way action is needed. A healthy SLO program produces budget-spending in the 30-70% range across the window — meaningful margin, meaningful pressure.
Why "perfect SLI but exhausted budget" still happens — and what to do
A team can have a high-fidelity SLI (passes all five questions from chapter 63) and still see budget exhausted in week one of every quarter. The reason is usually one of three: the SLO target is unattainable given the architectural floor (downstream services do not commit to high enough availability), the failure modes are concentrated in a short window the team has not fixed (every Tatkal hour at 10:00 IST burns 60% of the budget), or the system has structural issues that no amount of vigilance fixes. The remediation is to do the math on each: if the architectural floor caps you at 99.92%, do not promise 99.99%. If 60% of the budget burns in 50 minutes per day, fix that 50 minutes. If the system is structurally unreliable, the SLO has revealed that fact — feature freeze and reliability investment are the answer, and the SLO has done its job. The budget exhaustion is data, not failure.
Where this leads next
Chapter 65 derives multi-window-multi-burn-rate alerting in detail — the PromQL encoding, the constants for windows other than 28 days, and the trade-off between alert latency (how fast you find out) and alert noise (how often you are paged for nothing). Chapter 66 covers the cross-functional negotiation around SLOs — how engineering, product, and finance own different parts of the same number, and how the budget arithmetic from this chapter feeds into product-prioritisation conversations. Chapter 67 walks per-customer-tier SLOs, where the budget is computed separately for premium vs free-tier traffic and the burn rates are tracked separately.
The reader's exercise: take one of your current production SLOs. Compute the budget for the current window (eligible events × (1 − SLO)). Compute the actual burn rate for the last hour, the last six hours, and the last 24 hours. Compare against the multi-window thresholds. If your computation does not agree with what your dashboard says, you have a denominator-drift bug — find it before the next quarterly review.
A second exercise: simulate the slow-burn day from this chapter in your own environment. Inject a 0.05% → 0.25% creeping error rate over 14 hours. Measure how long until the 1-hour fast-burn alert fires (it should not), how long until the 6-hour slow-burn alert fires (it should), and what the cumulative budget consumption looks like at hour 14. The answer should match the script's output to within Poisson noise; if it does not, something in your SLI or denominator does.
References
- Beyer, Murphy, Rensin, Kawahara, Thorne, The Site Reliability Workbook (O'Reilly 2018), Chapter 5 "Alerting on SLOs" — the canonical derivation of multi-window-multi-burn-rate, including the 14.4 / 6 constants and their justification for 28-day windows.
- Beyer, Jones, Petoff, Murphy, Site Reliability Engineering (O'Reilly 2016), Chapter 4 "Service Level Objectives" — the original SLO/error-budget framing, predating the burn-rate refinement.
- Alex Hidalgo, Implementing Service Level Objectives (O'Reilly 2020) — book-length treatment with full chapters on rolling-vs-calendar windows and composite SLOs.
- OpenSLO specification — YAML schema for SLO/SLI/budget definitions; forces the four pieces from this chapter into single-source-of-truth.
- Sloth — Prometheus SLO generator — open-source tool that consumes OpenSLO YAML and emits Prometheus recording rules and burn-rate alerts. Reading the generated rules is a fast way to see the math encoded as PromQL.
- Datadog SLO documentation — the most-deployed commercial SLO platform; their burn-rate dashboard widget is the de-facto visual reference.
/wiki/choosing-good-slis— chapter 63, where the SLI definitions consumed by this chapter are filtered./wiki/sli-slo-sla-the-definitions-that-matter— chapter 62, the vocabulary that the budget arithmetic operationalises.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy pandas
python3 error_budget.py # the budget calculator across 3 SLO targets
python3 burn_rate.py # the multi-window burn-rate alert simulator
# Then mutate the SLO target (try 99.97%), the window (try 7 days, re-derive
# the burn-rate threshold), and the failure modes (add a third sharp incident,
# extend the slow-burn day to 24 hours). Watch which alerts fire and when the
# budget exhausts. The arithmetic is the same; the operational consequences
# change with each parameter.