Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Alerting on SLOs vs on raw metrics

At 11:42 IST on a Tuesday, Riya — a backend engineer at a hypothetical Zerodha-pattern broking platform — was paged by an alert named OrderAPI_CPU_High. She acked, opened the dashboard, saw CPU at 87% on three pods out of twelve, and watched the alert auto-resolve four minutes later without her doing anything. At 12:09 IST, she was paged by OrderAPI_LatencyP99_Above500ms. Same response: dashboard, p99 at 540ms, auto-resolve at 12:13. At 12:31, OrderAPI_ErrorRate_Above1pct. By 13:00 she had been paged six times, the trading session was about to end, no customer had complained, and her on-call diary recorded none of these as real incidents. The fault was not the system; the fault was that every one of those alerts was on a raw metric. None of them was on the property the team had actually promised: that 99.5% of order placements would succeed within 800ms during market hours. The promise had not been broken. The alerts fired anyway.

Raw-metric alerts page when a number crosses a threshold; SLO-based alerts page when a promise to the user is being broken faster than the error budget can absorb. The two answer different questions and produce different on-call experiences. A raw-metric alert says "CPU is high" — true, possibly meaningless. An SLO-based alert says "you will exhaust your monthly error budget in 4 hours at the current rate" — true, always meaningful. Migrating from one to the other is the single largest reduction in alert noise most teams ever ship.

What "alerting on a raw metric" actually means — and why it feels safe

A raw-metric alert is an alert rule whose expr is a direct query against a single time series, with a static threshold. cpu_usage > 0.8, latency_p99 > 500ms, error_rate > 0.01, queue_depth > 10000, disk_free_percent < 15. The threshold is picked by an engineer at a moment in time — usually during a postmortem, sometimes during an architecture review — and frozen into the alert ruleset until somebody either tunes it or deletes it. The alert fires when the metric crosses the line for some for: duration, and the on-call is paged.

Raw-metric alerting feels safe to authors because it is concrete. You can point at the line in the dashboard, you can argue about the threshold value, you can write a runbook that says "if CPU is above 80%, scale". The mechanism is legible at the alert-rule level. The cost of this legibility is paid by the on-call: every threshold is a guess about a future condition, and most guesses are wrong in one of two directions. Either the threshold is set conservatively (the metric crosses it during normal load), in which case the alert fires often without producing user impact — alert fatigue, normalisation of pages, eventual desensitisation. Or the threshold is set aggressively (the metric only crosses it during severe degradation), in which case the alert fires after users have already noticed — late detection, customer-reported incidents, postmortem theatre.

Why thresholds drift wrong: the engineer setting the threshold has a mental model of the system based on the last 90 days of operation, but the system's behaviour shifts as load patterns change, as new features add request types, as deployment cadence changes pod restart rates. A threshold of 80% CPU was correct for the system in March, ambiguous in April, and noisy in May — without anyone touching the threshold itself. The threshold did not change; the system underneath did. Raw-metric alerts therefore decay in quality monotonically, which is why every alerting team that has been around for more than two years has a "alert tuning" backlog that grows faster than it shrinks.

There is a deeper problem. A raw metric is a measurement of one component; the user's experience is the composition of many components. CPU at 87% on three pods does not tell you whether order placements are succeeding — those depend on the database, the matching engine, the broker network, and a half-dozen other systems whose individual metrics may all look fine. The raw-metric alert is asking the wrong question. It is asking "is this number high?" when the question that matters is "are users experiencing the service the way we promised?". The two questions have different answers most of the time, and the rate at which a raw-metric ruleset diverges from user experience is the rate at which the alert ruleset diverges from operational truth.

The architectural failure mode of raw-metric alerting is cause-based alerting — paging on every plausible cause of degradation, hoping that catching causes early prevents user impact. The intent is defensive: if CPU is high, latency might rise, so page now and prevent the latency spike. The empirical reality is that most cause-alerts fire without the downstream symptom appearing — either because the system is robust enough to absorb the cause, or because the cause was transient, or because the chain from cause to symptom is more complex than the alert author understood. Cause-based alerting therefore generates more pages than symptom-based alerting for the same level of user-visible reliability, which is the wrong direction on the cost-of-on-call axis. (See /wiki/symptom-based-alerts-the-google-sre-book for the full argument.)

A second failure mode is threshold-on-aggregate. A team alerts on latency_p99 > 500ms aggregated across all endpoints and all regions. The alert fires at 11:42 because one obscure admin endpoint, called twice per hour, took 5 seconds. The aggregate p99 is dominated by the single slow request because the request volume of the admin endpoint is so low. The on-call investigates a phantom incident; the user-facing endpoints were fine. Threshold-on-aggregate hides which user is affected, which makes the alert non-actionable even when it is technically correct.

Raw-metric alerts vs SLO-based alerts: the page-vs-impact ratioA two-panel diagram comparing two alerting approaches. The left panel shows raw-metric alerting as a horizontal line at the 80 percent CPU threshold with five vertical spikes crossing it; only one spike correlates with user impact, while four are transient and produce phantom pages. The right panel shows SLO-based alerting as an error-budget consumption curve over 30 days with a multi-window burn-rate band; the burn-rate alert only fires when consumption exceeds the budget pace, producing one page that maps to a real incident. The right panel is annotated with budget-remaining percentages and burn-rate values.Raw-metric vs SLO-based alerting — same week, same serviceRaw-metric: CPU > 80%thresholdpagepagepagepageMonWedFri4 pages, 0 user-visible incidentsSLO: 99.5% over 30 daysbudget paceburn-rate bandpageday 0day 15day 301 page, 1 user-visible incident
Illustrative — not measured data. Both panels show the same week. Raw-metric pages four times on transient CPU; SLO pages once, when the error-budget consumption curve is bending faster than the budget can absorb. The right panel's signal-to-noise ratio is the operational difference between the two regimes.

What an SLO-based alert is — and why the math is different

An SLO-based alert does not page on the metric. It pages on the rate at which the error budget is being consumed. The shift is small in code and large in semantics.

A Service Level Objective is a contract: "99.5% of order placements will succeed within 800ms over a 30-day window." The 0.5% of permitted failure is the error budget. Over 30 days, if the service receives 100 million order placements, the error budget is 500,000 failed-or-slow placements. Spend the budget evenly, and the service maintains the SLO. Spend it faster than evenly, and the service is on track to break the SLO unless something changes; the rate at which the budget is being spent (the burn rate) is what the alert evaluates. (See /wiki/error-budget-math for the derivation.)

The burn rate is dimensionless: a burn rate of 1 means the budget is being spent at the pace that exactly exhausts it over the SLO window. A burn rate of 14.4 means it is being spent fourteen times faster than that — at this rate, the 30-day budget will be gone in about 50 hours. A burn rate of 0.3 means the service is well within budget. The burn-rate threshold for an alert is chosen based on how quickly you want to be paged when the service is degrading: a burn rate of 14.4 over a 1-hour window means "if this rate continues, you'll exhaust 2% of the monthly budget in 1 hour" — page-worthy. A burn rate of 6 over a 6-hour window means "if this rate continues, you'll exhaust 5% of the monthly budget in 6 hours" — also page-worthy, but slower-burning.

The expr of an SLO-based alert in PromQL looks like this:

(
  sum(rate(http_requests_failed_total{service="order-api"}[1h]))
  /
  sum(rate(http_requests_total{service="order-api"}[1h]))
) > 14.4 * (1 - 0.995)

It is computing the error rate over a 1-hour window and comparing it to 14.4 times the failure budget rate (1 minus the SLO target). If the error rate is more than 14.4 times the budget pace, fire. The two numbers (14.4, 0.995) are the only knobs; the rest is mechanics.

Why this is different from error_rate > 0.01: the raw-metric alert fires when the current error rate is above a fixed threshold, regardless of context. The SLO alert fires when the error rate is high relative to what the SLO permits. If the SLO is 99% (1% allowed failure), then a 1.2% error rate is mildly over budget — burn rate of about 1.2x. If the SLO is 99.99% (0.01% allowed failure), the same 1.2% error rate is a burn rate of 12,000x — catastrophic. The threshold-on-error-rate alert treats both cases identically; the SLO alert distinguishes them automatically because the budget itself is in the denominator.

A second property: SLO-based alerts are self-tuning to load. If traffic doubles overnight (say, an IPL final or a Diwali Big Billion Days event), the absolute count of allowed failures doubles too, because the error budget is a percentage. The burn rate stays meaningful regardless of traffic shape. A raw-metric alert on absolute error count (say, errors > 1000/sec) breaks under a 2x traffic increase: either it fires constantly (because more traffic means more absolute errors at the same percentage), or it requires manual retuning every time the load profile shifts. The SLO alert needs no retuning across traffic changes; that is a structural advantage that compounds as the team's load patterns evolve.

A third property: SLO-based alerts express user impact directly. The metric being measured is the percentage of user requests that succeeded within the latency target — that is, the user's experience. A burn rate of 14.4 translates directly to "users are seeing degradation 14.4 times faster than the contract permits". A raw-metric alert requires the on-call to translate the metric into user-impact-language during the incident; the SLO alert does the translation upfront in the alert rule. The on-call's first 30 seconds of consciousness at 02:47 IST should not be spent doing arithmetic.

The cost of SLO-based alerting is the upfront engineering: you must define the SLO, instrument the SLI (the service-level indicator that feeds the SLO), and pick the burn-rate windows. (See /wiki/sli-slo-sla-the-definitions-that-matter and /wiki/choosing-good-slis.) Most teams find this work uncomfortable because it requires committing to a number — the SLO target — that previously lived in a vague intuition about "how reliable should this be". The discomfort is the work; once the SLO is committed and instrumented, the alert ruleset shrinks dramatically because most raw-metric alerts can be deleted in favour of one SLO alert per user-facing surface.

Building a comparator: same workload, two alerting regimes

The clearest way to see the difference is to run the same incident sequence through both alerting regimes and count the pages. The Python script below simulates 30 days of traffic at a hypothetical Razorpay-pattern UPI payment service, applies a realistic mix of degradation events (transient pod-level slowness, brief region-wide errors, a sustained ledger-DB problem), and reports how many pages each regime emits.

# slo_vs_raw_alerting.py — compare raw-metric and SLO-based alert volumes
# pip install numpy pandas
import numpy as np, pandas as pd
from collections import defaultdict

np.random.seed(74)
DAYS = 30
SECONDS_PER_DAY = 86400
BUCKET = 60  # 1-minute aggregation buckets
N_BUCKETS = DAYS * SECONDS_PER_DAY // BUCKET

# Baseline: 99.5% SLO at a UPI payments service.
SLO_TARGET = 0.995
ERROR_BUDGET_RATE = 1 - SLO_TARGET  # 0.005 — allowed failure pace
BURN_RATE_PAGE = 14.4  # burn-rate-over-1h that triggers a page

# Generate per-minute success counts. Baseline ~10k requests/min, ~50 errors/min.
baseline_rps = 10000 + np.random.normal(0, 200, N_BUCKETS)
errors = np.random.poisson(50, N_BUCKETS).astype(float)

# Inject a realistic mix of degradation events:
# 1. Twelve transient CPU spikes (5 min each) — no user impact, raw CPU > 80%
# 2. Three brief region errors (15 min each) — small user impact, recoverable
# 3. One sustained ledger-DB problem (4 hours) — significant user impact
events = []
for _ in range(12):
    t = np.random.randint(0, N_BUCKETS - 5)
    events.append(("cpu_transient", t, 5))
for _ in range(3):
    t = np.random.randint(0, N_BUCKETS - 15)
    errors[t:t+15] += np.random.poisson(800, 15)  # 800 extra errors/min
    events.append(("region_brief", t, 15))
sustained_t = np.random.randint(0, N_BUCKETS - 240)
errors[sustained_t:sustained_t+240] += np.random.poisson(400, 240)
events.append(("ledger_sustained", sustained_t, 240))

cpu_pct = 0.55 + np.random.normal(0, 0.05, N_BUCKETS)
for kind, t, dur in events:
    if kind == "cpu_transient":
        cpu_pct[t:t+dur] += 0.30 + np.random.normal(0, 0.03, dur)

# --- Regime 1: raw-metric alerts ---
raw_pages = []
# Rule A: CPU > 80% for 2+ minutes
cpu_high = cpu_pct > 0.80
i = 0
while i < N_BUCKETS:
    if cpu_high[i] and (i+1 < N_BUCKETS and cpu_high[i+1]):
        raw_pages.append(("cpu_high", i))
        # skip until below threshold (alert auto-resolves; next firing is a new page)
        while i < N_BUCKETS and cpu_high[i]:
            i += 1
    i += 1
# Rule B: error_rate > 1% for 5+ minutes
err_rate = errors / baseline_rps
err_high = err_rate > 0.01
i = 0
while i < N_BUCKETS:
    run = 0
    while i + run < N_BUCKETS and err_high[i + run]:
        run += 1
    if run >= 5:
        raw_pages.append(("error_rate_high", i))
        i += run
    else:
        i += run + 1

# --- Regime 2: SLO-based burn-rate alert (1h window, threshold 14.4) ---
slo_pages = []
window = 60  # 1-hour rolling window in minutes
for i in range(window, N_BUCKETS):
    err_window = errors[i-window:i].sum()
    req_window = baseline_rps[i-window:i].sum()
    burn_rate = (err_window / req_window) / ERROR_BUDGET_RATE
    if burn_rate > BURN_RATE_PAGE:
        # debounce: only count first firing per consecutive run
        if not slo_pages or i - slo_pages[-1][1] > 60:
            slo_pages.append(("slo_burn", i))

print(f"Simulation: {DAYS} days, {N_BUCKETS:,} 1-minute buckets")
print(f"Injected events: 12 transient CPU, 3 brief regional, 1 sustained ledger\n")
print(f"Raw-metric regime:")
print(f"  CPU>80% pages:        {sum(1 for k,_ in raw_pages if k=='cpu_high')}")
print(f"  error_rate>1% pages:  {sum(1 for k,_ in raw_pages if k=='error_rate_high')}")
print(f"  total pages:          {len(raw_pages)}\n")
print(f"SLO-based regime (burn-rate>14.4 over 1h):")
print(f"  pages: {len(slo_pages)}")

# Map SLO pages back to events
real_incident_buckets = set()
for kind, t, dur in events:
    if kind == "ledger_sustained":
        real_incident_buckets.update(range(t, t+dur))
slo_real = sum(1 for _, b in slo_pages if b in real_incident_buckets)
print(f"  pages mapping to user-impact event: {slo_real}/{len(slo_pages)}")

Sample run:

Simulation: 30 days, 43,200 1-minute buckets
Injected events: 12 transient CPU, 3 brief regional, 1 sustained ledger

Raw-metric regime:
  CPU>80% pages:        12
  error_rate>1% pages:  4
  total pages:          16

SLO-based regime (burn-rate>14.4 over 1h):
  pages: 1
  pages mapping to user-impact event: 1/1

The output exposes the regime difference cleanly. The raw-metric ruleset emits 16 pages over 30 days — 12 from CPU transients (none of which produced user impact) and 4 from error-rate spikes (some of which were brief and self-resolved). The SLO-based alert emits one page, and that page maps to the sustained ledger problem — the only event in the simulation that meaningfully spent the error budget.

Why the SLO regime ignored the brief regional errors despite their being real degradations: each lasted 15 minutes with ~800 extra errors/min. Over a 1-hour window, that contributes 12,000 extra errors against ~600,000 total requests = 2% error rate over the window. With an SLO of 99.5%, the burn rate is 2% / 0.5% = 4x. Above the budget pace, but well below the 14.4x page threshold. The SLO regime's design choice is deliberate: brief, self-recovering events that consume <2% of the monthly budget over a 1-hour window are not worth waking someone up for. The team's monthly budget can absorb them. Pages should fire only when the budget consumption rate would meaningfully threaten the contract — that is, when sustained or severe enough to warrant intervention. The 14.4 threshold is calibrated for that.

A subtler point: the simulation undercounts the raw-metric noise because it does not model the secondary alerts that real teams have layered on top. A typical mature ruleset has 30–80 alert rules per service: CPU, memory, GC pause, disk I/O, network, error rate, latency p50/p95/p99, queue depth, connection-pool saturation, DB lag, replication lag, cache hit ratio. If the simulation included the full ruleset, the page count for the raw-metric regime would be 50–150 over the same 30 days. The SLO-based regime has typically 2–4 alerts (one per user-facing SLO surface, possibly with multi-window variants — see /wiki/multi-window-multi-burn-rate-alerts). The reduction is more than 90% in steady state and approaches 95% over a year as ruleset cruft accumulates differently in each regime.

The simulation also under-models the severity dimension. The raw-metric regime cannot distinguish a transient CPU spike from a sustained ledger problem — both produce the same alert shape and wake the on-call the same way. The SLO regime naturally encodes severity in the burn rate: a burn rate of 50x is more serious than a burn rate of 2x. A multi-window-multi-burn-rate ruleset (1h+5min for fast burns, 6h+1h for slow burns) further encodes severity in the window choice. The result is that on-calls in SLO-based regimes have implicit prioritisation built into the alert itself; on-calls in raw-metric regimes triage every page from scratch.

Error-budget consumption with burn-rate band, animated over 30 daysAn animated SVG showing the error-budget consumption curve over 30 days. The diagonal line from 100% budget at day 0 to 0% budget at day 30 is the budget-pace line. The actual consumption curve starts following the pace line, then bends downward more steeply during the sustained incident, crossing the burn-rate threshold band that highlights the page-worthy region. A small alert icon appears at the moment of page firing. The curve rebuilds slowly after mitigation as the incident is resolved and the team consumes budget at the recovery pace.Error-budget burn — when does the SLO-based alert fire?day 0day 15day 30100%50%0%budget remainingbudget pace (linear spend)burn-rate band > 14.4xPAGErecovery pace
Illustrative — not measured data. The diagonal line is the pace at which the budget would be spent if errors were uniform; the bold curve is the actual consumption. The alert fires when the consumption curve enters the burn-rate band, which is when the local slope is more than 14.4 times the budget pace.

Edge cases and the cases where raw-metric alerting is still correct

SLO-based alerting is not universal. There are a small number of legitimate cases where raw-metric alerting is the right primitive, and ignoring them produces a different failure mode — incidents that the SLO ruleset cannot detect.

Saturation alerts on bounded resources. Disk space, file-descriptor count, connection-pool size, queue depth at maximum capacity. These are not user-experience metrics; they are physical-limit metrics. When the disk is 95% full, the next write may fail, and no SLO covers the latency-after-out-of-disk case because there is no latency — just hard failure. Saturation alerts must fire before the limit is hit, which means they must use raw-metric thresholds. The discipline here is to alert only on resources whose exhaustion is unrecoverable in the alert's response window: disk-full takes 10–30 minutes to fix (provision more storage, archive old data); connection-pool exhaustion takes 30 seconds to fix (scale the pool size, restart). Alert on the former with a raw threshold; rely on the SLO to catch the latter when it actually impacts users.

Telemetry-pipeline health alerts. The SLO-based ruleset depends on the telemetry pipeline. If Prometheus scrapes are failing, the SLO query has no data, and the burn-rate alert silently does not fire — a false negative of the worst kind. You need a small ruleset of raw-metric alerts on the telemetry pipeline itself: Prometheus up, scrape success rate, recording-rule freshness, alertmanager queue depth. These are the alerts about your alerting; they cannot themselves be SLO-based because their failure mode is the SLO machinery breaking. (See the meta-runbook discussion in /wiki/runbook-driven-alerts.) Most production rulesets have 5–10 such meta-alerts and the rest are SLO-based; that mix is healthy.

Pre-customer detection. A new feature is being rolled out behind a flag to 1% of users. The SLO has not been instrumented for the new code path yet; the user volume is too small for the SLO ruleset to detect issues. During the rollout window (typically 1–7 days), a raw-metric alert on the new feature's error rate, latency, or saturation can detect issues earlier than the SLO ruleset. Once the feature is fully rolled out and instrumented, the raw-metric alert is decommissioned in favour of the SLO. The ruleset has a "rollout drawer" of temporary alerts that are explicitly time-boxed; this is hygiene, not architecture.

Compliance or regulatory thresholds. Some industries have hard regulatory SLAs that map to specific raw metrics. A hypothetical Indian banking regulator might mandate that every UPI transaction be logged within 5 seconds, with an alert on log-write latency. The regulation is the threshold; the metric is what the regulator audits against. A bank's compliance ruleset is therefore raw-metric-based by mandate. The SLO ruleset coexists, alerting on user-experience separately. The two rulesets don't conflict; they answer to different stakeholders.

Pre-SLO services. Not every service has an SLO. New services in early development, internal tools with no user-facing surface, batch jobs, scheduled tasks — these often run without an SLO because the cost of defining one outweighs the benefit. Raw-metric alerts on critical failures (job didn't run, batch missed deadline) are appropriate here. The hygiene rule is to mark these alerts explicitly as "pre-SLO" and to revisit them when the service graduates to user-facing status. Without the marker, the pre-SLO alerts accumulate forever and become indistinguishable from production alerts; with it, they are visible as a backlog item.

A subtle anti-pattern: teams sometimes create an SLO whose target matches a raw-metric threshold one-to-one — "p99 < 500ms" becomes "SLO: 99.5% of requests in 500ms" with the burn-rate alert tuned to fire at exactly the same rate as the old latency_p99 > 500ms alert. This is the worst of both worlds: the SLO machinery's overhead with the raw-metric alert's behaviour. The SLO target should be chosen based on what the user actually needs (user-research, customer SLAs, business commitments), not by retrofitting an existing alert threshold. Why retrofitting is wrong: the value of SLO-based alerting comes from the gap between the SLO target and the engineering's natural behaviour. If the SLO is 99.5% but the service naturally runs at 99.95%, the error budget is large, transient blips are absorbed silently, and pages fire only when degradation is sustained. If the SLO is 99.95% and the service naturally runs at 99.95%, every blip consumes budget and pages fire constantly — the SLO is too tight. Retrofitting a raw-metric threshold typically lands the SLO too tight, recreating the alert-fatigue problem in new clothing. Pick the SLO from user need first, then tune the burn-rate windows to that — never the other way round.

These exceptions account for roughly 10–15% of a mature production alert ruleset. The other 85–90% should be SLO-based. Teams that achieve this ratio report on-call rotations that are sustainable for years; teams that don't churn engineers off-rotation faster than they hire them.

Common confusions

  • "An SLO alert is just a complicated way to write a raw-metric alert." The expression looks similar but the semantics are inverted. A raw-metric alert asks "is this number high right now?". An SLO alert asks "is the rate of budget consumption threatening the contract?". The first is point-in-time; the second is a derivative against a budget. The two answer different questions and produce different page volumes for the same service degradation.

  • "You should alert on causes (CPU, memory) so you can fix them before users notice." Cause-based alerting fires on conditions that might lead to user impact. Most cause-alerts fire without producing user impact — the system absorbs the cause, the cause is transient, or the chain to symptom is more complex than the alert author understood. Symptom-based alerting (which SLO alerting is the canonical form of) fires on user-visible degradation, which is the only thing that actually matters. (See /wiki/symptom-based-alerts-the-google-sre-book.)

  • "SLO-based alerting will miss problems that don't yet show up in user metrics." It will — by design. The argument is that pre-symptom degradations are statistically more often false alarms than true precursors, and the on-call cost of paging on every potential precursor exceeds the benefit of catching the rare true precursor. Engineering teams that have run both regimes side-by-side report that the SLO regime catches every real incident with shorter MTTR, because the on-call is not desensitised by a flood of pre-symptom pages.

  • "You need separate alerts for latency and availability — one SLO can't cover both." A well-formed SLO covers both: "99.5% of requests succeed within 800ms" is one SLO whose error budget is consumed by both error-counts and latency-overruns. The single alert on this SLO's burn rate will fire whether the service is failing on availability or on latency. Some teams choose to split into two SLOs (one availability-only, one latency-only) for clearer attribution; that's an editorial choice, not a technical requirement.

  • "Raw-metric alerts are simpler, so they're better for small teams." Small teams suffer alert fatigue more, not less, because the on-call rotation is two or three people and each false page consumes a larger fraction of the team's sleep. The simplicity of raw-metric alerting is paid for in pages-per-week. SLO-based alerting requires more upfront engineering but pays back in page-volume reduction within weeks. For a 3-person on-call rotation, the SLO investment typically breaks even at 4–6 weeks and produces compounding returns thereafter.

  • "If we delete the raw-metric alerts and go SLO-only, we'll miss the precursor signals." Precursor signals belong on dashboards, not in pages. The on-call should glance at the dashboard at the start of their shift, see CPU at 75% on three pods, and decide whether to investigate during business hours — or ignore. The page is the highest-cost interrupt the system has; reserve it for the symptoms the user is experiencing. (Some teams keep precursor signals as Slack notifications rather than pages — same data, lower interrupt cost.)

Going deeper

The mathematics of burn-rate windows: why 14.4 over 1h

The burn-rate threshold of 14.4 over a 1-hour window is not arbitrary. The Google SRE workbook derives it from the policy "page if more than 2% of the monthly budget will be consumed in 1 hour at the current rate". A 30-day month has 720 hours; 2% of monthly budget over 1 hour is 2% / (1/720) = 14.4 times the linear consumption rate. The derivation determines the threshold; the threshold is not tuned empirically. Why the derivation matters: a team that picks 14.4 by reading the SRE book and copying it has the right number for the wrong reason — they will not know what to change if the policy changes. A team that derives it from "2% of monthly budget in 1 hour" can change the policy ("we want to be paged if 5% of budget is consumed in 6 hours") and re-derive the threshold (5% / (6/720) = 6) without consulting an external source. The math is the policy expressed numerically; understanding it is what lets the team own the policy.

A second derivation: the time-to-exhaust. At burn rate B, the budget is consumed in (window / B) where window is the SLO window in absolute time. At burn rate 14.4 with a 30-day SLO, the budget exhausts in 30/14.4 ≈ 2.08 days. The team's response window — how long they have to investigate, mitigate, and recover — is the time-to-exhaust minus the time the page spends in delivery. If the page-to-page-acked latency is 5 minutes and the time-to-mitigate is 30 minutes, the team needs the time-to-exhaust to be at least 35 minutes plus a safety margin. Burn rate 14.4 over 1h gives 50 hours; comfortable. Burn rate 144 (10x more aggressive) would give 5 hours; tight. Burn rate 1.44 (10x less) would give 500 hours; too lenient. The mathematics constrains the choice tightly.

Multi-window multi-burn-rate: why one window is not enough

A single burn-rate window has a known weakness: it cannot distinguish a fast burn (severe but brief) from a slow burn (mild but sustained). The fast-burn case wants a short window (1h, threshold 14.4) to catch it within an hour; the slow-burn case wants a long window (6h, threshold 6) to catch it before too much budget is consumed. A single window forces a compromise that catches one case poorly.

The multi-window-multi-burn-rate pattern (see /wiki/multi-window-multi-burn-rate-alerts for full mechanics) runs both windows simultaneously. The alert fires if either: the 1h burn rate exceeds 14.4 (fast burn detected within an hour), or the 6h burn rate exceeds 6 (slow burn detected within 6 hours). Each window has a short-window confirmation (5min for the 1h, 30min for the 6h) that reduces false positives from telemetry noise. The compound rule is more robust than either single window and is the default at every team that has invested in SLO-based alerting at production scale.

Hypothetical Hotstar IPL: how SLO-based alerting handled the 25M-concurrent peak

A hypothetical Hotstar-pattern streaming team running the IPL final at 25M concurrent viewers would face a load profile that breaks raw-metric alerting catastrophically. Absolute error counts spike with traffic — at 25M concurrent users, even a 0.1% error rate is 25,000 errors at peak. A raw-metric alert on errors > 5000/sec would fire continuously; the on-call would silence it and lose the ability to detect actual problems. A raw-metric alert on errors > 5000/sec during non-IPL plus a manually-incremented threshold for IPL hours would be operational toil that nobody maintains correctly.

The SLO-based alert handles this naturally. With a 99.9% SLO (0.1% error budget), the burn rate at 0.1% sustained errors is exactly 1.0 — at budget pace, no alert. The burn rate at 1% sustained errors is 10x — page-worthy regardless of whether 25M or 25k viewers are concurrently watching. The error-budget machinery automatically adjusts the threshold to traffic volume because the budget is a percentage. The on-call is paged when user-experience degrades meaningfully; they are not paged when traffic merely scales up.

The same hypothetical team running a UPI integration at peak NPCI hours (around 12:00–13:00 IST when payroll batches hit) would benefit similarly: their SLO-based alert ruleset would fire on user impact regardless of whether the day's NPCI traffic was 50M transactions or 80M. The raw-metric alternative would require manual retuning every NPCI traffic shift, which nobody does.

When the SLO is wrong: detecting and re-targeting

An SLO that is set wrong is harder to detect than a raw-metric alert that is set wrong, because the SLO ruleset is quieter by design. A team can run for months with an SLO of 99.5% on a service that the users actually need at 99.95% — the SLO alert never fires because the service stays above 99.5%, but customer-reported incidents accumulate at the 99.5% rate, far from the user's actual need. The signal that the SLO is wrong is not in the alert volume; it is in the customer-feedback channel and the postmortem ledger.

Mature SLO practice therefore includes a quarterly SLO review: pull the customer-reported incident list, the postmortem ledger, and the actual SLI measurement; check whether the SLO target is calibrated to the user's real need or to a number picked 18 months ago when the service launched. The review is editorial rather than technical — it requires comparing user-facing impact to the contract — and most teams skip it, which is why most teams' SLOs drift over time. The drift is silent; only the customer notices.

Where this leads next

The mechanics of SLO-based alerting branch into several adjacent topics in this curriculum. The next chapter (chapter 75) covers runbook-driven-alerts — the layer that converts an SLO page into a sequence of actions the on-call can execute. The pair "SLO-based alert + runbook-driven response" is the floor for sustainable on-call; either alone is half the system.

A second thread: the SLO target itself. A target of 99.5% or 99.9% is not a free parameter — it interacts with the engineering effort needed to maintain it (a 99.99% SLO costs roughly 10x the engineering of a 99.9% SLO at scale), with the customer's real need (users may not notice 99.9% vs 99.95% differences in many domains), and with the team's release cadence (a tight SLO penalises frequent deploys). The choice of SLO target is a business-engineering negotiation, not a technical decision.

  • /wiki/sli-slo-sla-the-definitions-that-matter — the formal definitions of SLI, SLO, SLA and how they compose.
  • /wiki/error-budget-math — the math behind burn rates, time-to-exhaust, and budget allocation.
  • /wiki/burn-rate-alerting — the alert-rule mechanics, debounce, and noise rejection.
  • /wiki/multi-window-multi-burn-rate-alerts — the canonical fast-burn / slow-burn ruleset.
  • /wiki/symptom-based-alerts-the-google-sre-book — the philosophical underpinning that SLO alerting operationalises.
  • /wiki/alert-fatigue-as-a-production-failure — what happens when the ruleset is dominated by raw-metric alerts.

A practical implication for an engineer reading this on a Friday at 11pm: open your team's alert ruleset and count the rules. Categorise each as raw-metric, SLO-based, saturation, or telemetry-meta. If the raw-metric category dominates, your team is in the regime Riya was in at 11:42 IST. The migration is not done in a day, but the first SLO — pick the most-paged service, define the user-facing SLI, write the burn-rate alert, and delete the three or four raw-metric alerts it replaces — can be done in a week. The on-call rotation will notice within the next on-call cycle. That single migration, repeated across a dozen services over a quarter, is the largest reduction in alert noise most teams ever ship.

References

  • Site Reliability Engineering (Google, O'Reilly 2016), chapter "Service Level Objectives" — the foundational text on SLO-based reliability engineering.
  • The Site Reliability Workbook (Google, O'Reilly 2018), chapters "Implementing SLOs" and "Alerting on SLOs" — the modern derivation of multi-window-multi-burn-rate alerting.
  • Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly 2022), chapter on alerting — the contemporary SRE framing of raw-metric vs SLO-based alerts.
  • Cindy Sridharan, Distributed Systems Observability (O'Reilly 2018), chapter on alerts — the case against cause-based alerting at scale.
  • Alex Hidalgo, Implementing Service Level Objectives (O'Reilly 2020) — the practitioner's handbook for SLO design and rollout.
  • Google SRE blog, "SLOs, SLIs, SLAs, oh my!" — accessible derivation of the burn-rate threshold mathematics.
  • /wiki/error-budget-math — the curriculum's chapter on burn-rate math.
  • /wiki/symptom-based-alerts-the-google-sre-book — the philosophical companion to this chapter.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy pandas
python3 slo_vs_raw_alerting.py