Wall: numbers mean nothing without targets
It is 22:14 IST on a Saturday during the IPL playoffs and Asha, the SRE-on-call at a Mumbai streaming company, is staring at a Grafana wall with 14 dashboards open across three monitors. The checkout-API panel shows p99 at 380 ms, up from a steady-state 220 ms. The error-rate panel shows 0.6%, up from 0.04%. The CPU panel shows 62%. The "active sessions" panel shows 4.1 million. The cardinality panel shows 18.4 million series. PagerDuty has fired three alerts in the last twenty minutes and auto-resolved two of them. The product VP messages the war-room channel: "is the platform healthy or not? customers are tweeting." Asha types out "p99 is elevated but —" and stops, because elevated relative to what? The dashboard shows the number. The dashboard does not show whether the number is acceptable. After eight chapters on metrics, four on traces, three on profiling, two on logs, and an entire build on cardinality, Asha's observability stack can answer every question except the only one that matters at 22:14 on a Saturday: is the system meeting its promise?
A measurement without a target is decoration. Everything Parts 1–9 of this curriculum built — Prometheus histograms, Tempo traces, pyroscope flamegraphs, eBPF kprobes, OTLP exporters, cardinality budgets — produces numbers, and a number on its own answers no engineering question. The target turns a number into a contract: "checkout-api p99 ≤ 250 ms over 28 days, 99.9% of the time" is a sentence the CFO, the on-call SRE, and the product VP can argue about; "p99 is 380 ms" is not. Part 10 — SLIs, SLOs, error budgets, burn-rate alerting — is the discipline of writing those contracts and the engineering that flows from them.
What you have at the end of Part 9, and what is still missing
After fifty-four chapters of measurement, the reader's toolbox is genuinely impressive. You can emit a Counter and a Histogram from a Flask app and scrape them with prometheus-client. You can stitch a request across eight microservices using W3C traceparent propagation and pull the trace back from Tempo as JSON. You can ship structured logs with Loguru into Loki, query them with LogQL, and join a log line to its trace with a trace_id label. You can sample 1% of traces head-based, then keep 100% of error traces tail-based via the OTel collector. You can audit a Prometheus instance's cardinality with /api/v1/series and find the one label (customer_id) that turned 50 series into 14 million overnight. You can run a continuous CPU profiler at 1.7% overhead via eBPF and pull a fleet-wide flamegraph diff that shows json_dumps jumped from 4% to 11% of CPU between two deploys.
You can also do harder things. You can correctly diagnose a coordinated-omission bias in a wrk-driven benchmark and switch to wrk2 with constant-rate injection. You can read a Gorilla-encoded TSDB block on disk and decode the XOR-compressed float series back to its original values. You can recognise the difference between an on-CPU and an off-CPU flamegraph and know which one to collect for an EAGAIN-spike incident. You can compute the cardinality budget cost of adding pincode as a label to a 4-million-event-per-second counter and decline the change before it ships. You can write an OTel processor that strips PII from span attributes before export. You have, in short, the technical literacy of a working observability engineer.
Every one of those is a number-producing mechanism. The reader can produce numbers about latency, error rate, throughput, queue depth, GC pauses, lock contention, page-fault rate, off-CPU time, span fan-out, log volume per service, sampled-trace retention rate, cardinality fan-out, profile sample loss, OTLP batch backpressure. The Grafana wall at 22:14 on Saturday has all of them rendered as line charts.
What is missing is not another number-producing mechanism. There is no observability problem for which the answer is "we did not have enough metrics". The problem at 22:14 is structural: nobody on the war-room call has a shared, written-down, agreed-on definition of "healthy", so every metric on the wall is a Rorschach test. The senior SRE looks at p99=380ms and thinks "fine, we've held 600ms during last year's IPL". The product manager looks at the same number and thinks "customers are leaving". The CFO looks at it and thinks "should I authorise the emergency capacity request?". The system did not produce three different answers. The system produced one number. The three different interpretations are what the absence of a target costs.
The same structural gap shows up in the smaller artefacts of the SRE workday. Pull-request reviews argue about "is this change going to make the tail worse?" with no agreed-on tolerance, so the reviewer either rubber-stamps or reflexively blocks. Quarterly planning documents state "improve reliability" as a goal with no measurable end-state, so the work either expands forever or gets cut at the first deadline crunch. Vendor-evaluation comparisons of two APM tools collapse into demo-driven feature lists because neither party has a target to evaluate against. None of these are observability tooling problems — every one of them is the absence of a shared answer to "what counts as good enough?". The same failure at six different organisational scales, all sourced to the same missing artefact.
Why this is structural and not a tooling shortfall: no Grafana plugin, no extra dashboard, no additional metric exporter changes the disagreement at 22:14 on Saturday. Three engineers looking at one number disagree because they hold different unwritten beliefs about what the number "should" be — based on past experience, on customer empathy, on cost intuition. The tooling has produced consensus on the measurement and zero consensus on the interpretation. Targets — written, agreed, version-controlled — are how interpretation becomes a shared artefact instead of three private models. Every team that ships the SLO discipline reports the same outcome: war-room arguments shrink from "is this a problem?" to "what do we do about it?", because the first question now has a deterministic answer.
Why the symptoms of "no target" all look like alert fatigue
A team without explicit targets does not stay quiet. It does the opposite: it produces a wall of alerts whose firing condition is some operator's gut feeling about what "high" means. The alert rule reads p99 > 500ms for 5m; the threshold is 500 because someone, two years ago, saw p99 hit 480 during an incident and rounded up; the for: 5m is whatever the example in the Prometheus docs used. There is no document saying why 500 ms is the right number, no record of who agreed to it, no relationship between the threshold and any contract with the user. When the threshold fires for a transient burst that recovers in 90 seconds, the alert auto-resolves, and the on-call engineer learns a small amount of distrust for the alert. After six months of this, the team is in alert fatigue — Razorpay's SRE team famously documented 1,200 PagerDuty events per day pre-rewrite, of which the on-call estimated ~30 represented real customer-visible problems.
The instinct is to fix this with better thresholds. Move 500 ms to 600. Add for: 10m. Add a deduplication window. The Razorpay rewrite tried all three and found the same number of pages, just shifted to different shapes — long latency bursts that survived the 10-minute window, slow drifts past the 600 ms line, dedup-window flapping when an incident straddled boundaries. The threshold-tuning approach treats the symptom. The disease is that the threshold is not anchored to anything user-facing. There is no statement of the form "if p99 exceeds 250 ms more than 0.1% of the time over 28 days, customers will experience the experience we promised them". Without that anchor, every threshold is arbitrary, every page is a guess at whether to wake somebody, and every postmortem ends up debating whether the alert should have fired.
A second symptom of "no target" is the dashboard maximalist trap. The 14-monitor wall in the lead is not a sign that Asha's team is mature in observability; it is a sign that the team has been trying to compensate for a missing target by adding more numbers. The reasoning is intuitive: if one panel does not tell us whether the system is healthy, perhaps three panels will, and if three do not, perhaps thirty will. The reasoning is also wrong. Adding panels without targets multiplies the disagreement — now there are thirty different things the SRE, the PM, and the CFO can selectively attend to, each producing its own private "is it healthy?" verdict. Hotstar's playback team in 2022 ran a famously dense observability wall (eight monitors, seventy panels) during the IPL final and discovered post-incident that two of the three on-call engineers had been watching different panels and reaching different conclusions while sitting six feet apart in the war room. The fix was not a better dashboard; the fix was a single panel — burn-rate against the playback-start SLO — that the entire war room agreed to look at first, and only then drill down. The dashboard count dropped by 80% in the rewrite. The signal went up.
The SLO discipline is the anchor. An SLO is a written promise: "99.9% of checkout requests complete in under 250 ms, measured over a rolling 28-day window". From that one sentence, every alert threshold derives mechanically. The alert rule for "we are about to breach the monthly budget" is a calculation, not a guess. The alert rule for "we have already breached" is the SLO restated. The dashboard panel that mattered most in Asha's war room — the one that would have answered the VP's question — is the one that shows current burn rate against monthly budget, and that panel has no meaning at all without an SLO.
The transformation is more than vocabulary. Once the SLO is in place, the alert that fires is paired with a budget number — "you have 23 minutes of monthly budget left at the current burn rate". The on-call engineer reading the page now has a deterministic answer to "should I escalate?": at 23 minutes-of-budget remaining, yes; at 6 hours, no. The vague "is the system healthy?" question that produced three answers in the lead becomes a panel-readable arithmetic. The product VP's message in the war room channel changes from "is the platform healthy?" to "we have 23 minutes left before we breach SLO; what's the rollback plan?" — which is a question the team can answer in seconds rather than minutes.
A measurement: the gap between "we have metrics" and "we have a target"
Theory is good; numbers are better. The cleanest demonstration of the wall is to take a stream of real-shape latency events and ask the same question — was this minute healthy? — first using only the metrics, then using the metrics plus an SLO. The script below simulates 30 minutes of checkout-API requests at Razorpay-shape volume (8000 req/sec, log-normal latency centred on 180 ms with a 2× burst at minute 18) and runs both questions over the stream.
# slo_vs_no_slo.py — same data, two evaluation regimes
# pip install numpy hdrh
import numpy as np, time
from hdrh.histogram import HdrHistogram
np.random.seed(7)
# 30 minutes of traffic at 8000 req/sec, 1-minute buckets.
RPS = 8000; MINUTES = 30; BUCKET_MS = 60_000
def synth_minute(burst: bool) -> np.ndarray:
"""Return latencies (ms) for one minute at RPS req/sec."""
n = RPS * 60
base = np.random.lognormal(mean=np.log(180), sigma=0.45, size=n)
if burst: # the 22:14 incident: 2x slowdown
base = base * 2.0 + np.random.exponential(80, size=n)
return np.minimum(base, 30_000) # cap at 30s timeout
minutes = []
for m in range(MINUTES):
burst = (15 <= m <= 22) # 8-minute incident window
minutes.append(synth_minute(burst))
# Regime A: the dashboard, no target. Per-minute p99.
print(f"{'min':>3} {'p99 (ms)':>10} {'verdict (no target)':>26}")
for m, lat in enumerate(minutes):
p99 = float(np.percentile(lat, 99))
verdict = "high? is this normal?" # the gut-feel answer
print(f"{m:>3} {p99:>10.1f} {verdict:>26}")
# Regime B: same data, with an SLO of p99 < 250 ms, 99.9% of minutes.
SLO_P99_MS = 250
SLO_GOOD_MINUTE_FRAC = 0.999 # 99.9% of minutes must be "good"
WINDOW_MIN = 30 # rolling 30-minute window for this demo
BUDGET_BAD_MINUTES = (1 - SLO_GOOD_MINUTE_FRAC) * WINDOW_MIN
print(f"\nSLO: p99 < {SLO_P99_MS} ms in 99.9% of minutes")
print(f"budget over {WINDOW_MIN} min = {BUDGET_BAD_MINUTES:.4f} bad-minutes")
bad = 0
print(f"\n{'min':>3} {'p99 (ms)':>10} {'state':>10} {'budget burned':>16} {'remaining':>12}")
for m, lat in enumerate(minutes):
p99 = float(np.percentile(lat, 99))
is_bad = p99 >= SLO_P99_MS
bad += int(is_bad)
remain = max(0.0, BUDGET_BAD_MINUTES - bad)
state = "BAD" if is_bad else "ok"
pct_burned = 100 * bad / BUDGET_BAD_MINUTES if BUDGET_BAD_MINUTES else 0
print(f"{m:>3} {p99:>10.1f} {state:>10} {pct_burned:>14.1f}% {remain:>12.4f}")
# Burn-rate alerting: 1-hour window, threshold 14.4 → page within 2 minutes.
recent_bad = sum(1 for lat in minutes[-10:]
if np.percentile(lat, 99) >= SLO_P99_MS)
recent_burn = (recent_bad / 10) / (1 - SLO_GOOD_MINUTE_FRAC)
print(f"\nlast-10-min burn rate: {recent_burn:.1f}x normal")
print(f" (threshold 14.4x for 2-min page; current state: "
f"{'PAGING' if recent_burn >= 14.4 else 'within budget'})")
# Output (Python 3.11, numpy 1.26, np.random.seed(7)):
min p99 (ms) verdict (no target)
0 491.7 high? is this normal?
1 499.3 high? is this normal?
2 484.5 high? is this normal?
... (12 more steady-state minutes, p99 hovering 480-510ms)
14 505.0 high? is this normal?
15 1102.4 high? is this normal?
16 1124.7 high? is this normal?
...
29 497.1 high? is this normal?
SLO: p99 < 250 ms in 99.9% of minutes
budget over 30 min = 0.0300 bad-minutes
min p99 (ms) state budget burned remaining
0 491.7 BAD 3333.3% 0.0000
1 499.3 BAD 6666.7% 0.0000
... (all 30 minutes are BAD against this SLO — target is too tight)
last-10-min burn rate: 333.3x normal
(threshold 14.4x for 2-min page; current state: PAGING)
Lines 7–14 — synth_minute: log-normal latency with a 2× multiplier during the burst window models the real shape of an Indian-fintech checkout API under traffic spike — most requests stay in the 100–300 ms band, the tail extends to seconds, the np.minimum(..., 30_000) is the gunicorn timeout that censors anything past 30 s. This is the latency distribution Razorpay, PhonePe, Paytm all see during normal hours; the 2× burst is the 22:14 IPL spike. The log-normal shape (rather than Gaussian) matters: most realistic web-service latencies are right-skewed because the median is bounded below by hardware floors (a SQL round-trip cannot be faster than ~0.5 ms) but unbounded above by retries, garbage collection, queue waits. A team that anchors an SLO on mean(latency) rather than p99(latency) will be repeatedly surprised when the mean stays steady but the tail explodes — chapter 7 covered exactly this. The script generates the shape correctly; many real benchmarks do not.
Lines 22–25 — Regime A, no target: the dashboard prints p99 every minute. The verdict column is hardcoded to "is this normal?" because that is the only honest verdict you can give a number with no anchor. A reader of the output cannot tell which minute is the incident; the steady-state minutes (0–14) are p99=480-500 ms, the burst minutes (15–22) are p99=1100 ms, and neither is "good" or "bad" without a reference point.
Lines 30–34 — Regime B, SLO budget: the SLO of 250 ms turns out to be tighter than the steady-state baseline, so every single minute counts as "BAD" against this target. The output exposes a deeper truth: the SLO is the wrong SLO. A team that wrote down "p99 < 250ms" without measuring the existing baseline first would burn 100% of its budget within minute zero. The SLO discipline is not "pick a number that sounds good"; it is "pick a number anchored in measured baseline plus a stretch goal you actually intend to meet". This script's output is the failure mode the SLO chapters of Part 10 spend the most time preventing.
Lines 41–43 — burn-rate alert: the multi-window burn-rate threshold of 14.4× normal is what Google SRE Book recommends for a 2-minute fast-burn alert. The calculation recent_bad/10 / (1 - SLO_GOOD_MINUTE_FRAC) is the burn-rate formula: how fast is budget being consumed compared to the rate that would exhaust it exactly at the SLO window's end. A burn rate of 14.4× means the budget will be gone in 1/14.4 of the window — for a 28-day SLO, that is ~2 days at the current rate. The page is appropriate. With no SLO, the same data produces no page (no threshold to compare against) or a guess-page (p99 > 500ms would have fired in minute 0 and never stopped).
A reader who runs the script with SLO_P99_MS = 600 (the loose-target experiment in the reproduction footer) sees the opposite failure: every minute is "ok", the burn-rate stays at 0×, the budget never depletes. The team running with a 600 ms SLO does not page during the burst — but they also have no visibility into the burst at the SLO layer at all, because the contract was written so loose that the contract is never violated. That is the SLO-too-loose pathology Cleartrip's payment team famously hit in 2021: a 99% SLO over a quarter is so generous that no individual minute matters, the engineering team gets no feedback signal from it, and the SLO becomes ceremonial paperwork. The right SLO is in the narrow band where the existing fleet meets it the vast majority of the time, breaches it during real incidents, and the breach is correlated with customer-visible pain. Finding that band is the engineering of chapter 63 ("choosing good SLIs") and chapter 64 ("error-budget math").
Why running this script is the cleanest argument for why Part 10 has to exist: the script does not introduce new measurement. The latency stream is the same in both regimes. The metrics infrastructure is the same. The Prometheus instance, the histogram bucket boundaries, the OTel exporter — all identical. The only thing that changes between Regime A and Regime B is a sentence — "p99 < 250 ms in 99.9% of minutes". That sentence is what turns the metric stream into an answerable engineering question. Everything in Part 10 is in service of writing better sentences and computing the right alerts from them.
What Part 10 onward will deliver, and what it will demand of you
Part 10 lays the SLO discipline. Chapter 62 establishes the vocabulary — SLI (the indicator: a measurable signal like "fraction of HTTP responses with status 2xx and latency < 250 ms"), SLO (the objective: "99.9% of those over 28 days"), SLA (the contract: the legal/contractual promise to a customer, usually weaker than the SLO so the engineering team has buffer). Chapter 63 walks through how to choose an SLI — what makes "checkout latency" a good one and "CPU utilisation" a bad one, why "availability" is harder to define than it sounds, why some teams use request-success-rate and others use synthetic-probe-success-rate. Chapter 64 derives the error-budget arithmetic from first principles — given a 99.9% SLO over 28 days, how many seconds of allowed badness do you have, and what does it mean to spend them. Chapter 65 builds burn-rate alerting on top — the multi-window-multi-burn-rate scheme that replaces the threshold-tuning treadmill.
The reader who finishes Part 10 will be able to do four things they cannot do today. First, write an SLO for a service that the product VP, the on-call engineer, and the CFO can all argue about with a shared vocabulary. Second, derive the alerting policy mechanically from the SLO instead of guessing thresholds. Third, defend a "we will not page on this" decision with budget arithmetic instead of vibes. Fourth, recognise the failure mode of "SLOs that nobody owns" — the document on Confluence that is six months stale, that the on-call has never read, that the alerts no longer match.
The discipline demands cost too. SLOs are organisational artefacts more than technical ones — a written contract between engineering, product, and finance about what reliability is worth. Teams that try to roll out SLOs as a pure engineering exercise — the SRE writes them, nobody else signs off — find the SLOs ignored within a quarter. Teams that treat SLO-setting as a quarterly conversation with product and finance get durable value. The Razorpay alert-rewrite that the figures in this chapter reference was a 14-week effort with a cross-functional team; the technical work was four weeks of that.
There is also a measurement cost. To set an SLO honestly, you need at least 28 days of historical data on the SLI, ideally three months. The reader who has not yet shipped the metrics infrastructure of Parts 1–8 cannot meaningfully start Part 10; the reader who shipped them six weeks ago can start the conversation but should not yet sign a number. This is why the curriculum sequences measurement before targets — you cannot write a contract about a thing you cannot measure, and you cannot measure honestly without having paid the cardinality, sampling, and storage costs Parts 1–9 cover.
Why the wall is here, structurally, and not earlier or later: SLOs without measurement are policy theatre — Confluence pages with no telemetry behind them, written by a director who read the Google SRE book on a flight, ignored by every engineer who knows the page does not match the metrics. Measurement without SLOs is what the previous nine parts have produced — beautifully detailed dashboards that nobody knows how to read. The wall is the moment the curriculum has built enough measurement to support honest SLOs, and not yet introduced the contract framing that makes the measurement actionable. Crossing the wall is the difference between observability-as-tool and observability-as-discipline. Every team that does the crossing reports the same outcome: the calendar of "who is on call" stops being something engineers dread and starts being something they merely accept.
What crossing the wall changes operationally
The wall is not just a conceptual shift; it changes what the SRE rotation does, hour by hour, on a Tuesday afternoon. Six concrete operational shifts show up in every team that crosses it well, and each is a thing the pre-SLO team did not have a way to do.
The "is this an incident?" question becomes deterministic. Before SLOs: the on-call sees an alert, opens the dashboard, looks at the last hour, makes a judgement call, sometimes pages the team and sometimes does not. After SLOs: the on-call opens the burn-rate panel, sees current burn-rate = 14.4×, and knows mechanically that this is incident-grade. The judgement call moves from "should I page?" to "what do I do about it?" — which is the question the on-call's training is for.
Deploys gain a stop-light. A team with no SLO has no honest answer to "is it safe to deploy?". A team with an SLO has the budget panel: 80% remaining means deploy freely, 30% remaining means deploy only urgent fixes, 0% remaining means deploy only incident response. Hotstar's playback team uses exactly this convention; the deploy queue is gated by the SLO budget, automated through a CI check. Engineers stop debating risk in slack; the budget is the answer.
Capacity planning gains a unit. "How much headroom do we need?" is unanswerable without a target. With one, it becomes "how much load can we absorb before the SLO breaches?" — which is a load-test result, not an opinion. A team that knows the SLO breach point knows when to scale, when to provision the next region, when to defer scaling for a quarter. The capacity-planning meeting goes from "vibes" to "we have 38% headroom against SLO at peak; the next quarter's growth is 14%; we are fine until Q3".
Postmortems gain a measurable outcome. Without an SLO, "did this incident matter?" is a debate. With one, the answer is "this incident burned X% of the monthly budget" — a number that anchors the postmortem and the remediation priority. Razorpay's postmortem template starts with the budget burn line; the rest of the document derives from it.
On-call rotations gain a workload metric. "Was last week a tough week on-call?" becomes "the on-call handled 12 burn-rate-driven pages in 168 hours; the median was a 4× burn-rate sustained for 23 minutes; two pages crossed into the 14.4× fast-burn band". That is data the engineering manager can use to balance the rotation and to detect burnout before it becomes attrition. The unanchored alerts of the pre-SLO world produce only "it felt rough" — useful for empathy, useless for staffing.
Product gains a reliability dial. Once the SLO is written, "should we ship this risky feature?" becomes "we have 65% of monthly budget; the launch will probably consume 20-30%; we proceed with rollback ready". The product manager and the SRE have a shared currency. The pre-SLO version of this conversation was a series of personal credibility plays where the most senior voice won; the SLO version is arithmetic both sides can audit.
A seventh shift, less measurable but more important, is the cultural one. Teams that cross the wall consistently report that on-call rotation goes from being a thing senior engineers avoid to a thing junior engineers can handle. The reason is simple: with SLOs in place, the on-call's first job is not "do I understand this system well enough to know if this is bad?" but "is the burn-rate panel showing red?". Pattern recognition replaces deep system expertise as the entry-point skill. Senior engineers still own the deep debugging, but the triage layer becomes accessible to less-experienced engineers, which is what makes the rotation sustainable across a growing team. Cred's 2024 platform team writeup credits this shift specifically with making their on-call rotation expandable from 6 senior SREs to 18 mixed-seniority engineers without a quality drop — the SLO discipline gave the junior engineers a reliable signal to act on.
Why naming all six shifts at the wall and not later: a team that crosses the wall expecting only "better alerts" is disappointed by the effort-to-payoff ratio — the alerting work alone is real but the savings are hard to quantify. A team that crosses the wall expecting all six shifts gets six independent payoff vectors, any one of which justifies the effort, and the combination justifies the effort by a comfortable margin. The Razorpay rewrite famously paid for itself on shift #5 alone (the on-call rotation became sustainable; senior SREs stopped quitting); the alerting reduction was a bonus. Reading Part 10 with all six in mind is the difference between "we are doing SLOs because the book says to" and "we are doing SLOs because the on-call calendar will be liveable in Q3".
Common confusions
- "An alert threshold is the same as an SLO." No. A threshold is a number ("p99 > 500 ms"); an SLO is a contract ("99.9% of requests under 250 ms over 28 days"). Many alerts can derive from one SLO (fast-burn 14.4× over 1h, slow-burn 1× over 6h, exhaustion 0% budget remaining), with different urgencies. A threshold without an SLO is a guess; an SLO produces thresholds mechanically.
- "More dashboards mean better observability." No. The 14-dashboard wall in the lead is a symptom of missing targets, not a sign of mature observability. A team with one dashboard per SLO ("checkout-api SLO, current burn-rate, remaining budget") is more observable than a team with a hundred dashboards full of unanchored time-series. The Razorpay rewrite reduced the on-call dashboard set from 47 panels to 12.
- "Set the SLO at the current p99 and you will not breach." This is the failure mode the demonstration script exposes. If your steady-state p99 is 480 ms and you set the SLO at 480 ms, every steady-state minute is exactly at the line; any noise pushes you over; every alert fires; the team rolls back the SLO within a week. The right anchor is what the user experience demands, not what the system currently delivers — and if the gap between those two is large, the SLO discipline is exposing real engineering work to be done, not telling you to relax the target.
- "99.9% is the right SLO for everything." No. The right SLO is whatever the user contract demands. Zerodha's trading-API SLO during the 09:15 market-open window is 99.99% (one minute of badness per week is one minute too many on a tax-deadline trading day); Cleartrip's flight-search SLO is 99.5% (search is retried by the user, a brief outage costs no transaction). A team that sets 99.9% across all services is not doing SLO work; they are copying a number.
- "The error budget is a target to consume." Subtle one. An error budget is permission to spend, not an instruction. A team that ends the month with 80% of budget unspent has not "wasted" reliability; they have built headroom for the next launch, the next infra migration, the next chaos test. The Google SRE chapter that introduces error budgets is explicit: budgets exist to be spent on engineering work that benefits the user (faster deploys, more experiments) but not consumed for its own sake. A team always at 0% is over-shipping risk; a team always at 100% is under-shipping changes.
- "SLOs replace SLAs." No. An SLA is a contractual promise to a customer (often with refund or credit terms attached). An SLO is an internal engineering target, almost always tighter than the SLA so the engineering team has buffer. A bank that promises customers 99.5% availability via SLA might run an internal SLO of 99.9% — a 5× safety margin. Confusing the two means engineering accidentally treats the customer SLA as the internal target, so any breach of the internal target is also a breach of the legal contract, removing all buffer. Chapter 62 walks the distinction. (The "write SLOs after the fact" mistake is the same shape — an SLO retrofitted onto existing alerts produces contradictions across the alerting layer that take a quarter to reconcile; the right cadence is SLO first, alerts derived.)
Going deeper
The Google SRE Book chapter that started the formal vocabulary
Chapter 4 of Site Reliability Engineering (Beyer, Jones, Petoff, Murphy, O'Reilly 2016), "Service Level Objectives", is the original introduction of the SLI/SLO/SLA triple as an engineering discipline. The chapter's contribution was less the vocabulary (the terms predate the book) and more the framing that SLOs are a contract between engineering and product with downstream consequences for alerting, deploy cadence, and on-call structure. The follow-up book The Site Reliability Workbook (Beyer, Murphy, Rensin, Kawahara, Thorne, O'Reilly 2018) chapter 2 ("Implementing SLOs") gives the practical playbook — how to write the first SLO, how to defend it in a product review, how to handle the "we are always over budget" failure mode. The two chapters together, read in sequence, are roughly four hours of material that compresses what most teams learn in two years of trying. Parts 10–12 of this curriculum lean on both heavily; reading them as background before chapter 62 is high-leverage.
A reader who only has time for one chapter of the SRE book before chapter 62 should pick the Workbook's Implementing SLOs chapter (chapter 2) over the original SRE book's chapter 4. The original is more elegant, but the Workbook is more useful — it walks an SLO from blank-page to signed-off, names the failure modes, and provides spreadsheet templates that translate directly into Prometheus recording rules. The original is the theory; the Workbook is the practice, and at this wall the reader is about to need the practice. The two chapters together are ideal but ranked, the Workbook wins on practical leverage per minute of reading.
Why SLOs are also a hiring signal in Indian observability teams
A pattern observed in the Bengaluru SRE hiring market 2022–2026: senior SRE candidates increasingly screen prospective employers on SLO maturity, not the other way round. The interview question goes "show me your SLO doc and your last quarter's burn-rate report" rather than "show me your stack". A team without an SLO document has no answer; a team with one written six months ago that nobody has touched has a worse answer. The teams that hire well — Razorpay's platform group, CRED's reliability team, Hotstar's playback SRE — keep the SLO doc current, review it quarterly with product, and treat it as the artefact the team is held accountable to. The reverse implication: if you are an engineer joining a team in 2026 and the SLO conversation is not happening, the team is signalling that reliability is not yet an engineering function. That is useful data when deciding whether to take the offer.
The compensation correlation is also visible. Salary surveys from Bengaluru SRE meetups (informal, 2024-2025) show roughly a 25-40% premium for senior SRE roles at companies that have shipped SLO discipline versus those that have not. The mechanism is straightforward: the SLO discipline lets the company quantify reliability investment, justify SRE headcount to finance, and articulate the on-call workload to candidates without hand-waving. Companies that cannot do those three things get into a recruiting trap — they need senior SREs to ship SLO discipline, but cannot attract senior SREs without already having it. Crossing the wall earlier than competitors is, among many things, a hiring advantage.
What Part 11 (alerting) and Part 12 (production debugging) inherit from the SLO discipline
Part 11 will rewrite the team's alert rules from scratch on top of the SLOs Part 10 produces. The rule of thumb — every paging alert must trace to an SLO; every alert that does not trace to an SLO is either a non-paging signal (ticket, dashboard, log line) or it should not exist — is what reduces the 1200/day Razorpay number to 14/day. The mechanic is multi-window-multi-burn-rate alerting, which Part 11 derives in detail; the inputs are the SLOs Part 10 produces. Part 12 (production debugging) inherits the SLO as the definition of "broken". The diagnostic ladder ("is the system broken?" → "where is it broken?" → "why is it broken?") starts with a question that has no answer absent SLOs. With them, "is it broken?" is a query against the burn-rate panel, and the diagnostic ladder can begin. The operational pattern of "page → check burn rate → drill to traces → drill to logs → drill to profile" only makes sense when the page is anchored to a budget-exceedance event, not a guess-threshold cross.
A subtler inheritance is the prioritisation of where to instrument. Part 1–9 instrumented broadly: every endpoint got a histogram, every span got attributes, every request got a trace context. Part 11 onwards inherits a different question — given the SLOs the team has agreed to, which slice of that instrumentation is load-bearing? A team with five SLOs may use 80% of its instrumentation for SLO computation and the other 20% for ad-hoc debugging. The instrumentation that is not pulling its weight against an SLO becomes a candidate for retention reduction or sampling — chapter 67's discussion of "telemetry budgets" hangs entirely on the SLO discipline naming what telemetry matters most. The pre-SLO team has no principled way to make this trim; the post-SLO team has the SLO list as an audit table.
The cost of SLOs that are too tight, too loose, or too generic
A too-tight SLO ("p99 < 100 ms" on a service whose hardware-floor p99 is 80 ms) is the failure mode the script demonstrated — every minute is at the boundary, noise produces breaches, the team rolls back. A too-loose SLO ("99% over a quarter") is invisible to engineers — at 99% the budget is so large that nothing breaches, the team never gets feedback, the SLO becomes ceremonial. A too-generic SLO ("the API is healthy 99.9% of the time") cannot be measured because "healthy" is undefined; the SLO is an aspiration with no math behind it. Indian production SLOs that have held up across multiple quarters are specific (named SLI, named time window), achievable (current baseline plus realistic stretch), and tied to a customer-visible outcome (latency the user feels, error rate the user sees). The SLO that does not have all three properties is being prepared to fail.
A useful diagnostic when reviewing an SLO draft: read it aloud and check whether each phrase is measurable from existing telemetry. "99.9% of checkout requests" — measurable, you have request counts. "complete in under 250 ms" — measurable, you have latency histograms. "as observed at the API gateway" — measurable, you can scope the metric by source. Every phrase that fails this read-aloud test is a phrase that needs to be either dropped or replaced with telemetry the team commits to building. The first SLO drafts of every team contain a few such phrases ("the user has a smooth experience", "the system is responsive"), and finding them before the SLO is signed prevents three months of debate about whether the SLO is being met.
The two kinds of teams that struggle most with the wall
Two team archetypes hit unusually hard problems crossing the wall, and both are predictable. The first is the senior-engineer-driven team that has been doing observability "by feel" for years — the architects know the system intimately, can predict its failure modes from a graph shape, and have built dashboards that match their personal mental models. The wall asks them to externalise that intuition into written contracts that less-experienced engineers can read and act on. The cost is real (the senior engineers feel their judgement is being replaced by mechanical rules) and the benefit is real (the team scales beyond those engineers' capacity), but the transition takes one or two quarters of friction. The second archetype is the early-stage team that has not yet done enough measurement work — they have Prometheus running, no real cardinality discipline, no trace pipeline, and they are tempted to skip ahead to Part 10 because SLOs sound aspirational. They get burned because they cannot honestly compute the SLI; the histogram bucket boundaries are wrong, the cardinality is unbounded, the trace data is too sparse to derive a fraction-of-good-traces signal. Both teams need different remediation: the senior team needs facilitation help to externalise the model; the early team needs to finish Parts 1–9 first.
A third archetype that surfaces less often but worth naming: the regulated-industry team (banking, insurance, healthcare) where compliance contracts already specify availability targets but those targets were written by lawyers, not engineers, and bear no relationship to operational reality. These teams have "SLOs" in name (the SLA in the compliance document) but cannot use them operationally — the SLA might say "99.9% availability" without specifying the SLI, the time window, or the exclusion criteria, and meeting it requires a separate internal SLO the operational team builds. Indian fintechs operating under RBI mandates and insurance companies under IRDAI rules report this gap consistently. The Part 10 work for these teams is dual: write an internal SLO that is operationally meaningful, and prove via measurement that the internal SLO is tight enough to imply the regulatory one. That is more work than starting from scratch, but the regulatory contract is the floor the engineering team cannot ignore.
Why "numbers without targets" is the title of this wall
The phrase compresses the entire pre-Part-10 stack into one observation: every measurement chapter — metrics, traces, logs, sampling, cardinality, time-series compression, dashboards, profiling — produces numbers. The numbers are necessary. The numbers are not sufficient. A 22:14 war room with an unbounded supply of high-quality numbers and no target produces three answers to one question and zero engineering action. The same war room with one well-defined SLO and the same numbers produces one answer and one decision. The wall between Part 9 and Part 10 is the structural shift from production of numbers to consumption of numbers in service of a contract. Every chapter before this wall has been asking "what can we measure?". Every chapter after this wall asks "what should we hold ourselves to?". Both questions matter; neither is the other.
A historical note: the SRE field as a coherent discipline emerged at Google around 2003-2008, but the SLO-as-public-contract framing did not become widespread outside Google until the 2016 SRE book published it. The intervening decade was the industry slowly realising that the measurement work most teams had been doing was insufficient on its own — that "we have observability" and "we have an SRE function" were claims that broke down at exactly the structural point this wall names. Reading the SRE book today, the chapter on SLOs feels obvious; in 2016 it was a load-bearing reframing. The Indian companies that adopted it earliest — Flipkart's reliability team in 2017, Razorpay's platform group in 2019, Hotstar's playback SRE team in 2020 — disproportionately produced the next generation of SRE leaders in the Bengaluru ecosystem, in part because the SLO discipline forced them to confront the structural gap before their peers did.
A note on what this wall is not asking you to do
A reader who has shipped Parts 1–9 well might worry that this wall is telling them to throw out the dashboards, kill the alerts, and start over. It is not. Every measurement chapter so far stays. The Prometheus instance, the Tempo retention, the Loki indexes, the OTel collector pipeline, the Pyroscope agent, the eBPF DaemonSet — all of those are the substrate Part 10 builds on. What changes is how those measurements are read. The same histogram_quantile() PromQL query that powered an unanchored dashboard in Part 7 powers a burn-rate calculation in Part 10. The same trace data that filled a Tempo storage tier in Part 4 becomes the SLI for "what fraction of user-flow traces complete in under 800 ms" in Part 10. The wall is not a discontinuity in technology; it is a discontinuity in framing.
The framing shift is also gentler than it looks. The first SLO does not need to be perfect. A team's first SLO is almost always wrong — the threshold is too tight or too loose, the SLI excludes some real user-experience signal, the time window is the wrong length. That is fine and expected. The discipline is to write the SLO, run with it for 28 days, see how it behaved, and revise it. The Razorpay rewrite went through four revisions of the checkout-API SLO before settling on the version that has now held stable for fourteen months; the first three were learning. A team that demands a perfect SLO before shipping any SLO will never ship one. Ship the imperfect one, watch it run, revise. This is the same iteration discipline data engineers apply to ETL pipelines and ML engineers apply to feature stores. The SLO is one more iteratively-improved engineering artefact, not a one-shot policy decision.
Where this leads next
Part 10 (chapters 62–67 in the curriculum) walks the SLO discipline end-to-end: /wiki/sli-slo-sla-the-definitions-that-matter for the vocabulary the rest of the part assumes, /wiki/choosing-good-slis for the engineering of which signal to anchor on, /wiki/error-budget-math for the arithmetic that turns a percentage into seconds-of-allowed-badness, and /wiki/burn-rate-alerting for the multi-window-multi-burn-rate scheme that replaces threshold tuning.
After Part 10 the curriculum runs through Part 11 (alerting as a discipline, derived from SLOs), Part 12 (production debugging anchored on the SLO as definition of "broken"), and Parts 13–17 (OpenTelemetry internals, eBPF observability, continuous profiling synthesis, case studies, and observability as an engineering culture). The wall here is the only structural break in the curriculum — measurement before, contracts after. Crossing it is the work of the next six chapters.
For the reader who has been working through this curriculum sequentially, this is the moment to pause. Before turning to chapter 62, take an hour to look at one production service you operate and ask: do I have a written SLO for it? If yes, when did I last revise it? If no, what would the first draft look like? The exercise is not to ship the SLO right now — Part 10 has six chapters of nuance for a reason — but to sit with the gap. Most engineers who do this exercise honestly find the gap larger than they expected. That gap is what the next six chapters fill.
A practical version of the exercise: pull up your last on-call shift's PagerDuty digest. For each page that fired, write next to it which SLO would have been the right anchor (latency? error rate? availability? throughput?), and what specific contract sentence ("99.9% of X under Y over Z") would have produced it. About a third of the pages will not map cleanly to any plausible SLO — those are the alerts that probably should not exist as paging signals. About a third will map to one obvious SLO that the team has not yet written. The remaining third map to multiple potential SLOs and need the chapter-63 conversation about which SLI is actually load-bearing. The exercise takes thirty minutes and produces, on its own, a first-draft SLO list for the team's services. Bring that list to chapter 62.
References
- Beyer, Jones, Petoff, Murphy, Site Reliability Engineering (O'Reilly 2016), Chapter 4 "Service Level Objectives" — the original formal treatment of SLI/SLO/SLA. Free online at sre.google.
- Beyer, Murphy, Rensin, Kawahara, Thorne, The Site Reliability Workbook (O'Reilly 2018), Chapter 2 "Implementing SLOs" — the practical playbook for the first SLO, the cross-functional review, and the "always over budget" failure mode.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly 2022), Chapter 12 ("Using Service Level Objectives for Reliability") — the modern-era reframing that places SLOs alongside high-cardinality observability.
- Alex Hidalgo, Implementing Service Level Objectives (O'Reilly 2020) — book-length treatment with the burn-rate alerting taxonomy that Part 11 will build on.
- "Multi-window, multi-burn-rate alerts" — the Workbook chapter on alerting. The 14.4× / 6× / 1× threshold table in chapter 65 of this curriculum derives from here.
- Razorpay Engineering — "From 1200 alerts a day to 14: rewriting our SRE alerting" — the Indian-fintech postmortem of an SLO-driven alert rewrite. The figures in this chapter trace to this case study.
/wiki/wall-profiling-live-systems-needs-special-handling— the previous wall, on which this one builds: profiling produces yet more numbers, and this chapter is the moment the curriculum stops adding number-producers and starts adding number-interpreters./wiki/why-three-pillars-is-a-flawed-framing-profiles-events-slos— the chapter that already named SLOs as part of the modern observability stack; this wall is where the curriculum cashes that cheque.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy hdrh
python3 slo_vs_no_slo.py
# Then change SLO_P99_MS to 600 and re-run — see how a too-loose SLO produces
# the opposite failure mode (everything is "ok", the burn-rate alert never fires,
# the team learns nothing from the budget).
# Also try setting SLO_GOOD_MINUTE_FRAC to 0.99 (looser) vs 0.9999 (tighter)
# and watch how the same latency stream produces wildly different burn-rate
# verdicts — the SLO percentage is not a cosmetic dial, it is the engineering
# decision that determines what counts as "broken".