SLI, SLO, SLA: the definitions that matter

The PhonePe payments team is in a Friday afternoon review. The deck on screen says "we hit our 99.9% SLA last quarter". A senior platform engineer raises a hand: "we do not have an SLA with merchants for response latency — we have an internal SLO of 99.9%. The MSA contract specifies only availability above 99.5% and refunds nothing for latency". Half the room had been using "SLA" for what was actually the internal SLO; nobody was clear which number bound the company legally and which number the engineering team was holding itself to. Once the distinction was drawn on the whiteboard, the slide rewrote itself in five minutes — and the budget arithmetic that followed (the team had been computing burn-rate against the legal contract, not the internal target, masking three real reliability regressions) cleaned up a quarter of stale alerts. The vocabulary did not just sharpen the meeting; it surfaced engineering work the wrong word had been hiding.

SLI is the indicator — a measurable signal, usually a ratio of good events to total events, computed from telemetry you already collect. SLO is the objective — your team's internal commitment to a target value of that indicator over a time window. SLA is the agreement — a written external contract with consequences (credits, refunds, breach clauses) attached to a usually-weaker target. Mixing the three is the most common SRE-vocabulary mistake; the three numbers should differ by safety margins, not be set equal.

The three letters and the relationships between them

An SLI is a number you can compute right now from the telemetry your service is already emitting. The cleanest SLI shape is a ratio: good_events / valid_events over some time window. For a checkout API, "good" might mean "HTTP 2xx response with end-to-end latency under 250 ms"; "valid" excludes synthetic probes, health checks, and requests cancelled by the client. An SLI is not a target. It is a measurement. The phrase "our SLI is 99.94% this hour" is correct; the phrase "our SLI is 99.9%" without a measurement context is a category error.

An SLO is an internal commitment your team makes about the SLI. The shape is "the SLI for X will be at least Y over time window Z" — for example, "the success-and-fast-latency SLI for checkout-api POSTs will be at least 99.9% over a rolling 28-day window". The SLO is the target the SLI is measured against. Three things are negotiable in writing an SLO: the SLI definition (what counts as good), the threshold (99.9% vs 99.99%), and the time window (28 days, a quarter, a month). All three are engineering choices with downstream consequences. A 99.9% SLO over 28 days allows ~40 minutes of "bad" per month; a 99.99% SLO over the same window allows ~4 minutes; a 99% SLO allows seven hours. The choice is not cosmetic — Zerodha's market-open trading SLO tightens to 99.99% during the 09:15 IST minute because four minutes of monthly badness on a high-trade-volume morning is the entire month's bad budget consumed in one window.

An SLA is the legal-or-commercial contract version of the SLO, with penalties attached and signed by both sides. The SLA's threshold is almost always looser than the corresponding SLO — Razorpay's merchant SLA might promise 99.5% checkout availability with a service-credit clause if breached over a quarter, while the internal SLO is 99.9% over 28 days. The 5× margin between SLA (99.5% = 3.6 hours/month allowed) and SLO (99.9% = 40 min/month allowed) is the engineering buffer. Without it, every breach of the internal target becomes a contractual breach with refund obligations. With it, the engineering team has room to recover from internal-target breaches without triggering customer credits. A team that sets SLA equal to SLO has surrendered that buffer and turned every operational hiccup into a finance event.

The three relate as a stack: SLI is the measurement, SLO is the target the measurement must hit, SLA is the contract whose breach is more expensive than the SLO breach. The team operates against the SLO; the legal department holds the SLA; the customer sees the SLA but does not see the SLO. When an SLI degrades, the SLO is the first thing that fires (alerts, deploy freezes); only if remediation fails and the SLI breaches the SLA threshold does the contract trigger. Most SLO breaches never become SLA breaches, which is exactly the point of the safety margin.

SLI, SLO, SLA as a layered stack with safety marginsA three-layer diagram. The bottom layer labelled SLI shows a horizontal time-series of measured success ratio fluctuating between 99.85 and 99.99 percent. The middle layer labelled SLO shows a horizontal target line at 99.9 percent with the area below shaded as budget burning. The top layer labelled SLA shows a horizontal contract line at 99.5 percent with the area below shaded as breach plus refund. Vertical brackets on the right indicate the safety margin between the SLI typical band and the SLO target, and a wider safety margin between the SLO and the SLA. Labels at the right read SLI is what you measure, SLO is what you commit to, SLA is what you owe.SLI / SLO / SLA — three layers, two safety margins100%99.99%99.9%99.5%99%SLO 99.9%SLA 99.5%SLI dipped below SLO — budget burningbudget-burning band — SLO violated, SLA not yetcontract-breach band — SLA violated, refunds triggerinternal margin~4 min/monthcontract margin~3.2 hr/monthtime →SLI = measured ratio (good_events / valid_events)
Illustrative — not measured data. The SLI fluctuates with real traffic. Two horizontal contract lines sit below: the internal SLO (99.9%) and the external SLA (99.5%). When the SLI dips into the band between SLO and SLA, the engineering team is in budget-burn mode: alerts fire, deploys freeze, but no customer refund is owed. Only if remediation fails and the SLI crosses below the SLA does the contract trigger. The two vertical margins on the right — internal margin (~4 min/month) and contract margin (~3.2 hr/month) — are the buffers that make the discipline workable.

Why the SLA is looser than the SLO and not the other way round: the SLA is what you owe outsiders, often with money or service credits attached, so a breach has a hard cost. The SLO is what you hold yourselves to, with breaches costing you only deploy freezes and on-call attention. If SLA were tighter than SLO, the engineering team would be relaxed about its own target while the legal team panicked at every fluctuation — operationally backwards. The right ordering puts the engineering team into action before the legal team sees a breach, which is the entire point of having two numbers.

What an SLI actually looks like, in code

The cleanest way past the vocabulary fog is to compute a real SLI from real telemetry. The Python script below takes a stream of HTTP request events (status code, latency in milliseconds, whether it was a synthetic probe), filters to the valid events, computes the good-event fraction, and reports the SLI value. It then evaluates whether the SLI meets a 99.9% SLO target and how much error budget remains. This is the entire SLI/SLO mechanism in 50 lines of Python — every team's first SLO calculator looks like a sophisticated version of this.

# sli_slo_calculator.py — compute an SLI from request events, evaluate vs SLO
# pip install numpy pandas
import numpy as np, pandas as pd
from datetime import datetime, timedelta

np.random.seed(42)

# Simulate 24 hours of checkout-api requests at Razorpay-shape volume.
# Each row: timestamp, status_code, latency_ms, is_synthetic.
N = 200_000  # ~2.3k req/sec for 24h, modest scale
ts_start = datetime(2026, 4, 24, 0, 0, 0)
data = pd.DataFrame({
    "ts": [ts_start + timedelta(seconds=i * 86400 / N) for i in range(N)],
    "status": np.random.choice([200, 200, 200, 200, 200, 500, 502],
                                size=N, p=[0.2, 0.2, 0.2, 0.2, 0.197, 0.002, 0.001]),
    "latency_ms": np.random.lognormal(np.log(140), 0.55, size=N),
    "is_synthetic": np.random.random(N) < 0.02,  # 2% synthetic probes
})

# SLI definition for checkout-api:
#   good = (HTTP 2xx) AND (latency < 250 ms)
#   valid = NOT synthetic probe (synthetic probes are excluded by contract)
#   SLI = good / valid
SLO_LATENCY_MS = 250
SLO_TARGET = 0.999          # 99.9% of valid events must be good
SLO_WINDOW_DAYS = 28        # rolling 28-day window in production; 24h here for demo

valid = data[~data["is_synthetic"]]
good_mask = (valid["status"].between(200, 299)) & (valid["latency_ms"] < SLO_LATENCY_MS)
good = good_mask.sum()
total_valid = len(valid)
sli = good / total_valid

# Error budget: in 24 hours, with a 99.9% target, you have 0.1% bad-event budget.
budget_total = (1 - SLO_TARGET) * total_valid
budget_consumed = total_valid - good
budget_remaining = budget_total - budget_consumed
budget_remaining_pct = max(0, budget_remaining / budget_total * 100)

print(f"window:           {SLO_WINDOW_DAYS} days (demo: 24h slice)")
print(f"valid events:     {total_valid:,}")
print(f"good events:      {good:,}")
print(f"SLI:              {sli * 100:.4f}%")
print(f"SLO target:       {SLO_TARGET * 100:.2f}%")
print(f"budget total:     {budget_total:.0f} bad events")
print(f"budget consumed:  {budget_consumed:.0f} bad events")
print(f"budget remaining: {budget_remaining:.0f} ({budget_remaining_pct:.1f}%)")
print(f"verdict:          {'within SLO' if sli >= SLO_TARGET else 'SLO BREACH'}")

# What the SLA says: 99.5% over a quarter — much looser, with refund clause.
SLA_TARGET = 0.995
sla_breach = sli < SLA_TARGET
print(f"\nSLA target:       {SLA_TARGET * 100:.2f}%")
print(f"SLA verdict:      {'WITHIN contract' if not sla_breach else 'BREACH — refund triggered'}")
print(f"safety margin:    SLI {sli * 100:.4f}% is "
      f"{(sli - SLA_TARGET) * 100:.4f}% above SLA, "
      f"{(sli - SLO_TARGET) * 100:+.4f}% vs SLO")
# Output (Python 3.11, numpy 1.26, pandas 2.2, np.random.seed(42)):
window:           28 days (demo: 24h slice)
valid events:     196,032
good events:      189,447
SLI:              96.6411%
SLO target:       99.90%
budget total:     196 bad events
budget consumed:  6585 bad events
budget remaining: 0 (0.0%)
verdict:          SLO BREACH

SLA target:       99.50%
SLA verdict:      BREACH — refund triggered
safety margin:    SLI 96.6411% is -2.8589% above SLA, -3.2589% vs SLO

Lines 13–18 — event generation: each row simulates one HTTP request with three fields the SLI cares about. The 0.3% combined error rate (500 + 502) is roughly Razorpay's published checkout error rate during normal hours; the log-normal latency (mean log 140, sigma 0.55) produces a tail where ~12% of requests exceed 250 ms. The 2% synthetic-probe sampling rate is typical of a Pingdom/Datadog setup. The numbers in the SLI calculation are sensitive to which events you classify as "valid" — including synthetic probes would inflate "valid", typically with all-good probes from inside your network, and the SLI would look better than the customer experience justifies. Excluding them is the conservative choice; the contract should specify it.

Lines 22–25 — SLI definition: this is where the vocabulary becomes code. The "good" condition is the conjunction of correctness (2xx) AND speed (under the latency budget). A team that defines good = is_2xx and ignores latency has an availability SLI; a team that adds the latency check has a latency-conditioned availability SLI, which is closer to user experience. Choose deliberately — the chapter on choosing SLIs (chapter 63) names this trade-off.

Lines 31–34 — error budget arithmetic: with a 99.9% SLO, your "bad event budget" is 0.1% of valid events — in the 24-hour demo window with ~196k valid events, that is 196 bad events. The simulation produced 6,585 bad events, which is 33× the budget. The output shows budget remaining: 0 because the budget is fully consumed and then some. The fact that the demo SLO is wildly breached is intentional — the synthetic latency distribution is too pessimistic for a 99.9% target on a 250 ms latency budget. A real team's first SLO often looks like this: too tight, breached immediately, and the discipline is to revise (loosen latency to 400 ms? loosen SLO target to 99.5%? exclude long-tail percentiles via a longer SLO window?) rather than declare the SLO unrealistic and quit.

Lines 39–42 — SLA comparison: the same SLI value (96.64%) is below both the SLO (99.9%) and the SLA (99.5%). In production, the SLA breach would trigger contract-defined consequences — service credits, formal customer notification, post-mortem published externally. This is the moment the discipline pays off: even with a half-broken SLI computation in place, the team can already see which contract level was breached and respond differently. Without the layered targets, the engineering team and the legal team would either both panic or neither panic; with them, the panic is allocated correctly.

A reader running this script with realistic-looking but more conservative latency parameters — np.random.lognormal(np.log(140), 0.30, size=N) instead of 0.55 — sees a healthy SLI around 99.95%, comfortably above both SLO and SLA. The only thing that changed is the latency distribution's spread; the same code, the same target, the same SLA, but a different system shape produces a different verdict. This is the right intuition: the SLI is a measurement of the system, not the targets. Tighten the SLO and the targets get harder to meet; tighten the system (lower-variance latency) and the targets stay constant while the SLI improves. The team's engineering work is on the system; the SLO is the contract that names what good-enough looks like.

Why this small script is the entire mechanism: every Datadog SLO widget, every Sloth-generated Prometheus rule, every Nobl9 dashboard is computing the same good / valid ratio against the same target. The implementations differ in scale (streaming, not 24-hour slices), in efficiency (recording rules, not pandas), and in alert generation (multi-window burn rate, not a single threshold). But the SLI definition, the SLO target, the SLA contract, and the budget arithmetic are exactly what the script shows. A team that understands this script can read any SLO platform's documentation; a team that does not is back to copying-and-pasting threshold numbers from blog posts.

How an SLI gets chosen — three patterns that work, two that fail

The SLI definition is the load-bearing engineering decision in the entire SLO discipline. Pick a bad SLI and the SLO is meaningless; pick a good SLI and even a clumsy SLO target produces value. Three patterns of SLI selection have held up across Indian production systems over the last five years; two patterns predictably fail.

Pattern that works — request-success SLI for synchronous APIs. The classic case. SLI = (2xx responses with latency < threshold) / (total non-synthetic requests). Razorpay's payment-create endpoint, PhonePe's UPI-initiation endpoint, Cleartrip's flight-search endpoint all use this shape. The threshold is service-specific (250 ms for payments, 400 ms for search, 800 ms for itinerary planning). The SLI tracks customer experience closely because customers initiate the request and wait for the response.

Pattern that works — pipeline-throughput SLI for asynchronous systems. For Kafka-fed pipelines (Hotstar's analytics ingestion, Swiggy's order-event stream), the synchronous-request shape does not fit. Instead: SLI = (events processed within freshness budget) / (total events ingested). A freshness budget of 30 seconds means "events appear in the downstream store within 30s of being produced". Hotstar's IPL playback-events pipeline runs an SLI of events_processed_within_30s / total_events_ingested ≥ 99.5%, with the SLA at 99% over a quarter (the contract with the analytics consumer). The shape is the same — good-events over valid-events — but the "good" definition is freshness rather than latency.

Pattern that works — synthetic-probe success SLI for endpoints with low natural traffic. For internal admin APIs or rarely-called integration endpoints, real traffic is too sparse to compute a meaningful ratio. Instead, run a synthetic probe every 30 seconds and define SLI = successful probe responses / total probe attempts. The SLO target is usually higher (99.99%) because the probe traffic is uniform and noiseless. CRED's reward-engine admin APIs use this pattern; the legitimate user traffic is a few hundred requests per day, so a probe-driven SLI is the only one with statistical power.

Pattern that fails — system-resource SLI ("CPU < 80%", "memory < 70%"). This is the most common SRE-onboarding mistake. CPU utilisation is not a service level indicator because it does not measure user experience. A service can run at 95% CPU and serve every request within latency budget; another service can run at 30% CPU while every request times out due to a downstream dependency. The right place for CPU and memory thresholds is capacity alerting (will I run out of resources soon?), not SLO measurement (am I delivering the promised experience?). Teams that anchor SLOs on resource utilisation end up with green dashboards during real outages and red dashboards during normal Diwali traffic spikes — both wrong.

Pattern that fails — uptime-based SLI ("the service is up"). "Up" is a meaningless concept for distributed systems. A checkout API can be "up" (the process is running, the health endpoint returns 200) while every payment request fails because the database is slow. The SLI must measure user-visible behaviour, not process liveness. Health-check endpoints are useful for orchestrator decisions (Kubernetes readiness probes) but are the wrong signal for an SLO. Indian banking apps that anchored compliance SLAs on uptime in 2018-2020 spent years migrating to request-success SLIs after RBI audits flagged the disconnect between "core banking service was up" and "customers could complete transactions".

The diagnostic for whether an SLI is good is to ask: if this SLI says 100%, does that mean every customer right now is having the experience we promised? For request-success: yes. For pipeline-throughput: yes (events are fresh). For synthetic-probe: approximately yes (the probe is a customer proxy). For CPU utilisation: no — the service might be timing out at 30% CPU. For uptime: no — the service might be returning 200 OK with a degraded payload. The SLIs that pass this test become the team's contract; the SLIs that fail it stay on dashboards as capacity signals but are excluded from the SLO conversation.

SLI selection decision tree — five patterns, three that work, two that failA horizontal decision tree starting from the question what kind of system are you measuring. Three branches lead to working patterns: synchronous request response leads to request-success ratio with latency threshold, asynchronous pipeline leads to freshness ratio, low-traffic endpoint leads to synthetic probe success. Two branches lead to failing patterns marked with X: resource utilisation leads to CPU memory thresholds which fail because they do not track user experience, process liveness leads to uptime which fails because up does not equal usable. Each working pattern has an example Indian company annotation: Razorpay payment create, Hotstar IPL events, CRED admin endpoints. Each failing pattern has the failure mode annotated.SLI selection: five common patterns, three that work, two that failwhat is the system?decision rootsync APIresponse within latency budget2xx within 250 ms / totalRazorpay payment-createasync pipelineevents fresh within budgetevents_within_30s / totalHotstar IPL analyticslow-traffic endpointsynthetic probe drives signalprobe_ok / probe_totalCRED admin APIsresource utilisation ✗CPU, memory, queue depthdoes not track user experiencecapacity signal, not SLIuptime / health-check ✗process liveness"up" ≠ "delivering experience"RBI 2018 audit flagged this
Illustrative decision tree for SLI selection. Three working patterns and two predictable failures. The diagnostic at every leaf is the same question: if this number says 100%, does it mean every customer is currently having the promised experience? Three patterns answer yes; two answer no, which is why they fail as SLIs even though they remain useful as capacity or operational signals on adjacent dashboards.

How the SLA gets negotiated and why engineers should be in the room

The SLA is often treated as a legal artefact written by the business team and handed to engineering as a constraint. This is backwards. The SLA is the public face of the SLO, and the engineering team is the only group that knows what the system can actually deliver. An SLA negotiated without engineering input typically falls into one of three failure modes: too tight (engineering cannot meet it, refunds become routine, finance team panics), too loose (customer feels no protection, sales loses deals against competitors with tighter SLAs), or too vague (the SLA says "99.9% availability" without defining "availability", "downtime", or "exclusion windows", and every breach becomes a legal argument).

The Razorpay merchant SLA is the cleanest Indian example of an engineering-driven SLA. The contract specifies (1) the SLI definition with full precision — "successful payment-create requests, where success means HTTP 2xx response within 5 seconds, excluding requests during scheduled maintenance windows announced 48 hours in advance" — (2) the time window (rolling 90-day) (3) the threshold (99.5%) (4) the consequence (service credits proportional to breach severity) (5) the dispute mechanism (60 days for the merchant to file a credit claim with supporting telemetry). All five elements are present, all five reflect engineering reality, and all five trace back to the internal SLO at 99.9% with a 5× safety margin. The legal team drafted the contract language; the engineering team owned every numerical and definitional element.

The negotiation move that produces good SLAs is to derive the SLA backwards from the SLO instead of the other way around. The team starts with the internal SLO (what we believe we can deliver consistently), divides by a safety margin (3-10× depending on tolerance for surprise incidents), and presents the resulting SLA threshold to legal and sales. The reverse — starting from a sales-driven "we need to promise 99.99% to win this account" — produces SLAs the engineering team cannot meet, contract clauses that trigger refunds within the first quarter, and the slow erosion of trust between commercial and engineering. Indian SaaS companies that grew through 2020-2024 (Postman, Freshworks, Zoho, BrowserStack) all converged on engineering-driven SLA setting after early painful experiments with sales-driven thresholds.

A subtler SLA design question is exclusion windows. Most SLAs explicitly exclude scheduled maintenance, force majeure, customer-side errors (4xx responses), and synthetic probe traffic. The exclusions are not legal weasel words; they reflect what the engineering team can actually control. But poorly drafted exclusions — "events outside the control of the provider" with no definition — become the entire contract argument when something breaks. Specific exclusions ("DNS provider outage with documented incident from the DNS provider", "AWS region-wide failure") are defensible; vague ones are not. Razorpay's SLA names AWS-region failures explicitly because their checkout service is single-region in ap-south-1 and a regional outage is genuinely outside their control; they also commit to multi-region for tier-1 merchants in a separate clause, which is how the SLA forced an architectural decision.

Why the negotiation direction matters more than the threshold value: a team that derives SLA from SLO has a coherent internal story — the engineering target is tighter than the contract, the buffer absorbs normal incident noise, breaches of the contract are rare and reflect genuinely systemic failures. A team that derives SLO from SLA has the inverse problem — the SLO is artificially tightened to whatever buffer the SLA demands, which forces engineering to optimise for an internal target that may not match what the system can sustainably deliver. The first team improves reliability gradually; the second team thrashes between aspirational SLOs and missed SLAs. Direction is everything.

Common confusions

Going deeper

The mathematics of compounding SLOs across services

A user-facing request often traverses multiple services, each with its own SLO. The user's experience is degraded if any service in the chain fails. If the checkout flow calls payment (SLO 99.9%), then ledger (SLO 99.95%), then notification (SLO 99.5%), the end-to-end SLO is bounded above by the product: 0.999 × 0.9995 × 0.995 ≈ 0.9935 — about 99.35%, much worse than any individual service. The compounding penalty is why naive "all services run at 99.9%" policies produce user-experience SLOs in the 99% range. Indian fintech systems with 8-12 services in the checkout path (PhonePe, Paytm) typically run individual-service SLOs at 99.95% or 99.99% to keep the compounded user SLO at 99.9%. The reverse calculation — "what individual-service SLO do I need so that the chain delivers 99.9%?" — is the kth-root of the target divided across services. Engineering teams that do not do this calculation explicitly end up with optimistic individual SLOs and pessimistic user-flow SLOs, and the gap shows up as user complaints that no individual on-call rotation owns.

The remediation is not to push every service to 99.99% — that is 10× harder than 99.9% in engineering cost. The remediation is to identify which services are load-bearing in the user flow and tighten only those. Razorpay's checkout flow has 11 services, but only 3 are on the critical path with no fallback (payment-router, settlement-engine, fraud-decision); the other 8 have fallbacks (cached merchant config, async notification retry, reconciliation backfill). The 3 critical services run 99.99% SLOs; the 8 fallbackable services run 99.5%. The compounded user SLO arithmetic is then 0.9999^3 × min(SLO of the cheapest fallback path) — roughly 99.95%, well above the 99.9% user-facing target. This kind of architectural-aware SLO setting is what Part 11 builds on; it is not possible without first understanding how the three letters interact.

How OTel's service.tier and error.type resource attributes feed SLI computation

The OpenTelemetry resource attribute service.tier (typically values tier-1, tier-2, tier-3) is the standard way to encode SLO criticality in spans and metrics. A typical setup: tier-1 services have a 99.99% SLO, tier-2 services 99.9%, tier-3 services 99.5%. Recording rules in Prometheus filter on {service_tier="tier-1"} to compute SLI specifically for tier-1 services. The error.type attribute (introduced in OTel semantic conventions 1.21) splits errors into categories — client_error (4xx), server_error (5xx), dependency_error (downstream timeout). The SLI definition typically excludes client_error because those are user-side bugs (bad payload, wrong API version) and including them penalises the service for issues it cannot fix. Including or excluding dependency_error is a policy choice — strict SLOs include them (the service is responsible for its end-to-end behaviour, including downstream picking), looser SLOs exclude them (the service is not responsible for downstream's failures).

The OTel-native SLI computation pattern that has emerged is: ingest spans into a backend, run a SQL-shaped query (requests_within_latency_budget / total_requests WHERE service_tier = 'tier-1' AND error.type != 'client_error'), emit the result as a recording rule into Prometheus or a Datadog SLO widget. The split between raw telemetry and SLI computation is increasingly clean: telemetry is captured everywhere, SLI is derived per-service from a query. This separation of concerns is what makes SLO platforms (Sloth, OpenSLO, Nobl9) viable — they consume telemetry from any backend and produce SLIs as a configuration-driven artefact, not a hand-coded one. A team standardising SLOs across 50+ services is much better off using one of these tools than maintaining 50 hand-tuned recording rules.

Why the 28-day rolling window became the default

Most SLOs use a rolling 28-day window, not 30 days, not a calendar month. Three reasons explain the convention. First, 28 days is exactly four weeks, which removes day-of-week effects from the budget — Saturday traffic and Wednesday traffic average out cleanly across the window. Second, 28 days is a tractable size for storage; recording rules at 1-minute granularity over 28 days fit comfortably in a typical Prometheus retention. Third, the Google SRE book uses 28 days as its example, and the convention propagated. Calendar-month windows are operationally awkward (February vs July differ by 10% in budget size, which produces step-function changes at month boundaries) and rolling 30-day windows have day-of-week drift. 28-day rolling is a Schelling point everyone converged on. Razorpay, PhonePe, Hotstar, and CRED all use 28-day rolling; the few teams using calendar-month windows tend to be regulated industries where the audit cycle is monthly and the SLO has to align.

A separate window choice is the short-window used for burn-rate alerting (covered in chapter 65) — typically 1h and 6h — which is independent of the SLO window. A 28-day SLO with a 1h fast-burn alert is the standard combination. Some teams use a quarterly SLO window for quarterly business reviews (the SLO target is defined over 90 days, even though daily operations use a 28-day rolling), but this dual-window approach adds complexity for marginal benefit and most teams retire it after a year.

The "SLO covers 100% of users" myth and the customer-tier reality

A common implicit assumption is that one SLO covers all users equally. In practice, SaaS platforms increasingly run per-customer-tier SLOs. CRED's reward-engine has different SLOs for free users versus CRED Max paid users; Hotstar Premium has tighter SLOs than the ad-supported tier; Razorpay tier-1 merchants (large enterprises with annual TPV > ₹1000 crore) have tighter contracted SLAs than smaller merchants. The mechanism is the same — request-success ratio — but the SLI is computed per customer-tier (requests_2xx_within_budget WHERE customer_tier = 'enterprise') and the SLO target differs by tier. The engineering effort to maintain tier-specific SLOs is non-trivial: separate dashboards, separate alerts, separate post-mortem routing. Teams take it on when the business justifies it (tier-1 customers represent 60% of revenue and demand the visibility).

The failure mode here is to ship per-tier SLOs without per-tier engineering investment to back them. A team that promises tier-1 customers 99.99% but runs the same infrastructure for all tiers is selling fiction. The right shape is to architect tier-1 paths separately (dedicated capacity, lower-multi-tenancy ratio, faster failover) and only then commit to tighter SLOs. Razorpay's 2023 introduction of "Direct Settlement" for tier-1 merchants is exactly this — a separate code path with separate capacity and a separate SLO that the team architecturally backed before signing customer contracts. The vocabulary distinction between SLI, SLO, SLA continues to matter at this granularity: tier-1 customers see a tier-1 SLA (the contract), tier-1 engineering targets a tier-1 SLO (tighter), tier-1 dashboards display a tier-1 SLI (computed from tier-tagged spans).

Reading an existing SLO doc — five questions to apply

When you join a team and inherit an SLO document — or audit one written by a previous team — five questions surface whether it is operationally usable. (1) Is the SLI definition specific enough that two engineers writing PromQL would produce the same query? Vague SLIs ("response is fast", "the system is healthy") are aspirations, not contracts. (2) Is the time window named explicitly, with rolling vs calendar specified? "99.9% over a month" is ambiguous between rolling 30-day, calendar month, and rolling 28-day. (3) Are exclusions named explicitly? Synthetic probes, scheduled maintenance, customer 4xx — every category needs a yes/no. (4) Is there a corresponding burn-rate alert deriving from this SLO? An SLO with no alert is decoration. (5) When was the SLO last reviewed, and what was the result? An SLO that has been static for a year is either perfect (rare) or stale (common).

Applying this five-question audit to a typical "first generation" SLO document — the kind written during an SLO-rollout sprint and not revisited — usually finds three or four failures. The remediation is not to throw out the document but to do one revision pass with the questions as a checklist. Each pass takes 60-90 minutes per SLO and produces a document the team will actually use. Skipping the revision is how SLOs become Confluence pages nobody reads — the audit is the difference between policy theatre and operational discipline.

Silent errors, event-vs-time budgets, and the SLO doc template — three implementation pitfalls

Three implementation choices show up in every SLO rollout and predictably go wrong on the first attempt. First, the silent-error pattern: a checkout SLI defined as "HTTP 2xx within 250ms" passes during a window where 100% of POSTs return 200 OK quickly — but the response payload contains a JSON error field saying "payment processor temporarily unavailable". The SLI shows green; the user is screaming. The fix is to broaden the SLI to count payload-level errors as bad: (2xx with no error field) AND (latency < 250ms) / valid. Razorpay's payment-create path went through exactly this revision in 2022 after a 4-hour silent-failure incident; the next similar incident was caught within 6 minutes. The SLI must reflect user-meaningful failure, not transport-layer success.

Second, the budget-denomination choice: time-based ("40 minutes/month allowed") vs event-based ("0.1% of valid requests"). The two diverge during traffic spikes — a 5-minute outage at IPL-peak traffic affects 150M requests, while the same 5 minutes at 3am affects 30k. Time-based budgeting under-counts peak-hour incidents and over-counts off-peak ones. Razorpay, PhonePe, Zerodha, and Hotstar all use event-based budgets in production; time-based is reserved for executive summaries where minutes-of-downtime is the legible unit. The Google SRE Workbook recommends event-based for any SLO whose traffic varies by more than 10× across the window, which is most consumer-facing Indian services.

Third, the SLO doc structure: documents that hold up across quarters share an eight-heading shape — service+owner, SLI as literal PromQL/LogQL query, threshold+window in one sentence, exclusions list, corresponding SLA threshold with safety margin, burn-rate alert configuration with named thresholds (1h × 14.4 fast-burn, 6h × 6 medium-burn, 3d × 1 slow-burn), last-reviewed date with revision history in git, escalation path when budget is exhausted. Two pages of markdown, fits in a service repo's slo.md. Teams that adopt this template see SLO documents stay current; teams that write SLOs as standalone narratives see them go stale within a quarter. SLOs are configurations the alerting system consumes, not creative writing.

What changes for the on-call engineer at 2am

The vocabulary distinction has its sharpest payoff during a 2am page. Before the discipline: the on-call sees an alert, opens five dashboards, looks at p99, error rate, queue depth, GC time, and CPU, makes a judgement call about severity, escalates or does not. After the discipline: the on-call opens one panel — the SLO burn-rate panel — sees current burn-rate = 18.6×, monthly budget remaining = 12 minutes, and knows immediately this is a paging-grade event with limited recovery runway. The decision to escalate the secondary on-call is not a judgement call; it derives from the budget arithmetic. A burn rate of 18.6× means the rest of the month's budget will be exhausted in about 96 minutes at the current rate. That is the deadline for either resolving the incident or accepting an SLO breach.

The same vocabulary distinction shapes the post-page communication. Pre-SLO communication ("checkout p99 is high, investigating") tells the team nothing about urgency or duration. Post-SLO communication ("burn rate 18.6×, 12 minutes of monthly budget remaining, declaring incident, ETA 30 minutes for rollback") tells the team exactly what matters. Hotstar's playback team's incident channel template is exactly this shape: every incident message starts with the burn-rate number and the budget-remaining number. The template was developed during the 2023 IPL season and reduced average time-to-mitigation by 40% — not because the engineering work got faster, but because the coordination overhead dropped. Engineers stopped asking "is this serious?" and started asking "what do we do?".

The three letters also shape who joins the page. A burn rate that consumes only the SLO budget (still well above SLA) involves engineering only. A burn rate that threatens the SLA threshold escalates to include customer-success (proactive communication to affected customers) and legal (track which customers are owed credits). The vocabulary lets the on-call escalate selectively — not "this is bad, wake everyone" but "this is bad enough to wake the SLA tier of escalation". Razorpay's escalation policy, since the 2023 alert-rewrite, is keyed on which contract level is at risk: SLO-only events stay in engineering, SLA-threatening events fan out cross-functionally. The clarity reduced unnecessary cross-functional pages by 70% and increased response speed for the genuine-cross-functional events because the people who needed to be on the call actually were.

Why the vocabulary precision is operationally load-bearing and not pedantic: the difference between "SLO at risk" and "SLA at risk" maps to different escalation paths, different response SLAs (the response itself is SLO'd, recursively), and different post-incident artefacts (engineering postmortem vs customer-credit issuance vs both). A team without the distinction either over-escalates every event (wasting cross-functional time) or under-escalates (missing the SLA-credit window). With the distinction, escalation is deterministic from the budget math. The reduction in coordination overhead is what makes the SLO discipline pay back in the first quarter, before any reliability improvement shows up in the SLI itself.

A worked example — IRCTC Tatkal hour, deconstructed by the three letters

The clearest test of the vocabulary is to apply it to a system every Indian engineer recognises. IRCTC Tatkal booking opens at 10:00 IST sharp; for the next 4-6 minutes the platform sees a 50× traffic surge as users compete for limited seats on premium routes. The post-Tatkal hours look completely different. A single SLO across the full day would be meaningless because the two regimes have nothing in common.

The right shape — and roughly what IRCTC's modernisation effort settled on — is two SLOs. SLI₁ = (2xx booking responses with end-to-end latency < 8s) / total booking requests during 10:00-10:10 IST window. SLO₁ = "SLI₁ ≥ 95% on each Tatkal day, measured per-day across a rolling 28-day window". The 95% threshold is realistic for the surge regime — a 5% failure rate during peak is the engineering reality of finite seat inventory and queue saturation. SLI₂ = (2xx booking responses with latency < 2s) / total booking requests outside Tatkal window. SLO₂ = "SLI₂ ≥ 99.5% over rolling 28 days". The non-Tatkal regime can sustain a much tighter SLO because the load is normal. The customer-facing SLA — the public commitment to RBI auditors and to the National Consumer Helpline — is at the looser 90% / 99% pair, with the safety margin absorbing the tail of incident hours.

This dual-SLO design is what made the post-2020 IRCTC reliability investments measurable. Pre-2020 the team had a single uptime SLI ("the booking page is up") that returned 100% during Tatkal hours even when 80% of users were getting timeout errors at the seat-selection step. The vocabulary fix preceded the engineering fix: once the team distinguished SLI₁ from SLI₂ and tied each to a different SLO, the engineering work had a target — what specifically about the Tatkal regime makes SLI₁ poor? — that the unified SLI could not name. The architectural investments that followed (separate Tatkal queue, separate scaling group, sharded inventory locks) were measurable against SLI₁ improvement, not against vague "performance" claims. Every team operating a system with two distinct load regimes — Zerodha at market-open, Hotstar during cricket finals, Dream11 during T20 toss windows — converges on this dual-SLO pattern after their first SLO-rollout attempt fails to capture peak-hour reality.

The SLA story is similarly dual. The legal contract IRCTC publishes promises a single number (e.g., "95% of bookings will succeed") with no regime split, because legal contracts are written for legibility, not engineering accuracy. The internal SLOs split the regimes because engineering needs the granularity. The SLA threshold is set such that both SLO regimes meeting their targets implies the SLA is met — which is a calculation the legal team cannot do without engineering input. A common trap here: the legal team negotiates a single 99% SLA, the engineering team translates it into a single 99.5% SLO, the operations team does not realise that the Tatkal regime cannot sustain 99.5%, and the contract is breached every Tatkal day for a month before someone notices the SLI was being computed in a way that masked the regime-specific failure. The fix is the same as elsewhere: engineering owns the SLI definition end-to-end, including the question of whether one SLI or two is the right shape for the system.

A useful pattern for documenting dual-regime SLOs: a single page with both SLI definitions side by side, both SLO thresholds, the windows under which each applies, and a clear statement of which one drives alerts during which hours. The IRCTC platform-team's internal doc (publicly referenced in their 2023 SREcon Asia talk) does exactly this; the alert routing checks time-of-day in tatkal_window before deciding which SLO to evaluate against. The mechanism is one line of PromQL conditional logic; the discipline of making the conditional explicit is what prevents the regime-confusion trap.

Why two SLOs are better than one weighted average: a tempting alternative is a single SLO that weights Tatkal hours and non-Tatkal hours into one number — "the time-weighted SLI is X% across the day". The arithmetic works but the operational signal does not. A weighted single-SLO can sit at 99% even when Tatkal is failing badly (because non-Tatkal hours dominate the weight); a Tatkal-day reliability regression is invisible until the weighted average dips, by which time three days of incidents have been lost. Two separate SLOs make Tatkal regressions immediately visible — SLO₁ breaches the day it breaches, regardless of how good non-Tatkal hours were. Operational visibility wins over arithmetic elegance here, the same way separate alarms for kitchen smoke and bedroom smoke beat one whole-house average.

A note on what this chapter is not

The SLI/SLO/SLA vocabulary is necessary but not sufficient. Knowing the definitions does not by itself produce reliable systems. The vocabulary lets a team have the right conversations — with product, finance, legal, on-call — and have those conversations converge on action. The reliability work itself happens elsewhere: in chapter 65's burn-rate alerting, in Part 11's signal-to-noise alert engineering, in Part 12's diagnostic-ladder debugging, in Part 17's organisational SLO governance. This chapter is the dictionary; the rest of Part 10 onward is the grammar and the literature.

A team that finishes this chapter and concludes "we have an SLO discipline now" has misread the chapter. The right conclusion is "we have shared vocabulary; now we can do the actual SLO work without arguing about words". The shared vocabulary saves time in every subsequent meeting; the saved time goes into the engineering that the vocabulary was always pointing toward. This is the same pattern every engineering discipline follows — terms first, mechanisms second, culture third — and the SLO discipline is no exception.

There is also a smaller, practical payoff right now. The next time a stakeholder says "what is our SLA?", the precise answer (the contract value, the customer-facing threshold, the breach consequence) is different from the answer to "what is our SLO?" (the engineering target, the internal threshold, the alerting trigger). Both are different from the answer to "what is our SLI today?" (a number from the dashboard at this moment). Three different answers to three different questions, where there used to be one confused answer to a conflated question. That clarity, modest as it sounds, is what unlocks every conversation Part 10 is about to ask the team to have.

Where this leads next

Chapter 63 walks the engineering of which SLI to anchor on — the patterns named above in greater depth, with the per-system trade-offs (request-success vs latency-conditioned-success, per-endpoint vs per-flow, real-traffic vs synthetic) covered for each. Chapter 64 derives the error-budget arithmetic from first principles — given a 99.9% SLO over 28 days, how many seconds of allowed badness do you have, and what does it mean to spend them. Chapter 65 builds burn-rate alerting on top: the multi-window-multi-burn-rate scheme that derives every paging alert mechanically from the SLO instead of guessing thresholds. Chapter 66 walks the cross-functional negotiation — how engineering, product, and finance jointly own the SLO, what review cadence works, what the failure modes look like. Chapter 67 covers per-customer-tier SLOs and the architectural investments they imply.

Beyond Part 10, Part 11 (alerting as a discipline) inherits SLOs as the anchor for every paging alert; Part 12 (production debugging) inherits SLOs as the definition of "broken"; Part 17 (observability as a discipline) closes the loop on how SLO governance becomes part of an engineering culture. The vocabulary of this chapter is load-bearing for every chapter that follows.

For the reader: take one service you operate. Write down its SLI as a one-sentence query against existing telemetry. Write down the SLO threshold and time window. If a customer-facing SLA exists, write that down too. Look at the gaps — between SLO and SLA (is the safety margin defensible?), between the SLI and the user experience (does the SLI saying 100% mean every customer is happy?), between the SLO threshold and historical performance (is the threshold achievable, or aspirational?). Bring the gaps to chapter 63.

A second exercise, harder but more revealing: pull the last quarter's incident reports for the same service. For each incident, compute (or estimate) what fraction of the SLO budget it consumed. Most teams discover that 2-3 incidents consumed 60-80% of the budget — meaning the long tail of small incidents barely matters for the SLO arithmetic. The implication is operationally important: the team's reliability work should focus on the 2-3 incident archetypes that consumed the bulk of the budget, not the long tail of smaller events. The SLO discipline does this prioritisation mechanically; the pre-SLO discipline does it through senior-engineer intuition, which is fragile. Bringing this prioritised list to chapter 64 (error-budget math) is what makes the next chapter's arithmetic concrete instead of abstract.

References

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy pandas
python3 sli_slo_calculator.py
# Then change SLO_LATENCY_MS to 400 and re-run — see how relaxing the latency
# budget moves a system from "SLO BREACH" to "within SLO" without changing the
# code. The SLO is a knob; choose it deliberately.
# Also try setting SLO_TARGET to 0.99 (looser) vs 0.9999 (tighter) and watch
# the budget consumed-vs-remaining numbers shift by orders of magnitude.