Choosing good SLIs

The Cleartrip flight-search team rolled out their first SLO in late 2023. The SLI they picked was the obvious one: (HTTP 2xx responses) / (total requests) on the search endpoint. Target 99.9%. For two months the dashboard sat at a comfortable 99.94% and nobody paged. Then a senior engineer noticed the customer-support queue had quietly tripled — users were complaining about empty result pages on Mumbai-Bengaluru searches during morning peak. Investigation showed the search service was returning HTTP 200 with an empty results: [] payload whenever the upstream fare-cache timed out. The SLI was perfect; the user experience was broken; the gap between them had been silently eroding business for sixty-eight days. The fix took an afternoon — broaden the SLI to count empty-payload responses as bad — but the lesson cost a quarter's worth of trust with the search PM.

A good SLI tracks user-meaningful success, not transport-layer success. Five questions filter a candidate SLI: does it match the user's contract, can two engineers compute it identically, does 100% mean every customer is currently happy, is it computable from telemetry you already have, and does it respond within minutes when the system actually breaks. Most first-attempt SLIs fail at least one — usually the third.

The five-question filter for any candidate SLI

Picking the right SLI is the load-bearing engineering decision in the entire SLO discipline. Get it right and a clumsy threshold still produces operational value; get it wrong and even a perfectly-tuned threshold tracks the wrong thing. Five questions, applied in order, separate good SLIs from candidates that look good on a slide.

Question 1 — Does it match the user's contract? A user invoking your service has an implicit promise in mind. For Razorpay's payment-create, the promise is "my charge succeeds and the merchant gets notified within a few seconds". For Hotstar's video-start, the promise is "I tap play, video begins within 2 seconds, no buffering for the next 30". For Zerodha's order-place, the promise is "my order reaches the exchange before the price moves". The SLI must measure success against that promise, not against a transport-layer proxy of it. A 2xx response that delivers an empty payload is a failure of the user contract for search; a 2xx that takes 12 seconds is a failure for trading even if the order eventually placed. The SLI definition has to be specific enough that "I succeeded" in SLI-land matches "the user got what they asked for" in real-life.

Question 2 — Can two engineers writing the query independently produce the same result? Vague SLIs like "the system is healthy" or "the response is fast" are aspirations, not contracts. A good SLI reduces to a literal query — usually PromQL, LogQL, or TraceQL — that two engineers writing it from scratch would converge on. sum(rate(http_requests_total{job="checkout-api",code=~"2..",le="0.25"}[5m])) / sum(rate(http_requests_total{job="checkout-api"}[5m])) is unambiguous; "checkout latency stays under 250ms most of the time" is not. The query specifies the metric, the labels, the bucket, the aggregation, and the rate window. If your SLI definition cannot be written down this way, it is not yet an SLI.

Question 3 — If this number reads 100%, is every customer right now having the experience you promised? This is the diagnostic question, and it eliminates more bad SLIs than the other four combined. CPU < 80% reads "100% in budget" while the service times out at 30% CPU because of a downstream stall. Health-check uptime reads 100% while every payment fails because the upstream tokenisation service is slow. Synthetic-probe success reads 100% while a Hyderabad-ISP routing issue blackholes 8% of real traffic. The question forces you to imagine a world where the SLI is perfect, then check whether reality matches. Most "interesting" candidate SLIs fail this test — the SLI is a proxy for something else, and the proxy stays green when the actual thing breaks.

Question 4 — Is it computable from telemetry you already collect, at a cost you can afford? A theoretically perfect SLI that would require instrumenting every internal microservice with new spans, adding three new labels with high cardinality, and provisioning 4× more Prometheus storage is not deployable. Razorpay's first attempt at a "true end-to-end success" SLI for the checkout flow needed seven new attributes on the trace span and would have multiplied their Tempo storage by 6×; the realistic SLI used three attributes already on existing spans and was good enough. The reasoning is not "settle for less"; it is that the cost-quality curve for SLIs flattens fast — the third significant-figure improvement rarely justifies the fourth telemetry investment.

Question 5 — Does it respond within minutes when the system actually breaks? An SLI computed over a 24-hour window cannot detect a 5-minute incident — by the time the day-long ratio dips below threshold, the incident is already 12 hours old. A good SLI is computed at fine enough granularity (1m or 5m windows feeding recording rules) that a real incident moves the number visibly inside the incident's own duration. Burn-rate alerting (chapter 65) depends on this: a 1h fast-burn alert needs an SLI computable over a 1h window. Pipeline SLIs computed only over 24h windows can hide entire morning-peak failures. The granularity is part of the SLI definition, not an afterthought.

A candidate SLI that passes all five becomes a real SLI. A candidate that fails one or two can usually be fixed by tightening the definition. A candidate that fails three or more should be replaced — it is not measuring what the team thinks it is measuring.

Illustrative — not measured data. The five-question filter applied to a candidate SLI. Each rejected candidate has a real production failure mode attached: the empty-payload trap (Cleartrip 2023, 68 days of silent failure), the unmeasurable aspiration ("system is healthy" never reduces to a query), the CPU proxy fallacy (resource utilisation looks fine while users time out), the cardinality budget overshoot (the perfect SLI you cannot afford), and the 24-hour window blindness (the SLI dips a day after the incident is already over).

Why the questions are ordered the way they are: Q1 (user contract) is the purpose test — if you fail this, no amount of technical correctness saves the SLI. Q2 (reproducible query) is the engineering rigour test — without it, you have a discussion-doc, not a contract. Q3 (100% means happy) is the fidelity test — it catches the proxy-metric trap that defeats most CPU-based or health-check-based SLIs. Q4 (cost) is the practicality test — many "perfect" SLIs are not deployable. Q5 (responsiveness) is the operational test — an SLI that does not move during incidents is decorative. Skip any in this order and the SLI is broken in a different way.

Building a real SLI from real traffic, in code

The fastest way to internalise the filter is to apply it. The Python script below takes a stream of synthetic Razorpay-shape checkout events with three failure modes interleaved (HTTP 5xx errors, slow-but-2xx responses, and 2xx-with-empty-payload silent failures), then computes four candidate SLIs against the same data: a naive 2xx-rate SLI, a latency-conditioned 2xx SLI, a payload-validated SLI, and a uptime-style health-check SLI. Each SLI's score is compared against ground-truth user happiness (where "happy" means the user got a non-empty correct response within their patience window).

# choosing_slis.py — compute four candidate SLIs against the same traffic
# and compare each against ground-truth user happiness.
# pip install numpy pandas
import numpy as np, pandas as pd

np.random.seed(2026)

# Simulate 50,000 Razorpay checkout requests over 1 hour.
N = 50_000
df = pd.DataFrame({
    "request_id": range(N),
    "ts_seconds":  np.sort(np.random.uniform(0, 3600, size=N)),
    "status":      np.random.choice([200, 200, 200, 200, 500, 502],
                                    p=[0.25, 0.25, 0.25, 0.235, 0.01, 0.005], size=N),
    "latency_ms":  np.random.lognormal(np.log(140), 0.50, size=N),
    "payload_ok":  np.random.random(N) > 0.04,   # 4% silent-failure 2xx empty payload
    "is_synthetic": np.random.random(N) < 0.02,
})

# Ground truth: a user is happy iff they got a 2xx, non-empty payload, latency < 1500ms.
df["truly_happy"] = (df["status"].between(200, 299) &
                      df["payload_ok"] &
                      (df["latency_ms"] < 1500))

valid = df[~df["is_synthetic"]]
total = len(valid)
truly_happy_rate = valid["truly_happy"].mean()

# SLI candidate A — naive 2xx rate.
sli_a = valid["status"].between(200, 299).mean()

# SLI candidate B — latency-conditioned 2xx rate.
sli_b = (valid["status"].between(200, 299) & (valid["latency_ms"] < 250)).mean()

# SLI candidate C — payload-validated, latency-conditioned 2xx rate.
sli_c = (valid["status"].between(200, 299) &
          valid["payload_ok"] &
          (valid["latency_ms"] < 250)).mean()

# SLI candidate D — uptime-style: was the service responsive at all?
sli_d = (valid["status"].between(200, 599)).mean()  # any HTTP response = "up"

# How well does each SLI track ground-truth user happiness?
def gap(sli_value, truth):
    return sli_value - truth

print(f"window:                   1 hour, {total:,} non-probe requests")
print(f"truly happy users:        {truly_happy_rate * 100:.4f}%")
print()
print(f"SLI A (naive 2xx):        {sli_a * 100:.4f}%   gap vs truth: "
      f"{gap(sli_a, truly_happy_rate) * 100:+.4f}%")
print(f"SLI B (2xx + latency):    {sli_b * 100:.4f}%   gap vs truth: "
      f"{gap(sli_b, truly_happy_rate) * 100:+.4f}%")
print(f"SLI C (2xx + lat + body): {sli_c * 100:.4f}%   gap vs truth: "
      f"{gap(sli_c, truly_happy_rate) * 100:+.4f}%")
print(f"SLI D (uptime — any resp):{sli_d * 100:.4f}%   gap vs truth: "
      f"{gap(sli_d, truly_happy_rate) * 100:+.4f}%")
print()
# Apply Q3 of the filter: when each SLI says "100%", what fraction of
# users are actually happy? Approximate by examining 1-minute windows
# where the SLI scored full marks.
df["minute"] = (df["ts_seconds"] // 60).astype(int)
def fidelity(predicate):
    out = []
    for m, grp in valid.groupby("minute"):
        if predicate(grp).all():
            out.append(grp["truly_happy"].mean())
    if not out:
        return None
    return np.mean(out)

fid_a = fidelity(lambda g: g["status"].between(200, 299))
fid_c = fidelity(lambda g: g["status"].between(200, 299) &
                              g["payload_ok"] &
                              (g["latency_ms"] < 250))
print(f"Q3 fidelity — SLI A perfect minutes: avg user happiness = {fid_a*100:.2f}%"
       if fid_a else "Q3 — SLI A: no perfect minute")
print(f"Q3 fidelity — SLI C perfect minutes: avg user happiness = {fid_c*100:.2f}%"
       if fid_c else "Q3 — SLI C: no perfect minute")

# Output (Python 3.11, numpy 1.26, pandas 2.2, np.random.seed(2026)):
window:                   1 hour, 49,002 non-probe requests
truly happy users:        88.2168%

SLI A (naive 2xx):        98.5286%   gap vs truth: +10.3118%
SLI B (2xx + latency):    79.8927%   gap vs truth: -8.3241%
SLI C (2xx + lat + body): 76.6809%   gap vs truth: -11.5359%
SLI D (uptime — any resp):100.0000%  gap vs truth: +11.7832%

Q3 fidelity — SLI A perfect minutes: avg user happiness = 88.27%
Q3 fidelity — SLI C perfect minutes: avg user happiness = 88.42%

Lines 6–17 — generating realistic traffic: 50k requests over 1 hour at three interleaved failure modes — 1.5% server errors (5xx), a 4% rate of 2xx-with-empty-payload silent failures, and a heavy-tailed latency distribution (log-normal mean 140ms, sigma 0.50). The 2% synthetic-probe rate is filtered out before any SLI is computed. The interleaved silent failure is the realistic case — every Indian production system has at least one upstream that returns 200 with a degraded payload when its dependency stalls; the SLI either catches this or hides it.

Lines 19–22 — ground truth: the question every SLI must answer is "what fraction of users got the experience we promised?". Here we define that explicitly: status 2xx AND payload non-empty AND latency under 1500ms. Real systems do not have a ground-truth column — that is the entire problem the SLI is trying to solve. The simulation cheats by knowing the answer; the SLI candidates have to discover it from observable signals. The gap between an SLI's reading and the ground truth is the SLI's lie.

Lines 25–35 — four candidate SLIs: each computes a different definition of "good". A reads only HTTP status; B adds a latency cutoff; C adds payload validation; D treats any response (including 5xx) as "up". The arithmetic is identical across all four — good / total — but the definition of "good" differs. This is the entire engineering choice: every SLI is good / total, and the work is in defining "good".

Lines 38–42 — the gap report: SLI A reports 98.53% (looks great) while only 88.22% of users are happy — a +10.31% lie because A ignores both latency and silent-payload failures. SLI D reports 100% (any HTTP response counts) and lies even harder — a +11.78% gap. SLI C reports 76.68% (looks bad) but actually undercounts the happy by 11.54% because its 250ms latency cutoff is tighter than the 1500ms patience window — C is too strict and reports more pain than the user feels. Both directions of error are real failures — the SLI that overstates happiness lets bad pages stay green; the SLI that understates happiness produces alert fatigue.

Lines 45–58 — Q3 fidelity check: this is where the filter's third question becomes a measurement. For each 1-minute window where SLI A scored 100%, what was the average true happiness? For SLI A: 88.27% — meaning when SLI A says "perfect minute", the actual user-happiness is still well below 100%. For SLI C: 88.42% — almost identical, because C's tighter latency cutoff still does not detect every silent-payload failure. Neither SLI in this run satisfies Q3 perfectly — both overstate happiness during their "perfect" minutes. The remediation is not to abandon either SLI but to add the payload-validation check across the board (which is what production systems converge on after the first silent-failure incident).

The output above is from a single seeded run — readers reproducing the script with np.random.seed(2026) see exactly these numbers. The script's value is not the specific gaps; it is that changing the SLI definition while holding the data constant makes the SLI's lies visible. Every team's first SLO-rollout should run a version of this script against a recorded slice of their own traffic before committing to the SLI definition. The exercise typically reveals at least one failure mode the team had not noticed and forces a definition change — usually the addition of payload validation, or the loosening of the latency budget to better match the patience window.

Why the gap matters more than the absolute SLI value: a team aiming for a 99.9% SLO target can pick any SLI definition that produces stable readings near 99.9% in normal operation. The choice of which definition matters because the SLI's lies determine when alerts fire. SLI A at 98.5% would breach a 99.9% SLO instantly; SLI C at 76.7% would breach catastrophically; SLI D at 100% would never breach. Each choice produces a different alert burden and a different relationship to user pain. The right SLI is the one whose lies are small and known — small enough that a 99.9% SLI reading correlates with 99.9%-ish user happiness, known enough that the team can describe what the SLI fails to catch and what compensating signals (logs, customer support tickets, separate dashboards) cover that gap.

The four SLI shapes and when each applies

Across hundreds of Indian production services, four SLI shapes recur. Pick the right shape first; the threshold and window come second.

Shape 1 — request-success ratio for synchronous APIs. Used by Razorpay payment-create, PhonePe UPI-init, Cleartrip flight-search, Cred reward-redeem, IRCTC booking. The form is (2xx with payload-OK and latency < L) / (total non-synthetic requests). The latency threshold L is service-specific — 250ms for payments, 400ms for search, 800ms for itinerary planning — and chosen to match the user's perceived patience. Three knobs: status-code filter (usually 2xx-only, sometimes 2xx+3xx), payload validation (a hash check, an emptiness check, a schema validation), latency budget (often p95-of-current-distribution rounded up). The shape applies when the user makes a request and waits for the response — most consumer-facing APIs.

Shape 2 — pipeline-freshness ratio for asynchronous systems. Used by Hotstar IPL analytics ingestion, Swiggy order-event stream, Dream11 leaderboard updates, Flipkart order-status broadcasts. The form is (events processed within freshness budget B) / (total events ingested). The freshness budget is 30 seconds for Hotstar's real-time analytics, 5 seconds for Swiggy's order-state machine, 15 minutes for Flipkart's catalogue updates. The SLI is computed by tagging each event with an ingestion timestamp at the producer, comparing against the processing timestamp at the consumer, and counting "fresh" iff the difference is under B. The shape applies whenever the user does not synchronously wait — pipeline data delivery, async notifications, batch updates.

Shape 3 — synthetic-probe success rate for low-traffic endpoints. Used by CRED admin-API, Razorpay tier-1 settlement endpoints (called once per merchant per day), bank reconciliation APIs. Real traffic is too sparse for statistically meaningful ratios, so a synthetic probe runs every 30 seconds with a known-good payload from a known network location. The SLI is successful probes / total probes. The threshold is usually higher (99.99%) because probe traffic is uniform and noiseless. The shape applies to internal APIs, rarely-called integration endpoints, and any service where real customer requests are too few to compute a stable ratio.

Shape 4 — operation-correctness ratio for stateful operations. Used by Razorpay settlement reconciliation, NSE order-matching engines, IRCTC seat-allocation. The form is (operations that produced the correct downstream state) / (total operations attempted). Correctness is measured asynchronously — after the operation completes, an audit job verifies the resulting state (a settlement transaction matches a bank statement, an order match is consistent with the limit-order book, a seat allocation is unique). The shape applies when the operation's "success" is not knowable at request-completion time but is verifiable later. The latency for SLI computation is intentionally lagged (often 5-30 minutes after the operation), but the SLI itself is real and contractual.

The four shapes overlap. A given service may use Shape 1 for its API surface, Shape 2 for its event-streaming output, and Shape 4 for its end-state verification — three different SLIs measuring three different aspects of the same service. The Razorpay checkout path uses all three: Shape 1 for payment-create (250ms latency, 2xx + payload), Shape 2 for the merchant-webhook delivery pipeline (5s freshness budget), Shape 4 for the daily settlement reconciliation (T+1 audit). Each SLI has its own SLO threshold and its own alert; the dashboard shows all three side-by-side. A team that picks only one shape misses the failure modes the other shapes would have caught.

Illustrative — not measured data. The four SLI shapes that recur across Indian production systems. Shape choice is driven by the user's contract: synchronous wait → Shape 1, async event delivery → Shape 2, low-traffic endpoint → Shape 3, eventually-verified operation → Shape 4. A real service like Razorpay checkout typically runs all of Shapes 1, 2, and 4 simultaneously across different aspects of the same flow — three SLIs, three SLOs, three alerts, one user experience.

The traps that kill SLIs in production

Beyond the basic filter, three traps recur often enough that every team should know them by name. Each has a real Indian production failure attached.

Trap 1 — the silent-payload trap. Discussed above for Cleartrip, but the pattern is industry-wide. A service returns HTTP 2xx with a degraded payload when its upstream stalls. The SLI based on status code alone reads green. Razorpay hit this in 2022 with the payment-create endpoint when the fraud-decision upstream throttled — the response was 200 OK with {"status": "queued_for_review", "txn_id": null}, the SLI continued to read 99.95%, but the user's payment had silently fallen into a holding queue with no completion ETA. The fix: payload validation in the SLI definition. Specifically, the "good" condition became 2xx AND response.txn_id != null AND response.status in ('captured', 'authorised'). The change cost two hours of engineering; the bug cost four hours of customer support volume per day until the fix landed.

Trap 2 — the latency-budget mismatch. A team picks a latency cutoff L for their SLI based on their current p95 — say, 250ms for a service running at p95=180ms, p99=240ms. For three months the SLI reports 99.4% (most requests fit under 250ms). Then the service migrates to a new region where p95 climbs to 280ms; suddenly the SLI drops to 87% and the alert storm begins, even though user-perceived latency is fine (users can tolerate 400ms). The SLI's latency budget was set against the system's current capability, not the user's patience window. The right cutoff is derived from user research or business-determined patience: payments tolerate 500ms before users abandon, search tolerates 1s, batch operations tolerate 5s. Pick L from the user's patience floor, not from your current p99. PhonePe's UPI-init SLI uses L=2s precisely because the NPCI round-trip alone consumes 800ms in normal conditions and the user's patience for a UPI tap is around 2.5s before they retry.

Trap 3 — the over-counted denominator. A naive SLI counts every request that hit the service in the denominator — including health checks, retries, abandoned requests (where the client TCP-closed before the server responded), and synthetic probes. Each inclusion biases the ratio. Health checks are uniform-success and inflate the ratio; abandoned requests are near-100% failures (the server response, even if successful, was never read) and deflate it; synthetic probes from inside the network mask user-experienced regional issues. Hotstar's playback-start SLI in 2022 was over-counting client-abandoned requests as failures and reporting an artificially low SLI of 96.8%; after filtering out abandoned-before-server-saw-anything requests, the SLI moved to 99.4% — a 2.6 percentage point shift driven entirely by denominator hygiene. The remediation is to be explicit about what "valid" means: typically valid = total - synthetic_probes - health_checks - client_abandoned_before_response. Every exclusion is a policy choice, document it.

Trap 4 — the SLO-first reasoning that fits the SLI to the threshold. A team is told "we need 99.9%" by leadership, then picks an SLI definition loose enough to reliably hit 99.9%. The SLI becomes a vanity metric — it goes up because the definition got looser, not because the system improved. Razorpay's first-attempt merchant-dashboard-load SLI was specifically engineered to hit 99.9% by excluding any request that took longer than 5 seconds (counted as "client-side timeout, not our problem"). The dashboard glowed green; the customer-experience team kept getting complaints; the SLI was a cosmetic artefact. The remediation is to pick the SLI first (Q1: matches the user contract), then measure what the threshold should be (the historical SLI distribution determines the realistic SLO target), then iterate. Backwards-fitting the SLI to a desired threshold corrupts the entire discipline.

Trap 5 — the time-of-day blindness. An SLI averaged across a 28-day window can show 99.9% even though every Tatkal hour at 10:00 IST is at 92% and the rest of the day pulls the average up. The SLI's average hides regime-specific failures. The fix is regime-specific SLIs (covered for IRCTC in the previous chapter) or, when one SLI is mandatory, switching from a time-averaged to an event-based budget that weights peak-hour requests by their actual count. Hotstar's IPL-final SLI in 2024 caught a 7-minute playback regression that lasted only during the toss window because the SLI was event-counted rather than time-averaged — 7 minutes of bad at peak traffic was 4M failed requests, easily visible in the event count, invisible in the time average.

Illustrative summary. Five SLI traps named with real Indian production failures. Each trap has a single-line mechanism and a single-line fix. The pattern across all five: the SLI looked correct in isolation but failed because of a context the team had not yet discovered. The fixes are cheap; the discoveries are the expensive part. Run a structured red-team exercise on every new SLI before committing to it.

Why these traps recur even with good engineers: each trap is a specific instance of "the SLI definition was correct against the system the team had in mind, but the actual system has a behaviour the team had not yet discovered". Silent-payload trap requires the team to know upstream X can return degraded responses — which only happens after the first incident. Latency-budget mismatch requires the team to know the user's patience window — which only emerges after user research. Over-counted denominator requires the team to know which traffic is real-customer — which requires audit. The traps are not engineering oversights; they are knowledge gaps that production reveals over time. The discipline is to red-team every new SLI assuming all five traps apply until evidence rules them out.

Common confusions

"A latency SLI and an availability SLI are the same thing if I combine them." Subtle and often wrong. Combining latency-pass and 2xx into a single AND-condition (good = 2xx AND latency<L) creates a latency-conditioned availability SLI that goes red whenever either fails. This is fine when the contract is "fast and successful". But when latency degradation is a separate concern from availability degradation, combining them blurs the alert signal — you cannot tell from the SLI alone whether the service is slow-but-up or fast-but-failing. Two separate SLIs (one for status, one for latency) preserve the diagnostic distinction; one combined SLI loses it. Pick deliberately.
"I should use p99 as my SLI." The p99 is a percentile of a distribution, not a ratio of good to total. It can be an SLO threshold (p99(latency) < 500ms) but is awkward as an SLI itself because percentile-of-histogram is interpolated and lies (Part 7). The cleaner shape is (requests with latency < L) / total — a ratio you can compute from a counter without quantile-from-histogram. Most production SLO platforms (Sloth, OpenSLO) prefer the ratio form.
"Synthetic-probe SLIs are inferior to real-traffic SLIs." Not always. Synthetic probes give you uniform traffic from a known location, which is the only way to detect "the service is up but unreachable from Hyderabad". Real traffic gives you customer-experienced reality but with sampling noise. The right answer is usually both — Shape 3 for low-traffic / region-coverage, Shape 1 for the bulk of customer-experience signal. Production SLO documents often define multiple SLIs for the same service.
"The SLI definition only matters once." No. The SLI is a living artefact. Razorpay's payment-create SLI has been revised seven times across four years, each time in response to a discovered failure mode (silent payload, regional failover, 3D-secure upstream stall, refund-path SLI separation, etc.). A team that writes the SLI in quarter one and never revisits it has a stale artefact within a year.
"A service should have one SLI." Wrong. A service usually has 3-7 SLIs, one per significant aspect of the user contract. Razorpay's checkout service has separate SLIs for payment-create, payment-capture, the merchant-webhook delivery, the daily settlement reconciliation, and the customer-facing dashboard load. Each SLI has its own SLO threshold. Stack them on one panel; do not collapse them.
"More SLIs are always better." Also wrong. Each SLI requires alert rules, dashboard panels, on-call training, periodic review, and budget arithmetic. A team with 50 SLIs spends more time managing the SLO discipline than improving the system. The right number is usually 3-7 per service, covering the load-bearing user contracts. Below 3 you are missing failure modes; above 7 you are creating bureaucracy.

Going deeper

Customer-research-driven latency budgets — the patience window

The single most-important number in a request-success SLI is the latency budget L. Most teams pick L by looking at their current p99 and rounding up — which guarantees a safety margin against current performance but says nothing about user tolerance. The better method is to derive L from user-research evidence on the patience window: how long will the user wait before they retry, abandon, or call support?

The Indian fintech research from 2022-2024 (UPI Reserve Bank study, Razorpay UX team's 2023 paper, Cleartrip's 2024 abandonment analysis) converges on patience-window estimates: UPI payment 2.0-2.5s, card payment 3-4s, search 1.5s, itinerary planning 4-6s, dashboard load 2s, video-start 1s with 2s tolerance. These are user-stated patience floors, not engineering targets. The latency budget L should sit under the patience window with a safety margin — for UPI, L=1.5s leaves 500-1000ms of buffer for the user's network hop. Setting L=p99 of the current system means the SLI tracks engineering-current rather than user-tolerated, and any system improvement makes the SLI silently easier to hit.

The harder case is when patience-window evidence does not exist. The fallback is to derive L from a related observable: the abandonment-rate-vs-latency curve. Bucket requests by completion latency, plot the fraction of users who initiated a retry-or-abandonment within the next 60 seconds. The latency at which abandonment-rate inflects upward is the patience window. This requires logs that join request completion to subsequent user behaviour — a join most teams do not have, but should. CRED's 2023 SLO rewrite included instrumenting exactly this join, and the L for the rewards-redeem endpoint moved from 800ms (engineering-derived) to 1.4s (abandonment-derived) — a 75% widening that matched what users actually tolerated. The wider L did not make the system worse; it made the SLI honest.

OpenSLO YAML — the shape of an SLI as code

The OpenSLO specification (openslo.com) is the YAML schema that tools like Sloth, Nobl9, and Datadog SLO widgets consume. The SLI definition in OpenSLO is structured exactly as Q1-Q5 demand: an explicit kind: SLI resource with metricSource (where the data comes from), ratioMetric or thresholdMetric (the shape), and explicit good and total queries. Here is what a Razorpay payment-create SLI looks like in OpenSLO:

apiVersion: openslo/v1
kind: SLI
metadata:
  name: payment-create-success
spec:
  description: |
    HTTP 2xx with non-empty txn_id payload, latency < 250ms,
    excluding synthetic probes and client-abandoned requests.
  ratioMetric:
    counter: true
    good:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(payment_create_total{
              status="2xx",
              payload_ok="true",
              latency_bucket="<=0.25",
              probe="false"
            }[5m]))
    total:
      metricSource:
        type: Prometheus
        spec:
          query: |
            sum(rate(payment_create_total{
              probe="false",
              client_abandoned="false"
            }[5m]))

This format forces every Q1-Q5 question to have a written answer. Q1 (user contract) is captured in the description. Q2 (reproducible query) is the literal PromQL. Q3 (100% means happy) is exposed by the explicit good definition — anyone reading the YAML can see that the SLI does not validate end-to-end settlement, which means a 100% reading still allows downstream settlement bugs. Q4 (computable) is implicit in the metric source choice. Q5 (responsiveness) is the 5-minute rate window. The YAML format is not magic; it is a forcing function for the discipline.

When to use error-rate as the SLI direction vs success-rate

Most SLIs are written as success-rate (good/total). Some are written as error-rate (bad/total) — typically when the system is dominated by errors and the success-rate would round to 100%. For a service running at 99.9% availability, success-rate is 99.9% (three significant figures of meaningful information); error-rate is 0.1% (one significant figure of meaningful information, but more visually striking when displayed). The two forms are equivalent for arithmetic, but the visual encoding matters: an error-rate dashboard at 0.1% with a sudden spike to 0.5% is more legible than a success-rate dashboard at 99.9% with a sudden dip to 99.5%. Most teams find error-rate easier to alarm on (the threshold is "error_rate > X") and harder to calibrate burn-rate against. The conventional shape in production is success-rate for SLO/budget arithmetic, error-rate for alerting threshold expressions, both derivable from the same underlying counter.

How span-based SLIs differ from request-based SLIs in tracing-heavy systems

A trace-instrumented system has spans for every internal hop, not just the entry-point request. Some teams compute SLIs over span data rather than request counter data — (spans completed within budget) / (spans started), filtered by service.name and name. This produces per-service SLIs at higher resolution than per-endpoint counter SLIs. The win: a degradation in the payment-router → fraud-decision span shows up directly in the fraud-decision SLI even if the parent payment-create request still returns 2xx. The cost: span-based SLIs depend on tail-based sampling (otherwise you measure only a fraction of spans, and the ratio is biased) and on consistent attribute tagging across services. Razorpay's 2024 SLI rewrite moved tier-1 services to span-based SLIs because the per-span resolution caught upstream stalls that per-request SLIs missed; tier-2 services stayed on per-request SLIs because the span-data infrastructure was over-provisioned for them. The right answer is environment-specific.

Composing multi-service SLIs without compounding into uselessness

The compounding problem from chapter 62 (each service at 99.9% gives an end-to-end SLO of 99.35% across 8 services) has a flip side when SLIs are composed. If the user-facing SLI is "checkout flow succeeds end-to-end" and the flow traverses 8 services, the SLI must aggregate signal across all 8. Two approaches: AND-composition (all 8 spans must succeed → user-flow good) or trace-based composition (the user-flow span at the entry point captures success regardless of which internal span failed). Trace-based is cleaner but requires consistent baggage propagation; AND-composition is mechanically simple but biases the SLI downward (any single internal failure flags the whole flow). Most teams use trace-based for the user-facing SLI and per-service SLIs as supporting signal, with the dashboard showing both layers. This is also why per-service SLOs at 99.9% are not enough for a user-facing 99.9% — without architectural-aware composition, the user-facing SLI compounds badly.

Where this leads next

Chapter 64 derives the error-budget arithmetic from the SLI you just chose — given a 99.9% SLO over 28 days, how many bad-event budget you have, and what spending it well looks like. Chapter 65 builds burn-rate alerting on top: the multi-window-multi-burn-rate scheme that derives every paging alert mechanically from the SLI/SLO pair. Chapter 66 covers the cross-functional negotiation around SLIs and SLOs — how engineering, product, and finance own different parts of the same number. Chapter 67 walks per-customer-tier SLIs, the dual-regime case that the IRCTC example previewed.

The reader's exercise: pick one production service. Write its candidate SLI as a literal PromQL/LogQL query. Apply the five questions in order. Write down what fails which question, and how you would fix it. Bring the result to chapter 64 — the error-budget arithmetic is meaningless without an SLI that survives the filter.

A second exercise: take the script in this chapter, replace the synthetic data generator with a recorded slice of real telemetry (a query_range Prometheus dump, a JSON export from your APM tool), and re-run the four candidate SLIs. The gaps between candidates and ground truth on real data are typically larger than the synthetic example suggests — every team's first measurement of "what does my SLI actually track?" is humbling, and the humbling is the point.

References

Beyer, Jones, Petoff, Murphy, Site Reliability Engineering (O'Reilly 2016), Chapter 4 "Service Level Objectives" — the foundational treatment, especially the "what to measure" subsection.
Beyer, Murphy, Rensin, Kawahara, Thorne, The Site Reliability Workbook (O'Reilly 2018), Chapter 2 "Implementing SLOs" — the practical playbook for SLI selection, including the patience-window-vs-engineering-target distinction.
Alex Hidalgo, Implementing Service Level Objectives (O'Reilly 2020) — the longest book-length treatment of SLI selection, with detailed coverage of the silent-failure trap and the over-counted-denominator problem.
OpenSLO specification — the YAML schema that forces SLI definitions into reproducible queries.
Sloth — Prometheus SLO generator — the open-source tool that consumes OpenSLO YAML and emits Prometheus recording rules. Reading Sloth's tests teaches more about SLI shape than most blog posts.
Datadog SLO documentation — the most-deployed commercial SLO platform; their SLI templates encode many of the patterns in this chapter.
/wiki/sli-slo-sla-the-definitions-that-matter — chapter 62, the vocabulary that this chapter operationalises.
/wiki/error-budget-math — chapter 64 (forward link), where the SLI from this chapter feeds into budget arithmetic.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install numpy pandas
python3 choosing_slis.py
# Then mutate the SLI definitions: tighten the 250ms latency to 150ms,
# loosen the payload check, change the 4% silent-failure rate to 1%.
# Watch how the gap-vs-ground-truth shifts. The SLI is a knob; the data
# is fixed; the choice of definition determines which lies you inherit.