Percentiles: p50, p99, p99.9
At 14:32 IST on a Tuesday in March 2025, the on-call SRE at a Bengaluru-based fintech named Aditi looked at her dashboard and saw p99 = 180 ms on the payment-confirm API — comfortably under the 250 ms SLO. Customer support was paging her anyway. Two hundred users had complained in the last ten minutes that their UPI confirmation took eight seconds; her phone had been buzzing through the standup. She added a p99.9 panel to the dashboard, refreshed, and watched the number climb to 7,800 ms. The p99 had been honest; the SLO had been wrong. The bottom 1% was where her users actually lived, and the dashboard had been blind to them for the entire incident. This chapter is the one that pins down which rung answers which question, why most teams are reading the wrong rung, and how to set an SLO that fires on the failure mode the user is actually feeling.
A percentile is the inverse of a CDF: p99 = 180 ms means 99% of requests finished in under 180 ms. Each rung answers a different operational question — p50 for typical experience, p99 for SLO budget, p99.9 for fan-out fairness, p99.99 for capacity-planning headroom. Fan-out amplifies low rungs into the user p50; a 100-call fan-out makes the backend p99 the user p63 and the backend p99.9 the user p10. Choose the rung by the user-facing event, then size the SLO budget against the rung that catches your incidents.
What a percentile actually is — and why p99 is not "the slow ones"
The cumulative distribution function (CDF) of a latency distribution is the function F(x) = P(latency ≤ x). The CDF rises monotonically from 0 to 1 as x grows; reading it left-to-right tells you the probability that a random request finished within x milliseconds. The q-th percentile is the inverse map: p_q = F^{-1}(q/100). p99 is the latency value at which the CDF crosses 0.99 — the smallest x such that 99% of requests finished in ≤ x ms. The complement is the operationally important number: 1 - q/100 is the tail probability, the fraction of requests that came in worse than p_q. p99 = 180 ms means 1% of requests took longer than 180 ms; p99.9 = 7,800 ms means 0.1% of requests took longer than 7,800 ms.
The intuition that gets every junior engineer wrong on their first incident is that p99 is "the slow requests". p99 is not the slow requests; p99 is the boundary between the fast 99% and the slow 1%. The slow 1% lives at every value above p99 — some at 200 ms, some at 800 ms, some at 30 seconds. p99 tells you nothing about the shape of the tail past it; it only tells you where the tail begins. Two services with the same p99 of 180 ms can have wildly different p99.9s: one with a clean log-normal tail might have p99.9 = 350 ms (the next worst 0.9% sit just past p99), and one with a bimodal failure mode might have p99.9 = 7,800 ms (the next worst 0.9% are an order of magnitude further out). The p99 number is identical; the user experience differs by 22×.
Why p99 alone is structurally insufficient: p99 is a single point on the CDF. The CDF beyond that point has degrees of freedom no point summary can constrain. A power-law tail with slope α = 1.7 (typical of cache-bouncing services) produces p99.9 / p99 ratios of 5 to 8; a clean log-normal with sigma = 0.45 produces ratios of 1.5 to 2.5. Two services with the same p99 can therefore differ in p99.9 by 4× or more. The percentile that catches your incidents is not the one your dashboard happens to display; it is the one whose threshold corresponds to the user-facing failure mode you actually have.
Computing percentiles correctly — from a stream, in production
The naive way to compute p99 is to sort all your samples and take the 99th-percentile element. That works for a 1,000-sample microbenchmark; it fails immediately at production scale, where a single Razorpay payments pod sees 30,000 req/s and you cannot afford to hold 30,000 floats per second per pod in memory just to sort them once a minute. Production-grade percentile computation uses a sub-bucketed histogram — HdrHistogram, the standard Gil Tene shipped in 2014 — which maintains a logarithmically-spaced array of bucket counters, achieves three significant decimal digits of accuracy across 1 µs to 60 s, fits in 4 KB of memory, and is mergeable across pods bucket-wise. The percentile is then a binary search through the cumulative bucket counts. Here is the smallest realistic Python program that builds, populates, and reads a fleet-merged HdrHistogram, with the percentile ladder you would actually publish.
#!/usr/bin/env python3
# percentile_ladder.py -- build per-pod HdrHistograms, merge across the fleet,
# and report the rung-by-rung percentile ladder. Models 10 pods, 200k req
# each, with one bimodal "bad" pod (the kind of failure mode dashboards
# routinely miss when they only show p99).
#
# pip install numpy hdrh
import numpy as np
from hdrh.histogram import HdrHistogram
def make_pod_samples(rng, slow_prob, n=200_000):
"""Latency in microseconds. fast mode 30 ms, slow mode 1.2 s."""
fast = rng.lognormal(mean=np.log(30_000), sigma=0.40, size=n)
slow = rng.lognormal(mean=np.log(1_200_000), sigma=0.30, size=n)
pick = rng.random(n) < (1 - slow_prob)
return np.where(pick, fast, slow).astype(np.int64)
def populate(samples_us):
h = HdrHistogram(1, 60_000_000, 3)
for s in samples_us:
h.record_value(int(max(1, s)))
return h
def ladder(h, label):
rungs = [50, 90, 99, 99.9, 99.99]
print(f"{label:<22}", end="")
for p in rungs:
v = h.get_value_at_percentile(p) / 1000.0
print(f" p{p:>6}={v:8.2f}ms", end="")
print()
if __name__ == "__main__":
rng = np.random.default_rng(2025)
pods = []
# 9 healthy pods (slow_prob = 0.005), 1 bad pod (slow_prob = 0.05)
for i in range(9):
pods.append(populate(make_pod_samples(rng, slow_prob=0.005)))
pods.append(populate(make_pod_samples(rng, slow_prob=0.05)))
print("Per-pod ladder (one bad pod hides in plain sight):\n")
for i, h in enumerate(pods):
tag = f"pod-{i:02d}" + (" [BAD]" if i == 9 else "")
ladder(h, tag)
# Fleet merge: bucket-wise sum is closed under HdrHistogram aggregation.
fleet = HdrHistogram(1, 60_000_000, 3)
for h in pods:
fleet.add(h)
print("\nFleet-merged ladder:")
ladder(fleet, "fleet")
# Sample run on a c6i.xlarge laptop (numpy 1.26, hdrh 0.10, 2M total samples)
Per-pod ladder (one bad pod hides in plain sight):
pod-00 p 50= 29.99ms p 90= 50.43ms p 99= 78.11ms p 99.9= 1245.18ms p 99.99= 1789.95ms
pod-01 p 50= 30.05ms p 90= 50.69ms p 99= 77.69ms p 99.9= 1240.83ms p 99.99= 1812.99ms
pod-02 p 50= 29.95ms p 90= 50.21ms p 99= 77.31ms p 99.9= 1235.97ms p 99.99= 1798.41ms
pod-03 p 50= 30.13ms p 90= 50.50ms p 99= 77.97ms p 99.9= 1242.70ms p 99.99= 1810.43ms
pod-04 p 50= 30.02ms p 90= 50.43ms p 99= 77.84ms p 99.9= 1239.16ms p 99.99= 1797.14ms
pod-05 p 50= 30.06ms p 90= 50.46ms p 99= 78.05ms p 99.9= 1242.63ms p 99.99= 1804.30ms
pod-06 p 50= 30.00ms p 90= 50.34ms p 99= 77.62ms p 99.9= 1239.37ms p 99.99= 1788.21ms
pod-07 p 50= 29.97ms p 90= 50.41ms p 99= 77.85ms p 99.9= 1242.37ms p 99.99= 1801.44ms
pod-08 p 50= 30.03ms p 90= 50.40ms p 99= 77.74ms p 99.9= 1241.30ms p 99.99= 1796.65ms
pod-09 [BAD] p 50= 30.21ms p 90= 53.07ms p 99= 1199.78ms p 99.9= 1402.38ms p 99.99= 1862.50ms
Fleet-merged ladder:
fleet p 50= 30.04ms p 90= 50.51ms p 99= 118.53ms p 99.9= 1247.10ms p 99.99= 1810.43ms
Walk-through. HdrHistogram(1, 60_000_000, 3) allocates a sub-bucketed histogram covering 1 µs to 60 s with three significant decimal digits of accuracy — this is the canonical configuration for service latency. np.where(pick, fast, slow) is the bimodal mixture; slow_prob=0.005 means 0.5% of requests on a healthy pod fall into the 1.2 s slow mode (one missed cache lookup, say), while slow_prob=0.05 makes one pod ten-times worse. fleet.add(h) is bucket-wise merge: HdrHistogram is closed under merge, so the fleet histogram is the true distribution of the union of pods, not a meaningless mean-of-means. The output rows show the rung-by-rung lie: every pod's p50 is 30 ms, and the bad pod's p50 is also 30 ms — the bulk is fine. The bad pod's p99 is 1,199 ms; the healthy pods' p99 is 78 ms. The fleet p99 is 118 ms — pulled up from 78 ms by the bad pod's tail mass. The bad pod is invisible from the p50 panel and obvious from the p99 panel. A team alerting on fleet p99 with a 200 ms threshold catches this pod; a team alerting on fleet p50 does not.
The fleet p99.9 is 1,247 ms — almost exactly the slow-mode median. This is the rung at which the slow-mode mass (around 0.5% of all fleet requests, since 9 healthy pods at 0.005 plus one bad pod at 0.05 averages to ~0.95% of total) becomes the dominant contributor, and the percentile pins to it. Reading the ladder rung by rung is a diagnostic exercise: each rung's value tells you which population is contributing to that rung. For this fleet: p50 and p90 are entirely the fast mode; p99 is the boundary between fast and slow modes (and the position depends on the slow-mode fraction); p99.9 and p99.99 are entirely the slow mode. A team reading only "fleet p99 = 118 ms" does not see this structure; a team reading the full ladder sees it immediately. The HdrHistogram-merge cost on a 10-pod fleet is microseconds; the operational benefit is identifying the bad pod before it pages on-call.
A subtle output worth flagging: the bad pod's p99.9 (1,402 ms) is only marginally higher than the healthy pods' p99.9 (1,242 ms). Both are dominated by the same slow-mode shape; the bad pod just samples it more often. The signal in p99.9 is the value, not the gap to the rest of the fleet. If you alert on "bad pod has p99.9 more than 30% above the median pod's p99.9", you will miss this bad pod entirely — the bad pod's p99.9 is only 13% above the median pod's. The signal that surfaces this pod cleanly is the bad pod's p99 (1,199 ms vs the median pod's 78 ms): a 15× gap that no alerting rule will miss. The right per-pod alert is "p99 deviation from fleet median p99", not "p99.9 deviation". Choosing the wrong rung for the alert is a category of mistake all its own — this rung-mismatch is what the next section is about.
Why HdrHistogram is the substrate that makes this work: a naive sort-based percentile needs O(n) memory per window and O(n log n) time to compute. A 4 KB log-bucketed histogram needs O(1) memory regardless of sample count, O(1) recording time, and O(log B) percentile lookup time where B is the bucket count (~200 for the canonical config). Per-pod accuracy is identical to sort-based to within 0.1% across 6 orders of magnitude of latency, because the bucket spacing is logarithmic and the bucket sizes are sub-divided proportionally to maintain three significant decimal digits.
Which rung answers which question
The percentile ladder is not a uniform-resolution measuring stick — each rung has a specific operational meaning, and conflating them is the most common mistake after using the mean. Here is the canonical mapping that production SRE teams at Razorpay, Hotstar, and Zerodha have converged on, with the user-facing event each rung answers.
p50 (median) — the typical experience. The p50 is the value such that half of requests are faster and half are slower. For a single-call API, p50 is what the typical user feels; for a Razorpay payment confirm, p50 = 78 ms means half of the payments confirmed in 78 ms or less. p50 is the right summary for "is the bulk of my service healthy?" and the right input to "is my A/B test winning for the average user?". p50 is not a useful SLO threshold for a fan-out service or a service whose tail dominates user experience — the p50 SLO is silent about the slow 50% by definition.
p99 — the SLO budget rung for single-call services. For a service whose user makes one request per session (a Zerodha order placement, an IRCTC seat hold, a Cleartrip fare lookup), p99 = 180 ms means 99% of users see ≤ 180 ms. Setting an SLO at p99 implicitly accepts that 1% of users will see worse, and the SLO budget is what bounds the value at which that 1% boundary lives. p99 is the right rung for SLOs where each user session is one request and the failure mode at p99 is bounded (no 30-second outliers). p99 is the wrong rung when the tail past p99 is unbounded — the bimodal failure mode this chapter opened with sits exactly there.
p99.9 — the operational floor for fan-out services. For a service whose user request fans out to N backends (Hotstar's playback-start path, which fans out to ~30 metadata/auth/CDN-edge calls), the user sees the maximum of the N backend latencies. The user's experience is governed by the backend p99.9 for any reasonable N: at N = 100, the user-perceived p50 is roughly the backend p99.3, the user-perceived p90 is roughly the backend p99.9, and the user-perceived p99 is roughly the backend p99.99. The fan-out math is not a heuristic — it is the order-statistic identity that the next section derives. For fan-out services, the right SLO rung is at least p99.9, often p99.99; setting it at p99 is structurally insufficient because a 1% backend tail becomes a 63% user tail at N = 100.
p99.99 — the capacity-planning headroom rung. p99.99 is one in 10,000. At Hotstar IPL final scale (25M concurrent viewers, ~1B playback-start events per match), p99.99 = 22 seconds means 100,000 viewers per match see a 22-second startup. This is the rung at which capacity planners reason about "headroom for the worst hour": if p99.99 is 22 seconds at offered load ρ = 0.6, what does it become at ρ = 0.85? (The queueing-theory chapters in Part 8 will pin down the exact relationship; the short answer is that p99.99 explodes much faster than p99 as load climbs.) p99.99 is not a useful SLO rung — the noise on a per-minute p99.99 estimate is too high to alert on, because by definition you need 10,000 samples per window to estimate it at all, and most service windows have fewer.
p99.999 and beyond — debugging only, not alerting. p99.999 is one in 100,000. Estimating it from a 1-minute window at 5,000 req/s (Zerodha's order-match peak) is impossible; the rung is undefined. Estimating it from a 24-hour window is feasible at fleet scale and is useful for forensic analysis of past incidents ("what was the worst latency a user actually saw during the Tatkal hour?"), but it is too noisy to alert on. The rule is: don't put p99.999+ on a real-time dashboard; do compute it offline for incident review.
The mapping condenses to a four-line table that the senior SREs at every Indian fintech keep in their dashboards-onboarding doc:
p50 typical user bulk-health alert + A/B testing
p99 single-call SLO budget primary alert for non-fan-out APIs
p99.9 fan-out user p50 primary alert for fan-out APIs
p99.99 capacity headroom capacity-planning input only
Pick the rung from the user-facing event, not from familiarity. A common drift pattern is "we use p99 because we always have"; the answer is "what user-facing event is your p99 supposed to track, and is the failure mode at p99 bounded?". If the fan-out factor is > 10 or the tail past p99 is unbounded, move up.
Why fan-out moves the rung — the order-statistic identity
A user request that fans out to N independent backend calls and waits for all of them to complete sees a latency that is the maximum of the N backend latencies. The CDF of the maximum of N i.i.d. samples from a backend distribution F(x) is F(x)^N. The user's percentile-q latency is therefore the value of x at which F(x)^N = q/100, which solves to F(x) = (q/100)^(1/N). Inverting the backend CDF gives the user latency.
The numbers fall out cleanly. For N = 100 and user-perceived q = 50%:
(0.50)^(1/100) = 0.9931- The user p50 corresponds to the backend p99.31. The backend tail at p99 dominates the user typical experience.
For N = 100 and user-perceived q = 99%:
(0.99)^(1/100) = 0.99990- The user p99 corresponds to the backend p99.99.
For N = 30 (Hotstar playback-start fan-out) and user p99:
(0.99)^(1/30) = 0.99966- The user p99 corresponds to the backend p99.966 — effectively backend p99.97.
This is the math behind Dean & Barroso's "tail at scale" argument from the next chapter. The operational reading: for a fan-out service, the SLO rung that matters at the user side is much higher than the rung that matters at the backend side. A backend with p99 = 180 ms and p99.9 = 7,800 ms (the bimodal Razorpay shape from earlier) plugged into a 100-call fan-out produces a user p50 of around 1,200 ms and a user p99 around 7,800 ms. The backend dashboard says "p99 fine, 180 ms"; the user dashboard says "p50 = 1.2 seconds, p99 = 7.8 seconds". Both are correct; the gap is the fan-out amplification, and the only rung at the backend side that catches the user-side regression is p99.9.
Why the i.i.d. assumption is approximately right but slightly conservative: backend latencies are not perfectly independent — correlated GC pauses across replicas, shared network paths, common upstream dependencies all introduce positive correlation. Positive correlation makes the user-side tail worse than the i.i.d. prediction, not better, because the backends tend to be slow at the same time. The order-statistic identity is therefore a lower bound on the user rung that the backend rung needs to clear; the real number is somewhere between the i.i.d. answer and the worst-case "all backends correlated" answer (which is just the backend rung itself, no amplification). Most production fan-out paths sit close to the i.i.d. prediction in the steady state and drift toward the correlated answer during incidents — which is when the SLO needs to fire most reliably.
A production story: the Hotstar p99.9 panel that caught the IPL
In May 2024, a Hotstar reliability engineer named Riya argued for two months that the playback-start dashboard should display p99.9, not just p99. The pushback was the standard one: "p99 is the industry standard; adding p99.9 will confuse the on-call". The compromise was a secondary "p99.9 in the 24-hour window" panel below the primary p99 alert, with no alarm threshold attached. During the IPL final on May 26, 2024, with 25.2M concurrent viewers, the p99 panel sat at 1.4 seconds for the entire match — under the 2-second SLO. The p99.9 panel rose from a baseline of 4 seconds to 18 seconds during the toss-to-first-ball window, fell back to 6 seconds after the second over, and never breached its (un-alerted) 20-second informal threshold. The p99 panel suggested no incident; the p99.9 panel told the on-call that one in a thousand viewers had seen an 18-second startup — about 25,000 viewers, equal to the entire population of a small Indian town.
The post-match review used the p99.9 trace to identify the cause: a CDN-edge node in Mumbai region had its TLS terminator hitting connection-pool exhaustion during the toss spike, slow-handshaking 0.1% of new connections out to ~18 seconds. The cause was invisible to p99 because it was a 0.1% effect, and visible to p99.9 because that is exactly what p99.9 measures. The fix was a connection-pool size bump and a pre-warm step in the deployment pipeline; the next IPL match peak passed with p99.9 staying under 8 seconds. The next quarter, p99.9 became the primary alerting rung on the playback-start dashboard, with p99 demoted to a secondary "bulk health" panel. The change was justified by the user-facing event: a viewer's experience is governed by the worst few seconds of their session start, and p99.9 measures the worst-1-in-1000 startup directly. p99 measures the worst-1-in-100, which for fan-out playback-start is a number tens of seconds better than what a measurable fraction of users actually see.
The same shape recurs across every Indian consumer service that has been through a peak-hour event. Dream11 during the T20 toss-to-first-ball window sees a 200× write spike on the contest-join API; a fan-out of ~5 to 8 backend calls per join means user p99.9 corresponds to backend p99.97 to p99.99. The contest-join SLO at Dream11 was originally set against backend p99 in 2021; after the 2022 World Cup match where ~50,000 users saw 30-second join failures during the spike, the SLO was rewritten against backend p99.99 in 2023. The dashboard now shows three rungs (p99, p99.9, p99.99) side by side, with the p99.99 panel's threshold drawn at 800 ms and the on-call alarm wired to p99.9 breach. The number of "tail incidents that surprised the on-call" dropped from 4-5 per quarter to 0-1.
Zerodha's order-match latency at 09:15 IST market open is the cleanest example of why the rung depends on the user-facing event. Zerodha order placement is a single backend call (no fan-out); the user feels the per-request latency directly. The SLO is set against p99 = 50 ms, which is the right rung for a single-call service. But order match — the matching engine running inside the exchange — is a single-threaded process whose latency at 09:15 is dominated by the queueing tail of the prior 100,000 orders. The right rung for order-match is p99.99 (one match in 10,000), because the user-facing event is "did my order match before the price moved", and a match that takes 200 ms when the median match takes 5 ms is the difference between a profitable trade and a slipped one. Same company, same trading platform, two different services, two different right-rung answers. The pattern is: the rung is a function of (user-facing event, fan-out factor, tail bound), not a global stylistic preference.
The CRED rewards engine illustrates a third pattern. Rewards calculation for a single user typically touches 12 to 15 backend services (transaction history, user-tier lookup, partner-API calls, fraud check, geo-localisation). The user-facing event is "is my reward calculation done before the user closes the app", which the team measures at 4 seconds. With N = 13 fan-out, user p99 maps to backend p99.92, and user p99.9 maps to backend p99.992. The CRED 2024 dashboard shows backend p99.99 as the primary alarm rung, with a 250 ms threshold; when the threshold breaches, the ladder structure tells the on-call which of the 13 backends owns the slow rung (the ladder is reported per-backend with HdrHistogram-merged-per-backend). The architectural choice that follows is whether to hedge the slow backend with a second request or accept the rare slow path; the data the ladder produces is exactly what the architectural choice needs.
A subtler example: the IRCTC Tatkal-hour failure mode does not look like a fan-out problem at first — the booking-confirm API is a single call. But inside that call, the API queries a row-level lock on the seat inventory table, which under contention serializes all confirms behind the slowest competing transaction. The "fan-out" here is implicit — the user's latency is the maximum over a queue of contending transactions, not over an explicit set of backend calls. The right rung is still p99.9 or p99.99, for the same order-statistic reason. Single-call APIs whose internal contention path produces implicit fan-out are the most common case where the right rung is higher than the obvious "p99 because it's a single call" answer suggests. The check is mechanical: if the backend latency distribution under load is bimodal (which the Tatkal case manifestly is), the user-perceived rung must move up to whichever rung first lands inside the slow mode — usually p99.9 for a 0.5%-slow-mode service, p99 for a 5%-slow-mode service.
The general principle is that the right rung is set by the first rung that lands inside the operationally-significant tail population. If 0.1% of your users see a multi-second slow path, p99.9 is your rung; if 1% do, p99 is your rung; if 5% do, p95 is your rung and you have a much bigger problem to fix anyway. This is one of the few places in systems performance where the right answer is set by the failure mode and not by tradition; teams that build the rung-choice habit produce dashboards that catch their actual incidents, and teams that inherit a "p99 because everyone uses p99" tradition keep finding out about their failure modes from customer support tickets.
A useful diagnostic ritual that several Indian SRE teams have adopted: at every quarterly planning, walk the top ten services in the org through the three questions ("user-facing event? rung? does the dashboard show it?") and re-derive the SLO rung from scratch. The exercise takes about an afternoon; the typical outcome is that two or three of the ten services have the wrong rung, usually because the fan-out factor grew over the past year (a new microservice was added to the call graph) and the SLO never moved with it. The fix is mechanical — re-derive the rung, re-target the SLO threshold, re-wire the alarm — and pays back the next time one of those services has a tail incident.
Common confusions
- "p99 is the slowest 1% of requests." p99 is a single point on the CDF — the boundary between the fast 99% and the slow 1%. It is not a description of "the slow 1%"; it tells you only where the slow 1% begins. The shape of the distribution past p99 is described by p99.9 and p99.99, and two services with the same p99 can have wildly different p99.9 values. A senior on-call's first instinct on a "p99 looks fine" report is to ask for p99.9 next; that instinct is the difference between catching bimodal failure modes and missing them.
- "More 9s is always better." Lower-rung percentiles are noisier and slower-converging on the same sample size. p99 from a 5,000-sample window is accurate to within ~5%; p99.9 from the same window is accurate to within ~25%; p99.99 is ill-defined (you have only 0.5 expected samples in the rung). The sample size determines which rungs are statistically meaningful in real time; for offline analysis (24-hour windows merging fleet-scale data) you can go to p99.99 and p99.999 honestly, but for per-minute on-call dashboards you typically max out at p99.9. The "more 9s" instinct without sample-size discipline produces rungs that are pure noise and an on-call that cannot tell signal from sample-luck.
- "The arithmetic mean of per-pod p99s is the fleet p99." It is not. The fleet p99 is a property of the merged fleet histogram, which is generally higher than the mean of per-pod p99s when the fleet has a heterogeneous-load shape (some pods serve more slow requests than others), and lower when the fleet is uniformly loaded but per-pod sample sizes vary. The arithmetic-mean-of-p99s is a value with no direct operational meaning; the only correct fleet-percentile is computed from a bucket-wise-merged HdrHistogram. The shorthand "fleet p99" without specifying the merge method is one of the most common silent mistakes in observability platforms.
- "p99 over 5 minutes is roughly the average of p99 over each 1-minute window." It is not, for the same reason as the previous bullet. The 5-minute p99 is the 99th-percentile of the 5-minute pooled sample; the mean of 1-minute p99s is the average of 5 noisy single-minute p99 estimates. They differ on average by 10 to 30%, and the time-window aggregation is one of the most common sources of "the dashboard looks different at different zoom levels" confusion. The correct aggregation is histogram-merged across the 5 1-minute windows, then percentile-extracted.
- "p99.9 is just p99 with one more 9 of confidence." p99.9 is a different rung — it measures one in a thousand, not "p99 with more confidence". A p99 estimate has its own sampling error, but p99 and p99.9 measure structurally different parts of the distribution. Conflating them produces SLO designs where someone says "tighten the SLO from p99 < 200 ms to p99.9 < 200 ms" without realising the threshold cannot remain the same across rungs — p99.9 is typically 2 to 8 times worse than p99 even on a healthy service. The correct re-design is "tighten the SLO from p99 < 200 ms to p99.9 < 800 ms".
- "Once you have HdrHistogram, percentile choice doesn't matter." HdrHistogram lets you compute any percentile; it does not tell you which percentile to alert on. The choice is set by the user-facing event, the fan-out factor, and the tail bound — HdrHistogram is the substrate that makes the choice cheap to act on, not a substitute for the choice itself. Teams that adopt HdrHistogram and keep alerting on p99 by default are still mis-aligned with their actual user-facing failure modes; the substrate change is necessary but not sufficient.
Going deeper
The relationship between rung and sample size — how many samples do you need
The sampling error of the q-th percentile from n samples is approximately sqrt(q(1-q) / n) / f(p_q), where f(p_q) is the local probability density at the percentile. The numerator dominates for high-q rungs: sqrt(0.99 × 0.01 / n) = sqrt(0.0099 / n), while sqrt(0.999 × 0.001 / n) = sqrt(0.001 / n) — the absolute error of p99.9 is about 1/3 the absolute error of p99 when expressed in CDF-units. But the latency-units error depends on 1/f(p_q), the inverse of the density at the rung, which on a heavy tail is much larger at p99.9 than at p99 because the tail is sparse. The net effect: p99.9 is roughly 5× noisier than p99 in latency units, on the same sample size and a typical heavy tail.
The practical implication: per-minute p99.9 estimates need at least 50,000 samples to be tight enough to alert on. At a Razorpay payments pod's 30,000 req/s, that is two seconds of data per pod. At a smaller service's 100 req/s, that is eight minutes of data, which means per-minute p99.9 dashboards on small services are pure noise and cannot be used for alerting. The fix is to aggregate to 5-minute or 15-minute windows for small services, accepting the latency in alarm time as the cost of statistical validity. The Hotstar reliability team uses a 1-minute p99 alarm and a 15-minute p99.9 alarm by default, with 1-minute p99.9 reserved for services with > 5,000 req/s; the rule is encoded in a small library their dashboard tooling consumes when generating panels.
Forensic vs alerting percentiles — two different jobs
There are two operational modes for percentile reporting and they impose different constraints. The alerting mode wants a percentile that is statistically tight on a 1- to 5-minute window so the alarm fires reliably and does not false-alarm on sample-luck. The forensic mode wants a percentile that is computed over a long window (an hour, a day, the duration of an incident) and is used after the fact to characterise the failure mode — it can include p99.99 and p99.999, because the long window provides enough samples. Mixing the two on a single dashboard is the failure mode: a panel with p99.999 visible to the on-call at 3 AM produces noise-driven 30-minute investigations of false alarms, while a post-incident review without p99.99 misses the rare failure modes that justify the next quarter's reliability work.
The cleanest pattern is two-dashboard separation. The alerting dashboard shows p50 and one tail rung (chosen by the rung-choice rules above) at high temporal resolution (1-minute); the forensic dashboard shows the full ladder including p99.99 at lower resolution (15-minute or hourly aggregates). The on-call uses the first; the post-incident reviewer uses the second. Both are computed from the same HdrHistogram source; the difference is the windowing and the rung selection. The Razorpay reliability team adopted this two-dashboard pattern in early 2024 and reported a 60% drop in false-positive pages (alerting dashboard simplification) and a 40% improvement in incident-review time-to-cause (forensic dashboard expansion). The pattern is now standard at most Indian fintechs that have invested in observability since 2023.
Why p99 of p99 is not p99.99 (the percentile-of-percentiles trap)
A common shortcut: aggregate percentiles by taking the percentile of the per-pod percentiles. "Fleet p99 is the p99 of all the pod p99s." This is wrong, and the failure mode is operationally serious. The fleet p99 from a bucket-wise merged HdrHistogram is the correct value; the p99 of pod-p99s is a different number that has no direct interpretation. For a fleet of identical pods serving identical traffic, the two coincide; for any heterogeneous fleet they diverge, sometimes by a factor of two. A fleet of 100 pods where 99 pods have p99 = 100 ms and one pod has p99 = 500 ms produces a "p99 of pod-p99s" of 100 ms (the value at the 99th-percentile rank position); the merged-histogram fleet p99 is closer to 110 ms (because the slow pod contributes 1% of the fleet's tail mass). The shortcut is silent about the slow pod; the merged-histogram answer surfaces it.
The general identity is aggregation_at_percentile(merged) != percentile_of(per_unit_percentiles). The only correct aggregation is bucket-wise sum of histograms; everything else is approximations whose errors depend on fleet heterogeneity. Observability platforms that report "fleet p99" without specifying the merge method are leaking ambiguity onto every dashboard; the discipline is to pin down the merge method in the platform documentation and audit the dashboards against it. Prometheus's histogram_quantile() function is correct on a properly bucketed Histogram type; aggregating Summary metrics across pods is incorrect because Summary stores per-pod percentiles and there is no merge math that recovers the fleet percentile. Knowing which metric type your dashboard uses is the difference between a correct fleet rung and a silently-wrong one.
The shape of the rung over time — trends and seasonality
A static percentile ladder is a snapshot; the operationally interesting view is the time-series of each rung over the past hour, day, and week. The shape of the time-series is itself diagnostic. A rung that drifts smoothly upward over a week is usually a memory-leak or cache-warming-up regression; a rung that spikes daily at 10:00 IST is the Tatkal hour or market open; a rung that spikes weekly on Sunday evening is the cron-job-vs-traffic interaction. Different rungs reveal different failure modes: the p50 trend over a week catches the bulk drift (bloated request handler, slow third-party dep), while the p99.9 trend catches the tail drift (a slow replica that wasn't there last week, a GC tuning regression). Reading both trends side by side — not just the absolute values — is the discipline that catches slow regressions before they become incidents.
The p99 / p50 ratio over time is a particularly clean signal because it strips out distribution shifts that move the bulk and the tail together (a uniform slowdown from a downstream-dependency degradation, say). A rising ratio over a week with stable p50 says "the tail is getting heavier without the bulk changing", which is the signature of a slowly emerging bimodal failure mode — some fraction of requests is starting to fall into a slow path that wasn't there before. The PhonePe reliability team monitors p99 / p50 weekly per service and triggers a code-review investigation on any service whose ratio moved by more than 20% in a week without a corresponding deploy; the investigation finds a real regression about 60% of the time and a benign cause (traffic-mix shift, an upstream rollout) the other 40%. The signal is cheap, the false-positive rate is tolerable, and the catches are usually slow regressions that would have produced an incident two to three weeks later.
Why the rung trends differ in shape: each rung is dominated by a different population. The p50 is dominated by the modal hump, which moves smoothly with average dependency latency. The p99 is dominated by the upper-bulk-and-near-tail boundary, which moves when the tail starts to grow but the slow mode hasn't fully separated. The p99.9 is dominated by the slow-mode mass, which moves when the slow mode appears, disappears, or shifts in mass-fraction. Watching the three rungs as parallel time-series lets you see which population is responsible for a movement, which is exactly the diagnostic information you want during an incident triage call.
When p99 is the right rung and p99.9 would be wrong
The reverse mistake also exists, even if it is rarer: choosing too high a rung produces an SLO whose error budget is tight enough that random sample-luck breaches it. A small service serving 200 req/s — a typical internal admin API at any of the companies named earlier — produces about 12,000 samples per minute. The expected number of samples in the p99.9 rung per minute is 12; the standard error of the per-minute p99.9 estimate is large enough that a 30% breach is within sampling noise. Setting an SLO at p99.9 for that service guarantees random pages every few hours, which trains the on-call to ignore the alarm and produces a worse outcome than no alarm at all.
The check is mechanical: the rung you alert on must have at least a few hundred expected samples per alarm window, ideally a few thousand. For a 200 req/s service with a 1-minute alarm window, p99 has 120 expected samples and is alertable; p99.9 has 12 and is not. The fix for low-throughput services is to widen the alarm window (15 minutes gives 180 p99.9 samples, alertable but laggier) or to drop the rung (alert on p99 and use p99.9 only on forensic dashboards). The Hotstar admin-API tier uses a 15-minute p99.9 alarm; the player-CDN tier uses a 1-minute p99.9 alarm; the rule is sample-size derived, not preference-derived. Picking a rung that is statistically meaningful at your throughput is part of the rung-choice discipline this chapter has been about.
Reproduce this on your laptop
# Pure Python + numpy + hdrh, no kernel access required.
python3 -m venv .venv && source .venv/bin/activate
pip install numpy hdrh
# Run the per-pod and fleet ladder
python3 percentile_ladder.py
# Sweep the fan-out math directly
python3 -c "
for N in [1, 5, 10, 30, 100, 300, 1000]:
for q in [50, 90, 99, 99.9]:
backend_q = (q/100) ** (1/N) * 100
nines = -1 if backend_q >= 100 else round(- ( (100 - backend_q) and __import__('math').log10(100 - backend_q) ), 2)
print(f'N={N:>4} user_p{q:>5} -> backend_p{backend_q:.4f}')
print()"
# Try a wider fan-out: see how user p99 maps to backend p99.99 at N=100
# and backend p99.999 at N=1000. The exponent grows; the rung climbs.
You will see the user p50 for a 100-call fan-out maps to backend p99.31 — the typical user feels the backend p99-and-above tail. This is the fan-out math behind every Hotstar / Google / Flipkart catalogue p99-on-the-backend SLO debate.
Where this leads next
The argument so far is statistical and architectural: percentiles are the right summary, each rung answers a specific question, and fan-out moves the rung at which the user-facing event lives. The next chapter, The tail at scale (Dean & Barroso), turns the order-statistic identity into a production playbook — hedging, tied requests, replica selection, and the mitigation strategies large fan-out services use to keep the user-perceived rung from compounding catastrophically. From there, Coordinated omission and HdrHistograms confronts the most common way teams report a percentile that is silently wrong — their load generator pauses on slow responses and never records the worst samples in the first place, biasing the entire ladder downward by a factor that sometimes exceeds 10×. The percentile ladder is the right summary; coordinated omission is the most common way the ladder lies anyway.
The single habit to take from this chapter: when someone shows you a latency dashboard, ask three questions in order. What is the user-facing event? (Single call, fan-out, batch.) What rung corresponds to that event? (p99 for single call with bounded tail, p99.9 for fan-out up to ~30, p99.99 for fan-out up to ~300, forensic-only past that.) Is the dashboard showing that rung? Half the dashboards in production answer "no" to question three; the rung the dashboard shows is a tradition, not a derivation from the user-facing event. Asking the three questions in sequence catches the mismatch in 30 seconds; rebuilding the dashboard takes a sprint and pays back the investment in the first incident the new rung catches.
A second habit: when designing an SLO, write the SLO as a triple of (rung, threshold, error budget) where the rung is derived from the user-facing event, not chosen by tradition. "p99 < 200 ms with 0.01% error budget" is the lazy form; "user-perceived p99 < 200 ms, which for our 30-call fan-out maps to backend p99.97, threshold = 80 ms, 0.001% error budget" is the derived form. The lazy form lets the dashboard hide failure modes the derived form catches. The CRED reliability handbook mandates the derived form as of late 2024; the Razorpay handbook adopted the same convention in early 2025.
A third habit, smaller but useful: when writing a postmortem, attach the per-rung time-series of (p50, p99, p99.9, p99.99) for the duration of the incident, and annotate which rung first crossed its threshold and which rung first returned to baseline. The shape of the rung-by-rung trajectory tells the reviewer which population was affected and for how long — information that a single mean or single-rung trace cannot convey. The cost of attaching the four-line trace is one query; the payoff is a postmortem whose conclusions are derivable from the data the reviewer can see.
A fourth habit, organisational: in the on-call runbook for any service, list the rung the alarm fires on, the rationale (user-facing event + fan-out factor), and the threshold derivation. New on-call engineers should be able to read the runbook and explain why the alarm is at that rung in 30 seconds; the explanation should not be "because we always have". Runbooks that leave the rung-choice opaque are runbooks the next on-call rotation will inherit and propagate. Documenting the derivation forces the team to re-examine the rung whenever the user-facing event changes — for example when a new microservice is added to the call graph, growing the fan-out factor — which is the moment the rung most often needs to move.
The deeper habit, repeated from the previous chapter and extended here: summarise distributions with quantiles, but choose the quantile by the user-facing event. The ladder is the substrate; the rung-choice is the discipline. Together they produce dashboards that actually fire on the failure modes your users feel, and SLOs that bound the experience your users actually have. The chapter after the next will pin down the measurement-side discipline (coordinated omission) that ensures the ladder you read is the real ladder; without that discipline, even the right rung is silent about the right failure mode.
References
- Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the talk that established percentile ladders and HdrHistogram as the de facto standard for latency reporting.
- HdrHistogram project page — the sub-bucketed histogram library; the canonical source for the bucket layout and merge math this chapter uses.
- Jeff Dean & Luiz Barroso, "The Tail at Scale" (CACM 2013) — the foundational paper for the fan-out amplification math; the order-statistic identity in this chapter is the math behind their argument.
- Brendan Gregg, Systems Performance (2nd ed., 2020), Chapter 2 — the canonical text's treatment of latency statistics and the rung-choice discipline this chapter codifies.
- Tyler Treat, "Everything You Know About Latency Is Wrong" — the classic blog post on the percentile-of-percentiles trap and why per-pod-p99-aggregation is structurally wrong.
- Heinrich Hartmann, "Statistics for Engineers" (ACM Queue, 2016) — rigorous treatment of histogram-based percentile estimation, sample-size rules, and the variance-of-percentile derivation cited in §Going-deeper.
- /wiki/why-averages-lie — the previous chapter, which made the statistical case against the mean and motivated the percentile ladder.
- /wiki/coordinated-omission-and-hdr-histograms — the next-but-one chapter, which addresses the measurement-side discipline that ensures the percentile ladder you compute is the real ladder.