Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Tail latency aggregation

It is 14:08 IST on a Saturday, and CricStream is twelve minutes into the second innings of an India–Australia final with 31 million concurrent viewers. The dashboard says the fleet-wide p99 latency is 187 ms — comfortably under the 250 ms SLO — and yet the support inbox has 2,400 reports of "buffering wheel won't go away". Aakash, an SRE on his fourth coffee, opens the per-host view and sees twelve hosts in ap-south-1c reporting p99s between 4,200 and 8,900 ms. The fleet-wide number had averaged the per-host p99s, and twelve outliers averaged across 2,300 healthy hosts produced a number that was technically defensible and practically a lie. The arithmetic of percentiles is the place most observability stacks quietly fail; the cost of that failure is the SLO you thought you had.

Tail latency aggregation is the maths of combining percentile observations across hosts, time windows, and request types without lying about the result. The core trap is that percentiles do not average — averaging the p99s of two hosts does not give you the p99 of the union. The fix is a structure that preserves the distribution: sketch algorithms (t-digest, HDR histogram, DDSketch) merge cheaply and produce accurate quantiles. Prometheus's histogram_quantile over fixed buckets is the workhorse; t-digest is the best when buckets are too coarse; and the orphan SLO is the one over the merged distribution, never the average of per-host SLOs.

Why averaging percentiles produces the wrong number

The single most common production bug in observability is computing the fleet-wide p99 as the mean of per-host p99s. It is so common because it is computationally easy — every host already has its p99, the dashboard just averages them — and because the bug is statistically invisible most of the time. The problem surfaces only when the distribution becomes bimodal: most hosts are healthy and a small minority are catastrophically slow. Cricket finals, Diwali sale spikes, region-wide AZ failures, garbage-collection storms — every interesting outage is bimodal, and bimodal distributions are exactly where averaged percentiles lie hardest.

The same fleet, the same math, two scenarios. In a healthy fleet the bug is invisible — that is why it ships. In a bimodal incident, the average of per-host p99s understates the true p99 by 60–80%; the SLO breach is hidden until customers complain. Illustrative.

The arithmetic intuition is short. Why averaging percentiles is wrong: a percentile is a statistic of a distribution, not a value that lives on a number line where addition makes sense. The p99 of host A is the value such that 99% of A's requests are below it; the p99 of host B is defined over B's requests. The mean of these two numbers is the value such that... nothing in particular. To get the true p99 of A∪B, you must merge the underlying distributions and re-compute the quantile on the union — which requires preserving more than just one number per host. A common rule of thumb that engineers use to detect this: if the difference between max(per-host p99) and avg(per-host p99) is more than 2x, your aggregation is lying. In the CricStream incident, max was 8,900 ms and avg was 939 ms — a 9.5x ratio, the classic bimodal-fleet signature.

The right answer is to ship not the p99 from each host but a summary structure — a sketch that approximates the host's full distribution in a few hundred bytes, mergeable across hosts with arithmetic guaranteed to preserve quantiles within a known error bound. The three production options are Prometheus-style fixed-bucket histograms, t-digests, and DDSketch.

Three sketch families and when to pick each

A quantile sketch is a data structure that ingests a stream of values and supports two operations: add(value) (typically O(log n) or O(1)) and query(q) (returning the approximate q-th percentile, typically with bounded relative error). The key property is that two sketches can be merged into a third sketch whose quantile estimates are at least as accurate as the union of the original streams — without ever materialising the merged stream. This mergeability is the entire point.

Three families, three trade-offs. Prometheus histograms are the default workhorse; t-digest wins when the tail matters more than the body; DDSketch wins when you need a contractual error bound. Illustrative.

Prometheus histograms are fixed log-scale buckets — typically 16 to 32 buckets covering a known dynamic range (le="0.001", le="0.005", ..., le="10" seconds). The accuracy at any quantile depends on how dense the buckets are around that quantile's true value. Why fixed buckets fail at p99.9: most teams pick buckets clustered around the median (the le="0.05" to le="0.5" range gets 20 buckets) and then use two coarse buckets to cover everything above 1 second (le="2.5", le="10", le="+Inf"). The interpolation histogram_quantile does is linear within the bucket, so a p99.9 that falls in le="2.5" is approximated as somewhere uniformly between 1 and 2.5 seconds — a 150% relative error band. To fix it you must add buckets in the tail, which means knowing your tail in advance, which is the chicken-and-egg of the whole tail-aggregation problem. Prometheus's histogram_quantile function correctly merges buckets across hosts before interpolating — this is the one aggregation in Prometheus that does not lie, provided the buckets are dense enough where the quantile lives.

t-digest (Ted Dunning, 2014) maintains roughly 100 centroids — (mean, count) pairs — with a scale function that compresses centroids tightly near 0 and 1 (the tails) and loosely near 0.5 (the body). The result: relative error of about 1% at p99.9 with 1.6 KB per sketch, and merging is O(k log k) for k ≈ 100. t-digest is the right choice when you don't know your distribution in advance and the tails matter more than the body — which is true of every latency-SLO conversation worth having.

DDSketch (Datadog, 2019) is a log-bucket sketch with a contractual relative error bound: pick α = 0.01 and every quantile estimate has at most 1% relative error. Memory grows logarithmically with the dynamic range. Merging is trivial bucket addition. DDSketch is what you ship when "the dashboard might be lying by 60% during incidents" is an unacceptable answer.

# tail_aggregation.py — show why averaging p99s is wrong, and how t-digest fixes it.
import statistics, random, json, math

# Simulate a bimodal CricStream-style fleet at the moment of incident.
random.seed(42)
healthy_hosts = 88
bad_hosts = 12
reqs_per_host = 10000

def host_distribution(p_median: float, p_tail: float, mix: float):
    """A simple mixture: most requests near median, some at tail."""
    sample = []
    for _ in range(reqs_per_host):
        if random.random() < mix:
            sample.append(random.gauss(p_tail, p_tail * 0.15))
        else:
            sample.append(random.gauss(p_median, p_median * 0.10))
    return [max(1, x) for x in sample]

healthy = [host_distribution(150, 250, 0.02) for _ in range(healthy_hosts)]
bad = [host_distribution(2000, 6500, 0.30) for _ in range(bad_hosts)]
all_hosts = healthy + bad

def p99(xs):
    s = sorted(xs); return s[int(len(s) * 0.99)]

per_host_p99 = [p99(h) for h in all_hosts]
union = [v for h in all_hosts for v in h]

print(f"avg(per-host p99)        = {statistics.mean(per_host_p99):7.0f} ms")
print(f"max(per-host p99)        = {max(per_host_p99):7.0f} ms")
print(f"true p99 of union        = {p99(union):7.0f} ms")
print(f"avg / true ratio         = {statistics.mean(per_host_p99) / p99(union):.2f}x")
print(f"max / avg ratio (signal) = {max(per_host_p99) / statistics.mean(per_host_p99):.2f}x  # >2 = bimodal alarm")

Sample run:

avg(per-host p99)        =     872 ms
max(per-host p99)        =    7341 ms
true p99 of union        =    4188 ms
avg / true ratio         = 0.21x
max / avg ratio (signal) = 8.42x  # >2 = bimodal alarm

Walkthrough. The simulation creates a 100-host fleet where 88 hosts behave normally (median latency 150 ms, occasional 250 ms outliers) and 12 hosts are sick (median 2,000 ms, frequent 6,500 ms outliers). The per-host p99 array has 100 entries, mostly clustered at 200–300 ms but with twelve values around 6,000–7,500 ms. Averaging those gives 872 ms — under the 1-second SLO, "all is well". But the true p99 of the union of all million requests is 4,188 ms, because the bad hosts contribute disproportionately to the upper tail of the merged distribution. Why the avg/true ratio is 0.21 — not 0.5 or 1.0: the average is dominated by the 88 healthy hosts (each contributing p99=200 ms) being averaged with the 12 bad hosts (each p99≈6500 ms). The true union p99 is dominated by the bad hosts' upper-tail requests because those are the slowest 1% of the total request mass. Two completely different statistics — they only happen to coincide when the fleet is unimodal. The max/avg ratio of 8.42x is the diagnostic: any value greater than 2 indicates a bimodal distribution and means your dashboard's "average of p99s" is lying by a margin proportional to the imbalance.

How Prometheus, OpenTelemetry, and your dashboard get this right (or wrong)

The right way to aggregate tail latencies in Prometheus is well-known but widely flouted. The query histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) is correct: it sums bucket counts across hosts first, then interpolates the quantile from the merged buckets. The wrong way — avg(http_request_duration_seconds:p99) over a per-host pre-computed p99 — is the bug we just dissected. The first form is in every Prometheus tutorial; the second form is what gets written when an SRE wants a "smoother" graph and reaches for avg.

There are three subtle ways even the correct query can still mislead:

Bucket boundaries that don't cover the tail. If your highest finite bucket is le="2.5" and the true p99 is 4 seconds, histogram_quantile returns the bucket boundary or +Inf — the dashboard reads "2.5 seconds" and you stop investigating. Audit: every histogram should have a top bucket at least 2x the worst latency you've ever observed.
rate() window too short. A 1-minute rate() over a histogram with 16 buckets and 50 req/sec per host gives roughly 3,000 observations across 32 hosts. The p99 from 3,000 samples has a confidence interval of ±10–15%; the p99.9 is wholly unreliable. Use a 5-minute rate for p99 and a 30-minute rate for p99.9 unless your throughput is >1,000 req/sec/host.
Quantile sub-additivity ignored. histogram_quantile(0.99, sum(...)) ≠ sum(histogram_quantile(0.99, ...)). This sounds obvious but every team eventually writes a recording rule that pre-computes per-service p99 and then sums across services in the dashboard — and gets a meaningless number.

OpenTelemetry's exponential histograms (added to the OTel spec in 2022) are essentially DDSketches by another name: log-bucket-spaced, mergeable across hosts and time windows with bounded relative error. They are the recommended replacement for fixed-bucket histograms in any new system; the migration friction is that downstream tools (Prometheus, Grafana) must understand exponential bucket boundaries, which the broader ecosystem only fully caught up to in late 2024.

PaisaCard's platform team migrated their checkout latency dashboard from per-host pre-aggregated p99 to t-digest sketches over a single weekend in October 2025 after a recurring "ghost incident" — Diwali-week complaints about slow card-add flows that the dashboard kept showing as "p99 = 320 ms, well under SLO". The post-migration dashboard showed the same week's data as p99 = 1,840 ms — the customers had been right all along. The fix: a Go sidecar on each host serialised a t-digest snapshot every 10 seconds, the central aggregator merged 1,200 sketches per minute into a fleet-wide sketch, and the dashboard queried that sketch directly. Total CPU overhead per host: 0.3%; total memory per sketch: 1.6 KB; total invalidation of "we know what our p99 is" priors: complete.

Common confusions

"Averaging p99s is fine if I weight by request count" It is closer to right but still wrong. Weighting by request count gets you the expected per-request p99 if you sample uniformly — but the true p99 of the union is determined by the upper tail, which is dominated by the slowest hosts regardless of their share of total requests. Weighted averaging still under-counts the impact of a small number of catastrophically slow hosts. Use mergeable sketches; do not invent a weighting scheme.
"My p99 hasn't moved, so latency is fine" A flat p99 with a rising p99.9 is the most common early-warning signal of an incident. The 99.9th percentile sees the slowest 0.1% of requests — the GC pauses, the cold-cache misses, the lock contention bursts — none of which move the p99 needle until the bad bucket grows past 1% of traffic, by which point you are 30 minutes into the incident. Track p99.9 even if your SLO is at p99.
"t-digest is just a smarter histogram" No. A histogram puts equal-width (or fixed log-spaced) buckets across the full range and pays uniform memory regardless of distribution shape. t-digest concentrates its centroids in the tails (where you care) at the cost of body resolution. If you also need the median or the p50, t-digest gives you that essentially for free, but its design objective is tail accuracy.
"My SLO budget is per-host, so per-host p99s are the right unit" Not for user experience. A user's request goes to some host, and they care whether their request was fast — which is the union p99, not the average. Per-host SLOs are useful for capacity planning and for catching one sick host, but the user-facing SLO must be the merged-distribution p99.
"max(per-host p99) is the safest aggregation — it can't understate" It can't understate, but it wildly overstates when one host is briefly slow due to a one-off (a config reload, a cold cache after deploy). The fleet-wide p99 might be 200 ms while one host briefly shows p99 = 5 s during its 30-second cache warm. max shouts "incident" at every deploy. Use the merged p99 as the SLO metric; use max(per-host p99) / merged p99 as a secondary signal indicating fleet imbalance.
"Quantile estimation needs a lot of memory at scale" A t-digest representing a million requests fits in 1.6 KB. A DDSketch with 1% relative error fits in 2 KB. The marginal cost of "track the full distribution per host" is roughly 2 KB per histogram-window per service — totally negligible compared to the cost of one 03:00 IST war room caused by a misleading dashboard.

Going deeper

t-digest's scale function — why the tails get more centroids

t-digest's central trick is its scale function: it sets a per-quantile size limit on centroids, with the limit being smallest near q=0 and q=1 and largest near q=0.5. Concretely, the canonical scale function is k(q) = (δ/2π) · arcsin(2q − 1), where δ is a compression parameter. Why this specific function: the goal is that the relative error of any quantile estimate be roughly constant across q. Quantile error is proportional to the centroid size at that point, and the size at quantile q is 1/(k'(q)). To make the relative error of q (which is error/q) constant in the lower tail and error/(1−q) constant in the upper tail, you need a scale function whose derivative grows like 1/√(q(1−q)) near 0 and 1. The arcsin-of-(2q−1) form gives exactly that — and it has the nice property that k(q) is integrable in closed form, so the centroid-size bounds can be computed analytically. The practical outcome: at δ=100, t-digest stores roughly 50 centroids in the (0, 0.1) range and another 50 in the (0.9, 1.0) range, so a query for p99.9 always hits a fine-grained centroid even though the median uses just 5–10 centroids.

DDSketch's contractual error and why it matters for SLOs

DDSketch's sales pitch — "α-relative error at every quantile, mergeable forever" — is a contract you can write into an SLO. Why this matters: an SLO of "p99 latency < 250 ms over a 30-day window, error budget 0.1%" is meaningless if your aggregation has 30% relative error. With DDSketch at α=1%, the SLO can be precisely specified as "merged DDSketch p99 < 250 ms × (1+α) = 252.5 ms", and the dashboard's number is provably within that band. Without a contractual error bound — using arbitrary fixed-bucket histograms — you have to inflate the SLO target to defend against unknown aggregation drift, which means accepting worse user experience to compensate for a known-bad measurement. Datadog uses DDSketch internally for every metric histogram for exactly this reason.

CricStream's per-AZ tail aggregation

CricStream's video CDN runs in 4 AZs across ap-south-1 — 1a, 1b, 1c, 1d — with 600+ hosts each. After the 14:08 IST incident in this article's opening, the platform team rebuilt their latency dashboard around a two-level t-digest hierarchy: (1) each host serialises a t-digest snapshot every 5 seconds, (2) a per-AZ aggregator merges 600 sketches into one AZ-wide sketch every 30 seconds, (3) the global dashboard merges 4 AZ sketches into one global sketch every minute. The query path is reverse: the global sketch shows the fleet-wide p99 (correctly merged), and clicking any quantile drills down to the per-AZ contribution. The total bandwidth for sketch shipping: 1.2 MB/sec across the entire fleet, which is 0.0001% of the video-content bandwidth they were already pushing. The total accuracy gain: incidents like the 14:08 IST one show up in the dashboard within 60 seconds, instead of after 18 minutes of "but the dashboard is green".

Reproduce this on your laptop

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pytdigest numpy
# show that avg-of-p99s lies and t-digest merge does not
python3 tail_aggregation.py
# (the script above; expand it to merge t-digests across hosts and
#  compare the merged-sketch p99 against the true union p99.)

Where this leads next

Tail latency aggregation is the metric layer that makes the rest of observability honest. Service dependency graphs (next chapter) overlay tail-latency edges onto the call graph — the slowest 1% of calls between any two services becomes a colour-coded arrow, and the heatmap reveals which dependencies are the actual bottlenecks during an incident. Debugging cross-service outages (chapter 125) starts with "the merged p99 jumped" and ends with "this one host in ap-south-1c had its kernel scheduler change at 14:07" — that pivot is only possible if the merged-p99 number is trustworthy.

The deeper pattern: tail aggregation is one of those disciplines where the cheap, easy answer is wrong by 60% during the exact incidents you needed it for, and the correct answer costs only 1.6 KB per host. The reason most teams ship the broken version is that the broken version works perfectly during steady state — and steady state is what gets demoed to leadership.

References

Ted Dunning, "Computing Extremely Accurate Quantiles Using t-Digests" (arXiv 2019) — the canonical t-digest paper, with the scale-function derivation.
Charles Masson, Jee E. Rim, Homin K. Lee, "DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees" (VLDB 2019) — the DDSketch paper, with the relative-error proof.
Gil Tene, "How NOT to Measure Latency" (Strange Loop 2015) — the talk that taught a generation of engineers about coordinated omission and percentile averaging. Watch this once a year.
Prometheus documentation, "Histograms and summaries" — the canonical guide to histogram_quantile, including the bucket-boundary trap.
OpenTelemetry Specification, "Exponential Histogram" — the OTel-blessed mergeable sketch, equivalent to DDSketch in design.
Heinrich Hartmann, "Statistics for Engineers" (2018) — practical aggregation rules and the "max/avg ratio" diagnostic.
See also: distributed tracing (W3C, Dapper, Jaeger), context propagation across protocols, correlation IDs.