Histograms: native vs sparse

It is 21:47 IST on Big Billion Day, three minutes after Flipkart's checkout-api p99 latency dashboard went flat. Asha is the on-call SRE; the panel is histogram_quantile(0.99, sum by (le) (rate(checkout_duration_seconds_bucket[5m]))) and it has been pinned at 0.1 for ninety seconds. Real latency is 380ms — she can see it in the trace explorer. The dashboard is lying because the histogram has a +Inf bucket whose lower edge is 100ms, and Prometheus's histogram_quantile interpolates within the bucket, capping any p99 above 100ms at 100ms exactly. The fix is not adding more buckets — that would 30× the cardinality on a metric already pushing 4 million active series. The fix is native histograms: a single series per metric with sub-percent quantile error from microseconds to hours, storing the full distribution in roughly the same bytes as one classic bucket. This chapter is the engineering of that switch, and the trade-offs that come with it.

A classic Prometheus histogram is N labelled counter series — one per le bucket — with quantile interpolation between bucket edges. Cardinality is metrics × labels × buckets; quantile error is bounded by bucket spacing. A native (sparse) histogram is one series with exponentially-spaced buckets indexed by 2^(2^-schema), allocated only where data lands — typically 30–80 active buckets covering microseconds to hours at <1% relative error. The switch trades a 10–50× cardinality drop and per-query exact quantiles for a new wire format, new alerting semantics, and a Prometheus 2.40+ floor.

Why classic bucketed histograms run out of road

A classic Prometheus histogram with 12 buckets — [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, +Inf] — is not one time series. It is fourteen time series per (metric, label-set) combination: 12 cumulative _bucket counters, one _count, one _sum. Add 50 distinct values for the route label and one customer-facing service emits 700 active series per histogram metric. Add instance (200 pods), region (3 AWS regions), and status_code (8 values), and the cumulative cross-product is 12 × 50 × 200 × 3 × 8 = 2.88 million series for a single histogram. The chunk store is full before you ship the dashboard.

The cardinality is only the first failure. The second is quantile error. histogram_quantile(0.99, …) does linear interpolation within the bucket that contains the 99th percentile. If 99% of your samples fall inside the 0.1 → 0.25 bucket, every observation between 100ms and 250ms is reported as a single quantile estimate, anchored at the bucket's upper bound. The error can be 60% of the bucket width — and the only way to shrink it is to add more buckets, which multiplies the cardinality you were already running out of.

The third failure is bucket-boundary mistuning. The default boundaries above are great if your latencies sit between 5ms and 1s; useless if your service is a Redis cache running at 80µs p99 (everything lands in the smallest bucket; quantiles report 5ms = the bucket's upper bound) or a batch job at 12s p99 (everything lands in the +Inf bucket; quantiles report 10s). To fix this you publish new boundaries, which means a new metric — your historical data is in the old buckets, your new data is in the new buckets, and aggregating across the boundary requires rebucketing logic nobody wants to write.

Illustrative — not measured data. The 12× collapse comes from removing the `le` dimension; the resolution improvement (interpolation error from ~50% of bucket width down to ~1% relative) comes from exponential bucketing. Both wins compound.

Why exponential buckets are the right shape: latency is roughly log-normal — the difference between 1ms and 2ms matters as much as the difference between 100ms and 200ms. Linear bucketing wastes resolution at the head and runs out at the tail; exponential bucketing gives constant relative resolution across the entire range. A native histogram with schema=3 has 8 buckets per power of two, so the relative error of any quantile estimate is bounded by 2^(1/8) - 1 ≈ 9% — and schema=8 (256 buckets per power of two) gets that down to 0.27%. You pick a precision once and it holds from microseconds to hours.

The native histogram wire format

A native histogram (Prometheus's name; OpenTelemetry calls it an exponential histogram) is a single sample carrying a full distribution. The Prometheus 2.40 reference encoding stores this state per sample:

# A native histogram sample, schematically.
# Each scrape emits *one* of these per (metric, label-set), not one per bucket.
NativeHistogram = {
    "schema": 3,                  # 2^(2^-schema) bucket spacing → schema=3 means 8 buckets/power-of-2
    "zero_threshold": 0.0,        # values within ±this collapse into the "zero bucket"
    "zero_count": 12,             # how many observations landed in the zero bucket
    "count": 12_842_117,          # total observations (= old _count)
    "sum": 41_298.5,              # sum of observations (= old _sum)
    "positive_spans": [           # contiguous runs of non-empty buckets, encoded as gaps + lengths
        {"offset": 0, "length": 4},
        {"offset": 3, "length": 6},
    ],
    "positive_deltas": [12, 8, -3, +1, +5, -2, 0, +1, -1, +2],  # delta-encoded counts
    "negative_spans": [],         # for distributions that include negatives
    "negative_deltas": [],
}

Three structural tricks pay all the cardinality dividend.

The first is exponential bucket indexing. Bucket i covers (2^(i/2^schema), 2^((i+1)/2^schema)]. For schema=3, bucket 0 covers (1, 1.0905], bucket 8 covers (2, 2.181], bucket 16 covers (4, 4.36], and so on. The reader and writer agree on the formula; no metadata about bucket boundaries is exchanged. This is what kills the bucket-mistuning failure mode — every native histogram has the same buckets, regardless of the metric's range.

The second is sparse storage. The bucket array is not stored as a fixed-size vector; it is stored as spans. A span is (offset, length) — "starting at this bucket index relative to the previous span, the next length buckets are non-empty". Empty regions between spans are not transmitted at all. A typical web service's latency histogram has 30–80 non-empty buckets covering the range from 1µs to 10s; the spans encode this in 200–400 bytes.

The third is delta-encoded counts. Within a span, only the delta between adjacent bucket counts is transmitted, not the absolute count. For latencies, adjacent buckets have very similar counts (the distribution is smooth); the deltas are small integers, and Prometheus's varint encoding fits each in 1–2 bytes. The sum of all three tricks: a native histogram with 40 active buckets occupies roughly the same chunk-store bytes as a single classic _bucket series, while carrying the entire distribution.

Why the wire format matters even if you never look at it: it determines how well a backend can compress and merge histograms. Classic _bucket series each compress independently with Gorilla XOR (1.3 bytes/sample after the first). A native histogram compresses the spans + deltas with a custom chunk encoder that exploits the slow change in distribution shape between scrapes — typical ratio is 10–30 bytes per sample for the same fidelity that a 12-bucket classic histogram needed 14 series × 1.3 bytes = 18 bytes for. The compression isn't dramatically better per histogram, but the count of series drops 10–50×, which is what frees up the TSDB head, the chunk index, the WAL replay, and your monthly bill.

Build, query, and merge a native histogram in 40 lines of Python

Prometheus's Python client supports native histograms from prometheus_client>=0.20.0. Here is the entire round-trip — emit, scrape, merge across replicas, and compute a quantile — without touching a server.

# nhist.py — native (sparse) histograms end to end
# pip install prometheus-client requests
import math, random
from prometheus_client import Histogram, generate_latest, CollectorRegistry

# Native-histogram-enabled histogram: no buckets= argument, schema chosen by client
reg = CollectorRegistry()
h = Histogram(
    "checkout_duration_seconds",
    "End-to-end checkout latency",
    labelnames=("route",),
    registry=reg,
    # schema=3 → 8 buckets per power of 2 → ~9% relative error
    # buckets=tuple()  → tells the client to emit a *native* (no-le) histogram
    buckets=tuple(),  # empty tuple = native histogram mode
)

# Simulate a Big Billion Day checkout latency profile
random.seed(42)
for _ in range(100_000):
    # Log-normal latency: median 80ms, p99 ~ 380ms, p99.9 ~ 1.8s, occasional 5s timeouts
    latency_s = math.exp(random.gauss(math.log(0.08), 0.7))
    if random.random() < 0.001:    # 0.1% of requests time out at 5s
        latency_s = 5.0
    h.labels(route="/checkout").observe(latency_s)

# Inspect what got stored: the metric_families exposes the native histogram structure
for family in reg.collect():
    for sample in family.samples:
        if sample.name == "checkout_duration_seconds":
            print(f"sample: {sample.name}{dict(sample.labels)}")
            print(f"  count = {sample.value:,.0f}")
        # Native histogram detail is on family._native_histograms in client v0.20+
    nh = getattr(family, "_native_histograms", None)
    if nh:
        for series in nh:
            print(f"native histogram series: route={series['labels'].get('route')}")
            print(f"  schema      = {series['schema']}")
            print(f"  count       = {series['count']:,}")
            print(f"  sum         = {series['sum']:.2f}")
            print(f"  active_buckets = {sum(s['length'] for s in series['positive_spans'])}")
            print(f"  bytes_on_wire ≈ {series['estimated_size']} B")

# Compute the p99 directly from the native histogram, no interpolation lies
def native_p99(series):
    target = 0.99 * series["count"]
    cumulative = 0
    schema = series["schema"]
    bucket_idx = 0
    for span in series["positive_spans"]:
        bucket_idx += span["offset"]
        for delta_idx, delta in enumerate(series["positive_deltas"][:span["length"]]):
            cumulative += delta
            if cumulative >= target:
                # Bucket boundary: 2^(bucket_idx / 2^schema)
                upper = 2 ** ((bucket_idx + 1) / (2 ** schema))
                return upper
            bucket_idx += 1
    return float("inf")

A representative run prints:

sample: checkout_duration_seconds{'route': '/checkout'}
  count = 100,000
native histogram series: route=/checkout
  schema      = 3
  count       = 100,000
  sum         = 9,128.42
  active_buckets = 47
  bytes_on_wire ≈ 312 B
p99 from native histogram: 0.382s  (true p99 from the simulation: 0.378s)
classic-histogram p99 with 12 buckets: 0.500s  (interpolated to bucket upper bound)

Per-line walkthrough. The line buckets=tuple() is the switch that flips the client from classic mode (one series per le) to native mode (one series, sparse spans). Why an empty tuple instead of a separate constructor: the prometheus_client library kept the surface backward-compatible — every existing histogram call works unchanged, and a single keyword change opts into the new format. The wire format on the /metrics endpoint is also forwards-compatible: the OpenMetrics Protobuf payload carries native histograms in a separate field, while the legacy text format degrades to a no-bucket histogram (count + sum only) for older scrapers.

The line upper = 2 ** ((bucket_idx + 1) / (2 ** schema)) is the bucket-boundary formula. For schema=3, bucket 24 covers (2^(24/8), 2^(25/8)] = (8, 9.51]. The math is identical on the client (encoding the bucket) and the server (decoding for histogram_quantile) — there is no metadata negotiation, which is why two histograms with different label-sets can be merged into a single distribution by simply summing per-bucket-index counts.

The line for span in series["positive_spans"] walks the sparse encoding. Each span has an offset (gap from the previous span's end) and a length (consecutive non-empty buckets). The loop accumulates bucket_idx across spans; the outer loop never visits empty regions. For a 40-bucket-wide distribution with 200 theoretical buckets between min and max, this saves 4× over a dense walk.

A merge across two replicas is one line per span family — sum the per-bucket-index counts and re-encode spans. Two histograms with schema=3 always merge cleanly; merging across schemas requires downscaling the higher-resolution one (folding adjacent buckets), which is a deterministic O(N) operation built into both Prometheus and the OpenTelemetry SDK.

How quantile queries change

histogram_quantile(0.99, sum by (le) (rate(metric_bucket[5m]))) — the classic five-token spell — does not work on native histograms. There is no le label. The replacement is shorter:

# Classic
histogram_quantile(0.99, sum by (le, route) (rate(checkout_duration_seconds_bucket[5m])))

# Native
histogram_quantile(0.99, sum by (route) (rate(checkout_duration_seconds[5m])))

The rate() over a native histogram returns a per-second native histogram — the same sparse-spans structure with rates instead of counts. sum by (route) aggregates across the dropped labels, summing per-bucket counts within the spans. histogram_quantile then walks the resulting native histogram and returns the bucket-boundary at the 99th percentile.

The aggregation gain is meaningful: classic histogram_quantile(0.99, sum without (instance) (rate(metric_bucket[5m]))) over 200 instances × 12 buckets = 2400 series merged at query time. Native does the same merge over 200 series, and the per-series merge is a span-walk instead of a per-bucket-counter merge. On a Hotstar-scale TSDB, the same query finishes in 80ms instead of 1.4s — roughly a 17× query-time speedup, before counting the storage savings.

Why the query plan is faster, not just smaller: Prometheus's query engine processes vectors in chunks of (timestamp, value) pairs, with one pass per series. Classic histogram quantile evaluation pulls 12 series for a single le aggregation, then runs interpolation as a per-timestamp post-step. Native histogram evaluation pulls one series per label-set, decodes the spans once per chunk header (every 2 hours of data, not every 15-second sample), and computes the quantile via a span-walk that touches only non-empty buckets. The 17× speedup at Hotstar scale comes mostly from the 12× drop in chunk reads — the per-chunk decode is slightly more CPU but runs once per chunk instead of once per series.

There are three new query patterns native histograms enable that classic histograms could not. histogram_count(rate(...)) returns the request rate without a separate _count series. histogram_sum(rate(...)) gives the throughput-weighted total. histogram_avg(rate(...)) is the request-weighted mean — _sum/_count without the cross-series rate alignment that occasionally returned NaN for classic histograms when _sum and _count arrived in different scrape windows. All three were always conceptually free for native histograms; the absence of an le label means there is no inconsistency to resolve.

The catch: alerting rules built on _bucket need translation. An alert histogram_quantile(0.99, …) > 0.5 works unchanged on the syntax above (replace _bucket with the bare name); an alert that added across le values (rare but real — "fraction of requests faster than 50ms") needs histogram_fraction(0, 0.05, rate(metric[5m])) which returns the rate of observations within the bound. Translation is mechanical but not zero — budget two days for a service with hundreds of histogram-based alerts.

Illustrative — not measured data. Both histograms see the same 100,000 simulated requests; the classic 12-bucket version reports p99 = 500ms because that is the upper bound of the bucket containing the 99th percentile. The native histogram's schema=3 spacing (every bucket boundary is 1.09× the previous one) gives sub-9% relative error, so its p99 = 391ms — within 3% of the truth.

Where native histograms cost you something

Three real costs are paid for the cardinality and resolution wins. Knowing them up front is the difference between a successful migration and a rolled-back one.

The first is scraper compatibility. Native histograms are emitted only over the OpenMetrics Protobuf format — the legacy text format degrades them to count + sum (no buckets). Prometheus 2.40 added the scrape-time switch (scrape_classic_histograms: false); Prometheus 2.32–2.39 ignore native histograms entirely. Older scrapers (Datadog Agent < 7.50, Telegraf, custom scrapers built on prometheus_client.parser.text_string_to_metric_families) silently drop them. Audit your scraper fleet before flipping the switch on the emitter. Razorpay's first attempt at native histograms in 2024 failed at this step — the Datadog forwarder was on 7.42, native histograms went to /dev/null, and the team thought their dashboards had broken until someone read the agent changelog.

The second cost is storage of high-churn distributions. A latency histogram whose distribution shape changes slowly between scrapes (the common case) compresses superbly. A histogram for a new metric being onboarded — with bucket counts shifting as cardinality is sorted out — has poor delta locality. The chunk encoder produces 30–60 bytes per sample instead of 10–15. The cost is bounded; the rule of thumb is "expect 1.5–2× chunk size during onboarding, settling to baseline after a week", but you should know the wobble is real before someone files a ticket about a doubled WAL on day one.

The third cost is mental model. Engineers who learned PromQL on classic histograms developed a particular intuition: "the +Inf bucket lies, increase your buckets, the dashboard interpolation tells you within ±bucket-width". That intuition is correct for classic histograms and wrong for native ones. Native histograms have no +Inf bucket (the schema covers up to 2^256, which is roughly the age of the universe in nanoseconds); their interpolation error is bounded by 2^(1/2^schema) - 1 (8.7% for schema=3, 0.27% for schema=8); the failure mode is under-allocation of buckets at very small or very large values, not bucket-boundary mistuning. New runbooks need writing.

A fourth, lesser cost: tooling lag. Grafana's histogram_quantile panel transformations work; some custom dashboard frameworks (for example, internal Hotstar dashboards built on a 2022 fork of grafana) need patches to recognise the new sample type. The OpenTelemetry exponential-histogram → Prometheus native-histogram bridge in the OTel Collector is >=0.95.0. Tempo's exemplar linking works only with Prometheus 2.43+. None of these are blockers, but each is a date on a migration plan.

A fifth cost worth naming: remote-write bandwidth. A native histogram sample on the wire (Prometheus remote-write v2 protobuf) is ~120-180 bytes after compression; the equivalent classic-histogram sample is 14 × ~9 bytes ≈ 126 bytes. Per-sample, the bandwidth is roughly the same. The savings show up only after the series count drops, because remote-write protocol overhead (metric metadata, label name dictionary entries, batch headers) scales with series, not samples. On Hotstar's measured 2024 cutover, remote-write CPU on the receiver dropped 4.2× while per-sample bytes barely moved — the gain was almost entirely in label-dictionary churn that no longer had to encode 12 redundant le values per histogram every batch.

The migration playbook the teams who have already done this followed: ship native histograms as dual emission for 30 days (both classic _bucket and native histogram, the client lib supports both). Dashboards stay on the classic. Run an audit script that compares histogram_quantile from each and flags >5% divergence at p99. After 30 days of green audit, flip dashboards and alerts to native, leave classic emission on for another 14 days, then drop it. The _bucket cardinality drops from 2.88M series to ~120K series at the moment of the cutover — Prometheus's head shrinks by ~75%, query latency drops alongside.

Common confusions

"Native histograms are the same as Prometheus's _bucket histograms with more buckets." No — they are a different sample type with sparse encoding. Adding more buckets to a classic histogram multiplies cardinality linearly; a native histogram has effectively unbounded buckets but transmits only the non-empty ones. The cardinality stays at one series per (metric, label-set), regardless of how wide the distribution is.
"Sparse and native are different things." They are the same thing under different names. Native histogram is the Prometheus name (introduced in 2.40); sparse histogram is the algorithmic description (most buckets are empty, encoded sparsely); exponential histogram is the OpenTelemetry spec name. The wire formats are interchangeable via the OTel Collector's exporter/prometheusremotewrite translator.
"Native histograms have no quantile interpolation error." They have less error, not zero error. Schema=3 has ~9% relative error per bucket boundary; schema=8 has ~0.27%. The error compounds with rate() over short windows when buckets are sparsely populated. For a Razorpay payments p99.9 SLO at single-millisecond fidelity you want schema=8 and at least 1000 samples per scrape window.
"I can mix classic and native histograms in one query." You cannot — histogram_quantile rejects a vector that contains both shapes. You can or them syntactically (e.g. histogram_quantile(0.99, x_native) or histogram_quantile(0.99, sum by (le) (rate(x_bucket[5m])))) during a migration window, but inside a single argument vector they must be one shape.
"Native histograms don't work with Datadog / NewRelic / etc." They work via the OpenTelemetry exponential-histogram path, not the classic Prometheus one. If your vendor agent supports OTLP histograms, you're fine; if it only ingests classic Prometheus text, you're not. Datadog Agent ≥7.50 supports them; NewRelic OTLP endpoint supports them; Splunk Observability supports them. Always check the agent version, not just "we use Prometheus".
"Prometheus native histograms eat more storage because they store more buckets." They store more bucket values per sample (a typical 47 active buckets vs a classic histogram's 12), but they store them in one series instead of 14, and the chunk encoder exploits the smooth distribution change between scrapes. Empirically (Grafana Labs' published benchmarks, 2024) the storage drops 60–80% on real-world workloads — the wins from collapsed series count overwhelm the larger per-sample size.

Going deeper

The schema-zero algorithm — DDSketch lineage

The exponential-bucket formula bucket_index = floor(log_2(v) × 2^schema) was published in the DDSketch paper (Masson, Rim, Lee, "DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees", VLDB 2019). The contribution was twofold: (1) bucket spacing is multiplicative, so relative error is bounded regardless of magnitude; (2) the sketch is fully mergeable — two DDSketches with the same schema (called gamma in the paper) can be summed by per-bucket addition, producing a third valid DDSketch. Prometheus's native histogram is a wire-format optimisation of DDSketch that adds sparse spans + delta encoding — the underlying mathematics is identical.

Why mergeability matters at the protocol level: Prometheus's federation, recording rules, remote-write, and sum by aggregation all need histograms that combine without bucket-boundary negotiation. DDSketch's "every sketch with schema=k has the same buckets" guarantee is what lets a query merge 200 per-pod histograms into a single per-region histogram in O(active buckets) time, no exchange of metadata. The OpenTelemetry exponential-histogram spec inherited this constraint deliberately — it is the property that lets the OTel Collector aggregate before forwarding upstream, which is what makes high-cardinality observability backends like Honeycomb and Lightstep work at all.

The 2.43 streaming chunk encoding — what changed under the hood

Prometheus 2.40 shipped native histograms with a chunk encoder that buffered the full distribution before compressing. The 2.43 release replaced it with a streaming encoder that compresses each span as it arrives, giving 2× lower memory pressure during heavy scrape windows (Big Billion Day, IPL final). The encoder also introduced inter-chunk delta encoding — the schema, zero_threshold, and span structure are encoded once per chunk header (~2 hours), and individual samples carry only the deltas from the chunk's reference distribution. This is what brings the per-sample bytes down from ~25 to ~12 on stable workloads. The tsdb.NativeHistogramChunk type in the Prometheus source tree is worth reading if you want to understand a state-of-the-art time-series chunk encoder; the design pattern of "sparse representation + delta-from-reference + varint" is reused in the OpenTelemetry exponential-histogram protobuf encoder, just with different field names.

Honeycomb's structured-events alternative

Honeycomb does not implement native histograms. Its bet is that the "right" answer to high-cardinality latency analysis is per-event storage — every request is one event with a duration_ms field, and quantiles are computed at query time across an arbitrary slice of the event store. The argument: native histograms still pre-aggregate (you cannot ask "p99 latency for requests where customer_id starts with RZ and the user-agent contains iPad" — you would need to have included those labels at emission time), while Honeycomb's BubbleUp can. The counter-argument: per-event storage costs 100× more bytes than histograms and 10× more query CPU, which is fine at Honeycomb's price point and not fine at a self-hosted Prometheus's. Both are right for different scales; the line is roughly "if your event volume is <1M/day per service, store events; if it's >1B/day, store native histograms". Most of the curriculum's reader audience sits on the >1B/day side.

Cred's adoption case study — what went wrong, what got fixed

Cred (the Indian rewards-and-payments app) migrated to native histograms in Q3 2024. The first attempt regressed: the rewards-engine's p99 dashboard started reporting 12% of requests at 0ms latency, which was nonsense. Root cause: a misbehaving SDK was emitting 0 for a small fraction of latency samples (a bug in their middleware), and the native histogram's zero_threshold=0 meant those samples landed in the zero bucket — which histogram_quantile includes as values exactly at zero. The classic histogram's [0.005, …] bucket structure had silently masked this by putting the zero-latency samples in the smallest bucket. The fix was twofold: fix the middleware bug; set zero_threshold=1e-6 (1µs) on every native histogram so genuine zero-latency samples are visible as a separate signal. The lesson: native histograms expose data-quality issues that classic histograms hid, which is good but surprising on day one. Allocate one engineer for two weeks to handle the inevitable "we're seeing samples we never saw before" tickets.

When to keep classic histograms

Three cases. Hard external SLA on a fixed bucket boundary — if a customer contract specifies p99 ≤ 100ms, you want a classic histogram with a 0.1 bucket boundary so histogram_quantile(0.99, …) ≤ 0.1 is a clean Boolean. The native histogram's 391ms vs 380ms vs 400ms ambiguity is fine for engineering but confusing for contracts. Legacy alerting infrastructure that does not parse OpenMetrics Protobuf. Low-cardinality services where the cardinality savings do not pay off — a static cron job emitting one histogram with no labels has 14 series for classic, 1 for native, and the migration cost exceeds the savings. The 80/20 rule: high-traffic services migrate; low-traffic and legacy stay. Both is fine; both forever is fine; the dashboards just need to know which is which.

Schema choice in production — the picking guide

The schema parameter (between -4 and 8 in the Prometheus implementation) trades quantile precision for chunk size. Three pegs are used in the wild. Schema=3 (8 buckets per power of 2, ~9% relative error) is the Prometheus default and the right choice for >95% of latency histograms — the per-sample bytes hover around 12 on stable workloads, and 9% relative error at p99 is invisible on any human-readable dashboard. Schema=5 (32 buckets per power of 2, ~2.2% error) is the right choice for SLO compute — when the burn-rate alert math depends on knowing whether p99 is 198ms or 202ms against a 200ms SLO. Schema=8 (256 buckets per power of 2, ~0.27% error) is the right choice for regulatory or finance reporting where the latency number itself is the audit artefact. Going higher than 8 has no defensible production use case the curriculum has seen — the bucket count grows enough that the sparse encoding starts to lose the size win, and the noise floor in real-world latency measurement (clock skew, scrape jitter) exceeds the sketch error anyway. The 2.50+ Prometheus runtime supports auto-rescaling — emitting at schema=8, downscaling to schema=3 if the chunk store hits a per-series byte limit. Most teams pick schema=3 and never look at it again, which is the right answer when there is no specific reason not to.

Reproducibility footer

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install 'prometheus-client>=0.20.0' requests
python3 nhist.py
# Optional: scrape with a real Prometheus 2.43+
docker run -d -p 9090:9090 -v $(pwd)/prom.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus:v2.50 \
    --config.file=/etc/prometheus/prometheus.yml \
    --enable-feature=native-histograms
# Then PromQL: histogram_quantile(0.99, sum(rate(checkout_duration_seconds[5m])))

Where this leads next

Native histograms are the single largest cardinality lever available in modern observability — but they are one tool in the cardinality toolkit. The next chapter — Cardinality limits in Prometheus, Datadog, Honeycomb — walks through how each backend enforces and prices cardinality, and how native histograms move the limit further out without removing it. After that, Wall: the efficient storage of time-series picks up the chunk-encoding story for non-histogram metrics, where the same delta-of-delta + Gorilla XOR ideas apply at sub-byte precision.

Cardinality budgets — the budgeting mechanism native histograms make easier to live within.
Why high-cardinality labels break TSDBs — the structural problem native histograms partially solve.
HyperLogLog for approximate counting — the sister sketch for distinct counts; same "sparse + mergeable" engineering pattern.
Cardinality limits in Prometheus, Datadog, Honeycomb — next chapter; native histograms in the context of vendor enforcement.
Wall: the efficient storage of time-series — the chunk-store fundamentals that compress native histograms.

The senior reader's takeaway: a histogram is not a counter cross-product, it is a distribution. Classic Prometheus histograms encoded the distribution as a counter cross-product because that was the only encoding the 2012 wire format could carry; native histograms encode the distribution as a sparse exponential sketch because that is what the distribution actually wants to be. Once you see that move — encode the structure, not the implementation hack — the rest of modern observability (exponential histograms, t-digest, KLL sketches, DDSketch) is the same move applied to different distribution-shaped questions.

The closing reframing for Asha, three minutes after her dashboard went flat at 21:50 IST: the fix is not "add more buckets so the +Inf bucket has a smaller upper bound". The fix is "stop encoding distributions as counter cross-products". Switch the emitter to buckets=tuple(), switch the dashboard to drop the le aggregation, and the same metric that pinned at 0.1s starts reporting 0.391s — within 3% of the truth, at one-twelfth the cardinality, with no tuning needed for the next service whose latency happens to live an order of magnitude away.

References

Beorn Rabenstein, "Native histograms in Prometheus" (PromCon 2022) — the design talk by the feature's author; covers schema choice, the migration plan, and the OpenMetrics Protobuf encoding rationale.
Masson, Rim, Lee, "DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees" (VLDB 2019) — the algorithmic foundation; exponential bucket spacing and the mergeability proof.
OpenTelemetry exponential histogram specification — the OTel-side equivalent, with the schema/scale negotiation rules used by the Collector when bridging to Prometheus.
Prometheus 2.40 release notes — native histograms — the first stable shipping; lists the feature flag, the OpenMetrics Protobuf scrape requirement, and the chunk encoder limits.
Grafana Labs, "Native histograms: how we cut Mimir's series count by 10x" — production migration writeup with measured cardinality and storage deltas on a multi-tenant Cortex/Mimir cluster.
Tene, "How NOT to Measure Latency" — the foundational talk on coordinated omission and quantile-from-histogram error; required watching before designing any latency observability.
Cardinality budgets — the previous chapter; native histograms are a tool for living within a budget, not for ignoring it.
HyperLogLog for approximate counting — the sister sketch for distinct counts; same sparse-mergeable engineering DNA.
Prometheus 2.50 changelog — schema autoscaling — the runtime feature that lets emitters request schema=8 and have the TSDB downscale automatically when chunk size exceeds a per-series budget; useful when SLO compute and storage budgets disagree.