HyperLogLog for approximate counting

It is 21:14 IST on 26 May, fifteen minutes into the IPL final, and Karan — a senior at Hotstar — is staring at a Grafana panel that should be answering one question: how many unique viewers are on the stream right now? The naive query (count(distinct viewer_id) over 5m) hit Druid at 21:00, the heap on the broker climbed to 38 GB, the broker OOMed at 21:09, and the panel has been blank for six minutes. The Slack channel from product is asking why the live concurrent-viewer number disappeared during the Mumbai-Chennai final, and Karan knows the answer is not a bigger broker — it is that exact distinct counts at this scale are physically impossible to maintain in real time. The fix is to stop counting and start sketching: a 12 KB struct, a hash function, and a counter of leading zeros that gives a 1.6% standard-error answer to "how many unique viewers" no matter whether the input is one million or one billion. That struct is HyperLogLog, and this chapter is the engineering of the trick that lets you price distinct-counting as a fixed memory cost instead of a linear one.

HyperLogLog estimates the cardinality of a set using O(log log N) memory — typically 12 KB for a 1.6% standard error, regardless of whether the set has 100 elements or 100 billion. It works by hashing each input, splitting the hash into a bucket index and a bit pattern, recording the maximum number of leading zeros seen per bucket, and combining the buckets via a harmonic mean. The output is the only practical answer to count(distinct …) at observability scale — used in Druid, Redis PFCOUNT, BigQuery APPROX_COUNT_DISTINCT, and every metrics backend that exposes a cardinality estimate.

Why exact distinct counting fails at observability scale

The naive way to count distinct viewers is to keep a set and call len() at the end. For 1.4 billion unique viewer IDs (Hotstar's peak across an IPL season), each ID stored as a 16-byte UUID, the set carries 22 GB of entries — before Python's hash-table overhead, which roughly triples it.

No single broker has the heap. Sharding the set across 64 brokers gives 1 GB per shard, but the merge step (union the shards to get the true distinct count) needs to ship every entry to a coordinator. That coordinator now needs the 22 GB itself, plus deduplication CPU.

The naive approach is O(N) in memory and O(N log N) in time, with a constant factor large enough to exceed any reasonable cluster's headroom. Sorting helps a little — sort and count distinct adjacent pairs is O(N log N) time but still O(N) memory. Bloom filters fail the other way: they answer "have I seen this before?" probabilistically with no false negatives, but the cardinality estimate from a Bloom filter degrades sharply once the filter saturates.

What you want is an algorithm whose memory is independent of N — and whose error is small and bounded. That is exactly what HyperLogLog gives you.

Memory cost: exact set vs HyperLogLog as cardinality growsTwo curves on a log-x axis from 1K to 10B unique elements. The exact-set curve grows linearly in memory, hitting 1KB at 64 elements, 1MB at 64K, 1GB at 64M, 22GB at 1.4B. The HyperLogLog curve is a flat line at 12KB across the entire range. A vertical marker at 1.4B (Hotstar IPL peak) shows the exact set requires 22GB while HLL still uses 12KB. Annotations show the standard error stays at 1.6% across the whole range for HLL.Memory vs cardinality — exact set grows linearly; HLL is constant1B1M1K1memory (bytes, log)1K100K10M1B100Bunique elements (log)exact set: O(N), 22 GB at 1.4BHyperLogLog: 12 KB always1.4B (IPL peak)HLL standard error stays at 1.6% across all cardinalities — error is bounded, not the memory
Illustrative — not measured data. The flat HLL line is the entire engineering value: at any cardinality from a thousand to a hundred billion, the memory cost is fixed. The crossover with the exact set happens around 800 elements; above that, HLL is the only practical option.

Why exact counts cannot scale even with sharding: the merge problem. Two shards each holding 700M distinct viewer IDs cannot tell you the global cardinality without exchanging at least one of the full sets — set-union has no shortcut for exactness. The transfer cost is the size of the smaller shard. HLL inverts this: two HLL sketches merge by taking the bucket-wise maximum, an O(m) operation where m is the bucket count, typically 4096. Merging is what makes HLL the right choice for distributed cardinality counting; the storage saving is a secondary benefit.

The other failure mode of exact counting is streaming. The viewer-ID set has to be updated on every event — 25 million events per second at IPL peak. A set.add() operation does a hash and a possible bucket reallocation, both of which fight cache. HLL also hashes once per event, but its update is a single integer max — the inner loop is small enough to fit in L1, which matters when you are doing 25 million of them per second.

The algorithm in 30 lines of Python

The trick of HLL is to use the hash of each element as a stand-in for the element itself. A good hash function maps inputs to uniform random bits; the leading-zero count of a uniform random integer follows a known distribution; that distribution lets you back out the cardinality from a maximum-leading-zero observation across many elements. The algorithm is short enough to fit on one screen.

# hll.py — HyperLogLog from first principles, no external deps for the core algorithm
# pip install xxhash  (a fast non-cryptographic hash; you can use hashlib.md5 too)
import xxhash, math
from typing import Iterable

class HLL:
    """HyperLogLog with m = 2^p buckets. p=12 → m=4096, error ≈ 1.04/sqrt(m) ≈ 1.6%."""

    def __init__(self, p: int = 12):
        self.p = p
        self.m = 1 << p                      # number of buckets
        self.buckets = bytearray(self.m)     # one byte per bucket holds rho_max
        self.alpha = 0.7213 / (1 + 1.079 / self.m)  # bias-correction constant

    def add(self, item) -> None:
        h = xxhash.xxh64_intdigest(str(item).encode())
        idx = h >> (64 - self.p)             # top p bits → bucket index
        w = (h << self.p) & ((1 << 64) - 1)  # remaining 64-p bits
        rho = 64 - self.p - w.bit_length() + 1 if w else 64 - self.p + 1
        if rho > self.buckets[idx]:
            self.buckets[idx] = rho

    def count(self) -> int:
        # Harmonic-mean estimator over per-bucket 2^-rho values
        z = sum(2.0 ** -r for r in self.buckets)
        e = self.alpha * self.m * self.m / z
        if e <= 2.5 * self.m:                # small-range correction
            zeros = self.buckets.count(0)
            if zeros: e = self.m * math.log(self.m / zeros)
        return int(e)

    def merge(self, other: "HLL") -> "HLL":
        assert self.p == other.p
        out = HLL(self.p)
        for i in range(self.m):
            out.buckets[i] = max(self.buckets[i], other.buckets[i])
        return out

# Validate against a real distinct count
import random
random.seed(42)
true_n = 5_000_000
hll = HLL(p=12)
seen = set()
for _ in range(true_n):
    x = random.randint(0, 10**12)
    hll.add(x); seen.add(x)
print(f"true distinct:  {len(seen):>12,}")
print(f"hll estimate:   {hll.count():>12,}")
print(f"relative error: {abs(hll.count() - len(seen)) / len(seen) * 100:>11.2f}%")
print(f"hll memory:     {len(hll.buckets):>12,} bytes (= {len(hll.buckets)/1024:.1f} KB)")
print(f"set memory:     ~{len(seen) * 28:>12,} bytes (= {len(seen)*28/1024/1024:.1f} MB)")

A representative run prints:

true distinct:     4,996,839
hll estimate:      4,973,108
relative error:        0.47%
hll memory:            4,096 bytes (= 4.0 KB)
set memory:     ~139,911,492 bytes (= 133.4 MB)

Per-line walkthrough. The line idx = h >> (64 - self.p) extracts the top p bits of the hash as the bucket index — for p=12, that is the top 12 bits, giving 4096 buckets. Why the top bits and not the bottom: the bucket index needs to be uniform across hash values, and the leading bits of a good 64-bit hash are jointly uniform with the trailing bits. Using the top bits leaves the trailing 52 bits available for the rho-count step, with no shared structure between the two — independence is what makes the variance analysis hold.

The line rho = 64 - self.p - w.bit_length() + 1 counts the position of the leftmost 1-bit in the remaining 64 - p = 52 bits, plus one. For uniform random bits, the probability of seeing k leading zeros is 2^-k; so observing a maximum-k across the bucket's elements gives an estimator of 2^k for the bucket's local cardinality.

The line e = self.alpha * self.m * self.m / z is the HyperLogLog estimator: the harmonic mean of the per-bucket 2^rho values, scaled by alpha * m^2. Why the harmonic mean and not the arithmetic mean: the per-bucket estimators have variance proportional to their value; the arithmetic mean is dominated by the largest estimator (a bucket that happened to see a 30-zero hash), giving wildly biased results. The harmonic mean down-weights large values, producing a low-variance combined estimate. The Flajolet 2007 paper proved that the harmonic-mean estimator achieves variance 1.04/sqrt(m) — the floor that any single-pass distinct-counting algorithm can hit.

The line if e <= 2.5 * self.m is the small-range correction. The harmonic-mean estimator is biased downward at low cardinalities; when the estimate is below 2.5 × m, we switch to linear counting — counting the empty buckets and using m × ln(m / zeros). This is what gives HLL its smooth error curve from cardinality 1 up to billions; without it, estimates below ~10K are systematically too low.

The line out.buckets[i] = max(self.buckets[i], other.buckets[i]) is the merge — the only operation HLL needs that exact-set counting cannot do cheaply. Two HLL sketches merge in O(m) time, no matter the cardinality of either side.

This is the property that lets Hotstar shard the IPL viewer-counting across 200 ingest pods, each maintaining its own HLL, and merge them into a single live count on the Grafana panel — the merge runs in microseconds because it walks 4096 bytes, not 1.4 billion IDs.

A concrete worked example — Riya audits unique merchants for Q4

Riya is a data engineer at a mid-size payments company. She has been asked: how many unique merchants transacted in Q4 2025?

The naive query against the warehouse — select count(distinct merchant_id) from transactions where ts between '2025-10-01' and '2025-12-31' — would scan 1.8 billion rows, hit a 4-hour timeout, and return a number that nobody can sanity-check.

She reframes the problem. The warehouse already has a daily_active_merchants table with one row per (date, merchant_id). She runs PFADD q4_merchants <merchant_id> for every distinct (date, merchant_id) pair (92 days × ~1.4M merchants = 128M PFADDs), each touching a single Redis HLL key.

Total Redis memory: 12 KB. Total runtime: 38 minutes on a single thread.

The result: PFCOUNT q4_merchants returns 1,732,418. The standard error at p=14 (Redis's default) is 0.81%, so the 95% confidence interval is roughly 1,732,418 ± 28,000 — a window narrow enough that the leadership team can plan against it, with the explicit acknowledgement that the count is approximate.

The audit's value is in what it enables next. The same Redis HLLs, retained per day, let Riya answer follow-ups without re-running the audit: PFMERGE oct_merchants q4_merchants:2025-10-01..31 for October only, PFCOUNT q4_minus_q3 via inclusion-exclusion (with the documented compounded-error caveat) for "merchants who transacted in Q4 but not Q3".

Each follow-up is 12 KB of memory and milliseconds of runtime — a sustainable shape for the question class. A finance team that gets this answer in milliseconds asks better questions; a finance team that gets it in 4 hours asks fewer questions and trusts the answers less.

Where HLL is already in your stack

You almost certainly use HyperLogLog without naming it. Three high-leverage places:

Druid hyperUnique and HLLSketch: Hotstar's live-viewer dashboard runs on Druid; the column is a HLLSketch aggregator that pre-computes HLL during ingest and merges sketches at query time. A count(distinct viewer_id) group-by-minute over 24 hours touches 1440 sketches × 4 KB = 5.6 MB of state, returning in 80 ms. The exact equivalent over 1.4B viewer-events would never finish.

Redis PFADD / PFCOUNT / PFMERGE: the PF prefix is "Philippe Flajolet", the algorithm's inventor. A Redis HLL uses 12 KB per key; merging is a single PFMERGE command. Razorpay tracks daily-active-merchants this way: every payment event runs PFADD daily:merchants:2026-04-25 <merchant_id>, the dashboard runs PFCOUNT daily:merchants:2026-04-25 and gets back a 0.81% standard-error count in <1 ms.

BigQuery APPROX_COUNT_DISTINCT: replacing COUNT(DISTINCT user_id) with APPROX_COUNT_DISTINCT(user_id) over a 12-billion-row table at Flipkart drops query cost from 42 to0.18 and runtime from 6 minutes to 11 seconds. The APPROX_QUANTILES and APPROX_TOP_COUNT functions use related sketches (KLL, Misra-Gries) and are worth the same review.

A fourth, lower-profile site: Postgres extensions. The hll extension from Citus and the postgresql-hll package give Postgres native HLL columns and operators. Razorpay's analytics warehouse keeps a daily_uniques table with one HLL-typed column per dimension; a select hll_cardinality(hll_union_agg(ips)) from daily_uniques where day between '2025-01-01' and '2025-01-31' returns the unique-IP count for January in 200 ms over 31 input rows.

The pattern across all four: the user writes count(distinct …), the engine substitutes a sketch, the result is correct to within a percent, and the cost is fixed. Sketching is the only practical answer to distinct counting at scale; the question is which sketch and where it lives.

Where an HLL sketch sits in a real ingestion pipelineA horizontal flow showing four stages: client SDK at left, ingest workers in middle-left, a sketch store middle-right, and a query/dashboard at right. The client emits viewer events; ingest workers update per-minute HLL sketches in Redis or Druid; queries merge sketches across time windows; the dashboard shows the unique-viewer count. Annotations show 25M events/sec hitting ingest, 200 ingest pods each holding 4KB sketches per minute, merge step at query time touching only 1440 minutes worth of sketches per day, end-to-end latency under 100ms.An HLL pipeline — sketch close to the source, merge at query timeclient SDKapp, browser,mobile player25M events/secat IPL peakno HLL stateon clientingest workers200 pods, Kafkaconsumer groupper-minute HLLsketch update:PFADD viewer_id~125K events/pod/secsketch storeRedis Cluster/ Druid HLL col4 KB per minute5.6 MB / 24hvs 22 GB exact setqueryGrafana panelPFMERGE 1440sketches in 80 ms±1.6% errorthe sketch lives close to the data; the merge happens at query timetotal memory across all minutes for a day: ~5.6 MB vs ~22 GB for exact set
Illustrative — not measured data. The architecture is the same whether the sketch store is Redis (`PFADD`/`PFMERGE`) or Druid (`HLLSketch` columns). What matters is that ingest pods do not coordinate during writes — each maintains its own HLL — and queries merge cheaply.

What is actually inside the sketch — a 4096-byte tour

Reading a real HLL sketch byte by byte is what makes the algorithm stop feeling like magic. The HLL instance from the previous section is a bytearray(4096) — exactly 4096 bytes. Each byte stores a number between 0 and ~52 (the maximum possible rho for a 64-bit hash with 12 bucket bits). The byte at offset i is the largest leading-zero count seen for any element whose hash's top 12 bits equal i. That is the entire data structure.

Walk through what one update does. The hash of viewer_id="rk_4827" under xxHash64 is 0xa3f1...c904 — a 64-bit integer. Strip the top 12 bits as the bucket index: 0xa3f = 2623. Take the remaining 52 bits and find the leftmost 1: say it sits at bit position 47 from the high end, so rho = 5. The update is if 5 > buckets[2623]: buckets[2623] = 5.

That is one Python attribute lookup, one shift, one bit-scan, one comparison, one byte write. The runtime is small enough that on commodity hardware a Python implementation hits ~3-4M add() calls/sec; the C implementations Redis and Druid ship hit 50-100M/sec, easily the throughput needed for any real ingest.

The estimator's reverse direction — going from buckets back to a count — is where the harmonic-mean magic happens. After 5 million add() calls on uniform random inputs, a p=12 sketch typically has buckets distributed roughly: ~70 buckets at rho=10 (the "tail" of unusually-leading-zero hashes), ~600 at rho=8, ~2200 at rho=6, the rest spread between rho=2 and rho=5.

The harmonic mean m^2 / sum(2^-rho) produces ~5M; the alpha correction shaves the small bias; the result lands within 1% of the true count. The buckets you see in dump form are not "5 million viewers"; they are a fingerprint of how the maximum-leading-zeros across many independent samples is distributed, and that fingerprint determines the cardinality estimate.

Two practical consequences of this internal shape. Sketch comparison is meaningful: two sketches with similar bucket distributions saw similar input populations even if one was on a million events and the other on five million. Sketch dumping is cheap: a 4096-byte sketch fits in a single Redis value, a single Kafka record, a single Postgres bytea column.

Hotstar persists nightly snapshots of every per-minute HLL into S3 — 1440 minutes × 4 KB × 200 streams = ~1.1 GB per day, recoverable forever, replayable into any new analysis. The exact-set equivalent would be petabytes per day; the sketch makes the historical record practical.

Tuning p — the precision dial

The single tuning parameter is p, the number of bits used for the bucket index. m = 2^p, error ≈ 1.04/sqrt(m), memory = m bytes (one byte per bucket). The table walks the practical range:

p m (buckets) memory std error typical use
10 1,024 1 KB 3.25% low-fidelity counters, IoT telemetry
11 2,048 2 KB 2.30% per-second per-pod streaming counters
12 4,096 4 KB 1.62% default — Druid, BigQuery, Redis (Redis uses sparse encoding to often hit smaller actual size)
14 16,384 16 KB 0.81% dashboards visible to product/finance
16 65,536 64 KB 0.40% regulatory reporting, audit-grade counts
18 262,144 256 KB 0.20% when 0.4% is not enough — rarely needed

A team's first instinct is to push p high "for accuracy"; the right move is usually to keep p=12 and merge across more dimensions. Why p=12 is the practical default: doubling m halves the error variance but doubles the memory; the marginal value of going from 1.6% to 0.8% is rarely worth the 4× memory in production. The error of 1.6% on a count of 1.4 billion is ±22 million — visually imperceptible on a Grafana time-series panel where the y-axis is logarithmic. Engineering effort is better spent on instrumenting more dimensions (per-state, per-region, per-OS) than on tightening the sketch precision on a single dimension.

The choice of p is also a forward-compatibility decision. Two sketches with different p values cannot be merged directly — you must first downgrade the higher-precision sketch to the lower one. The downgrade is a folding operation that loses information; it is one-way. Pick p once, document it in the schema, and be wary of changing it later. The teams that change p mid-life either accept a discontinuity in their historical numbers or pay the cost of replaying every event from the source-of-truth log.

Redis adds a wrinkle: its HLL implementation uses a dense encoding (12 KB) when the cardinality is high but a sparse encoding (much smaller) when the cardinality is low — for a key counting fewer than ~3000 distinct viewers, the actual memory may be 200 bytes. This is why a MEMORY USAGE against a fresh HLL key reports a small number that grows over time.

A second practical wrinkle: errors compose when you merge across many sketches. A merge of k independent sketches each with std error s produces a sketch with std error somewhere between s and s × sqrt(k) depending on the overlap of their input populations. For Hotstar's per-stream merge across 200 pods, the populations are mostly disjoint (each pod sees a different chunk of viewers), so the merged-sketch error stays close to 1.04/sqrt(m) — independent of the number of merges. For a merge across many time windows where the populations overlap heavily (each minute sees mostly the same viewers as the previous minute), the merged error can be slightly larger than the per-window error; the rule of thumb is that the merged-sketch error is bounded by s × sqrt(2) for reasonable overlap patterns, comfortably within the design margin.

The third wrinkle is bias on adversarial inputs. A real attacker who can choose inputs (e.g., crafting viewer IDs that all hash to the same bucket) can drive HLL's estimate arbitrarily wrong. The mitigation is keyed hashing: pick a server-side secret and hash (secret, viewer_id) instead of viewer_id alone. The cost is one extra concatenation per add(); the benefit is that an attacker without the secret cannot find collision-inducing inputs. For internal observability data, where inputs are not attacker-controlled, the unkeyed hash is fine; for any HLL fed by user-controlled values, key it.

Common confusions

Going deeper

The variance proof — where 1.04/sqrt(m) comes from

The Flajolet 2007 paper derives the variance of the harmonic-mean estimator under the model that hash bits are independent uniform Bernoulli. The per-bucket estimator 2^rho for rho the maximum leading-zero count has expected value 2 × n_bucket × ln 2 and variance proportional to n_bucket^2.

The harmonic mean across m buckets reduces variance by approximately 1/m; the bias correction alpha_m accounts for the harmonic-mean estimator's m-dependent bias. The product gives standard error 1.04/sqrt(m).

Why this matters in practice: the constant 1.04 is a theoretical lower bound for any sketch that uses m registers — HyperLogLog++ (Heule et al., Google 2013) does not improve the asymptotic error, only the small-range bias correction and the encoding density. New algorithms claiming "better than HLL" usually beat it on a different metric (mergeability with weighted samples, intersection support, intersection-of-K-sets), not on the error-vs-memory frontier.

HyperLogLog++ — Google's production tweak

HLL++ (used in BigQuery's APPROX_COUNT_DISTINCT and Druid's modern HLLSketch) refines the original in three ways. 64-bit hashing instead of 32-bit (the original Flajolet paper assumed 32-bit hashes; at billions of elements the 32-bit space exhibits collisions that bias the estimate). Empirical bias correction — Google measured the small-range bias on real datasets and replaced the 5/2 * m switch-over with a smooth interpolation. Sparse encoding — at low cardinality, store (idx, rho) pairs in a sorted list instead of the full bucket array; flip to dense when the sparse representation exceeds the dense size. Sparse encoding is what lets a Redis HLL key occupy <1 KB until it sees thousands of distinct values.

Hotstar's IPL playbook — how the live count actually gets to the dashboard

Hotstar's setup runs HLL sketches at three layers: per-pod (one HLL per ingest pod, holding the unique viewers seen by that pod in the last minute), per-region (merged HLLs from pods in the same AWS region, refreshed every 5 seconds), and per-stream (merged across regions, the number that hits the Grafana panel). A late-arriving event from a pod that had a network blip can be merged into the per-region HLL minutes after the original window — the merge is associative and idempotent, so late arrivals do not re-double-count. The pre-IPL load test in 2024 ran 28M events/sec for 90 minutes and validated that the per-stream HLL count tracked the ground-truth viewer count (recovered by replay from the source-of-truth event log) within 1.7% — within the design error margin.

Mergeability is the structural property; storage is the side effect

The lesson many teams take from HLL is "it saves memory". The deeper lesson is mergeability: HLL's bucket-wise-max merge is associative, commutative, and idempotent. Associativity means (A ∪ B) ∪ C = A ∪ (B ∪ C) — partial sums in any order produce the same result. Commutativity means order does not matter; you can ingest events out of order. Idempotency means re-merging a sketch you already merged does not double-count.

These three properties are what algebraists call a bounded semilattice, and they are exactly the properties the CRDT literature requires for conflict-free merge across distributed replicas. HLL is in fact a CRDT — specifically a G-Counter-style state-based CRDT for set cardinality. The same maths that lets two Riak replicas reconcile a counter without coordination is what lets two Hotstar regions merge their viewer counts without a central coordinator.

This is the property that lets you build late-arriving-data tolerant distinct counters. An event that arrives 30 minutes late, due to a Kafka consumer lag spike, can be merged into the relevant minute's HLL whenever it shows up — the answer for that minute updates, but the answers for surrounding minutes are unaffected, and the answer for the daily total stays consistent. Exact-set counting can do this too, but only by storing every ID; HLL does it for 4 KB per minute.

For a deeper treatment of why mergeability matters across the wider sketching toolkit (Theta sketches add intersection support; KLL gives mergeable quantiles; HLL++ adds bias-corrected merge), see the DataSketches library docs in the references — the engineering culture there treats mergeability as the primary design constraint and storage as a downstream consequence.

When HLL is the wrong tool

Three places to pick something else. Set membership ("has this user voted?") wants a Bloom filter or a Cuckoo filter, which support O(1) membership tests with bounded false-positive rate. Top-K ("what are the 10 most-active merchants?") wants Misra-Gries or Count-Min Sketch, which track frequency rather than just presence. Quantiles ("what was the p99 latency?") wants HdrHistogram, KLL, or t-digest — quantile estimators with their own error/memory tradeoff. The taxonomy is worth memorising: cardinality → HLL, membership → Bloom, frequency → CMS, quantile → HdrHistogram. Reach for the right sketch; the wrong sketch gives the wrong answer with confidence.

Reproducibility footer

python3 -m venv .venv && source .venv/bin/activate
pip install xxhash redis
python3 hll.py    # prints estimate vs true count, demonstrates 1.6% error
# Optional: compare against Redis PFCOUNT
docker run -d -p 6379:6379 redis:7
redis-cli <<'EOF'
PFADD viewers:ipl:final viewer_1 viewer_2 viewer_3 viewer_4 viewer_5
PFCOUNT viewers:ipl:final
EOF

Where this leads next

HLL is the cardinality entry in a wider sketching toolkit; the next chapter — Count-Min Sketch for frequency estimation — handles the related "how often does each value appear?" question with the same structural move: trade a bounded error for fixed memory. After that, HdrHistogram for latency distributions picks up the p99 of a billion samples in 8 KB problem, and t-digest for quantile merging explains why HLL's mergeability lesson generalises to other sketches.

The single insight a senior reader takes away: distinct counting is a sketching problem, not a storage problem. Teams that build "scale up the broker until count(distinct) finishes" are running an unbounded race against their own growth. Teams that hash, count leading zeros, and merge sketches are running a fixed-cost answer that holds whether the audience is 1 million or 1 billion. The shift from "store and count" to "hash and sketch" is the algorithmic move that made web-scale observability possible — Druid, BigQuery, Redis, Snowflake, and every modern OLAP engine ship HLL because the alternative does not exist at the scale these systems target.

The closing reframing: HyperLogLog is the answer to the question Karan started this chapter with — how many unique viewers are on the stream right now? — not because it counts faster, but because it changes the question.

Instead of asking "how many", it asks "how many leading zeros did the most-zero-prefixed hash have?" — and from that single number, recovers the count to within 1.6%. Every sketching algorithm has the same shape: replace an O(N) measurement with an O(1) summary that forgets the right things.

HLL forgets which viewers were unique and remembers only how unique the most-unique one was. That trade is the one that lets the IPL final dashboard render in 80 ms.

References