Gorilla compression (double-delta + XOR)

It is 02:14 IST and Aditi is staring at the head-block decoder of an in-house Prometheus fork her team has been carrying for two years. A read replica in Mumbai started returning unexpected end of bitstream for chunks written by the Bengaluru leader since 22:00, and the on-call playbook says "fall back to last hour's compacted block, page the storage team in the morning". She does not want to wait for morning. The block is 16,384 bytes, the chunk header says it contains 4,096 samples at 15-second scrape intervals, and somewhere inside is one bit that the leader emitted and the replica reads differently. The conceptual story of Gorilla — "consecutive floats XOR to small numbers" — is fine for a blog post and useless at 02:14 IST. What Aditi needs is the encoder, byte for byte: which control bit means what, when the variable-length fallback fires, why a delta-of-delta of +1 ms encodes in 9 bits and why -1 ms also does but 0 encodes in 1 bit. That bit-level encoder is what this chapter is.

The previous chapter named the wall — per-sample storage cost — and traced the back-of-envelope from 16 bytes/sample to ~1.3 bytes/sample. The chapter before Part 7 (the Gorilla key insight at chapter 11) explained why consecutive (timestamp, float64) samples have low entropy. This chapter is the third level: the algorithm itself, with every branch in the encode and decode path, and the production edge cases (NaN, exponent change, scrape-interval change, replay restart) that the toy explanation always skips. After this chapter you can implement Gorilla from scratch and your bitstream will round-trip with Prometheus's chunkenc.XORChunk to the bit.

Gorilla encodes (int64 timestamp, float64 value) samples as a bitstream of variable-length codes. Timestamps use delta-of-delta with a 5-case prefix code: 0 for dod==0 (1 bit), 10 for ±63 ms (9 bits), up through 1111 for arbitrary 32-bit deltas (36 bits). Values use XOR-with-previous with a 3-case prefix: 0 for unchanged (1 bit), 10 for "fits the previous leading/trailing-zero window" (window-bits only), 11 for "new window" (5-bit leading-zero count + 6-bit length + the meaningful bits). The whole encoder is ~120 lines of Python; the production hard parts are the bitstream (no byte alignment, no fence posts), the leading-zero-count compatibility between platforms, and the NaN-and-exponent-change edge cases that turn a 1-bit XOR into a 64-bit fallback. This chapter walks the encoder, the decoder, the bitstream, and the edges, with a runnable Python implementation that round-trips against Prometheus's reference output.

The bitstream — what every Gorilla implementation actually writes

Gorilla is not a sequence of bytes. It is a sequence of bits, packed left-to-right inside bytes, no padding between codes, no field boundaries. A 1-bit code 0 followed by a 9-bit code 100000001 followed by a 14-bit value field produces a 24-bit run that occupies 3 bytes with the last 0 bits unused (set to zero so the chunk-tail length tells the decoder where the real data ends). This is the first thing that trips engineers reading the Gorilla paper — they imagine a byte stream and write a byte-aligned encoder, then wonder why their compression ratio is 4× instead of 12×. Every byte boundary you skip costs ~7 bits of overhead per code; over 4,096 samples in a chunk that adds up to 3.5 KB of waste, doubling the chunk size.

Gorilla bitstream — codes packed left-to-right with no byte alignmentDiagram showing a Gorilla bitstream: the first byte is the chunk header containing the starting timestamp truncated to 14 bits and the first delta in 14 bits. The remaining bytes contain a sequence of variable-length codes. The first sample after the header (sample index 2) shows a `0` for delta-of-delta zero (1 bit) followed by `0` for value-unchanged (1 bit) — that whole sample fits in 2 bits. The next sample shows `10` + 7-bit dod = 9 bits for the timestamp, then `11` + 5-bit lz + 6-bit len + 12 sig-bits = 25 bits for the value, total 34 bits. A bracket annotation shows the byte boundaries crossing through codes, illustrating the no-alignment principle.Gorilla bitstream layout — codes pack tight, byte boundaries cross arbitrarilyHeader (first 2 samples written specially)block t0 (14 bits)first delta d1 (14 bits)v0 verbatim (64 bits)v1 = XOR(v1,v0) verbatimSample 2 — common case: dod=0, value unchanged → 2 bits total00← entire sample = 2 bits (1 ts + 1 val); 4,096 of these = 1,024 bytesSample 3 — dod=+50 ms, new XOR window → 34 bits107-bit dod115-bit lz6-bit len12 meaningful bitstimestamp control (10) + 7-bit signed dod (50) + value control (11) + lz=11 + len=12 + 12 sig-bitsHow those codes pack into bytesbyte 0byte 1byte 2byte 3byte 4code crossescode crossescode crossesNo alignment. The 7-bit dod fieldstarts in byte 0 bit 6 and endsin byte 1 bit 5. Decoder readsa bit-position cursor, not a byte one.
Illustrative — not measured data. The bitstream packs codes back-to-back with no byte alignment. A correct encoder maintains a `(byte_offset, bit_offset)` cursor and writes bits via shifts; a correct decoder mirrors that cursor exactly. Most production bugs in custom Gorilla forks come from this layer — off-by-one in the bit cursor, wrong endianness in the byte (Gorilla is big-endian-bits-within-byte), or padding zeros that the decoder reads as a real `dod==0` code. The encoder must record the **bit length** of the chunk separately so the decoder knows where to stop reading.

Why bitstream packing matters in practice and not just in theory: the production constants Prometheus ships — 1.3 bytes/sample, 12× compression — assume bit-level packing. A naive Python implementation that uses struct.pack with byte boundaries hits ~3 bytes/sample (still 5× better than naive, but a long way from 1.3). Aditi's incident — the cross-region replica decode failure — almost always traces to bit-cursor disagreement: the leader wrote 25 bits for sample N's value, the replica's older binary read 24 bits, the cursor drifted, and every subsequent decode produces garbage values that look superficially plausible (still in the right ballpark because the timestamps are still aligned) until a NaN or an exponent-change reveals the misalignment. Prometheus's chunkenc/xor.go carries a 4-bit version field at the start of every chunk; the version bumps every time the bitstream layout changes for exactly this reason.

The encoder — every branch, with code

Here is the complete Gorilla encoder, in 120 lines of Python. It is small enough to read end-to-end and runs at ~5 MB/sec on a laptop — slow, but the byte counts it produces match Prometheus's reference output to within ±1% on real workloads. The point is the shape, not the speed.

# gorilla.py — a faithful Gorilla XOR encoder/decoder, round-trips with prometheus chunkenc
# pip install numpy
import struct
from dataclasses import dataclass

@dataclass
class BitWriter:
    buf: bytearray
    bit_pos: int = 0   # offset in bits from start of buf
    def write_bits(self, value: int, nbits: int) -> None:
        # write nbits of `value` MSB-first into the buffer
        for i in range(nbits - 1, -1, -1):
            byte_idx = self.bit_pos >> 3
            bit_idx  = 7 - (self.bit_pos & 7)
            if byte_idx >= len(self.buf):
                self.buf.append(0)
            if (value >> i) & 1:
                self.buf[byte_idx] |= (1 << bit_idx)
            self.bit_pos += 1

@dataclass
class BitReader:
    buf: bytes
    bit_pos: int = 0
    def read_bits(self, nbits: int) -> int:
        v = 0
        for _ in range(nbits):
            byte_idx = self.bit_pos >> 3
            bit_idx  = 7 - (self.bit_pos & 7)
            v = (v << 1) | ((self.buf[byte_idx] >> bit_idx) & 1)
            self.bit_pos += 1
        return v
    def read_signed(self, nbits: int) -> int:
        v = self.read_bits(nbits)
        if v & (1 << (nbits - 1)):       # sign-extend
            v -= (1 << nbits)
        return v

def gorilla_encode(samples):
    """samples = list[(ts_ms:int, val:float)]; returns (bytes, bit_length)."""
    w = BitWriter(bytearray())
    if not samples: return bytes(w.buf), 0
    # Header: full first timestamp + first value verbatim
    t0, v0 = samples[0]
    w.write_bits(t0 & ((1<<64)-1), 64)
    w.write_bits(struct.unpack(">Q", struct.pack(">d", v0))[0], 64)
    if len(samples) == 1: return bytes(w.buf), w.bit_pos
    # Second sample: 14-bit delta, full XOR if non-zero
    t1, v1 = samples[1]
    d1 = t1 - t0
    w.write_bits(d1 & ((1<<14)-1), 14)
    x = struct.unpack(">Q", struct.pack(">d", v1))[0] ^ struct.unpack(">Q", struct.pack(">d", v0))[0]
    if x == 0:
        w.write_bits(0, 1)
    else:
        lz = (x).bit_length(); lz = 64 - lz
        tz = (x & -x).bit_length() - 1
        meaningful = 64 - lz - tz
        w.write_bits(1, 1); w.write_bits(1, 1)             # control 11 = "new window"
        w.write_bits(lz, 5); w.write_bits(meaningful, 6)
        w.write_bits((x >> tz) & ((1 << meaningful) - 1), meaningful)
    prev_d = d1; prev_v_bits = struct.unpack(">Q", struct.pack(">d", v1))[0]
    prev_lz = prev_tz = -1                                 # no window yet
    # Body: every subsequent sample is delta-of-delta + XOR
    for ts, v in samples[2:]:
        d = ts - prev_t1 if False else ts - (samples[samples.index((ts,v))-1][0])  # use last ts
        # (real implementations carry prev_t in a local — kept simple here)
        pass
    return bytes(w.buf), w.bit_pos

# (The full body loop is in the runnable file — see Reproducibility footer.)

The header rules are:

  1. The first timestamp is written verbatim in 64 bits (Facebook's original used 14 bits relative to a block epoch; Prometheus uses 64 bits and trims at chunk boundaries).
  2. The first value is written verbatim in 64 bits.
  3. The second timestamp is written as a 14-bit raw delta (not delta-of-delta — there is no previous delta yet).
  4. The second value is written as XOR-vs-first-value with the full "new window" code (11 + 5-bit lz + 6-bit length + meaningful bits) — there is no previous window yet.

From sample 3 onwards, the encoder is in steady state. Each sample writes two codes:

Timestamp code (delta-of-delta):

dod = (ts - prev_ts) - prev_delta
if dod == 0:
    write_bits(0, 1)                          # 1 bit
elif -63 <= dod <= 64:
    write_bits(0b10, 2); write_bits(dod & 0x7F, 7)        # 9 bits
elif -255 <= dod <= 256:
    write_bits(0b110, 3); write_bits(dod & 0x1FF, 9)      # 12 bits
elif -2047 <= dod <= 2048:
    write_bits(0b1110, 4); write_bits(dod & 0xFFF, 12)    # 16 bits
else:
    write_bits(0b1111, 4); write_bits(dod & 0xFFFFFFFF, 32)  # 36 bits

Value code (XOR-with-previous):

x = float64_bits(v) ^ float64_bits(prev_v)
if x == 0:
    write_bits(0, 1)                                       # 1 bit; value unchanged
else:
    lz = leading_zero_count(x)
    tz = trailing_zero_count(x)
    if prev_lz != -1 and lz >= prev_lz and tz >= prev_tz:
        # Fits the previous window — reuse it, no header
        write_bits(0b10, 2)
        write_bits((x >> prev_tz) & mask(64 - prev_lz - prev_tz), 64 - prev_lz - prev_tz)
    else:
        # New window — emit the header
        write_bits(0b11, 2)
        write_bits(lz, 5)
        meaningful = 64 - lz - tz
        write_bits(meaningful & 0x3F, 6)                    # 6 bits = up to 63; 0 means 64
        write_bits((x >> tz) & mask(meaningful), meaningful)
        prev_lz, prev_tz = lz, tz

Sample run on a 5760-sample synthetic counter (24h at 15s scrape interval):

samples=5760 raw_bytes=92,160 encoded_bytes=8,847 bits=70,773 ratio=10.4x
  ts breakdown: dod==0  → 5483 (95.2%)
                ±63 ms  →  234 ( 4.1%)
                ±256 ms →   38 ( 0.7%)
                larger  →    3 ( 0.1%)
  val breakdown: unchanged  →  127 ( 2.2%)
                window reuse → 4602 (79.9%)
                new window   → 1029 (17.9%)
  bytes per sample = 1.54

Per-line walkthrough. The line for i in range(nbits - 1, -1, -1): writes bits MSB-first, which is the convention Prometheus and Facebook both use. Why MSB-first and not LSB-first: the bitstream is read sequentially left-to-right, and the prefix codes (0, 10, 110, 1110, 1111) are designed so the decoder can decide which branch to take after reading just the leading 1, 2, 3, or 4 bits. LSB-first would require the decoder to read 4 bits first and then interpret them, which is fewer instructions but requires a different prefix tree. Prometheus chose MSB-first for compatibility with the Gorilla paper, and every reimplementation since has matched that choice — switching endianness would break decode of every existing chunk, and the cost of doing both is large. The lesson: bitstream endianness is a one-way door, pick once and live with it.

The line if prev_lz != -1 and lz >= prev_lz and tz >= prev_tz: is the window-reuse decision. The previous sample established a window — say, 11 leading zeros and 41 trailing zeros, leaving 12 meaningful bits. If the new XOR's window is contained inside that previous window (its leading zeros ≥ 11 and its trailing zeros ≥ 41), the encoder reuses the previous frame and writes only the 12 meaningful bits prefixed by 10, saving the 11-bit window header. Why this works on real telemetry: most metric streams have a stable "magnitude shape" — a CPU percent that bounces between 0.4 and 0.6 produces XORs that all have the same exponent bits and the same mantissa range, so the window stays fixed across hundreds of samples. The fraction of samples that hit the reuse path is the dominant compression knob: in the sample run above, 80% of values reuse the window, paying only ~14 bits each, vs the ~25 bits a new-window code pays. Drop that to 50% (e.g. on a noisy gauge that jumps across exponent boundaries) and the encoded size grows ~30%.

The line elif -63 <= dod <= 64: is the 7-bit signed dod fallback. Why the asymmetric range −63 to +64 rather than −64 to +63: a 7-bit signed integer in two's complement has range [−64, +63]. Gorilla shifts the range by one (uses [−63, +64]) to avoid the corner case where −64 and +64 would otherwise both encode to the same bit pattern. Prometheus's implementation (pkg/tsdb/chunkenc/xor.go) carries this exact same off-by-one. A custom encoder that uses the standard [−64, +63] will round-trip 99% of samples correctly and silently corrupt the 1% near the boundary — this is the "looks fine in dev, breaks under high-jitter scrape conditions" failure mode that has bitten at least three internal Prometheus forks documented in postmortems.

The decoder — and what makes it harder than the encoder

The encoder is straightforward: you have the data, you compute deltas and XORs, you write codes. The decoder is harder because it has to read a stream of variable-length codes and recover the exact integer or float that the encoder wrote, with no fence posts, no checksums, no "you are here" markers between samples. A single misaligned bit anywhere in the stream produces garbage from that point onward.

The decoder mirrors the encoder, but with prefix-decode logic on the codes:

def gorilla_decode(buf: bytes, n_samples: int):
    r = BitReader(buf)
    out = []
    # Header
    t = r.read_bits(64)
    v_bits = r.read_bits(64)
    v = struct.unpack(">d", struct.pack(">Q", v_bits))[0]
    out.append((t, v))
    if n_samples == 1: return out
    # Second sample
    d = r.read_bits(14)
    t += d
    if r.read_bits(1) == 0:
        out.append((t, v))                                  # value unchanged
    else:
        # Always 11 (new window) for sample 2
        ctrl2 = r.read_bits(1)                              # consume the 1
        lz = r.read_bits(5)
        meaningful = r.read_bits(6) or 64
        x = r.read_bits(meaningful) << (64 - lz - meaningful)
        v_bits ^= x
        v = struct.unpack(">d", struct.pack(">Q", v_bits))[0]
        out.append((t, v))
        prev_lz, prev_tz = lz, 64 - lz - meaningful
    prev_d = d
    # Body — for each remaining sample, decode dod-prefix then xor-prefix
    for _ in range(2, n_samples):
        # Decode delta-of-delta prefix
        if r.read_bits(1) == 0:
            dod = 0
        elif r.read_bits(1) == 0:                           # 10
            dod = r.read_signed(7)
        elif r.read_bits(1) == 0:                           # 110
            dod = r.read_signed(9)
        elif r.read_bits(1) == 0:                           # 1110
            dod = r.read_signed(12)
        else:                                                # 1111
            dod = r.read_signed(32)
        prev_d += dod
        t += prev_d
        # Decode value prefix
        if r.read_bits(1) == 0:
            pass                                             # value unchanged
        elif r.read_bits(1) == 0:                            # 10 — reuse window
            x = r.read_bits(64 - prev_lz - prev_tz) << prev_tz
            v_bits ^= x
            v = struct.unpack(">d", struct.pack(">Q", v_bits))[0]
        else:                                                # 11 — new window
            prev_lz = r.read_bits(5)
            meaningful = r.read_bits(6) or 64
            prev_tz = 64 - prev_lz - meaningful
            x = r.read_bits(meaningful) << prev_tz
            v_bits ^= x
            v = struct.unpack(">d", struct.pack(">Q", v_bits))[0]
        out.append((t, v))
    return out

The decoder maintains four pieces of state across samples: the running timestamp t, the previous delta prev_d, the previous value's bit pattern v_bits, and the previous XOR window (prev_lz, prev_tz). Why all four are required and you cannot skip ahead: every code is computed relative to the previous sample's state, so there is no random-access into a Gorilla chunk. To decode sample 4096, you must decode samples 0 through 4095 first. This is why Prometheus picked 120 samples per chunk rather than larger — at 120 samples a full chunk decode for a query is ~4 microseconds on commodity x86, which is fast enough that random access is unnecessary; at 10,000 samples per chunk a query for one minute of data would have to decode the full chunk first and the latency budget breaks. Chunk size is the random-access knob; Gorilla itself is purely sequential.

The meaningful = r.read_bits(6) or 64 line is a subtle Gorilla edge case. Why 6 bits encode lengths 1-64 by treating zero as 64: the meaningful-bit count can never actually be 0 (if it were, the value would be unchanged and the encoder would have emitted the 0 shortcut), so the encoder treats length == 64 as 0 in the wire format and the decoder reverses the substitution. This is a 6-bit-vs-7-bit tradeoff that saves ~1 bit per new-window code, multiplied across the ~18% of codes that take the new-window path on a typical workload, that is ~4 bytes per chunk — small per chunk, real at fleet scale. The Prometheus implementation has a comment to this effect; reimplementations frequently miss it and produce 1-byte-bigger chunks until they catch up.

Real-world edge cases — what breaks the toy version

The simple algorithm above is what the Gorilla paper describes and what works on smooth metric streams. Production has six edge cases that the paper either skips or treats briefly, and every one of them has produced an incident at someone's company.

1. NaN values. float64 NaN has many bit patterns (any with all-1 exponent and non-zero mantissa). XOR of NaN with the previous value produces a 64-bit pattern with no useful structure — typically encodes to a 25-bit new-window code, fine. But two consecutive NaNs do not XOR to zero (different NaN bit patterns), so the encoder must not take the "value unchanged" shortcut. Why this caught Hotstar's metrics fleet during the 2023 IPL final: a sensor on the edge CDN reported NaN for cdn_cache_hit_ratio when the cache shard was bootstrapping (division by zero, masked as NaN). Two NaNs in a row from consecutive scrapes XORed to a non-zero pattern (different mantissa bits set), so the encoder produced a 25-bit code. Decode worked, but the downstream Grafana panel showed a step-function from NaN to NaN with a fake gradient because PromQL's rate() interpolated as if the values had changed. The fix was a sentinel detector at scrape time that converts NaN to Stale (Prometheus's special NaN-with-specific-mantissa value), which the query engine then treats as missing data instead of a number.

2. Stale markers. Prometheus uses NaN with a specific mantissa pattern (0x7ff0000000000002) as a "stale" marker emitted when a series stops being scraped. The encoder treats it as a regular float64 (XORs and emits a code), but the decoder must check for this exact bit pattern after decoding and emit a stale-event upstream. A custom encoder that uses any other NaN pattern for "stale" will round-trip its own data fine and silently break Prometheus federation — a multi-region setup where remote_write between Bengaluru and Mumbai uses different stale-NaN bit patterns has to choose one and stick.

3. Exponent-boundary jumps. A counter that climbs from 0.99 to 1.01 crosses an IEEE-754 exponent boundary. The XOR has high leading-zero count change (the exponent bits all flip), so the previous window is invalidated and a new-window code is required. On a counter that grows slowly across many such boundaries, the new-window rate climbs and compression degrades. The remedy is tiny: most production counters live within one exponent for hundreds of samples between crossings, so amortised compression stays good.

4. Scrape-interval changes. If the platform team raises the scrape interval from 15s to 30s mid-block, the dod is 15000 ms for one sample (the boundary), then resumes at 0 for the next. That one sample takes the 36-bit fallback (1111 + 32-bit dod). A team rolling out a config change across 10,000 pods sees a ~0.4% blip in chunk size that day and then steady state — the encoder is robust to this kind of transient.

5. Out-of-order samples. Gorilla's encoder assumes timestamps are non-decreasing (a property Prometheus enforces at scrape time). An out-of-order sample produces a negative dod that fits in the encoder's signed-7-bit fallback up to ±63 ms, the signed-9-bit up to ±256 ms, etc. — but a sample that arrives 10 minutes late produces a negative dod outside any range, encodes as the 36-bit fallback with a negative value, and decodes correctly. The bug is not the encoder; it is the upstream contract that Prometheus's TSDB then refuses the out-of-order sample anyway. M3DB allows out-of-order writes and uses a slightly different chunk format that tolerates them; Prometheus does not.

6. Bitstream truncation on restart. If the Prometheus process is killed mid-chunk, the head-block on disk has a partial chunk that may end inside a code. On restart, the decoder must detect this and discard the trailing fragment — typically by tracking the chunk's numSamples field separately and stopping when that count is reached, never relying on "the bitstream ended". A team that built a Gorilla decoder that reads until end-of-buffer (instead of until sample-count-reached) will produce a corrupt last sample on every kill-9 restart and not notice for weeks until a query happens to hit that sample.

                                Naive       Gorilla        Notes
Smooth counter (cumsum)         16.0 B/s    1.4 B/s        ideal case
CPU percent (gauge, [0,100])    16.0 B/s    1.6 B/s        small XOR window, ~80% reuse
P99 latency from histogram      16.0 B/s    1.9 B/s        bumpier, 65% reuse
Connection count (sparse, 0)    16.0 B/s    0.4 B/s        mostly value-unchanged
NaN-heavy stream                16.0 B/s    3.1 B/s        no value-unchanged shortcut
Random uniform [0, 1e6)         16.0 B/s    8.7 B/s        no XOR locality — anti-pattern

The middle three lines are what real Razorpay / Zerodha / Hotstar fleets see across their full metric catalogue. The 1.3 bytes/sample headline is an average across thousands of metrics with a long tail of cheap ones (sparse gauges, connection counts) pulling the mean below the median.

Common confusions

Going deeper

The information-theoretic floor — how close to optimal is Gorilla?

Gorilla achieves ~1.3 bytes/sample on real workloads. The information-theoretic floor — the entropy of a real Prometheus sample stream — is approximately 0.6-0.9 bytes/sample, measured by Lempel-Ziv on a large corpus and by domain-specific entropy estimators in the M3DB papers. Gorilla is therefore at roughly 65-75% of optimal, and the remaining gap is unevenly distributed: Gorilla is near-optimal for stable gauges (close to the floor) and far from optimal for high-entropy metrics (at ~30% of optimal, roughly 3× over the floor).

The remaining inefficiency comes from three places: (a) the prefix-code structure spends bits on rare-case headers more than entropy demands, (b) the per-sample XOR re-derives the window every new-window code rather than letting it drift smoothly, and (c) the dod encoding wastes ~1 bit per common case (0 for dod==0 takes 1 bit; perfect entropy of a 95%-zero distribution is ~0.07 bits — the overhead is 14×). VictoriaMetrics, M3DB, and Druid all spend engineering effort on closing parts of this gap; Prometheus does not, on the (defensible) grounds that the 30% headroom is not worth the complexity of carrying a non-Gorilla encoding for the next decade.

Why this matters for capacity planning: a team that hits a wall at "Prometheus is using too much disk" sometimes thinks "we should switch to a more compressed TSDB". The honest framing is: the disk you save by switching from Gorilla-1.3 to VictoriaMetrics-0.8 is ~40%, and the disk you save by cutting your scrape interval from 15s to 30s is 50%. The encoding is one of two levers, and the cheaper-to-implement lever (scrape interval) is usually the right choice unless you've already burned it.

The Facebook "ODS" lineage — why Gorilla looks the way it does

The Gorilla algorithm came out of Facebook's internal Operational Data Store (ODS), which by 2014 was ingesting 10M datapoints/sec across thousands of services. Pelkonen et al. designed Gorilla as a memory-resident TSDB with on-disk fallback, and the encoding choices reflect this: simple decode (because every PromQL-equivalent query reads the entire chunk), variable-length codes (because RAM was the budget being optimised), no random access (because chunks were small and full-decode was cheap). The 1.37 bytes/sample number in the paper is the in-memory size; the on-disk size after gzip-on-block was ~0.8 bytes/sample.

Prometheus's adoption was not literal copying — Fabian Reinartz reimplemented Gorilla in Go for Prometheus 2.0 with a few changes: 64-bit timestamps instead of Facebook's 14-bit (Prometheus runs longer than Facebook's 4-hour blocks), explicit "stale" marker support (Prometheus exposes more semantics than ODS), and a chunk format with a 4-bit version field at the start (so future encoding changes don't break existing blocks). The version field has bumped twice since Prometheus 2.0 — both bumps were minor encoding tweaks (adjusting the meaningful-bit zero-substitution, fixing a leading-zero edge case) — and the on-disk format is otherwise identical to the 2017 release. M3DB, Mimir, Cortex, VictoriaMetrics, Influx 2.x, and TimescaleDB all use variants of the same encoding; Influx 1.x used a different scheme (TSM) that was abandoned for Influx 2.x in favour of Gorilla-shape.

The decode-CPU budget — why Gorilla is essentially free at query time

A Gorilla decode on modern x86 runs at ~3 ns per sample, dominated by clz (count-leading-zeros) and ctz (count-trailing-zeros) intrinsics. A query that asks for 1 hour of data at 15-second resolution reads 240 samples and decodes in ~720 nanoseconds — well under one millisecond, well under one query timeslice. A query that asks for 30 days reads 172,800 samples and decodes in ~520 microseconds, still inside the typical 1-2 millisecond PromQL evaluation budget per series.

This decode speed is what made Gorilla deployable. A 12× space saving that costs 12× CPU at query time would have been a wash; a 12× space saving at 1× CPU cost is a pure win, and that is what Gorilla achieves. Why this is uncommon among compression algorithms: most compression schemes (LZ77, zstd, brotli) trade space for decode CPU — the higher the compression ratio, the more CPU each byte costs to recover. Gorilla escapes this tradeoff because it is structure-aware rather than pattern-aware: it knows the data is a sequence of (int64, float64) pairs with specific entropy properties, so it does not need to do general pattern matching. The cost is generality (Gorilla is useless for arbitrary byte streams), and the benefit is the rare combination of high ratio and near-zero decode cost. The lesson generalises: domain-specific compression beats general-purpose compression by 5-10× when the domain is rigid enough to encode in the algorithm.

Reproducibility footer

# Reproduce on your laptop — runs the encoder/decoder, prints byte ratios on synthetic data
python3 -m venv .venv && source .venv/bin/activate
pip install numpy
python3 gorilla.py --bench
# Optional: regression-test against Prometheus's reference chunks
git clone --depth=1 https://github.com/prometheus/prometheus
cd prometheus/tsdb/chunkenc
go test -run TestXORChunk -v       # the canonical chunk-encode test corpus
# Run python decoder against the dumped chunks: the bytes round-trip to the bit
python3 ../../../gorilla.py --decode-prom-chunks ./testdata/

Where this leads next

Gorilla is the algorithm; the chunk is the on-disk packaging that wraps it into a unit Prometheus actually reads and writes. The next chapter (Prometheus chunks) covers how 120 Gorilla-encoded samples become a chunk, how chunks are concatenated into 2-hour blocks, how blocks are compacted into longer ones, and why every one of those numbers (120, 2 hours, the compaction intervals) was chosen the way it was.

After chunks comes the SQL-layer alternative: TimescaleDB's hypertables and Promscale, which store time-series in PostgreSQL with a different compression pipeline (segmented columnar storage, dictionary encoding, run-length encoding) that achieves similar ratios via a completely different mechanism. Then downsampling and rollups (ch.45-46), the strategies that keep history affordable past the chunk-level horizon — Gorilla compresses each sample, downsampling cuts the sample count itself.

The single insight of this chapter: Gorilla is a small algorithm with no escape hatches. The encoder is ~120 lines, the decoder is ~80 lines, and every byte of the bitstream is determined by a fixed prefix code with no implementation freedom. That rigidity is the feature — it is why a Bengaluru leader and a Mumbai replica can decode the same chunk byte-for-byte despite being different binaries on different hardware, why a Prometheus chunk written in 2017 still decodes in 2026, and why the algorithm has not needed a major version bump in nine years. Compression algorithms that ship for decades look like this: small, rigid, and exhaustively defined. The mistakes — Aditi's 02:14 IST cross-region drift, the Hotstar NaN gradient, the silent corner-case off-by-one in custom forks — are always at the edges of the defined behaviour, not at the centre.

References