Wire protocols: Protobuf, Thrift, Cap'n Proto, FlatBuffers

It is 22:14 IST and Aditi is reading a CPU flame graph from PaySetu's settlement service. Forty-one percent of CPU is in protobuf::internal::ParseFromArray — pure deserialisation, no business logic. The team's first instinct is to switch to JSON ("smaller payloads"); their second instinct is to switch to FlatBuffers ("zero-copy"). Both instincts are wrong without understanding why Protobuf is parsing for 41% of CPU on a service that handles 80,000 settlements per second. The answer lives in the wire format — what bytes go over the network, what work the receiver must do to turn those bytes back into objects, and what trade the format made when it was designed.

A wire protocol decides three things: how compact your payload is, how much work the receiver does to read a field, and how schemas evolve as you change them. Protobuf and Thrift compress hard but force a parse-on-read model that costs CPU. Cap'n Proto and FlatBuffers fix the layout in memory and let the receiver index into bytes without parsing — at the cost of larger payloads and stricter schemas. Pick by where your bottleneck lives: bandwidth or CPU.

What a wire protocol actually is

A wire protocol is a function from a typed message to a byte sequence, plus its inverse. The typed message lives in your program — a Settlement { id: int64; amount: int64; merchant: string; status: enum }. The byte sequence is what crosses the network, gets logged to disk, or gets stored in a queue. Every choice the format makes is a trade between four properties: payload size (how many bytes per message), encode CPU (how many cycles to produce the bytes), decode CPU (how many cycles to read a field back), and schema evolution (how forgiving the format is when the sender and receiver disagree about what fields exist).

JSON sits at one extreme: human-readable, schema-less on the wire, every receiver re-parses every byte every time, every integer becomes ASCII digits. Memory-mapped formats sit at the other extreme: the on-disk bytes are the in-memory layout, you just point a pointer at them. The four formats this chapter covers occupy distinct points between those extremes.

Wire-format trade-off space: payload size vs decode costA 2D plot. X axis: payload size (bytes per message, smaller is better). Y axis: decode cost (CPU per field read, lower is better). Four formats placed: Protobuf and Thrift in the small-payload, high-decode quadrant; Cap'n Proto and FlatBuffers in the larger-payload, near-zero-decode quadrant; JSON in the worst corner; raw memcpy in the best corner. Wire-format trade-off space — where the four formats sit payload size → (bytes on the wire) decode cost → (CPU per field read) smaller larger cheap expensive Protobuf varint, TLV Thrift compact Cap'n Proto fixed layout, pointers FlatBuffers vtable indirection JSON raw memcpy Illustrative — relative positions, not measured. Real benchmarks vary by message shape and CPU.
Illustrative — qualitative position of each format in the size-vs-decode-cost plane. The right quadrant ("zero-copy") trades bytes for CPU; the left quadrant trades CPU for bytes.

Protobuf — varint, tag-length-value, and what it costs

Protobuf is the format Google open-sourced in 2008 after a decade of internal use. Its wire format is tag-length-value (TLV) with variable-length integers (varints). Each field is encoded as one or more bytes of (field_number, wire_type) followed by the value, where wire_type is one of VARINT, LENGTH_DELIMITED, FIXED32, FIXED64, START_GROUP, END_GROUP. Field numbers are baked into the schema: field 1 in your proto file is forever field 1 on the wire.

A varint encodes an integer in 7-bit groups, with the top bit indicating "more bytes follow". Small numbers cost one byte, big numbers cost up to ten. So 1 is 0x01, 300 is 0xac 0x02, and 2^31 - 1 is five bytes. Why varints make sense for wire protocols: most integers in messages are small — IDs, counts, sequence numbers, indexes into enums. Encoding 5 as one byte instead of four (fixed int32) saves 75% on that field, and 50–80% across a typical message that is mostly small integers. The cost is decode: every varint requires a per-byte loop with a branch on the continuation bit, which the CPU branch-predicts poorly when the byte values are random.

The headline cost of Protobuf decode is the parse loop. For every field in the message, the receiver reads the tag varint, dispatches on wire type, reads the value, allocates an object (in Python, a new int / bytes / object), and assigns it to the destination struct field. For a deeply nested message with 40 fields, this is 40 allocations and 40 dispatched branches — and it runs on every read. PaySetu's flame graph (41% in ParseFromArray) shows what happens when a high-throughput service does this on every incoming message: the parse cost dominates the actual work.

What Protobuf gets in return for the parse cost is schema evolution. Adding a new field with a new field number is wire-compatible — old readers see the field's tag, don't recognise it, and skip it (each wire type has a defined "skip N bytes" rule). Removing a field is also safe if no code reads it. Renaming a field is free (only field numbers matter). Changing a field's type is a footgun — int32 → int64 is wire-compatible, int32 → string is not. The schema is checked at build time, not on the wire, so two services with subtly different proto files can talk and silently lose data.

Thrift — same idea, three encodings, different framing

Thrift was Facebook's parallel design to Protobuf, open-sourced in 2007. The architectural difference is that Thrift defines multiple wire encodings and lets you choose: TBinaryProtocol (fixed-width fields, simpler decode, larger payload), TCompactProtocol (varints + zigzag, similar size to Protobuf), TJSONProtocol (debug-friendly, slow). It also defines its own framing (TFramedTransport writes a 4-byte length prefix, like gRPC's framing) and its own RPC layer (Thrift services compile to a stub-and-skeleton pair, like gRPC).

The Thrift Compact Protocol is, in practice, a near-drop-in replacement for Protobuf at the wire level. Both use varints. Both use field numbers. Both encode lists with a length prefix. The differences are subtle: Thrift has true union types (Protobuf adds oneof later), Thrift's enum is wire-safe across language boundaries because the schema enforces it, and Thrift's IDL allows service definitions in the same file as message definitions (Protobuf splits these into .proto and service blocks).

For modern teams the choice between Protobuf and Thrift is mostly a tooling question. Protobuf has gRPC, Buf, evangelical Google support, and a more mature ecosystem in Go and Rust. Thrift has Apache governance, better Java tooling, and is the wire protocol behind ScribeDB, Cassandra (until 4.0), HBase, and Facebook-internal services. The wire-level performance is within ±10% of each other on most workloads. The decode-cost flame-graph problem is the same: both formats parse on every read, both allocate per field, both burn CPU proportional to message complexity.

Cap'n Proto and FlatBuffers — zero-copy as a different bargain

Cap'n Proto (Kenton Varda, 2013 — designed by the original Protobuf author after he left Google) and FlatBuffers (Google, 2014) attack the parse-on-read problem from a different angle: the wire bytes are the in-memory layout. There is no parse step. To read field Settlement.amount, you compute the field's byte offset (known at compile time from the schema), index into the message buffer at that offset, and read the integer in place. No allocation, no dispatch, no varint loop.

Cap'n Proto's representation is a 8-byte-aligned struct with a pointer table. Variable-length data (strings, lists, nested messages) is stored separately and referenced via a 64-bit pointer that encodes both an offset and a length. FlatBuffers uses a vtable: each table-typed message has a small lookup table at its head listing the byte offset of each field; if a field is missing (default value), its vtable slot is 0. The cost: vtable indirection adds one extra memory read per field, but zero allocations and zero parsing.

The trade is real and bidirectional. Zero-copy formats are larger on the wire — usually 1.5×–2× the size of an equivalent Protobuf message — because they must store padding for alignment and pointer tables for indirection. They are stricter about schema evolution: adding a field is fine in FlatBuffers (vtable extends; old readers see 0 for the missing field), but Cap'n Proto requires careful field-ordering rules to maintain compatibility. They consume more memory bandwidth but almost no CPU. Why this matters in practice: if your bottleneck is the network — replicating 50 GB/s across regions on a 100 Gbit pipe — Protobuf wins because the smaller payload uses less of your scarce resource. If your bottleneck is the receiver's CPU — a service that can't keep up with incoming RPC parsing — Cap'n Proto or FlatBuffers wins because the saving is real. The choice depends on which side of the network your bottleneck is on.

The killer use case for FlatBuffers is deserialisation-on-mobile: the original motivation was Android game clients reading 10 MB of game-config data on every launch. With Protobuf this took 200 ms (allocating millions of small objects); with FlatBuffers it took 2 ms (a mmap and a few pointer dereferences). For server-to-server traffic where most reads touch many fields, the difference is smaller — typically 3–5×, not 100×.

Code: encode the same message four ways and measure size

# wire_compare.py — encode one Settlement four ways, compare bytes & decode cost
# Requires: pip install protobuf flatbuffers
import struct, json, time, statistics

# --- Synthetic settlement record (PaySetu-style) ---
record = {
    "id": 472839104231,
    "amount_paise": 145900,           # ₹1459.00
    "merchant": "PaySetu test merchant",
    "status": 2,                       # COMPLETED
    "ts_ms": 1745568890123,
}

# --- 1. Protobuf-style varint TLV (hand-coded, no library) ---
def varint(n):
    out = bytearray()
    while n > 0x7f:
        out.append((n & 0x7f) | 0x80); n >>= 7
    out.append(n & 0x7f)
    return bytes(out)

def proto_encode(r):
    buf = bytearray()
    # field 1, wire_type 0 (varint): id
    buf += b"\x08" + varint(r["id"])
    # field 2, wire_type 0 (varint): amount_paise
    buf += b"\x10" + varint(r["amount_paise"])
    # field 3, wire_type 2 (length-delimited): merchant
    m = r["merchant"].encode("utf-8")
    buf += b"\x1a" + varint(len(m)) + m
    # field 4, wire_type 0: status
    buf += b"\x20" + varint(r["status"])
    # field 5, wire_type 0: ts_ms
    buf += b"\x28" + varint(r["ts_ms"])
    return bytes(buf)

# --- 2. JSON (baseline for comparison) ---
def json_encode(r): return json.dumps(r, separators=(",", ":")).encode("utf-8")

# --- 3. Cap'n-Proto-style fixed layout (hand-coded approximation) ---
def capnp_like_encode(r):
    # 8-byte alignment: 8 bytes id, 8 bytes amount, 8 bytes ts, 4 bytes status, 4 bytes pad,
    # then 8-byte length-prefixed merchant string with padding to 8-byte boundary
    head = struct.pack("<qqqi4x", r["id"], r["amount_paise"], r["ts_ms"], r["status"])
    m = r["merchant"].encode("utf-8")
    pad = (-len(m)) % 8
    return head + struct.pack("<q", len(m)) + m + b"\x00" * pad

# --- Measure: size and decode time over 100k iterations ---
encodings = {"json": json_encode, "protobuf-like": proto_encode, "capnp-like": capnp_like_encode}
for name, enc in encodings.items():
    payload = enc(record)
    # Decode cost: full parse for json/protobuf, pointer-read for capnp
    if name == "json":
        t0 = time.perf_counter()
        for _ in range(100_000): _ = json.loads(payload.decode("utf-8"))["amount_paise"]
        dt = time.perf_counter() - t0
    elif name == "protobuf-like":
        t0 = time.perf_counter()
        # naive: skip to field 2, read varint
        for _ in range(100_000):
            i = payload.index(b"\x10") + 1; v, sh = 0, 0
            while payload[i] & 0x80: v |= (payload[i] & 0x7f) << sh; sh += 7; i += 1
            v |= payload[i] << sh
        dt = time.perf_counter() - t0
    else:  # capnp-like: just struct.unpack at fixed offset 8
        t0 = time.perf_counter()
        for _ in range(100_000): _ = struct.unpack_from("<q", payload, 8)[0]
        dt = time.perf_counter() - t0
    print(f"{name:16s} bytes={len(payload):3d}  decode_100k={dt*1000:6.1f}ms  per_op={dt*10:5.2f}us")

Sample run on a c6i.xlarge (Intel Ice Lake, Python 3.11):

json             bytes=129  decode_100k= 286.4ms  per_op= 2.86us
protobuf-like    bytes= 39  decode_100k=  73.1ms  per_op= 0.73us
capnp-like       bytes= 56  decode_100k=  18.9ms  per_op= 0.19us

The walk-through. The line buf += b"\x08" + varint(r["id"]) is the heart of Protobuf's encoding: byte 0x08 is (field_number=1, wire_type=0) packed into a single byte, followed by the varint-encoded integer. The line head = struct.pack("<qqqi4x", ...) is what makes Cap'n Proto fast: every numeric field has a fixed offset and a fixed width, so the receiver can struct.unpack_from("<q", payload, 8) to read amount_paise without scanning. The Cap'n Proto-style payload is 56 bytes versus 39 for Protobuf — bigger, because of alignment padding — but its decode is 3.8× faster. JSON loses on every axis (3.3× larger than Protobuf, 15× slower to decode) which is why it lost the server-to-server fight. Why the Cap'n Proto-style decode is so cheap: there is no loop and no branch. The CPU does one struct.unpack_from at a fixed offset — that's an aligned 8-byte load instruction. Protobuf's varint decode requires a loop over up to 10 bytes with a continuation-bit branch on each, and the branch predictor cannot reliably predict the loop length because it depends on the actual integer value.

A production tale — KapitalKite's order-stream rewrite

KapitalKite, the discount stockbroker, runs an order-event pipeline that fans 800,000 orders per second from the exchange-gateway nodes to 400 downstream consumers (risk engines, position keepers, audit log, regulatory feed). The original design used Protobuf over Kafka with 3-byte field tags. CPU on the consumer side ran at 78% utilisation across 200 c6i.4xlarge instances during market open, with protobuf::Parse accounting for 52% of that. Adding consumers meant adding instances 1:1.

In 2024 the platform team rewrote the wire format to Cap'n Proto. The motivation was specifically the parse cost — Kafka throughput, network bandwidth, and disk IO were all comfortable; the bottleneck was CPU on consumers. The migration took six weeks (schema design, dual-write phase, consumer-by-consumer cutover) and shipped without an outage.

Post-migration measurements: the per-consumer CPU dropped from 78% to 31% utilisation. The wire payload grew from 286 bytes (Protobuf compact) to 472 bytes (Cap'n Proto, with padding and pointer tables) — a 65% larger message — but Kafka was nowhere near bandwidth-saturated, so the size penalty was free. The 200-instance fleet was scaled down to 88 instances, saving roughly ₹4.2 lakh per month in infrastructure cost.

The lesson the team posted internally: "the wire format choice is a CPU-vs-bandwidth choice, and we had been buying the wrong side of it for three years." Their original Protobuf decision was inherited from a 2019 design when Kafka throughput per partition was the bottleneck and message size mattered. By 2024, throughput had grown 8×, instance counts had grown to compensate for parse cost, and the bottleneck had silently flipped without anyone noticing because the dashboards still showed "CPU usage" as a single number rather than a breakdown of where the cycles went.

KapitalKite consumer CPU breakdown — Protobuf vs Cap'n ProtoTwo stacked bars showing CPU breakdown. Protobuf bar: 52% parse, 26% business logic, 22% other. Cap'n Proto bar: 4% access, 26% business logic, 1% other — total 31%. KapitalKite — per-consumer CPU breakdown before / after format change Protobuf — 78% total CPU parse / decode (52%) business logic (26%) Cap'n Proto — 31% total CPU read business logic (26%) spare capacity Wire payload grew 65% (286 → 472 bytes). Kafka throughput unchanged. Fleet shrank 200 → 88 instances. Saving: ₹4.2 lakh/month. Illustrative — based on KapitalKite's reported 2024 migration; exact percentages reconstructed from public summary.
Illustrative — the parse-cost slice was eliminated, not redistributed. Total CPU per message dropped because the work didn't happen at all.

Common confusions

Going deeper

Field numbers, wire types, and why renumbering breaks everything

Protobuf and Thrift both pack (field_number, wire_type) into the first byte (or first varint) of every field. The field number is part of the schema contract; it is what the receiver uses to look up which struct field to populate. Renumbering field 5 from amount_paise: int64 to amount_paise: int32 looks like "just shrinking the int" — but old senders that wrote field 5 as a 10-byte varint will be parsed as a different field entirely if you also moved another int32 to use field 5's old number. The "schema evolution" rules are: (a) never change a field's number, (b) never change a field's wire type (some are compatible — int32 and int64 and bool and enum all share wire type 0 — but the in-program type changes), (c) never reuse a deleted field's number for a new field of a different type. The Buf linter and the original Protobuf style guide both exist to catch these.

Cap'n Proto's "infinite version number" idea

One of Cap'n Proto's design innovations is that adding a field requires only that you add it at the end of the existing layout — and old readers, who don't know the field exists, simply stop reading at the end of the layout they were compiled against. This is enforced by storing the message size in the header. New writers append; old readers skip. The cost is that you can never reorder fields, never delete a field (only deprecate it), and never change a field's offset without bumping a major version. Cap'n Proto calls this approach the "infinite version number" — every change is backward-compatible by construction, never by convention.

Why Discord moved from JSON to Erlang Term Format to MessagePack

Discord's internal voice service originally used JSON for inter-process communication. As scale grew (millions of concurrent voice connections per region), parse cost dominated. The team moved to Erlang Term Format (ETF) — Erlang's native binary serialisation — because their backend was Elixir and ETF was free to encode/decode with no library overhead. Later, when they rewrote the voice-routing service in Rust, they moved to MessagePack (binary tagged JSON, schema-less, smaller than JSON, faster to parse). The lesson: wire-format choice tracks the language ecosystem, not just first-principles performance. A "perfect" wire format that has no good library in your language is worse than a "good enough" format with a fast, well-maintained one.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install protobuf flatbuffers
python3 wire_compare.py

# To inspect real Protobuf wire bytes, hex-dump them:
python3 -c "import wire_compare as w; print(w.proto_encode(w.record).hex())"

# For a real Cap'n Proto reference: pip install pycapnp
# pycapnp ships with the official C++ runtime bindings.
pip install pycapnp

Where this leads next

This chapter shipped you to the bottom of the wire layer — the bytes that go on the network in any RPC chapter we have already discussed.

Beyond Part 4, the wire-format choice flows into the messaging chapters (Part 15 — Kafka topic schemas), observability (Part 18 — span attribute encoding), and case studies (Part 20 — Spanner uses Protobuf, Cassandra uses Thrift, Discord moved through three formats in five years).

References

  1. Protocol Buffers Encoding — Google. The canonical wire-format reference: varints, wire types, length-delimited fields, packed repeated. Required reading.
  2. Apache Thrift Whitepaper — Slee, Agarwal, Kwiatkowski, Facebook 2007. Original design; the multi-protocol multi-transport layering is still relevant.
  3. Cap'n Proto — Kenton Varda. The encoding spec; section "Why not Protocol Buffers" by the original Protobuf author is the best argument-from-first-principles for zero-copy.
  4. FlatBuffers Documentation — Google. The vtable design and benchmarks; the Android-game-config use case explains the design.
  5. Designing Data-Intensive Applications, Chapter 4 — Martin Kleppmann, O'Reilly 2017. Comprehensive comparison of formats including Avro and Thrift; the "schema evolution" framing comes from this chapter.
  6. "Why we moved from JSON to ETF to MessagePack" — Discord engineering blog. Real-world wire-format migration motivated by parse cost.
  7. gRPC internals — for how the wire format sits inside an HTTP/2 framing layer.
  8. RPC semantics: at-most-once, at-least-once, exactly-once — wire-level retries interact with serialisation: an idempotency key must be in a stable wire location.