Versioning RPCs without breaking clients

It is a Tuesday afternoon at PaySetu and Karan ships a one-line change to the Refund RPC: he renames a field from reason to reason_code because the old name "wasn't descriptive". The deploy goes green, the integration tests pass, the canary shows no errors. Forty minutes later the support queue lights up — merchants are seeing successful refunds in their dashboard but no money in the customer's account, and a 2019-vintage Android SDK that nobody on the current team has ever read is silently dropping the field on parse. The wire format did not break. Nothing crashed. The RPC kept its 200 OK. And ₹38 lakh of refunds were processed without a reason_code, every one of them rejected by the downstream reconciliation job that ran at 02:00 the next morning.

An RPC is a contract between code you control today and code you do not. Versioning means evolving the contract such that an old client and a new server, or a new client and an old server, never disagree about what a message means. The discipline is not "bump v1 to v2"; it is field-level rules — never renumber, never reuse, never change a type — plus a deprecation cycle, plus enough wire-level introspection to know which clients are still out there.

The shape of the problem — what "breaking" actually means

A breaking RPC change is any change that causes a deployed client and a deployed server to disagree on the meaning of bytes that pass between them. The wire format does not have to fail to parse. The status code does not have to be non-200. The compiler does not have to flag anything. The break can be silent, and silent breaks are the dangerous ones — they corrupt state slowly, in shapes the dashboards do not measure, until a downstream job notices days later. Karan's reason → reason_code rename was silent in exactly this way: the old SDK's generated parser did not know about reason_code, the new server stopped writing reason, and the field disappeared from every refund originating from a 2019 Android binary.

There are three populations of clients you have to think about, and they overlap in nasty ways. The callable population is the set of client binaries that can call your RPC right now — for a public mobile app, this includes every version installed on a device that has not been force-upgraded in the last 18 months. The invokable population is the subset that actually does call the RPC in any given week — usually a long tail with the median client several versions behind HEAD. The understood population is the set whose generated stubs match the current server's idea of the schema. The job of versioning is to keep the invokable population a strict subset of the understood population, while letting both evolve.

Illustrative — the silent-break zone is the population of binaries that can still send a request but whose stubs no longer match. Karan's rename grew this zone from empty to ~12% of PaySetu's traffic in 40 minutes.

The four moves and what each one costs

Every change to an RPC is one of four moves: add a field, remove a field, change a field's meaning, or change a field's wire shape. The first two are tractable; the second two are where careers are lost.

Adding a field is wire-compatible in Protobuf, Thrift, and Cap'n Proto — old readers see an unknown tag, skip it, and proceed. The cost is purely on the writing side: every new sender must send a sensible default, because old readers will not know to compute one. Why "old readers skip unknown fields" works: the wire format encodes (field_number, wire_type) ahead of every value, and each wire type defines a "skip N bytes" rule (varint reads bytes until the continuation bit is 0; length-delimited reads a length prefix and skips that many bytes). The receiver does not need the field's name or type to step over it. This is the single property that makes Protobuf evolvable at all.

Removing a field is safe only after a multi-stage deprecation: stop reading the field on servers, stop writing it on clients, wait long enough that no in-flight binary still emits it, then remove it from the schema. Skipping the cycle and just deleting the field reuses its number for whatever you add next, which is the second move's silent-break twin. The duration of "long enough" is the longest-lived client tail — for a server-only RPC that might be 30 days; for a public mobile API it can be 18 months.

Changing a field's meaning without changing its wire shape is the worst move because the compiler cannot catch it. If amount: int64 was always paise and you redefine it to mean rupees, every old client sends and every old server reads the same bytes, but now amount=100 means ₹100 instead of ₹1, a 100× error that passes every type check. This is "semantic versioning through code review", and the only safe form is to introduce a new field (amount_paise) and deprecate the old one over months. Renaming reason to reason_code was a milder version of this — the wire shape (length-delimited string) was identical, but the field number changed, so old clients dropped the value entirely.

Changing wire shape — int32 to string, repeated int32 to map<string, int32>, optional to required — is wire-incompatible by definition. A new server cannot parse old clients, and vice versa. This requires a fresh field number and a multi-version cutover. The Buf breaking linter enumerates the precise rules; see the references. Why optional → required is a hard break even though both labels share the same wire encoding: an old client that never set the field emits zero bytes for it; a new server with required semantics treats the absence as a parse failure and rejects the whole request. The wire bytes are valid Protobuf in both directions; the failure happens at the validation layer above the parser. This is also why Protobuf 3 dropped required entirely — once a field becomes required, there is no graceful path back.

The four moves. Two of them are safe; two of them require a deprecation cycle. Rename and renumber both fall in the bottom-left cell — the most dangerous, because the compiler cannot catch them.

Three real strategies — and which one fits your reality

Strategy 1: never break — field-level evolution within one schema

The Google-internal default. There is one .proto file per service, it lives forever, and you only ever add fields. Removed fields go to reserved (Protobuf has explicit syntax for this — reserved 7, 13, 21; and reserved "old_field_name"; — to prevent number or name reuse). Renames are forbidden in code review. Type changes are forbidden in code review. The Buf linter or the Google-internal proto_breaking_change analyser runs in CI and rejects merges that violate the rules.

This strategy works when the service owner controls every client (server-to-server within one company) and when CI gating is strict. It does not work when third-party SDKs exist in the wild.

Strategy 2: parallel versions — `/v1/` and `/v2/` as separate APIs

The public-API default. The URL or method name encodes the major version: payments.v1.RefundService/Create and payments.v2.RefundService/Create are different RPC endpoints, with different generated stubs. Clients pick a version explicitly. The server runs both implementations side by side, often as thin wrappers over a shared core. v1 enters maintenance mode (only critical security patches), v2 gets new features. v1 sunsets when traffic falls below a threshold — usually <0.1% of total — typically 12 to 24 months after v2 ships.

This strategy is expensive: every new feature has to be backported (or explicitly not backported) to v1, every server deploy carries both version handlers, every documentation page has to explain which version a feature lives in. But it is the only honest path when you do not control client binaries.

Strategy 3: capability negotiation — clients tell servers what they understand

The hybrid path. There is one schema, but every request includes a client_capabilities field — a set of opaque tokens like "refund.reason_code.v2" or "checkout.split_tender" — and the server tailors its response to what the client claims to understand. New fields ship when half of clients set the capability; servers stop sending the field once 99.9% of clients have moved on. Browsers and HTTP do this with the Accept header; gRPC services do it with metadata; PaySetu's internal RPC mesh does it with a caps repeated-string in every request envelope.

The cost is a permanent N×M test matrix: every server change has to be tested against every meaningful capability subset still in production. The benefit is that you can ship features incrementally without bumping the major version and without forcing every client to upgrade in lockstep.

Three strategies. Most large systems use a mix — never-break inside one cluster, parallel versions at the public boundary, capability negotiation for the long mobile tail.

Code: a wire-compatibility checker that catches the four moves

This script takes two .proto-like schema descriptions (just Python dicts, for portability) and reports whether the new schema is wire-compatible with the old. It encodes the rules every linter (Buf, proto_breaking_change, protolock) implements internally.

# wire_compat.py — compatibility checker for schema evolution
from dataclasses import dataclass, field
from typing import Dict, List, Tuple

@dataclass
class FieldSpec:
    number: int
    name: str
    wire_type: str   # "varint", "length-delimited", "fixed32", "fixed64"
    label: str       # "optional", "required", "repeated"
    semantic_type: str  # "int64", "string", "Money[paise]", etc.

@dataclass
class Schema:
    name: str
    fields: Dict[int, FieldSpec] = field(default_factory=dict)
    reserved_numbers: set = field(default_factory=set)
    reserved_names: set = field(default_factory=set)

def check_compat(old: Schema, new: Schema) -> List[Tuple[str, str]]:
    """Return list of (severity, message). 'break' = silent or hard wire break."""
    issues = []
    for num, of in old.fields.items():
        if num in new.fields:
            nf = new.fields[num]
            if nf.wire_type != of.wire_type:
                issues.append(("break", f"#{num} {of.name}: wire_type {of.wire_type}->{nf.wire_type}"))
            if nf.semantic_type != of.semantic_type and nf.wire_type == of.wire_type:
                issues.append(("break", f"#{num} {of.name}: SEMANTIC change {of.semantic_type}->{nf.semantic_type} (wire identical — silent break)"))
            if of.label == "optional" and nf.label == "required":
                issues.append(("break", f"#{num} {of.name}: optional->required rejects old absent-field clients"))
            if nf.name != of.name and num not in new.reserved_numbers:
                issues.append(("warn", f"#{num} {of.name}->{nf.name}: rename — old clients drop field"))
        else:
            if num not in new.reserved_numbers:
                issues.append(("break", f"#{num} {of.name}: removed without `reserved {num};` — number can be reused later (silent break)"))
    for num, nf in new.fields.items():
        if num not in old.fields and num in old.reserved_numbers:
            issues.append(("break", f"#{num} {nf.name}: reuses a previously-reserved number"))
    return issues

# --- Karan's actual change: rename `reason` to `reason_code` ---
old = Schema("RefundRequest", fields={
    1: FieldSpec(1, "merchant_id",  "length-delimited", "optional", "string"),
    2: FieldSpec(2, "amount_paise", "varint",           "optional", "int64"),
    3: FieldSpec(3, "reason",       "length-delimited", "optional", "string"),
})
new = Schema("RefundRequest", fields={
    1: FieldSpec(1, "merchant_id",  "length-delimited", "optional", "string"),
    2: FieldSpec(2, "amount_paise", "varint",           "optional", "int64"),
    7: FieldSpec(7, "reason_code",  "length-delimited", "optional", "string"),
})

for sev, msg in check_compat(old, new):
    print(f"[{sev:5s}] {msg}")

Sample run:

[break] #3 reason: removed without `reserved 3;` — number can be reused later (silent break)

The walkthrough. The line if nf.wire_type != of.wire_type: catches the int32 → string class — the hardest break, the one that fails to parse. The line if nf.semantic_type != of.semantic_type and nf.wire_type == of.wire_type: catches the rupees-to-paise class — same bytes, different meaning, silent. The line if num not in new.reserved_numbers: is the key insight in Karan's case: the old field number 3 was deleted but not reserved, which means the very next deploy could add a new field 3 of a different type and corrupt every old client. The output flags exactly that. Why marking a deleted field's number as reserved is non-negotiable: Protobuf wire format identifies fields by number, not name. If you delete reason (field 3) without reserving the number, and then six months later add is_priority: bool (field 3), every old client that was still sending reason="customer_dispute" will now have its string interpreted as a varint by the new server — a parse error if you are lucky, a misread bool if you are not. The reserved keyword is a permanent epitaph: this number was used once, do not touch it.

Common confusions

"Bumping the version number protects us." Bumping v1.proto to v2.proto only protects callers who actually opt into v2. Every existing v1 caller continues calling v1. If you change v1's schema thinking "we'll move people to v2 soon", you have just broken v1 traffic — the version bump did nothing because no client moved.
"required fields are like type safety — they catch errors." Protobuf 2's required is widely considered a design mistake; Protobuf 3 removed it. A field marked required cannot ever be made optional later without breaking every old reader, and once a required field exists in a deployed schema you can never delete it — your schema becomes write-only. The Google internal style guide explicitly bans required.
"Renaming a field is safe because Protobuf uses field numbers." The bytes on the wire are safe. The generated code on the client is not. The old client's stub has a getter named getReason() that maps to field 3; the new server stops emitting field 3 (it now emits field 7 named reason_code). The wire is intact; the client's view is empty. This is exactly the silent break that cost PaySetu ₹38 lakh.
"optional versus repeated is just a label." repeated fields use a different wire encoding (packed encoding for primitives is the default in Protobuf 3) and changing optional int32 to repeated int32 produces parse errors on old clients reading the new server's output. Wire shape and label are coupled.
"We can use the URL /v2/ for breaking changes and never bump again." Empirically, /v2/ becomes /v3/ within 24 months for any actively developed public API. The version axis is real and grows. Plan for /v3/ from the day you ship /v2/.
"exactly-once semantics let us skip version negotiation." Delivery semantics and schema versioning are orthogonal. Even an exactly-once pipeline (which is, strictly, idempotent at-least-once — see RPC semantics) ships messages that have a schema, and that schema has to evolve.

Going deeper

Stripe's API versioning by date

Stripe's public API uses date-stamped versions (Stripe-Version: 2024-04-10) instead of major numbers. Every breaking change pins to a date; every account locks to the date it was created on; the server runs every version simultaneously by maintaining a per-version transformation pipeline that translates incoming requests "up" to the latest internal schema and outgoing responses "down" to the requested version. This decouples "when did this change ship" from "is this client opted into it" and lets the server team ship breaking changes without coordinating a v1→v2 migration. The cost is the transformation pipeline — every version becomes a permanent piece of code Stripe must maintain. Their engineering blog estimates the pipeline contains over 200 transforms after 13 years, and the team explicitly budgets engineering time per quarter to keep the oldest versions working.

Avro's reader-schema / writer-schema duality

Avro takes a different angle: every message on the wire embeds (or references via schema registry) the writer's schema. The reader supplies its own reader schema at decode time. Avro's library then matches the two field-by-field: fields in both, by name, with compatible types, get translated; fields only in writer get dropped; fields only in reader get default values. This shifts the compatibility check from compile time (Protobuf) to runtime (Avro) and lets you evolve in directions Protobuf forbids — for instance, you can rename a field and tell the reader to alias the old name to the new. The cost is that every message carries a schema fingerprint or fetches one from a registry, adding a network round trip on first contact.

Why gRPC's `unary` / `streaming` are themselves a versioning axis

A gRPC method's cardinality (unary, server-streaming, client-streaming, bidi-streaming) is part of the method signature and cannot be changed compatibly. Promoting a unary method to server-streaming — a tempting move when adding pagination — produces clients that hang waiting for a single response that will never come (the new server is sending the response in chunks the old client doesn't know to read). The only safe path is a new method name (ListV2 instead of List), which is just parallel-versions strategy at the method level.

Reproduce this on your laptop

python3 -m venv .venv && source .venv/bin/activate
pip install protobuf
python3 wire_compat.py

# To run a real Buf check on a real .proto:
brew install bufbuild/buf/buf
buf breaking --against '.git#branch=main' your.proto

Where this leads next

This chapter shipped you the rules of schema evolution at the wire level — the bytes-on-the-wire view. Versioning intersects with three downstream concerns:

Idempotency keys, request hashing, and dedup tables — when you add a new field, what does it do to the request hash that your dedup table is keyed on? A schema bump that changes the canonical hash invalidates every in-flight retry.
Wire protocols: Protobuf, Thrift, Cap'n Proto, FlatBuffers — Cap'n Proto's "infinite version number" is a different point in the design space; FlatBuffers' vtable makes adding fields cheaper but reordering catastrophic.
gRPC internals — gRPC's HTTP/2 framing carries a :path that includes the service version; routing different versions to different server pools is a Layer-7 problem that grpc-gateway and Envoy both solve at the proxy layer.

Beyond Part 4, RPC versioning surfaces in messaging (Part 15 — Kafka topic schemas in a Schema Registry), in workflows (Part 16 — Temporal's workflow versioning is the same problem with a 90-day workflow lifetime), and in case studies (Part 20 — Stripe, Twilio, Twitter all wrote down their versioning policies and they are worth reading).

References

Protocol Buffers — Updating a Message Type — Google. The canonical wire-compatibility rules: which type changes are safe, which require a reserved declaration, why required is banned in proto3.
Buf's Breaking-Change Detector — the open-source linter that encodes the same rules. Read the rule list once; it is the operational form of the spec.
Stripe API Versioning — Brandur Leach's writeup of the date-stamped versioning model and the transformation pipeline. The single best public-API-evolution case study online.
Apache Avro Specification — Schema Resolution — reader-schema / writer-schema duality, the alternative point in the design space to Protobuf's "no rename".
Designing Data-Intensive Applications, Chapter 4 — Martin Kleppmann, O'Reilly 2017. The "schema evolution" framing for Protobuf, Thrift, Avro, and JSON; foundational for this chapter.
"The Hyrum's Law of Public APIs" — Hyrum Wright. "With a sufficient number of users, all observable behaviours of your API will be depended on by somebody." The reason renames are silent breaks even when the rename is "obviously safe".
RPC semantics: at-most-once, at-least-once, exactly-once — internal companion. Versioning interacts with retry and idempotency in ways the field-rules table doesn't capture.
Wire protocols: Protobuf, Thrift, Cap'n Proto, FlatBuffers — internal companion. The wire-format choice constrains the versioning moves available.

Versioning RPCs without breaking clients

The shape of the problem — what "breaking" actually means

The four moves and what each one costs

Three real strategies — and which one fits your reality

Strategy 1: never break — field-level evolution within one schema

Strategy 2: parallel versions — /v1/ and /v2/ as separate APIs

Strategy 3: capability negotiation — clients tell servers what they understand

Code: a wire-compatibility checker that catches the four moves

Common confusions

Going deeper

Stripe's API versioning by date

Avro's reader-schema / writer-schema duality

Why gRPC's unary / streaming are themselves a versioning axis

Reproduce this on your laptop

Where this leads next

References

Strategy 2: parallel versions — `/v1/` and `/v2/` as separate APIs

Why gRPC's `unary` / `streaming` are themselves a versioning axis