gRPC internals
It is 09:42 IST on a Wednesday and Asha is staring at a flame graph from MealRush's restaurant-search service. The dashboard says p99 latency for the Search RPC is 380 ms, but TCP-level latency from the proxy to the search service is 4 ms. The 376 ms gap is hiding inside gRPC. Her teammate Kiran says "it's serialisation"; the language docs say "it's stream queueing"; the gRPC source code says it's flow_control_window exhaustion combined with HOL blocking on a multiplexed connection that opened 87 concurrent streams to the same backend. None of those phrases mean anything until you know what gRPC is doing under the stub. This chapter is the walk down.
gRPC is three things stacked: Protobuf for the message bytes, HTTP/2 for the framing and multiplexing, and a code generator that turns .proto files into language-specific stubs. Every RPC is one HTTP/2 stream — request headers, length-prefixed Protobuf payload, optional trailers — running on a connection that multiplexes many streams. Once you internalise that, gRPC's deadline propagation, cancellation, streaming RPCs, and flow control all fall out of HTTP/2's primitives.
What gRPC actually is on the wire
A gRPC call is not a magic incantation; it is one HTTP/2 POST to a path that encodes the service and method, with a Protobuf body wrapped in a 5-byte length prefix. If you point tcpdump -X at port 50051 and decode the HTTP/2 frames, you can read the entire RPC by hand. Anything that sounds mystical about gRPC — load balancing, deadlines, streaming — is a layer on top of those wire bytes.
The message on the wire for a single unary RPC looks like this. The client sends a HEADERS frame on a fresh stream:
:method = POST
:scheme = https
:path = /mealrush.search.SearchService/Search
:authority = search.mealrush.internal
content-type = application/grpc+proto
te = trailers
grpc-timeout = 200m
grpc-encoding = identity
user-agent = grpc-python/1.60
Then a DATA frame whose payload is [1 byte compressed-flag][4 bytes message length big-endian][N bytes Protobuf]. The 5-byte prefix is gRPC's own framing — HTTP/2 has no concept of "messages", only byte streams, so gRPC needs its own delimiter. The server replies with a HEADERS frame (:status = 200, content-type = application/grpc+proto), zero or more DATA frames carrying the response message, and then a HEADERS frame with END_STREAM flag carrying grpc-status: 0 and grpc-message: "" — the gRPC trailers. Why trailers instead of headers for the status: the server may not know the final status until after it has streamed all the data — for a server-streaming RPC, the status reflects the entire stream's outcome. HTTP/2 trailers exist exactly to let the sender finalise metadata after the body. HTTP/1 had trailer headers in the spec but nobody implemented them; HTTP/2 made them mandatory and gRPC was designed around them.
How HTTP/2 multiplexing makes one connection do work for thousands of RPCs
The performance story of gRPC is mostly the multiplexing story of HTTP/2. A single TCP connection between two peers carries an unbounded number of streams; each stream is a bidirectional sequence of frames belonging to one logical request/response. Streams have IDs (odd for client-initiated, even for server-initiated), and frames on the same connection interleave by stream ID. The peer demultiplexes by reading the stream-id header on each frame.
This means an HTTP/1.1-style "connection per request" is replaced by "stream per request, connection persists for the channel's lifetime". For MealRush, a typical service might open one channel to restaurant-service at start-up; the channel opens one TCP connection (per subchannel — load balancing layer); and that connection then carries every RPC for the process's lifetime, multiplexing 50–500 concurrent streams.
The trade-off this introduces is head-of-line blocking at the TCP layer. HTTP/2 multiplexes at the application layer, but TCP is still a single byte-stream — if a packet is dropped on that one connection, every in-flight stream stalls until retransmit, because TCP cannot deliver any later byte until the missing one arrives. HTTP/3 (which uses QUIC instead of TCP) was designed to fix this — each QUIC stream has its own loss-recovery, so a dropped packet stalls only that stream. For most intra-datacentre gRPC traffic the HTTP/2 HOL problem is rare (loss rates are <0.01%), but on lossy networks (mobile, cross-region) it becomes a real source of tail latency.
The other constraint is per-connection flow control. HTTP/2 has a WINDOW_UPDATE frame; each stream has a flow-control window (default 65,535 bytes), and the connection itself has a separate window. The sender cannot push more bytes than the smaller of the two windows would allow; once exhausted, the sender stalls until the peer sends WINDOW_UPDATE to grant more credit. Why both stream-level and connection-level windows: stream-level prevents one runaway stream from starving siblings; connection-level prevents the aggregate from overrunning the receiver's socket buffer regardless of how many streams there are. A stream-level window of 64 KB is fine for an RPC carrying 4 KB messages, but for a server-streaming RPC pushing 50 MB of search results, the default 64 KB window must be raised at startup or every stream stalls every 64 KB until the receiver acks.
How a gRPC channel becomes one or more subchannels
A Channel is the client's handle to a service name (search.mealrush.internal). The channel does not know about IP addresses; it owns a resolver (typically DNS, sometimes xDS or a custom name resolver) that periodically maps the name to a list of addresses, and a load balancer that decides which address to send each call to. Each address that the load balancer picks gets a subchannel — a single TCP+HTTP/2 connection to that endpoint. A channel with pick_first LB has one subchannel; a channel with round_robin LB across 12 backends has 12 subchannels.
When you call stub.Search(...), the channel asks the load balancer to pick a subchannel; the call is then sent as a new HTTP/2 stream on that subchannel's connection. The most-cited misconfiguration here is using gRPC behind a Layer-4 load balancer (an AWS NLB or HAProxy in TCP mode) without telling the channel about the multiple backends. The L4 LB picks one backend per connection; the channel makes one connection; every RPC pins to one backend forever; horizontal scaling stops working. The fix is either (a) use round_robin LB at the gRPC channel level with DNS resolving to all backend IPs, or (b) put gRPC behind a Layer-7 proxy (Envoy, gRPCWeb, NGINX with grpc_pass) that understands HTTP/2 streams and can balance per-RPC.
Code: read a unary RPC's full HTTP/2 framing
Because gRPC is "just" HTTP/2 + Protobuf, you can implement a working gRPC client without grpcio — by speaking HTTP/2 directly to the server. The script below uses Python's h2 library (a pure HTTP/2 protocol implementation) to send one RPC, then dumps every frame the wire produces. Run it against any gRPC server.
# grpc_raw_demo.py — speak gRPC by hand, no grpcio
# Demonstrates: HTTP/2 HEADERS frame, length-prefixed DATA, gRPC trailers
import socket, ssl, struct, sys
import h2.connection
import h2.config
# Pretend we have this proto: rpc Echo(EchoRequest) returns (EchoResponse)
# EchoRequest { string message = 1; } — Protobuf wire bytes for "hi": 0a 02 68 69
PROTO_PAYLOAD = b"\x0a\x02hi" # field 1 (string), length 2, "hi"
def length_prefix(payload: bytes) -> bytes:
# gRPC framing: [1 byte compressed flag][4 bytes big-endian length][payload]
return b"\x00" + struct.pack(">I", len(payload)) + payload
def call(host: str, port: int, service_method: str, body: bytes, deadline_ms: int = 200):
ctx = ssl.create_default_context()
ctx.set_alpn_protocols(["h2"])
sock = ctx.wrap_socket(socket.create_connection((host, port)), server_hostname=host)
conn = h2.connection.H2Connection(config=h2.config.H2Configuration(client_side=True))
conn.initiate_connection()
sock.sendall(conn.data_to_send())
headers = [
(":method", "POST"),
(":scheme", "https"),
(":path", service_method),
(":authority", host),
("content-type", "application/grpc+proto"),
("te", "trailers"),
("grpc-timeout", f"{deadline_ms}m"),
("user-agent", "grpc-raw-demo/0.1"),
]
stream_id = 1
conn.send_headers(stream_id, headers, end_stream=False)
conn.send_data(stream_id, length_prefix(body), end_stream=True)
sock.sendall(conn.data_to_send())
# Drain frames and print them
while True:
data = sock.recv(65535)
if not data:
break
events = conn.receive_data(data)
for ev in events:
print(f" EVENT {type(ev).__name__}: {ev}")
if ev.__class__.__name__ == "StreamEnded":
return
sock.sendall(conn.data_to_send())
if __name__ == "__main__":
# Point at any gRPC server you have. Below assumes an Echo service exists.
call("localhost", 50051, "/demo.EchoService/Echo", PROTO_PAYLOAD, deadline_ms=200)
Sample run against a local Echo server:
EVENT RemoteSettingsChanged: <RemoteSettingsChanged changed_settings:{...}>
EVENT SettingsAcknowledged: <SettingsAcknowledged>
EVENT ResponseReceived: <ResponseReceived stream_id:1, headers:[
(b':status', b'200'),
(b'content-type', b'application/grpc+proto'),
(b'grpc-accept-encoding', b'identity,deflate,gzip')]>
EVENT DataReceived: <DataReceived stream_id:1, flow_controlled_length:9, data:b'\x00\x00\x00\x00\x04\n\x02hi'>
EVENT TrailersReceived: <TrailersReceived stream_id:1, headers:[
(b'grpc-status', b'0'),
(b'grpc-message', b'')]>
EVENT StreamEnded: <StreamEnded stream_id:1>
The walk-through. The line (":path", service_method) is how gRPC encodes service+method in the URL — /<package>.<Service>/<Method>; this is what the server uses to dispatch the call to the right handler. The line length_prefix(body) wraps the Protobuf bytes in gRPC's 5-byte framing; the server reads the first 5 bytes, learns the message length, then reads exactly that many bytes — so a single HTTP/2 stream can carry many messages back-to-back (this is how server-streaming RPCs work). The DataReceived event shows flow_controlled_length:9: 5 bytes of gRPC framing prefix plus 4 bytes of Protobuf payload (0a 02 68 69). The TrailersReceived event with grpc-status: 0 is the actual signal that the call succeeded — :status: 200 only says the HTTP layer accepted the call, not that the application succeeded. Why :status: 200 does not mean the gRPC call succeeded: a server can return HTTP 200 with grpc-status: 13 (INTERNAL) when the application threw an exception. The two layers are separate. A gRPC client checks grpc-status from trailers, not :status from headers. Code that checks only the HTTP status will treat application errors as successes.
Streaming RPCs: client-stream, server-stream, bidi — all the same primitive
gRPC defines four kinds of method:
- Unary: one request, one response. (Most RPCs.)
- Server-streaming: one request, many responses. (Restaurant search returning results as they're computed.)
- Client-streaming: many requests, one response. (Uploading a video in chunks; server returns the upload-complete confirmation.)
- Bidirectional: many of each, interleaved. (Real-time chat, telemetry pipelines, collaborative editing.)
All four use the same HTTP/2 stream model — one stream per RPC. The difference is just which side calls end_stream first. For unary, the client ends its half-stream after the request, then the server ends after the response. For server-streaming, the client ends after the request, the server pushes many DATA frames each with one length-prefixed message, and finally sends trailers with END_STREAM. For bidi, both sides interleave DATA frames freely until either calls END_STREAM.
Critically, a streaming RPC's flow control window is the same primitive as a unary RPC's. If the receiver does not drain its buffer (e.g. the server is slow to read incoming chunks of a client-stream upload), the sender's WINDOW_UPDATE credits run out and the sender stalls. This is why gRPC streaming pipelines need explicit application-level concurrency: a server that calls stream.Recv() once per loop iteration and then does 200 ms of CPU work between iterations will stall the client at the HTTP/2 flow-control layer, even though the network has bandwidth to spare. The diagnostic — visible in tcpdump or in GRPC_TRACE=flowctl,http, is WINDOW_UPDATE frames arriving at long, irregular intervals.
A second production tale — CricStream's chat service
CricStream's match-day chat service handles 4M concurrent viewers during a final, with each phone holding one bidirectional gRPC stream to a chat backend. In 2024 the team's p99 chat-message latency mysteriously jumped from 80 ms to 1,800 ms during the 19th over of an India-vs-Australia match, even though CPU on the chat backends sat at 30%. The dashboard pointed at the chat service; the chat service's own dashboard pointed at the network; the network team showed clean charts.
The actual cause was HTTP/2 connection-level flow control on the Envoy proxy in front of the chat service. Envoy's default initial_connection_window_size is 64 KB. The chat traffic was a sustained 25 MB/s per connection (250 messages/sec × 100 KB average for messages-with-emoji-payloads). With 4M streams across 800 backends that meant ~5,000 streams per backend per connection. Each stream's flow-control window was cycling through WINDOW_UPDATE healthily, but the connection-level window was being consumed faster than acks could return, especially when an Envoy worker was momentarily slow to schedule. The result: the connection stalled briefly, every stream on it stalled, the next message a user sent waited 1.8 seconds for the connection to unstick.
The fix was to raise initial_stream_window_size from 64 KB to 4 MB and initial_connection_window_size from 64 KB to 32 MB on the Envoy listener, plus the same on the gRPC server's GRPC_ARG_HTTP2_MAX_FRAME_SIZE. The chat-message p99 dropped back to 90 ms within minutes of the rollout. Why bigger windows fix this without buying bigger buffers: HTTP/2's flow-control window does not buffer the data — it just lets the sender ship more bytes before pausing. The data still flows through the same TCP socket buffer the kernel manages. The window is purely a credit-tracking mechanism; raising it costs nothing in memory until the receiver actually queues bytes (which it would not, if it is keeping up). Raising the window only makes the sender pause less aggressively when the receiver is momentarily slow.
Common confusions
-
"gRPC is faster than REST because it uses Protobuf." Mostly wrong. The serialisation gap between Protobuf and JSON for a typical RPC payload (a few hundred bytes) is single-digit microseconds — invisible compared to network RTT. gRPC's real performance advantage comes from HTTP/2 multiplexing — one connection serving thousands of concurrent streams without HTTP/1.1's connection-pool exhaustion. A REST API on HTTP/2 with msgpack would have most of the same wins.
-
"gRPC's deadline is the same as a TCP timeout." No.
grpc-timeoutis an HTTP/2 header the server reads and uses to bound its own work — it tells the server when to stop processing the call. Deadlines also propagate through chained RPCs (server A calls server B with the remaining deadline, not a fresh one). TCP timeouts are at the transport layer and have nothing to do with the application's deadline. -
"HTTP/2 streams are like threads." They are like cooperative coroutines, not threads. Streams share one TCP connection and one set of TCP buffers; if the connection stalls (packet loss, slow receiver), every stream on it stalls together. Threads (in a thread-per-request server) are independent and one slow thread does not slow others.
-
"gRPC retries are exactly-once because of the framework." No. gRPC's
retryPolicyperforms at-least-once retries (with backoff) only on RPCs markedidempotency_level = IDEMPOTENTin the proto. The framework cannot make a non-idempotent operation safe to retry — see RPC semantics: at-most-once, at-least-once, exactly-once for the application-level dedup pattern that actually gives you exactly-once-effect. -
"HTTP/2 multiplexing means head-of-line blocking is gone." Only at the application layer. A dropped TCP segment still blocks every stream on that connection until retransmit, because TCP delivers a single in-order byte stream. HTTP/3 (QUIC) is what actually fixes per-stream HOL — each QUIC stream has its own loss-recovery state.
-
"gRPC's load balancing works through any proxy." Only L7 proxies (Envoy, NGINX with
grpc_pass, gRPC-aware ALBs). An L4 TCP proxy load-balances connections, not streams; since gRPC keeps one connection per channel, you get one backend per client process, which is the opposite of horizontal scaling.
Going deeper
gRPC's three-tier resolution: name → addresses → subchannel → connection
The channel is initialised with a target like dns:///search.mealrush.internal:50051. The resolver (DNS by default) returns a list of A/AAAA records. The load balancer policy (pick_first or round_robin or xds) decides how to distribute calls. Each chosen address gets a subchannel — the gRPC-internal abstraction for "we want to be connected to this address". The subchannel state machine has states IDLE → CONNECTING → READY → TRANSIENT_FAILURE → SHUTDOWN. When READY, the subchannel owns one TCP+HTTP/2 connection. The channel's overall state aggregates its subchannels: the channel is READY if any subchannel is READY. This three-tier resolution is why a gRPC channel can survive a backend node restart without dropping in-flight RPCs on other backends — a single subchannel goes through TRANSIENT_FAILURE → CONNECTING → READY while the rest stay healthy.
Why HPACK matters for high-fanout services
HPACK is HTTP/2's header compression. It maintains a dynamic table of header name+value pairs on each end of the connection; subsequent headers can be sent as a 1-byte index reference into the table. For gRPC this is enormous: every RPC carries :path, content-type, te, grpc-encoding, user-agent, etc., totalling 200–400 uncompressed bytes per call. With HPACK, after the first call on a connection these are 5–10 bytes. For a service handling 50,000 RPCs/sec on one connection, HPACK saves ~15 MB/sec of header bandwidth — enough to fit on a single 100 Mbit link instead of a 1 Gbit one. The downside: HPACK's dynamic table is per-connection, so cold connections pay the full header cost on the first RPC. This is why long-lived channels matter — connection-level state amortises across calls.
Cancellation: how ctx.cancel() becomes a wire frame
When a client cancels a context, the gRPC client library sends an HTTP/2 RST_STREAM frame on that stream with error code CANCEL (0x8). The server's HTTP/2 implementation sees the frame, marks the stream as cancelled, and signals the per-stream context that the application is reading. The application then sees its ctx.Done() channel close (Go), or await call.cancel() returns (Python asyncio), or the synchronous call raises CancelledError. The server can then short-circuit any work it has not finished — releasing the database connection, abandoning the downstream RPC. Crucially, cancellation does not propagate through the network to a remote service the cancelled call had already invoked unless that service participated in deadline propagation; this is why every gRPC server's middleware should respect context.deadline and pass it down.
Compression: when it helps and when it makes things worse
gRPC supports per-message compression via the grpc-encoding header (identity, gzip, deflate, optionally snappy). Compression helps when (a) messages are larger than 1–2 KB and (b) network bandwidth is the bottleneck. It hurts when messages are small (< 200 bytes — gzip overhead exceeds savings), when the workload is CPU-bound (compression adds 50–200 microseconds per message), or when the payload is already binary-incompressible (most Protobuf is partly so — Protobuf already varint-encodes integers, removing easy redundancy). The default in most language SDKs is identity (no compression), and that default is correct for most intra-datacentre traffic. Enable compression deliberately for cross-region links where bandwidth matters, or for streaming RPCs carrying repetitive payloads.
Reproduce this on your laptop
# Set up a venv and try the raw HTTP/2 demo
python3 -m venv .venv && source .venv/bin/activate
pip install h2 grpcio grpcio-tools
# Start a tiny gRPC echo server (or use any local gRPC service)
# Then run:
python3 grpc_raw_demo.py
# Inspect the wire with tcpdump, decode HTTP/2 with Wireshark
sudo tcpdump -i lo -w /tmp/grpc.pcap port 50051
wireshark /tmp/grpc.pcap # set decode-as: HTTP2
# Enable gRPC tracing for flow control
GRPC_VERBOSITY=debug GRPC_TRACE=flowctl,http,api python3 your_client.py
Where this leads next
This chapter covered the wire and the channel; the next chapters in Part 4 build on top of those primitives.
- Idempotency keys, request hashing, and dedup tables — how to ride gRPC's at-least-once retry policy without producing duplicate effects.
- Wire protocols (Protobuf, Thrift, Cap'n Proto, FlatBuffers) — gRPC uses Protobuf by default but is not limited to it; the trade-offs against the alternatives.
- Message ordering: FIFO, causal, total — once delivery is at-least-once on a stream, what guarantees do you get about order across streams?
Beyond Part 4, gRPC's stream-per-RPC model is the load-bearing primitive for the service-mesh layer (Part 5 — service discovery), the proxy-based load-balancing patterns (Part 6 — heterogeneous backends balanced by an L7 Envoy), and the bi-directional control planes used by xDS, Temporal workers, and Kafka-Connect-style streaming controllers.
References
- gRPC Concepts — the official model: channels, stubs, RPC types, deadlines, status codes. The starting point for any deeper read.
- HTTP/2 specification (RFC 7540) — Belshe, Peon, Thomson, IETF 2015. The framing, multiplexing, and flow-control primitives gRPC builds on; sections 5–6 are essential.
- HPACK: Header Compression for HTTP/2 (RFC 7541) — Peon, Ruellan, IETF 2015. How HPACK's dynamic table compresses repeated headers, which is what makes gRPC viable for high-fanout services.
- Protocol Buffers Encoding — Google. The varint, tag-length-value wire format gRPC carries inside its DATA frames; required reading for anyone debugging payload bytes.
- Performance Best Practices with gRPC — gRPC team. The official "what to tune" list: keep-alive intervals, initial window sizes, max-concurrent-streams, channel re-use.
- HTTP/3 explained — Daniel Stenberg. Why QUIC's per-stream loss-recovery removes the HOL-blocking that HTTP/2 still has at the TCP layer; relevant for deciding when to migrate.
- RPC semantics: at-most-once, at-least-once, exactly-once — the application-layer reasoning gRPC's retry policy interacts with.
- The fallacies of distributed computing (revisited) — why the network unreliability fallacy makes flow control and deadlines load-bearing.