Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
The OTLP protocol
It is 21:47 IST on the night of an IPL playoff at a hypothetical Hotstar streaming-platform NOC, and Aditi, the on-call SRE for the telemetry pipeline, is watching the OTLP exporter queue depth on her own dashboard climb past 90% for the third time this hour. Every climb is followed by a cliff: queue drops to zero, span-emit-rate spikes briefly, then the climb begins again. The pattern is too regular to be load — load is a smooth ramp during an over change, not a sawtooth — and the obvious culprit, "the collector is OOMing", is wrong because the collector logs show nothing. Twelve minutes in, she runs tcpdump -i any -w otlp.pcap port 4317 against the Java app's exporter, opens the pcap in Wireshark with the OTLP protobuf dissector loaded, and sees the answer: every 30 seconds the exporter is sending a 4 MB batch, the gRPC server returns RESOURCE_EXHAUSTED because the default max_recv_msg_size is 4 MiB, the exporter retries with the same batch, and after 5 retries it drops the batch and resets the queue. 30k spans per cliff, 3 cliffs an hour, 90k spans per hour silently lost — and absolutely no alert because the SDK counts the drops in a metric nobody had wired into Prometheus.
OTLP is the wire protocol that every OpenTelemetry SDK speaks to send telemetry off the host. It is a protobuf schema (defined in the opentelemetry-proto repo), wrapped in either gRPC (OTLP/gRPC, port 4317) or HTTP/protobuf (OTLP/HTTP, port 4318), framed in batches that the SDK builds up and the exporter ships. Every SDK in every language speaks exactly this protocol; every collector speaks exactly this protocol; every backend that accepts OTel telemetry speaks exactly this protocol. The bytes-on-wire are the only universal interface in the OTel project — the API, the SDK, the semantic conventions all vary by language and version, but the OTLP bytes are the contract that lets a Java agent's spans land in a Tempo cluster written in Go and queryable from a Grafana dashboard rendered in TypeScript.
OTLP is OpenTelemetry's protobuf-encoded wire format, shipped over gRPC (port 4317) or HTTP/protobuf (port 4318), carrying spans, metrics, logs, and (now) profiles in a Resource → Scope → Signal three-level tree. The SDK batches signals in memory, the exporter sends one Export*ServiceRequest message per batch, and the receiver returns a partial_success or a gRPC error. Knowing the protobuf schema, the size limits, the retry semantics, and the compression ratio is what separates a fleet that absorbs IPL-night load from one that silently drops 90k spans an hour to a 4 MB message-size limit nobody set.
The shape of the message — Resource, Scope, Signal
OTLP is a small, deliberately repetitive protobuf schema. The repo github.com/open-telemetry/opentelemetry-proto holds about 1,500 lines of .proto definitions covering the common, resource, trace, metrics, logs, profiles, and collector packages. Read opentelemetry/proto/trace/v1/trace.proto once and the rest follow the same shape — every signal type is a three-level nested message with the same outer two levels and a signal-specific innermost level.
The three levels are:
Resource— a flat list ofKeyValueattributes describing the thing emitting telemetry.service.name,service.version,host.name,k8s.pod.name,cloud.regionlive here. Emitted once perResourceSpans/ResourceMetrics/ResourceLogsblock; applies to every signal inside.InstrumentationScope(formerlyInstrumentationLibrary) — the name, version, andschema_urlof the library that produced the signals inside.opentelemetry-instrumentation-flask0.45b0 with semconv schema 1.27.0 is one scope; the same process's manual instrumentation is another scope; a vendor SDK is a third. Spans inside a scope share the scope'sschema_url, which is the only OTLP field that tells a downstream consumer which semantic-conventions version the keys conform to.Signal— the actualSpan,Metric,LogRecord, orProfile. Each signal carries its own attribute list (the per-event attributes, not the resource ones), timestamps, and signal-specific bodies (span name and timing for spans, sum/gauge/histogram/exponential-histogram payloads for metrics, body and severity for logs).
The same three-level shape works for every signal. ExportMetricsServiceRequest is ResourceMetrics → ScopeMetrics → Metric, where the innermost Metric carries one of Sum, Gauge, Histogram, ExponentialHistogram, or Summary payloads. ExportLogsServiceRequest is ResourceLogs → ScopeLogs → LogRecord. The 2024-stable profiles signal is ResourceProfiles → ScopeProfiles → Profile carrying a pprof-derived inner structure. A receiver that parses the outer two levels generically can dispatch the innermost to a signal-specific handler — and this is exactly what every collector implementation does. Why the repetition matters: a fleet ships spans, metrics, and logs over the same OTLP transport, often through the same Collector pipeline. Sharing the outer two levels means the Resource and Scope are deduplicated across signal types — the Collector's batch processor can group ResourceSpans and ResourceMetrics from the same Resource block, and a downstream backend can join a metric with a log purely by the Resource match. Without the shared outer structure, joining metrics-to-traces for "this error_rate spike came from these 14 traces" would require attribute-level fuzzy matching instead of a structural join.
The protobuf schema also defines ExportTraceServiceResponse, the response message every Export RPC returns. It carries one optional field: partial_success, with rejected_spans (or rejected_data_points, rejected_log_records) and an error_message. This is how a receiver tells the exporter "I accepted 195 of your 200 spans but dropped 5 because they exceeded a per-span size limit". Most exporters log the partial-success message at WARN level and increment a metric; most fleets do not query that metric and discover the partial drops weeks later. The same metric pipeline that catches semconv_violations_total from chapter 93 should be wired to catch otlp_partial_rejected_spans_total.
See OTLP on the wire
The cleanest way to internalise OTLP is to look at the bytes. The script below stands up a tiny gRPC server that implements the OTLP traces service, has a Python SDK push a single span to it, parses the protobuf message that arrives, and prints both the structured fields and the raw byte length. This is the same exercise Aditi did with tcpdump, but reproducible on a laptop.
# otlp_dissect.py — receive an OTLP/gRPC export, parse it, print the structure
# and the raw byte length so you can see what the wire format looks like.
# pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc grpcio
import time, threading
from concurrent import futures
import grpc
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.proto.collector.trace.v1 import (
trace_service_pb2, trace_service_pb2_grpc)
CAPTURED = []
class TraceCollector(trace_service_pb2_grpc.TraceServiceServicer):
def Export(self, request, context):
# request is ExportTraceServiceRequest — already parsed protobuf.
raw_bytes = request.SerializeToString()
CAPTURED.append((request, len(raw_bytes)))
return trace_service_pb2.ExportTraceServiceResponse()
# 1) Stand up a gRPC server on 127.0.0.1:24317 implementing OTLP traces.
srv = grpc.server(futures.ThreadPoolExecutor(max_workers=2))
trace_service_pb2_grpc.add_TraceServiceServicer_to_server(TraceCollector(), srv)
srv.add_insecure_port("127.0.0.1:24317"); srv.start()
# 2) Configure an OTLP/gRPC exporter pointing at our server.
provider = TracerProvider(resource=Resource.create({
"service.name": "hotstar-playback-api",
"service.version": "2.5.0",
"host.name": "ip-10-0-3-87",
}))
provider.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://127.0.0.1:24317", insecure=True)))
tracer = provider.get_tracer("instr.flask", "0.45b0",
schema_url="https://opentelemetry.io/schemas/1.27.0")
# 3) Emit a single span — this is what travels over the wire.
with tracer.start_as_current_span("GET /play/{contentId}") as s:
s.set_attribute("http.request.method", "GET")
s.set_attribute("url.template", "/play/{contentId}")
s.set_attribute("http.response.status_code", 200)
time.sleep(0.012) # 12ms of work
provider.force_flush(); time.sleep(0.5)
# 4) Inspect what arrived.
for req, raw_len in CAPTURED:
print(f"raw protobuf size on wire: {raw_len} bytes")
for rs in req.resource_spans:
res_attrs = {a.key: a.value.string_value for a in rs.resource.attributes}
print(f" Resource: {res_attrs}")
print(f" schema_url(resource): {rs.schema_url!r}")
for ss in rs.scope_spans:
print(f" Scope: {ss.scope.name} v{ss.scope.version} schema={ss.schema_url}")
for sp in ss.spans:
attrs = {a.key: (a.value.string_value or a.value.int_value) for a in sp.attributes}
dur_ms = (sp.end_time_unix_nano - sp.start_time_unix_nano) / 1e6
print(f" Span: name={sp.name!r} dur={dur_ms:.2f}ms")
print(f" trace_id={sp.trace_id.hex()} span_id={sp.span_id.hex()}")
print(f" attributes={attrs}")
Sample run:
raw protobuf size on wire: 281 bytes
Resource: {'service.name': 'hotstar-playback-api', 'service.version': '2.5.0', 'host.name': 'ip-10-0-3-87'}
schema_url(resource): ''
Scope: instr.flask v0.45b0 schema=https://opentelemetry.io/schemas/1.27.0
Span: name='GET /play/{contentId}' dur=12.34ms
trace_id=4c8e2a1b9f3d7e6c5a4b8d2e1f0c7a3b span_id=8e7d6c5b4a392817
attributes={'http.request.method': 'GET', 'url.template': '/play/{contentId}', 'http.response.status_code': 200}
281 bytes for one fully-attributed HTTP-server span with three resource attributes and three span attributes. Five lessons in those bytes. Trace and span IDs are raw bytes, not strings. A trace_id is 16 random bytes (128 bits, hexed for display); a span_id is 8 random bytes (64 bits). On the wire they are bytes fields in the protobuf, not string fields — saving 16 bytes per ID over hex-encoding, and producing the binary 0x4c8e2a1b... on the wire that you'd see in tcpdump. Why this matters: every backend stores and queries traces by these IDs. Tempo's bloom filters, Jaeger's Cassandra schema, Honeycomb's column store all index on trace_id as binary. A propagation header (traceparent: 00-4c8e2a1b...-8e7d6c5b...-01) re-hexes the bytes for HTTP transport, but inside OTLP they stay binary. A team that builds a custom OTLP receiver in Python and .decode("utf-8")s the trace_id field gets a UnicodeDecodeError — the bytes are not text.
Timestamps are nanoseconds since epoch as fixed64. start_time_unix_nano and end_time_unix_nano are not human dates and not Unix seconds; they are 64-bit unsigned integers counting nanoseconds. This gives sub-microsecond precision (the span above lasted 12,340,000 nanoseconds) and a maximum representable date around 2554 AD. The fixed64 (vs int64) protobuf encoding means they always take 8 bytes on the wire, no varint encoding — which is the right choice because nanosecond timestamps are always near the upper end of the range and varint would be 9–10 bytes anyway.
Attributes are KeyValue with a oneof value. The AnyValue protobuf has string_value, bool_value, int_value, double_value, array_value, kvlist_value, bytes_value as a oneof. Each attribute pays the cost of the longer-of-tag-and-value plus 2–3 bytes of protobuf framing. A typical 12-character key plus a 3-character string value is about 22 bytes on the wire. The 200-attribute spans some teams ship are 4–5 KB each before compression, which is what feeds the 4 MB batch limit Aditi hit at the top of this article.
The schema_url lives on the Resource AND on the Scope. Two different fields, two different concepts. The resource-level schema_url is the conventions version that the resource attributes conform to (service.*, host.*, k8s.*); the scope-level schema_url is the version the signal attributes conform to. They can disagree — and in a fleet running mixed agent versions they routinely do. A backend that wants to translate between versions has to honour both, separately.
The whole 281-byte payload is what gRPC compresses. With gRPC's gzip compression enabled (grpc.default_compression_algorithm = GRPC_COMPRESS_GZIP), this typically shrinks 60–70% — a 200-span batch dropping from 64 KB to 24 KB on the wire. The protobuf encoding is already compact; gzip recovers the entropy from repeated attribute keys and host-name strings. Why this is worth knowing: an OTLP exporter in production should always have compression enabled. The CPU cost is ~3% on the sender and ~2% on the receiver; the bandwidth savings are 60%+ on real telemetry. Disabling it (the default in some older SDK versions) means paying 2–3× the cross-AZ data-transfer bill for no win — and on a Hotstar-shape platform with 80 microservices, that bill is real money.
The same script with the OTLP/HTTP exporter (opentelemetry-exporter-otlp-proto-http) produces nearly identical bytes — same protobuf schema, same encoding, just framed in an HTTP POST to /v1/traces instead of a gRPC stream. The choice between OTLP/gRPC and OTLP/HTTP is operational (gRPC needs HTTP/2 and modern load balancers, HTTP/protobuf works through any HTTP proxy) not protocol-level. Both speak the same protobuf shape.
One more detail worth seeing in the bytes: the kind field on a span. The protobuf enum Span.SpanKind has values INTERNAL, SERVER, CLIENT, PRODUCER, CONSUMER. The script above produced an INTERNAL span (the SDK's default when you call start_as_current_span without a kind argument), which is technically wrong for an HTTP-server endpoint. A correctly-instrumented Flask GET /play/{contentId} span should be SERVER, and the corresponding outbound requests.get call inside it should be CLIENT. The kind affects how backends compute service-level metrics — Tempo's metrics-generator and Honeycomb's auto-RED dashboards both filter on kind=SERVER when computing per-service request rates, because counting INTERNAL spans would double-count work. Auto-instrumentation libraries (opentelemetry-instrumentation-flask, opentelemetry-instrumentation-requests) set kind correctly; manually-written spans frequently do not, and the symptom is "my dashboard shows zero traffic" because the kind-filter is eliminating all the manual spans.
Batching, retries, and the failure modes the spec defines
The OTLP spec (opentelemetry.io/docs/specs/otlp) does more than define the protobuf schema — it specifies how an exporter must batch, retry, and degrade. These are the rules Aditi's agent was breaking, and they are the rules every production exporter has to honour to avoid silently dropping data.
The exporter side is conceptually simple: the SDK's BatchSpanProcessor (or BatchLogRecordProcessor, or the PeriodicExportingMetricReader for metrics) accumulates signals in an in-memory queue, and on a timer or queue-fullness trigger it pulls a batch out and hands it to the exporter. The exporter packs the batch into an ExportTraceServiceRequest, opens (or reuses) a gRPC stream, sends the request, and waits for the response. The defaults in the Python SDK are 512 spans per batch, 2048 spans queue capacity, 5 second flush timeout, 30 second export timeout. These look reasonable until you do the arithmetic — 512 spans at ~600 bytes each is ~300 KB uncompressed, ~120 KB compressed, well under any realistic gRPC limit. The trouble starts when teams crank max_export_batch_size to 4096 or 8192 to reduce overhead, or when individual spans grow past 5 KB because someone added 50 attributes including a stringified request body. 8192 spans × 5 KB = 40 MB uncompressed, 16 MB compressed — past the default gRPC max_recv_msg_size of 4 MiB on the server side.
The retry behaviour is what makes the failure modes subtle. The OTLP spec says exporters MUST retry on transient errors (UNAVAILABLE, DEADLINE_EXCEEDED, RESOURCE_EXHAUSTED, ABORTED, internal HTTP 502/503/504) and MUST NOT retry on permanent errors (INVALID_ARGUMENT, PERMISSION_DENIED, HTTP 400/401/403/422). The retry MUST use exponential backoff with jitter — typically 1s, 2s, 4s, 8s, 16s — and MUST cap at a configurable maximum (5 attempts in the Python SDK, 5 in the Go SDK). After exhausting retries, the batch is dropped and a metric is incremented; the spans inside are not re-queued for the next batch. This is the correct behaviour — re-queueing would let a stuck downstream poison the in-memory queue indefinitely — but it means the dropped spans are gone forever.
The "retry with the same batch" part is what bit Aditi. A 4 MB batch that exceeds the receiver's max_recv_msg_size returns RESOURCE_EXHAUSTED, which is in the retry list; the exporter retries the same too-large batch five times, fails five times, then drops. The spec is correct that RESOURCE_EXHAUSTED is retryable in general (it could mean "the receiver is briefly out of memory, try again"); it just happens to also mean "the batch is too big" in this specific case, and the exporter cannot tell the difference. The fix the SDKs are converging on is to detect the message-too-large error specifically (gRPC has a separate error string for it) and split the batch, but as of Python SDK 1.30 the heuristic is not yet enabled by default. The pragmatic answer in 2026 is to monitor the grpc.status of failed exports and to set the receiver's max_recv_msg_size to at least 16 MiB — high enough that no realistic batch hits it, even when someone misconfigures max_export_batch_size.
The other failure mode the spec defines is partial success. An ExportTraceServiceResponse with partial_success.rejected_spans = 5 and partial_success.error_message = "span exceeds max_attributes_per_span" tells the exporter "I accepted 195 of your 200 spans but dropped 5". The exporter increments the partial-failure metric and does not retry — partial success is a terminal state. Most exporters log this at WARN; few teams alert on the metric. A team that does, catches the slow-creep of "span attributes growing past the receiver's limit" months before any user-visible impact.
The OTLP/HTTP variant has the same retry rules with HTTP status codes mapped onto them — 408 (request timeout), 429 (too many requests), 500-503 are retryable; 400, 401, 403, 422 are not. 429 is special: the response can include a Retry-After header in seconds, and the exporter MUST respect it. This is the spec-level mechanism for backpressure — a backend overloaded at 22:00 IST during the IPL final can return 429 Retry-After: 30 to every exporter for 30 seconds, and well-behaved exporters back off, queue locally, and retry. Backends that refuse to implement 429 (just OOM and drop the connection) inflict the work of backpressure on every exporter individually, with worse results because individual exporters have no global view.
The interaction between batch sizing and the retry budget is the second-order effect most platform teams miss. Why batch size is a load-shedding lever: a batch that takes 200ms to send and fails on the 30th retry has consumed ~62 seconds of total exporter time (1+2+4+8+16 = 31s of backoff plus 30s of timeout per attempt). During those 62 seconds the in-memory queue is filling at the app's full emit rate, so a sustained downstream outage produces queue-full drops within ~30 seconds even if the per-batch retry budget looks generous. Smaller batches with shorter timeouts give the exporter more chances to recover before the queue fills; larger batches are more efficient per-byte but starve the queue under failure. The sweet spot for most fleets is max_export_batch_size = 512, export_timeout = 10s, max_queue_size = 4096 — small enough to clear a transient blip in 5 seconds, large enough to not waste CPU on per-batch overhead.
Compression, transport, and the operational tuning that pays back
The OTLP spec is transport-agnostic but the two transports — OTLP/gRPC over HTTP/2 on port 4317 and OTLP/HTTP (protobuf-bodied POST) on port 4318 — have measurably different operational profiles, and the choice affects bandwidth, latency, and how telemetry traverses your network.
Compression is the first lever. Both transports support gzip; OTLP/gRPC also supports deflate and zstd (where the implementation has it); OTLP/HTTP supports gzip via the Content-Encoding header. On real telemetry — repeated attribute keys, repeated host names, repeated string enums — the compression ratio is typically 60–75% with gzip and 65–80% with zstd. The CPU cost on the sender is ~2–4% of process CPU at 10k spans/sec; the cost on the receiver is similar. The bandwidth savings on a 80-microservice fleet at IPL-final scale are real money — a hypothetical Hotstar-shape platform shipping 50 GB/hour of uncompressed OTel traffic across availability zones is paying ₹8–12 lakh/month in cross-AZ data-transfer charges; gzip drops that to ₹3–5 lakh/month, zstd to ₹2.5–4 lakh. The default-on-compression setting is what every fleet should run; the Compression field in the OTLP exporter config exists precisely so you can name it explicitly in code rather than inheriting whatever the SDK default happened to be in the version you pinned.
The transport choice — gRPC vs HTTP — is mostly about your network infrastructure, not protocol differences. gRPC needs HTTP/2 end-to-end, which means modern L7 load balancers (AWS ALB, GCP HTTPS LB, Envoy, Nginx ≥1.13.10 with http2 directive); legacy L4 TCP load balancers that don't speak HTTP/2 will break gRPC streams. OTLP/HTTP works through any HTTP/1.1-capable proxy, including older corporate proxies, and is the right choice when telemetry has to traverse a network you don't fully control. The latency difference is small (gRPC saves ~5–10ms per export by reusing a single multiplexed stream; HTTP needs a new HTTP/1.1 request per batch unless keepalive is preserved). The throughput difference is also small at typical telemetry rates. Pick by infrastructure capability, not by perceived "gRPC is faster".
The Collector (chapter 91) is where most fleets do the transport translation: an in-cluster Collector accepts OTLP/gRPC from sidecar agents (low overhead, fast intra-cluster) and re-exports to the backend over OTLP/HTTP (works through cloud egress proxies, simpler TLS). This is the standard topology and is worth understanding because it affects where you tune the message-size limits. The intra-cluster Collector should accept up to 16 MiB messages (max_recv_msg_size); the backend can be tuned independently. A common subtle bug: the Collector accepts a 12 MiB batch from the agent fine, then tries to forward it to the backend whose limit is 4 MiB, and the forward fails. The fix is to set the Collector's batch processor's output size cap to match the minimum of all downstream limits.
The mTLS layer is mandatory for production. Both OTLP/gRPC and OTLP/HTTP support TLS; the spec describes the client-side configuration knobs (OTEL_EXPORTER_OTLP_CERTIFICATE, OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE, OTEL_EXPORTER_OTLP_CLIENT_KEY). The non-spec part is rotation discipline — if your client cert expires, every span emitted in the next 12 hours falls into drop #4 above, and the only signal is the export-failure metric. The Razorpay-shape platform team holds cert rotation in the same automation as their semconv version pinning; the Hotstar-shape one rotates via a service mesh's automatic mTLS (Istio, Linkerd) and trades the spec-level configuration for a mesh-level one. Either is fine; the antipattern is "we'll get to it" combined with no expiration alert.
Edge cases that bite — what the SDKs do not warn you about
The spec tells you what every conformant exporter MUST do; what it does not tell you is the half-dozen edge cases where exporters silently differ across SDKs and where production fleets discover bugs only at scale. Knowing these saves you the 03:00 IST debugging session.
The force_flush race on shutdown. BatchSpanProcessor.force_flush() blocks until the in-flight batch finishes exporting or the timeout (default 30s) expires. On a clean shutdown — Kubernetes sends SIGTERM, the app's atexit hook calls force_flush — this works. On a SIGKILL (the 30-second graceful-shutdown window expires), the queue's pending spans are lost and there is no opportunity to even count them. The fix is to keep the queue small enough that any in-memory contents fit inside one force_flush-sized export — max_queue_size = 2 * max_export_batch_size is a safer default than the SDK's 4×. Razorpay-shape services that handle UPI payments cannot tolerate even a 30-second loss of trace data on rolling deploys, and they pin the queue size accordingly.
The async-exporter thread crashes silently. The Python BatchSpanProcessor runs the exporter on a background thread; if that thread crashes (a malformed span causes a serialization error, the exporter's gRPC client throws an unexpected exception type), the thread dies and the queue keeps filling but no export ever fires. The SDK does not restart the worker thread. The signal is a steady climb in otel_sdk_span_processor_dropped_total{reason="queue_full"} with zero corresponding otel_exporter_export_* activity — exports stopped happening entirely. The fix is to wrap the exporter in a process that catches and logs unhandled exceptions, and to alert specifically on "queue dropping with no recent export activity" as a separate condition from "queue dropping during high export rate".
This failure is harder to detect than it sounds because the Python SDK does not currently expose a "exporter thread alive" health signal. A proxy is to wire otel_exporter_export_duration_seconds_count (the export-attempt counter) into Prometheus and alert on rate(...[5m]) == 0 AND rate(otel_sdk_span_processor_dropped_total[5m]) > 0 — exports stopped, drops climbing, exporter is dead. A handful of platform teams have gone further and run a periodic synthetic-span emitter (a small library that emits one span per minute and verifies the exporter actually transmitted it) as a continuous health check; this catches the dead-thread bug within 60 seconds rather than waiting for the queue to fill.
Span attribute value-size limits cap silently. The spec defines an attribute_value_length_limit (default 4096 chars per string attribute, unset by default in some SDKs). Strings longer than the limit are truncated server-side without a partial-success signal — the receiver just stores the truncated value. Code that puts a 50 KB JSON request body on a span as request.body discovers six months later that every value is truncated to 4 KB. The fix is to never put unbounded blobs on attributes; structured request bodies belong in linked logs, not span attributes. The OTLP spec is correct to truncate — it is the only safe behaviour at the receiver — but the silence about the truncation is the trap.
Clock-skew across processes shows up as negative durations. A span's start_time_unix_nano is set on the emitting process; a client span's parent uses the parent's clock; if the two processes' clocks differ by 50ms (NTP drift, especially on VMs in different AZs), a child span can have a start_time earlier than its parent's start_time. Backends that compute critical-path latency by subtracting timestamps see negative values and either clamp them to zero (Tempo) or display them weirdly (Jaeger's old UI showed negative-width bars). OTLP itself does not adjust for this — timestamps are wall-clock and that is exactly what the protocol says they should be. The discipline is to run good NTP across all hosts and to be suspicious of "this span finished before it started" displays — they are usually skew, not bugs in your code.
The trace-id zero-value bug. A trace_id of all zeros (0x00000000000000000000000000000000) is a valid 16-byte value but the OTel spec defines it as "invalid" — a span carrying it should be rejected at the receiver. Most receivers do reject it; some older Collector versions did not, leading to a phantom "trace" containing every span that lost its trace_id during a propagation bug. The signal is "one trace_id with millions of spans" in your trace-search UI. The fix is on the source side — debug the propagation library that emitted a zero trace_id — but knowing the receiver is supposed to reject it lets you pinpoint where the bug crept in.
A related symptom is a span_id of all zeros, which the spec also defines as invalid. This typically manifests when a manually-written span explicitly sets parent_span_id = b'\x00' * 8 instead of leaving it unset for a root span — the protobuf treats unset bytes-fields as empty (length 0), which receivers correctly interpret as "this is a root span with no parent", but a forced zero-valued parent_span_id is interpreted as "this span has a parent whose ID is invalid" and the span is rejected or assigned to a phantom parent. The Python SDK never produces this bug; manually-instrumenting code in Go and Java occasionally does when developers paste in a [8]byte{} literal. The fix is to omit the parent_span_id set entirely for root spans.
The 64-bit-vs-128-bit trace_id ambiguity. The OTel spec mandates 128-bit trace_ids; the older Jaeger and Zipkin specs allowed 64-bit. A Jaeger-instrumented service propagating its 64-bit trace_id to an OTel-instrumented service via the uber-trace-id header (Jaeger's propagation format) will produce a 128-bit trace_id where the upper 64 bits are zero — 0x0000000000000000abcd1234deadbeef. This is technically valid, technically queryable, but breaks any backend that uses the upper 64 bits as a sharding key (Tempo's bloom filters do). Mixed-instrumentation fleets see uneven shard utilisation as a symptom. The fix is to migrate every service to 128-bit propagation (W3C tracecontext) and decommission the legacy headers — the spec change has been stable since 2020 but legacy libraries linger.
The rare-but-painful corollary is when a fleet has both 64-bit-prefix-zero traces and full 128-bit traces in the same backend. Cross-trace search by partial-trace-id-prefix returns either set depending on whether the user typed 16 hex chars or 32; on-call engineers occasionally pull the wrong half of the trace tree without realising. The discipline is to enforce 128-bit at the ingestion edge — the Collector can reject traces whose upper 64 bits are zero with an OTTL rule — and to fix the upstream services whose spans are being rejected. This forces the migration rather than letting it linger indefinitely.
Common confusions
- "OTLP and OpenTelemetry are the same thing." They are not. OpenTelemetry is the project (API spec, SDKs in 11+ languages, semantic conventions, the Collector). OTLP is one of OTel's outputs — the wire protocol. Other parts of OTel — the API a developer codes against, the SDKs that buffer and process spans before export — never appear on the wire. OTLP is how telemetry leaves the process, not how it gets created.
- "OTLP/gRPC and OTLP/HTTP carry different data." They do not. Both transports carry the exact same protobuf-encoded
ExportTraceServiceRequest(or metrics/logs/profiles equivalent). The wire bytes are identical; only the framing differs (gRPC stream-framed vs HTTP body-framed). A mitmproxy of either transport would reveal the same protobuf shapes. - "A span's
trace_idis a string." It isbytes— 16 raw bytes, hex-encoded only for human display. The W3Ctraceparentheader re-hexes it for HTTP propagation, but inside OTLP it stays binary. A custom OTLP receiver that string-decodes the field will crash on the first non-UTF-8-valid byte. - "OTLP retries forever on errors." It does not. The spec mandates exponential-backoff retries with a maximum (typically 5 attempts) on a defined list of transient error codes. After max retries, the batch is dropped and counted in the export-failure metric. A drop is final — the spans are gone, not re-queued.
- "
partial_successmeans I should retry the rejected spans." It does not. Partial success is a terminal state — the receiver has accepted what it accepted and the exporter must move on. Re-sending the rejected spans would re-trigger the same rejection. The fix is to address the root cause (span too large, attribute count over limit, payload exceeded a value-size cap), not to retry. - "I can ignore
schema_urlbecause all my services run the same SDK." You cannot. As soon as one team upgrades their auto-instrumentation jar, they ship spans with a differentschema_url. Every queryable attribute key may rename. A backend or downstream pipeline that ignoresschema_urlcannot translate between versions and dashboards silently break — exactly the failure mode in chapter 93.
Going deeper
The four metrics every fleet should wire from the SDK
Every OTel SDK exposes a small set of self-observability metrics that a production fleet must scrape and alert on. They are: otel_sdk_span_processor_dropped_total{reason="queue_full"} (drop #1 — the queue filled and new spans were rejected at intake; signals "the exporter cannot keep up with the app's emit rate"), otel_exporter_export_failure_total{reason="<grpc_status>"} (drops #2 and #3 — the batch failed to send after retries; signals "the receiver is unhealthy or unreachable, or the batch is malformed"), otel_exporter_export_duration_seconds (a histogram of export latencies; the p99 climbing past the export timeout is the leading indicator that drops are coming), and otel_exporter_partial_rejected_total (the partial-success path — spans accepted-with-rejections; signals attribute hygiene problems before they snowball). Wire all four into Prometheus with alert rules: rate(otel_sdk_span_processor_dropped_total[5m]) > 0 is a paged alert because it means user-visible telemetry is being lost; the others are warning-level. The Razorpay-shape platform team that wired these metrics three years ago has a span-loss rate measurable in single digits per million; the team that did not has periodic "we lost an hour of payment traces" incidents that nobody can root-cause until the export-failure metric is finally added.
The metric names above match the Python SDK's exposed names; Java and Go SDKs use slightly different names (otelcol_exporter_send_failed_spans on the Collector side, for example), but the four conceptual signals are universal. The cleanest pattern is to set up a single Grafana dashboard called "OTel SDK health" with four panels — queue-fill rate, export failure rate by gRPC status code, export-duration p99, and partial-rejected rate — and to import that dashboard into every service's Grafana folder via the Grafana JSON API. Service teams see their own health; the platform team sees a fleet-wide aggregate panel that lights up the moment any service's exporter starts struggling. This is the meta-observability layer — observability of the observability pipeline itself — and it is the single most underinvested area in most fleets, second only to the schema-version pinning discipline from chapter 93.
The protobuf wire-encoding tricks that make OTLP small
Protobuf is already a compact binary format; OTLP's schema design takes advantage of three specific encoding tricks. Varint encoding for small integers: most attribute values that are integers (HTTP status codes, port numbers, retry counts) are small enough to fit in 1–2 bytes via varint, vs the fixed 4 bytes a fixed32 field would cost. Repeated scalars packed by default: an array of 50 integer values takes ~50–100 bytes packed, vs 150–200 bytes if each value were separately tag-prefixed. Nested messages with shared resource: a ResourceSpans message holds the Resource once and references it by structure, not by repetition; 200 spans in one ResourceSpans share the ~120-byte Resource block once. The protobuf wire format is also friendly to forward compatibility — a new field added in a future spec version is silently ignored by an older receiver, which is why the project can add fields like entity_refs (a 2024 addition) without breaking older tooling.
The trade-off this design makes is human-readability. OTLP bytes are not text-grep-able; you cannot tcpdump -A to spot-check them. The mitigation is the debug exporter, which logs OTLP messages as pretty-printed protobuf, and the OTel Collector's logging exporter, which does the same. In production debugging, run a sidecar Collector with logging exporter to dump OTLP messages to a JSON-lines file you can jq against.
gRPC keepalives and the long-lived-stream traps
OTLP/gRPC is designed to use a single long-lived gRPC stream per process, multiplexing many Export calls across it. This is more efficient than per-batch connection setup (no TLS handshake per batch, no TCP slow-start) but introduces failure modes specific to long-lived HTTP/2 streams. The most common are load-balancer idle timeouts — AWS ALB defaults to 60s, GCP HTTPS LB to 600s — which silently close the stream if no batch is sent for the timeout. The exporter detects this on the next send (UNAVAILABLE error), reconnects, retries; the retry succeeds and the user sees no impact, but every reconnect adds 5–10ms latency to that batch. The fix is to enable gRPC keepalives (keepalive_time_ms = 30000 on the client) so the stream stays warm.
The other trap is gRPC's HTTP/2 max-concurrent-streams limit, which defaults to 100 on most servers. A high-fanout service with 100+ concurrent in-flight exports (rare, but possible at peak load with many parallel exporters in the same process) can hit the limit and queue locally. This manifests as export latency spiking sharply at a load threshold, then plateauing. The fix is to raise the server-side limit (grpc.max_concurrent_streams = 1000) or, more honestly, to consolidate exporters into one per process — there is no good reason to have multiple OTLP exporters in a single process.
A third subtle gRPC behaviour is HTTP/2 GOAWAY frames. When a Collector or backend wants to gracefully drain (rolling deploy, scale-down), it sends a GOAWAY frame on every active stream telling clients "do not start new streams; complete in-flight ones and reconnect to a different instance". Well-behaved gRPC clients honour this — they finish the current Export call, reconnect (typically through a service-mesh sidecar that picks a new healthy instance), and continue. Older or buggy clients ignore GOAWAY and keep sending into the closing stream, which the receiver then has to reject with UNAVAILABLE, causing a brief retry storm. Newer Python and Go SDKs handle this correctly; the failure mode lives mostly in legacy Java agents pinned to gRPC versions before 1.40. A platform team migrating its Collector pool should test rolling-deploy behaviour explicitly with each language's agent before rolling out at scale.
The 2024 additions: profiles, entity refs, and the spec's evolving surface
The OTLP spec is not frozen. The 2023–2024 release line added a fourth signal type — Profile, carrying pprof-derived continuous-profiling data — covered in chapter 105. The 2024 stable release added entity_refs to Resource, a way to declare that a resource refers to an external entity (an AWS instance ID, a Kubernetes pod UID) without copying every attribute of that entity into every signal — a structural deduplication that backends with entity catalogs can use for join optimisations. The 2025 line is adding events_v2 (a richer log-shaped event signal that supersedes span events) and incremental improvements to the exponential-histogram metric type. A fleet's OTLP version-pinning discipline is the same shape as their semconv version-pinning discipline: track the changelog, plan the upgrades, audit before agent upgrades.
The spec also has an ongoing debate about whether to add a server-streaming response — currently OTLP/gRPC is unary (one request, one response), but a server-streaming Export could let the receiver send progress updates during a slow batch ingest. The design tension is between the simplicity of unary RPC (every existing exporter implementation works) and the operational visibility of streaming (the receiver can warn the sender mid-export that it is approaching a quota). As of 2026 the consensus is to keep unary as the default and add streaming as an opt-in feature for specific high-throughput use cases. Watching this debate gives a platform team early signal on what their exporter configuration will need in 12 months.
The two specs co-evolve closely — a new Resource attribute in semantic conventions immediately becomes a new field on the wire (always backwards-compatible because new fields are silently ignored by old receivers); a new top-level field in OTLP often appears in semantic conventions later as the canonical way to populate it. Reading the OTLP CHANGELOG and the semantic-conventions CHANGELOG together is what gives a platform team the right lead-time for upgrades — typically 6 months from "added experimentally" to "default in latest agents", which is plenty of runway to update Collector pipelines and dashboards if anyone is paying attention to the changelogs.
Why OTLP is faster to evolve than its predecessors — Jaeger and Zipkin
OTLP replaced two prior protocols whose evolution was painful enough to motivate building OTLP from scratch. Jaeger's Thrift-over-UDP protocol limited messages to ~63 KB (the UDP datagram size) and forced agents to fragment large traces; Zipkin's JSON-over-HTTP protocol was easy to read but bandwidth-expensive (typical 5–10× the protobuf-encoded size) and lacked a formal schema. Both predated the cross-vendor consensus that the OTel project later built. OTLP's design choices — protobuf for compactness, gRPC for streaming, a versioned schema with schema_url, an explicit partial_success channel — are direct responses to the operational pain of running Jaeger and Zipkin at scale.
The clearest evidence is the migration footprint. A fleet running Jaeger's Thrift protocol pre-2023 typically had three different agent versions in production at once because the Thrift schema kept evolving in non-backwards-compatible ways; an OTLP fleet running protobuf can absorb four years of spec evolution because every new field is additive. The Hotstar-shape platform that ran Jaeger pre-OTel (hypothetically) spent 18 months migrating to OTLP and reported the migration paid back in 12 months purely from bandwidth savings — Zipkin JSON traces averaged 1.8 KB per span over the wire, OTLP+gzip averages 280 bytes, a 6× reduction at no fidelity loss. The cross-AZ data-transfer bill alone justified the migration before any developer-experience improvements were factored in. This is the unsung reason every major observability vendor — Honeycomb, Datadog, New Relic, Splunk — adopted OTLP within 24 months of the spec stabilising: the wire-format efficiency gain compounds across every customer simultaneously.
Reproduce this on your laptop
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc grpcio
python3 otlp_dissect.py
# Watch: 281 bytes for one fully-attributed HTTP span. Add 199 more spans
# in the same scope and rerun — the byte count grows ~600 bytes per span,
# Resource and Scope blocks shared across all of them.
# To see compression effects: enable gzip on the exporter via
# OTLPSpanExporter(endpoint=..., insecure=True, compression=Compression.Gzip)
# and compare raw_bytes before vs after.
Where this leads next
The next chapter /wiki/the-w3c-tracecontext-spec covers the propagation half of the protocol — how trace_id and span_id travel between processes via HTTP headers (traceparent, tracestate) so a downstream service knows it is participating in an upstream trace. OTLP is what carries the finished spans to the backend; tracecontext is what carries the in-flight trace identity between services. Together they let a Hotstar request that touches 80 microservices land in Tempo as a single connected tree.
The orthogonal direction is what runs between the SDK and the backend — the OTel Collector (chapter 91) accepts OTLP, runs processors (filtering, attribute manipulation, tail-based sampling, semconv translation) and re-exports, often again as OTLP. Understanding OTLP at the wire level is the prerequisite for understanding what a Collector pipeline can and cannot manipulate without re-encoding.
Two further chapters build directly on the wire format covered here. /wiki/auto-instrumentation covers the agents that produce OTLP messages without code changes, and the version-coupling discipline between agent version, semconv version, and OTLP schema. /wiki/the-data-model covers the conceptual model — Resource, Scope, Signal — at a higher altitude than the protobuf schema in this article, and is the right read after this one if you want the why behind the three-level structure.
Aditi's incident at 21:47 IST ended at 22:33 with a Collector deployed in front of the Java app and max_recv_msg_size set to 16 MiB on both endpoints. The 30-second sawtooth on the queue-depth dashboard flattened to a smooth ramp. Two days later the platform team merged a BatchSpanProcessor config that capped max_export_batch_size at 1024 and added all four self-observability metrics to the platform Prometheus. The next IPL playoff went out with 12 million concurrent viewers, 220k spans/sec across the fleet, and zero spans dropped to message-size limits — because the protocol was no longer a black box.
The deeper lesson from Aditi's incident is the one every senior SRE eventually internalises about telemetry pipelines: the protocol is part of the system, not an implementation detail. A team that treats OTLP as "the wire format the SDK happens to use" is one configuration knob away from silently dropping data. A team that treats OTLP as a system component — with its own SLOs, its own dashboards, its own runbooks — catches drops within minutes instead of weeks. The shift is not technical, it is organisational: someone must own the telemetry pipeline as a product with a stated availability target. At a Razorpay-shape platform that owner is typically the platform-engineering lead; at a Hotstar-shape one it is the SRE platform team. Either way, the OTLP protocol's bytes, retries, and partial-success semantics are now part of their on-call surface, not a vendor's problem.
That ownership is what closes the loop between the protocol and the people who depend on it. The Razorpay platform team that owns the OTel pipeline runs a quarterly "OTLP audit" — they pull a week of otel_exporter_export_failure_total data, group it by service and gRPC status code, and identify the top three services whose failure rates have crept up. Each one becomes a small platform-engineering project: a batch-size tune, a Collector route change, a cert rotation. The work is unglamorous but compounds — the fleet's tail of "we lost spans here last month" incidents shrinks every quarter, and the on-call rotation gradually loses the OTLP-related pages entirely. That, ultimately, is the test of whether you understand the protocol: can you make it boring?
References
- OpenTelemetry Protocol Specification — the binding spec for OTLP/gRPC and OTLP/HTTP, including retry, batching, and partial-success semantics.
- opentelemetry-proto repository — the protobuf source-of-truth for
Resource,Scope,Span,Metric,LogRecord,Profile, and the collector-service RPCs. - OTLP/gRPC vs OTLP/HTTP — choosing a transport — the spec's own comparison and the operational guidance behind it.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 9 on event-shaped data and the wire-level economics of telemetry.
- gRPC keepalive guide — the keepalive parameters every long-lived OTLP/gRPC client should set.
- Protocol Buffers encoding guide — the underlying wire encoding that makes OTLP messages compact.
- /wiki/semantic-conventions — internal: the attribute-naming spec whose
schema_urlfield travels in every OTLP message. - /wiki/the-collector-receivers-processors-exporters — internal: the box that sits between the SDK exporter and the backend, speaking OTLP on both sides.