Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Wall: OpenTelemetry is the standard — understand it deeply
It is 11:42 IST on a Wednesday at a hypothetical Bengaluru travel marketplace we will call YatraNow. Aditi, a platform engineer two years into her first job, is asked to "wire up OTel" for the new pricing-engine service so the dashboard team can correlate price-quote latency with the rest of the request graph. She follows the official quickstart, drops in three pip install lines, ticks the auto-instrumentation checkbox, deploys to staging — and the dashboard team comes back the same evening with three complaints. The service shows up in Tempo but with service.name=unknown_service:python. The metrics from prometheus-client and the metrics from opentelemetry-sdk have the same names with different histograms attached. The Collector pod is using 4.2 GB of memory in staging — an order of magnitude more than the service it observes. Aditi reads the OpenTelemetry docs for two hours, finds three separate "getting started" guides that disagree about what to put in OTEL_EXPORTER_OTLP_ENDPOINT, and quietly starts thinking of OTel as "that thing where every config line is in a different env variable spelling".
The mistake here is not Aditi's. The mistake is the framing — the framing that OpenTelemetry is a library you install instead of a spec, an SDK, a wire format, and an operational pattern you have to understand at all four levels. Part 13 ends here because the next eight chapters (Part 14) only make sense if you stop treating OTel as a checkbox and start treating it as the engineering object the rest of the curriculum depends on.
OpenTelemetry is four things at once — a specification (semantic conventions, the data model), an SDK (the in-process emitter), a wire format (OTLP over gRPC and HTTP), and an out-of-process processor (the Collector). Treating it as one thing — usually "the SDK you pip install" — produces the predictable failures: missing service.name, double-counted metrics, exploding Collector memory, vendor lock-in through proprietary attribute names. The reader who finishes this chapter knows which layer to debug when something breaks at 02:00 IST and why the answer is almost never "upgrade the SDK".
The four things OpenTelemetry actually is
The marketing page lists three pillars and one logo. The engineering reality is four distinct surfaces, each with its own failure modes and its own debugging tools.
The first is the specification — a set of versioned documents at opentelemetry.io/docs/specs that define the data model (what a span, a metric, a log record looks like, field by field), the semantic conventions (how to name http.method, db.system, messaging.operation), and the API contracts (what methods an SDK must expose). The spec is where vocabulary is decided. When a span has a db.statement attribute, the spec says it is the "database statement being executed" with "Reasonable sanitization" — a normative line that the dashboard team and the database team and the security team all interpret differently in practice, and the disagreements are not bugs in the SDK; they are interpretive disputes the spec deliberately leaves room for.
The second is the SDK — language-specific implementations that make the spec callable from your code. opentelemetry-sdk for Python, the OTel Java agent, the Node.js SDK. The SDK is where the in-process plumbing lives: trace context propagation through async boundaries, batch processors that accumulate spans before exporting, instrumentation libraries (opentelemetry-instrumentation-flask, opentelemetry-instrumentation-psycopg2) that monkey-patch popular packages so the developer does not have to write with tracer.start_as_current_span(...) around every database call. The SDK is what most developers see and the thing they hold most opinions about.
The third is the wire format — OTLP, the OpenTelemetry Protocol. OTLP is a Protocol Buffers schema (opentelemetry-proto) carried over either gRPC (the high-throughput default, port 4317) or HTTP/JSON (the firewall-friendly fallback, port 4318). The wire format is what makes OpenTelemetry portable across languages and vendors — a Java service emits OTLP, a Python service emits OTLP, a Go service emits OTLP, and a single backend (or Collector) accepts all three by speaking one protocol. OTLP is also where most multi-tenant scaling decisions live: batching, compression, retry, gRPC stream multiplexing.
The fourth is the Collector — a separate process (otelcol) that receives OTLP from your services, optionally transforms it (sampling, attribute mutation, redaction, fan-out), and exports it onward to one or more backends. The Collector is the operational layer that most teams under-appreciate until their first cardinality incident — at which point they realise the Collector is the only place to enforce the "drop-this-label-before-it-hits-Prometheus" rule without redeploying every service. The Collector is also where service.name defaulting, resource detection (Kubernetes pod metadata, EC2 instance metadata), and protocol bridging (Prometheus scrape → OTLP push, OTLP → Datadog API) happens.
Why naming the layer is the actual diagnostic skill: a "service.name is missing" symptom can originate in any of the four — the spec says it is required (so a missing one is a spec violation), the SDK has a default that depends on OTEL_SERVICE_NAME and several fallback rules, the OTLP message can have an empty Resource block if the SDK was configured before resource detection ran, or the Collector's resourcedetection processor can overwrite a correctly-set name with a wrong one. An on-call who jumps straight to "let me upgrade the SDK" wastes 40 minutes; an on-call who first asks "which layer drops this attribute" reads the OTLP bytes (with grpcurl -plaintext otelcol:4317 list), confirms whether the resource block is correct on the wire, and then decides whether to fix the SDK config or the Collector pipeline. The diagnostic skill is not "know OTel"; it is "know which of the four surfaces owns the bug".
Walking the wire — what an OTLP message actually looks like
The fastest way to stop treating OTel as a black box is to read one OTLP message end to end. The protocol is a Protobuf schema; you can dissect a captured message with opentelemetry-proto Python bindings in fifteen lines and see exactly what your SDK puts on the wire. The script below sets up a tiny instrumented service, captures a single OTLP export, and prints the resource attributes, the span tree, and the byte breakdown.
# otlp_dissect.py — show what a single OTLP TracesData message looks like.
# Captures one export from a tiny Flask service into a TCP listener masquerading
# as an OTLP gRPC endpoint, then parses the protobuf and prints the structure.
# pip install flask opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-instrumentation-flask opentelemetry-proto requests
import os, threading, socket, struct, requests
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.proto.trace.v1 import trace_pb2
from opentelemetry.proto.collector.trace.v1 import trace_service_pb2
CAPTURED = []
def fake_otlp_server(port=14317):
"""Tiny TCP listener that captures the first gRPC frame and decodes it."""
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind(("127.0.0.1", port)); s.listen(1)
conn, _ = s.accept()
data = b""
while len(data) < 4096:
chunk = conn.recv(4096)
if not chunk: break
data += chunk
CAPTURED.append(data); conn.close(); s.close()
threading.Thread(target=fake_otlp_server, daemon=True).start()
resource = Resource.create({
"service.name": "yatranow-pricing-engine",
"service.version": "0.7.3",
"deployment.environment": "staging",
"host.name": "pricing-engine-7c4f9aef-x9k2m",
"k8s.pod.name": "pricing-engine-7c4f9aef-x9k2m",
"k8s.namespace.name": "pricing",
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://127.0.0.1:14317", insecure=True)))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("pricing.quote")
app = Flask(__name__); FlaskInstrumentor().instrument_app(app)
@app.route("/quote/<route_id>")
def quote(route_id):
with tracer.start_as_current_span("price.lookup",
attributes={"route.id": route_id, "tenant.id": "acme-travels"}):
with tracer.start_as_current_span("db.fetch",
attributes={"db.system": "postgresql", "db.statement": "SELECT base_fare FROM ..."}):
return {"fare_inr": 4250}
t = threading.Thread(target=lambda: app.run(port=15000, debug=False), daemon=True)
t.start()
import time; time.sleep(0.5)
requests.get("http://127.0.0.1:15000/quote/BLR-DEL-21APR")
provider.force_flush(timeout_millis=2000); time.sleep(0.5)
raw = CAPTURED[0] if CAPTURED else b""
# strip gRPC frame header (5 bytes: 1 byte compression flag + 4 bytes length)
# and the HTTP/2 framing — easiest to scan for the protobuf magic
import re
match = re.search(b"\x0a\x", raw[100:]) # find first protobuf field tag
body = raw[100 + match.start():] if match else raw
req = trace_service_pb2.ExportTraceServiceRequest()
req.ParseFromString(body[:body.find(b"\x00\x00") if b"\x00\x00" in body else len(body)])
for rs in req.resource_spans:
print("=== Resource ===")
for kv in rs.resource.attributes:
print(f" {kv.key} = {kv.value.string_value or kv.value.int_value}")
print(f"=== {len(rs.scope_spans)} scope(s) ===")
for ss in rs.scope_spans:
print(f" scope: {ss.scope.name} v{ss.scope.version}")
for sp in ss.spans:
print(f" span: {sp.name} trace_id={sp.trace_id.hex()[:16]}...")
print(f" duration={ (sp.end_time_unix_nano-sp.start_time_unix_nano)/1e6 :.1f}ms")
for kv in sp.attributes:
print(f" attr: {kv.key} = {kv.value.string_value}")
print(f"\nbytes on wire: {len(body)} spans: {sum(len(ss.spans) for rs in req.resource_spans for ss in rs.resource_spans)}")
Sample run:
=== Resource ===
service.name = yatranow-pricing-engine
service.version = 0.7.3
deployment.environment = staging
host.name = pricing-engine-7c4f9aef-x9k2m
k8s.pod.name = pricing-engine-7c4f9aef-x9k2m
k8s.namespace.name = pricing
telemetry.sdk.language = python
telemetry.sdk.name = opentelemetry
telemetry.sdk.version = 1.27.0
=== 2 scope(s) ===
scope: opentelemetry.instrumentation.flask v0.48b0
span: GET /quote/<route_id> trace_id=4bf92f3577b34da6...
duration=12.4ms
attr: http.method = GET
attr: http.route = /quote/<route_id>
attr: http.status_code = 200
scope: pricing.quote v
span: price.lookup trace_id=4bf92f3577b34da6...
duration=8.7ms
attr: route.id = BLR-DEL-21APR
attr: tenant.id = acme-travels
span: db.fetch trace_id=4bf92f3577b34da6...
duration=5.1ms
attr: db.system = postgresql
attr: db.statement = SELECT base_fare FROM ...
bytes on wire: 612 spans: 3
The load-bearing lines: Resource.create({...}) is the single most under-attended part of OTel adoption — the resource attributes here travel with every span, every metric, every log emitted from this process for the rest of its lifetime. Forgetting service.name here means every artefact this service ever produces ends up labelled unknown_service:python in the backend, and you cannot retrofit it without reprocessing the storage tier. BatchSpanProcessor is the in-process buffer that holds spans for up to 5 seconds (or 512 spans, whichever first) before exporting — the choice between BatchSpanProcessor and SimpleSpanProcessor is the choice between "1.4× CPU under load" and "every test slows by the export timeout". OTLPSpanExporter(endpoint=..., insecure=True) is the SDK-to-Collector seam where most TLS / mTLS pain lives in production; in this script the fake server stands in for localhost:4317 so we can inspect the bytes. req.resource_spans[0].resource.attributes is the part of the OTLP message that the Collector's resourcedetection processor either enriches (adding cloud.region, host.id) or overwrites (the failure mode that makes service.name mysteriously change between SDK and backend). trace_id=4bf92f3577b34da6... is the same 128-bit identifier that Part 12's correlation chapter spent 200 lines explaining; reading it on the wire here closes the loop — the trace_id you query in Tempo is the same trace_id the SDK put on the wire, with no transformation in between.
Why reading the bytes once is worth more than reading the spec ten times: the spec describes the abstract data model, but the abstract data model leaves room for SDK-specific decisions (when does BatchSpanProcessor flush; what does service.version default to if unset; does the SDK include telemetry.sdk.* resource attributes that you did not ask for and that count against your cardinality budget). The wire-level dissection makes those decisions concrete: the script's output shows that the SDK auto-injects telemetry.sdk.language=python and telemetry.sdk.version=1.27.0, which means every span you emit carries two extra resource attributes that end up as labels on metrics derived from spans (via spanmetrics Collector processor) and which can blow up cardinality if you do not strip them in the Collector. You only know to strip them because you saw them on the wire; the spec mentions them in passing, the SDK docs do not flag them.
Why "the standard" is now a default, not a choice
Five years ago, "should we use OpenTelemetry?" was a real question — the SDK was incomplete, the Collector was experimental, and Datadog / New Relic / Honeycomb each had their own SDK that was strictly better-supported in their backend. The question stopped being real around 2024 because three forcing functions converged. First, every major observability vendor began accepting OTLP natively — Datadog, New Relic, Honeycomb, Splunk, Dynatrace, AWS CloudWatch, Google Cloud Trace, Azure Monitor — which meant the SDK choice no longer determined the backend choice. Second, the cloud providers started emitting OTel-compliant telemetry from their managed services by default — RDS query logs, ALB access logs, Lambda traces — which meant the OTel data model was now the lingua franca of platform telemetry, not just application telemetry. Third, the Collector reached production-grade in late 2023 and gave teams a single place to enforce policy (sampling, redaction, cardinality control) that no proprietary SDK could match because no proprietary SDK is shared across all your services.
The practical consequence for an Indian platform team in 2026 — the YatraNow / Razorpay / Zerodha / Hotstar pattern — is that "should we use OTel" is no longer a meaningful planning question. The meaningful question is "which of the four OTel surfaces do we own, and which do we lease to a vendor". A team running self-hosted Prometheus + Loki + Tempo owns all four (spec adoption, SDK config, OTLP transport, Collector deployment). A team running Datadog APM owns surfaces 1–3 but leases surface 4 (Datadog's agent acts as a Collector with vendor-specific extensions). A team running AWS X-Ray uses a slightly older non-OTLP wire format but is migrating to OTLP-compatible AWS Distro for OpenTelemetry (ADOT) by default. The shape of the choice has shifted from "OTel vs vendor-SDK" to "how much of the OTel stack to operate ourselves vs delegate to a vendor that speaks OTLP at the edge".
The convergence is most visible in the cost economics. A hypothetical YatraNow-scale fleet emitting 800K spans/second across 240 services pre-OTel paid for: a Datadog tracing license (~₹38L/month), a separate Splunk logging license (~₹22L/month), and a self-hosted Prometheus stack (~₹6L/month operating cost). The SDK overhead was three different libraries per service, each with its own context-propagation rules that occasionally disagreed. Post-OTel, the same fleet emits OTLP from one SDK per language to a Collector tier, and the Collector fans out: 100% of traces to Tempo (₹4L/month self-hosted), 5% sampled to Datadog for the dashboards the leadership team has already built (~₹8L/month), all logs to Loki (~₹3L/month), all metrics to Prometheus (₹6L/month). Total: ₹21L/month, down from ₹66L/month — and the SDK overhead is now one library per language with consistent context propagation. The migration cost (engineer-quarters of work) was real; the steady-state savings paid it back inside two quarters. The forcing function for adoption is not technical elegance; it is the bill.
Why "vendor choice deferred" is operationally larger than the cost saving: the typical observability vendor commitment cycle is three years, the typical engineer tenure on the platform team is two years. A team that hard-binds its SDK to a vendor in year 1 is committing engineers in year 3 — who did not pick the vendor — to either pay the renewal or absorb a full SDK migration. A team that emits OTLP and lets the Collector fan out treats vendor choice as a Collector-config change: swap the export pipeline from otlphttp/datadog to otlphttp/honeycomb, restart the Collector, done. The political cost of vendor renegotiation drops from "convince leadership to fund a quarter of engineer time" to "show the Collector diff in the PR review". This is the lever Razorpay reportedly used in 2025 to renegotiate their tracing-vendor contract — credible threat of migration without engineering cost made the next-year price drop 35%.
Three failure modes Aditi (and you) will hit
The four-surface mental model maps directly onto the three failure modes that show up most often when teams onboard OTel. Each failure has a layer-name; the layer-name is the diagnostic.
Failure mode 1: service.name=unknown_service:python (SDK layer). The SDK falls back to a defaulted resource if OTEL_SERVICE_NAME is unset and the application code does not pass a Resource to the TracerProvider. The fallback is unknown_service:<language>. This is the most common first-day OTel mistake; readers see the service show up in Tempo but with the wrong name and assume the backend is broken. The fix is in the SDK config (set OTEL_SERVICE_NAME or pass Resource.create({"service.name": "..."})), but the diagnostic is at the OTLP layer (dump the resource block; if it says unknown_service, you know which surface is responsible).
Failure mode 2: double-counted metrics (Collector + SDK overlap). The team has been emitting Prometheus metrics from prometheus-client for two years. They add OTel SDK auto-instrumentation, which emits the same HTTP request metrics as OTel histograms. Both pipelines arrive at Prometheus — one via the existing /metrics scrape, one via the Collector's prometheusremotewrite exporter — and the dashboards now show double the request rate. The fix is at the Collector layer (drop the OTel-emitted HTTP histograms via a filter processor, or disable opentelemetry-instrumentation-flask metrics emission, depending on which you trust more). The diagnostic is at the SDK layer (the SDK emits both because both are configured) but the fix lives at the Collector. This is why naming the layer matters — the layer that produces the bug and the layer that fixes the bug are not always the same.
Failure mode 3: Collector OOM at 4 GB on a service that emits 200 spans/second (Collector layer). The team enables tail-based sampling in the Collector (Part 5's pattern) but configures the tail_sampling processor with the default 30-second decision window and no memory cap. A fleet that emits 200 spans/second per service across 240 services is 48,000 spans/second; held for 30 seconds, that is 1.4 million spans buffered in the Collector's memory at any moment. At ~3KB per span, the working set is ~4.2 GB. The fix is at the Collector layer (drop the decision window to 10 seconds and add memory_limiter), and the diagnostic is also at the Collector layer (zpages shows the buffer occupancy in real time). This one is pure-Collector; the SDK and the spec are blameless.
# otelcol-config.yaml — the four-line fix for failure mode 3
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1500 # hard cap; spill is dropped, not buffered
spike_limit_mib: 500
tail_sampling:
decision_wait: 10s # was 30s — the line that mattered
num_traces: 50000 # cap on in-flight traces
expected_new_traces_per_sec: 5000
policies:
- { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
- { name: slow, type: latency, latency: { threshold_ms: 800 } }
- { name: sample, type: probabilistic, probabilistic: { sampling_percentage: 1 } }
The line that actually matters is decision_wait: 10s. The default 30-second window is reasonable for a small fleet but pathological at scale; the trade-off is that traces longer than 10s are decided based on partial data. For 99% of HTTP request workloads the trade-off is fine; for batch-job traces or long-running streams, 10s is too short and you should sample those workloads at the SDK layer instead. The point is not which value to pick — it is to know that the Collector's memory budget is a function of (decision_wait × incoming_spans_per_sec × bytes_per_span) and to configure it deliberately, not to default it and discover the OOM at 02:00 IST.
Common confusions
- "OpenTelemetry is just a tracing library." It is a spec for traces, metrics, logs, and profiles, plus an SDK, plus a wire format, plus a Collector. Reducing it to "tracing" is the framing that produces teams who emit beautiful traces, contradictory metrics, and unstructured logs — three pillars again, with OTel only solving one.
- "If we use OpenTelemetry, we don't need a vendor." You still need a backend (Tempo, Loki, Prometheus, or a paid vendor) to store the telemetry. OTel is the SDK + wire format + Collector; the storage tier is a separate decision, and self-hosting it is a real operational commitment that should not be hidden from the budget.
- "The Collector is optional — the SDK can export directly to the backend." Technically yes; in production almost never. The Collector is where you enforce sampling policy, redact PII before it leaves your network, fan out to multiple backends, and protect the backend from SDK retry storms. Skipping it is fine for a 5-service POC and a planning failure for any production fleet.
- "OTel auto-instrumentation gives you observability for free." Auto-instrumentation gives you spans for the libraries it knows about (Flask, requests, psycopg2). It does not know about your business logic — the
apply_promo_codestep, thevalidate_kyccheck, thescore_fraud_riskmodel. Auto-instrumentation is a starting point; the spans that matter for debugging are the ones you write by hand around your own functions. - "Once we adopt OTel, we are vendor-neutral." You are wire-format-neutral. You are not yet attribute-neutral — if your dashboards depend on
dd.service(a Datadog-specific resource attribute that the Datadog Collector sets), migrating to Honeycomb means rewriting the dashboards. Real vendor-neutrality requires sticking to OTel semantic conventions in your own attribute names and not depending on vendor-specific enrichments. Most teams discover this on the migration day and not before. - "Upgrading the OTel SDK is always safe." The SDK is on a 1.x semver track; minor versions are backwards-compatible by policy. The semantic conventions (the spec layer) are not on the same cadence —
http.methodwas renamed tohttp.request.methodin 2024, and SDK versions that follow the new convention emit different attribute names than older ones. Mixed-version fleets produce dashboards with two sets of labels for the same metric, and the silent-fix is at the Collector (transform old → new in atransformprocessor). Read the changelog before upgrading.
Going deeper
Semantic conventions as the contract — and where they break
OpenTelemetry's semantic conventions (opentelemetry.io/docs/specs/semconv) define the canonical names for common attributes: http.request.method, http.response.status_code, db.system, messaging.destination.name, cloud.provider, cloud.region, k8s.pod.name, service.name. When every service in a fleet uses these names, the LLM-correlation tooling from chapter 87 has a stable vocabulary, the dashboards from chapter 81 have predictable labels, and the cross-service queries ({ http.request.method = "POST", http.response.status_code = 500 }) work without per-service translation. When teams diverge — http_method here, http.method there, httpMethod elsewhere — the dashboards become per-service rather than fleet-wide, and the cardinality of __name__ labels in Prometheus doubles for no operational reason.
The conventions are versioned and stabilising in stages. As of mid-2026, the HTTP, database, messaging, RPC, and Kubernetes conventions are stable; the GenAI, browser, and feature-flag conventions are experimental. Teams that adopt experimental conventions get the upgrade-day pain when those conventions change; teams that stay on stable conventions get the consistency benefit but lose access to the newer attribute spaces. The pragmatic split is to use stable conventions in production telemetry and experimental conventions in a separate namespace (org.yatranow.genai.*) until they stabilise — that way the migration day is a controlled rename rather than a fleet-wide breakage.
Sampling decisions: SDK, Collector, or backend?
A request is sampled exactly once in OTel-land — but "exactly once" is split across three layers and you have to pick which. Head-based sampling at the SDK (the TraceIdRatioBased sampler) is cheap and consistent (the same trace_id sampled across all services because the decision is deterministic on the trace_id), but it is blind to outcome — you cannot keep all errors under head-based sampling because you do not know yet which spans will error. Tail-based sampling at the Collector keeps all errors and slow traces (because the decision waits until the trace completes), but it costs the Collector memory in proportion to (decision_wait × span_rate). Backend-side sampling (Tempo's dynamic_sampler) is cheapest because the storage tier handles it, but it loses the cohort coherence — two services may end up with different sample decisions for the same trace, breaking the trace tree.
The default-good-choice for most Indian platform teams is tail-based at the Collector with a 10-second decision window and a 1-2% probabilistic floor for OK traces, because it preserves all error traces (which are what the on-call needs) and pays the memory cost in one place (the Collector pool, which can be sized predictably). The default-good-choice is wrong if your traces are long-running streams (Hotstar's CDN-edge spans for video sessions can be 90 minutes long, far exceeding any sane decision_wait), in which case you fall back to head-based with parentbased_traceidratio at the SDK and accept the loss of error retention. Knowing which mode applies to which workload is the actual sampling design; "we use OTel sampling" is not an answer.
Resource detection — what the SDK adds for you, and what you need to add
Resource detection is the SDK feature that auto-fills host.name, process.pid, process.runtime.*, and (with the opentelemetry-resource-detector-* packages) cloud-provider-specific attributes like cloud.region, cloud.account.id, aws.ecs.task.arn. The detector runs once at SDK init and the result is cached for the process lifetime. This is invaluable for fleet-wide queries — { k8s.namespace.name = "pricing" } works because the detector populated k8s.namespace.name from the downward API. It is also a cardinality risk — process.pid is unique per process restart and will show up as a label on metrics derived from spans (via spanmetrics), and a fleet that restarts pods for deploys ten times a day will have process.pid taking ~2400 values per service per day. The Collector's attributes processor lets you drop process.pid (and similar high-churn fields) before they hit the metric backend; the SDK does not let you drop them because the SDK does not know which attributes downstream consumers will index. The split — SDK enriches eagerly, Collector trims for downstream — is the right design but requires explicit Collector configuration.
The Collector is your policy enforcement point — design it like one
The Collector is unique in the OTel stack because it is the only surface where a single change affects every service simultaneously. That makes it the natural location for cross-cutting policy: PII redaction (strip user.email and user.phone before they leave the cluster), cardinality control (drop labels that exceed budget), service-name enforcement (overwrite unknown_service with the Kubernetes Deployment name from k8s.labels.app), tenant attribution (add cost.tenant from the request's API key for chargeback). Teams that under-invest in the Collector pipeline end up doing each of these in N services, and the N implementations drift. Teams that over-invest in the Collector — putting business logic in the Collector that should live in the application — make the Collector their single point of failure and discover that a Collector restart now breaks all telemetry simultaneously. The middle is to reserve the Collector for cross-cutting observability policy (sampling, redaction, attribution) and to keep business-domain decisions (which traces are billable, what the service should report) in the application, where they have a code review and a test suite.
A useful Collector hygiene practice: run the Collector with its own observability stack pointed at itself. The Collector emits its own otelcol_* metrics (queue depth, drop rate, processor latency); scrape them. The Collector's zpages extension at port 55679 shows live queue state; expose it for SREs. The Collector's pprof extension at port 1777 lets you take CPU and heap profiles; collect them on a schedule and feed into pyroscope. A Collector that observes itself is the only Collector you can debug at 02:00.
Reproduce this on your laptop
# Reproduce this on your laptop
docker run -d -p 9090:9090 prom/prometheus
docker run -d -p 3200:3200 -p 4317:4317 grafana/tempo:latest
python3 -m venv .venv && source .venv/bin/activate
pip install flask opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-instrumentation-flask opentelemetry-proto requests
python3 otlp_dissect.py
# observe: resource attributes, span tree, byte count on the wire
# then deliberately break OTEL_SERVICE_NAME and watch the resource block change
unset OTEL_SERVICE_NAME && python3 otlp_dissect.py # resource will say unknown_service:python
Where this leads next
Part 13's wall is the closing argument for OpenTelemetry as the cross-pillar substrate, and Part 14 is the engineering of how the substrate is implemented. The next eight chapters dive into the spec layer (the data model — /wiki/the-data-model — and how Resource, Scope, Span, MetricDataPoint, LogRecord fit together as one schema), the SDK layer (the lifecycle from API call → SDK buffer → exporter, the propagators, the samplers), the wire format (OTLP framing on gRPC and HTTP, the why-protobuf decision, batch and compression strategy), and the Collector internals (receivers, processors, exporters, the pipeline graph). Reading them in order is reading the four surfaces in depth.
The wall here also reframes the rest of the curriculum. Parts 6 (cardinality) and 11 (alerting) become Collector-configuration problems as much as backend problems. Part 5 (sampling) becomes an SDK-vs-Collector design choice, not a single configuration. Part 12 (correlation) becomes a semantic-convention enforcement problem — the trace_id correlation Riya needed in chapter 81 only works because every emitter speaks the same OTel-spec attribute name. The earlier parts taught you what to observe; OTel is what makes the observations interoperate.
For the broader question of how observability platforms ship correlation across pillars at production scale, see /wiki/exemplars-linking-metrics-to-traces and /wiki/log-to-trace-correlation-trace-ids-in-logs. For the cautionary view of automating the layer above OTel, /wiki/llms-for-correlation-a-cautious-view covers the shape of LLM-assisted investigation that depends on OTel-compliant telemetry as its grounding.
A closing thought on the framing of this chapter as a wall. The "wall" pattern in this curriculum marks the points where the curriculum cannot progress without a substrate change — Part 1's wall was the cardinality variable, Part 5's wall was the impossibility of keeping every event, Part 12's wall was the missing correlation contract. Part 13's wall is different in kind. The previous walls revealed problems the next part solved; this wall reveals a solution the next part operationalises. OpenTelemetry is the substrate the rest of the curriculum runs on; treating it as a vendor checkbox is the framing that produces teams who ship beautiful spans, contradictory metrics, exploding Collector pods, and bills they cannot explain. Treating it as a spec, an SDK, a wire format, and a Collector — four surfaces, four debugging tools, four layers of config — is the framing that lets the rest of Part 14 land.
References
- OpenTelemetry specification — the binding document for the data model, semantic conventions, and API contracts. Read the data-model and trace specs end-to-end at least once.
- OpenTelemetry Collector documentation — receivers, processors, exporters, the pipeline graph; the Collector internals chapter of Part 14 builds on this.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 11 on OTel adoption is the most measured industry treatment of the migration.
- CNCF OpenTelemetry project page — the governance context that makes OTel a default rather than a vendor's choice.
- opentelemetry-proto on GitHub — the protobuf schema that defines OTLP. The
trace.proto,metrics.proto, andlogs.protofiles are the actual contract. - Ben Sigelman, "Three Pillars with Zero Answers" (LightStep, 2018) — the foundational critique that motivated the OTel project's correlation-first design.
- /wiki/exemplars-linking-metrics-to-traces — internal: the metric-to-trace edge that OTel makes practical.
- /wiki/the-one-pane-of-glass-promise-and-its-limits — internal: the human investigation surface that depends on OTel-compliant telemetry.