Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The Collector: receivers, processors, exporters

It is 22:14 IST on the night of an India-Pakistan T20. Karan, an SRE on the Hotstar streaming-platform team, is staring at a Grafana panel that should show a clean 3 million spans per second flowing into Tempo and instead shows a hairy sawtooth oscillating between 800k and 4.2M. The application services are healthy. The Tempo backend is healthy. The OTLP wire bytes leaving the application pods are healthy. The thing in the middle — fifty-two pods running the OpenTelemetry Collector — is alternately stuffing its export queues, OOMing, restarting, dropping spans on the floor, and back-pressuring the SDKs which silently shed load. The dashboard is not lying about traffic; it is lying about what it is plotting, because the spans Tempo sees are no longer the spans the SDKs emitted.

The Collector is a small Go binary with an enormous job: it is the choke-point through which every metric, log, span, and profile in a multi-tenant fleet flows on its way from emitter to backend. Its three-stage shape — receivers that decode bytes off the wire, processors that mutate batches in memory, exporters that push transformed batches downstream — is what makes it the place every production policy ends up living. Sampling decisions, redaction, label dropping, batching to backend-friendly sizes, multi-tenant routing, retry-and-spill — all of it ultimately runs in the Collector, not in the SDK and not in the backend. The cost of misunderstanding the three-stage flow is exactly the kind of incident Karan is in tonight.

The OpenTelemetry Collector is a pipeline daemon: receivers (OTLP, Prometheus scrape, Filelog, Jaeger, Zipkin, ...) decode wire formats into the in-memory pdata structure, processors (batch, memory_limiter, attributes, tail_sampling, ...) transform that pdata in place, and exporters (OTLP, Loki, Prometheus remote-write, Tempo, ...) push the transformed batches to backends. Pipelines are typed (traces, metrics, logs, profiles), composed declaratively in YAML, and the order of processors is the policy. Everything operationally hard about OTel — backpressure, retry, multi-tenancy, tail sampling, redaction — is configured here, not in the SDK.

A pipeline made of three boxes — and why the boxes exist

The Collector exists because the SDK cannot make every decision and the backend should not. The SDK decides what to emit (spans, metrics, log records); the backend decides what to index and persist. Everything between those two decisions — what to keep, what to drop, what to add, what to redact, where to ship it, how to retry, who to charge — has no good home in either layer. So OTel created a third layer: a stateless (mostly) Go process that owns the policy.

The Collector's processing model is a typed pipeline. Configuration declares a set of components — receivers, processors, exporters, with optional connectors and extensions — and a set of pipelines that wire them together. Each pipeline has exactly one signal type (traces, metrics, logs, or in newer builds profiles). A receiver of type otlp can feed all four pipelines because OTLP carries all four signal types; a receiver of type prometheus (which scrapes a /metrics endpoint) can only feed the metrics pipeline. Every processor and exporter is similarly typed: a tail_sampling processor only exists in trace pipelines; a loki exporter only exists in log pipelines. The type system is what stops you accidentally piping a span batch into a Prometheus remote-write endpoint.

The shape of a Collector pipeline is identical to a Unix shell pipeline conceptually, but the analogy hides the most important difference: batches, not records, are the unit of flow. A receiver does not hand spans to the next stage one at a time. It assembles incoming OTLP messages into ptrace.Traces objects (a tree of ResourceSpansScopeSpansSpan) and hands the whole batch to the first processor. The processors mutate that batch in place — adding attributes, dropping spans, splitting one batch into two by tenant — and the exporter sees a single batch object to ship. Batch-orientation is what makes the Collector fast (a 5000-span batch goes through twelve processors with twelve function calls, not 60,000) and what makes a few of its processors subtle (the batch processor exists to re-batch upstream batches into exporter-optimal sizes, which means upstream batches and downstream batches are not the same objects).

OpenTelemetry Collector pipeline shape — receivers, processors, exportersA diagram of a typed pipeline. On the left a column of receivers feeds into a central row of processors, which fan out to a column of exporters on the right. Three parallel rows show the trace, metric, and log pipelines, each with its own typed processors. A footnote explains that batches flow through processors in declared order, and that the same Collector binary runs all three pipelines side by side.Three typed pipelines, one Collector process — receivers in, processors middle, exporters outIllustrative — pipeline shape from a typical contrib-distribution config; arrows indicate batch flow direction.Receiversdecode wire bytesotlpprometheusfilelogjaegerhostmetricskafkaTrace pipeline processorsmemory_limiterattributestail_samplingbatchMetric pipeline processorsmemory_limiterresourcedetectionmetricstransformbatchLog pipeline processorsmemory_limitertransformredactionbatchExporterspush to backendsotlp/tempoprometheusrwlokikafkafiledebug
Illustrative — three pipelines wired in one Collector. Receivers (left) feed all pipelines whose signal type they support; each pipeline runs its own processor chain in declared order; exporters (right) ship to backends. The same `otlp` receiver feeds the trace, metric, and log pipelines simultaneously because OTLP carries all three signal types on the same gRPC stream.

The pipeline shape is a deliberate inversion of the SDK's shape. The SDK is one process emitting one type of signal at a time, mostly through one exporter. The Collector is one process receiving every type of signal from every emitter in the fleet, multiplexing them through configurable processor chains, and fanning out to multiple backends. Where the SDK's job is "produce honest telemetry", the Collector's job is "transform fleet-wide telemetry into what the platform team and the cost-budget allow". Why move policy here instead of into the SDK: the SDK runs in the application's address space, on the application's CPU, billed to the application's owner. Anything you make the SDK do — tail sampling, redaction, multi-tenant routing — is paid for by every product team that ships the binary. The Collector runs on platform-team-owned pods, charged to the platform-team budget, and is upgraded on the platform team's cadence. Pushing decisions out of the SDK into the Collector is what lets the platform team change sampling rates without redeploying every microservice in the fleet. It is the same separation-of-concerns logic that drives the API/SDK split, applied one layer up.

A real pipeline you can run on your laptop

The cleanest way to understand the three layers is to run a Collector with all three configured, send traffic through it, and observe what each layer touches. The script below builds a minimal-but-real Collector configuration, sends OTLP traces through it, and prints what arrives at the (fake) downstream exporter so you can see receivers / processors / exporters at work.

# collector_walkthrough.py — exercise a real OTel Collector pipeline.
# pip install opentelemetry-api opentelemetry-sdk \
#             opentelemetry-exporter-otlp-proto-grpc grpcio
# Also needs the otelcol-contrib binary on PATH; install via:
#   curl -L https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.105.0/otelcol-contrib_0.105.0_linux_amd64.tar.gz | tar xz -C /tmp
import os, sys, time, json, signal, tempfile, subprocess, threading
from concurrent import futures
import grpc
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.proto.collector.trace.v1 import (
    trace_service_pb2, trace_service_pb2_grpc)

# 1) Stand up a fake "downstream backend" that the Collector will export to.
CAPTURED = []
class FakeTempo(trace_service_pb2_grpc.TraceServiceServicer):
    def Export(self, req, ctx):
        CAPTURED.append(req)
        return trace_service_pb2.ExportTraceServiceResponse()
srv = grpc.server(futures.ThreadPoolExecutor(max_workers=4))
trace_service_pb2_grpc.add_TraceServiceServicer_to_server(FakeTempo(), srv)
srv.add_insecure_port("127.0.0.1:24317"); srv.start()

# 2) Write a minimal Collector config that exercises all three layers.
CFG = """
receivers:
  otlp: { protocols: { grpc: { endpoint: 127.0.0.1:14317 } } }
processors:
  memory_limiter: { check_interval: 1s, limit_mib: 256 }
  attributes:
    actions:
      - { key: deployment.environment, value: production, action: insert }
      - { key: customer.email, action: delete }   # PII redaction
  batch: { send_batch_size: 8, timeout: 200ms }
exporters:
  otlp: { endpoint: 127.0.0.1:24317, tls: { insecure: true } }
service:
  pipelines:
    traces: { receivers: [otlp], processors: [memory_limiter, attributes, batch], exporters: [otlp] }
"""
cfg_path = tempfile.NamedTemporaryFile("w", suffix=".yaml", delete=False)
cfg_path.write(CFG); cfg_path.close()
col = subprocess.Popen(["otelcol-contrib", "--config", cfg_path.name],
                       stdout=subprocess.DEVNULL, stderr=subprocess.PIPE)
time.sleep(2.0)  # let the Collector come up

# 3) Send 25 spans through the Collector with a PII attribute that should be stripped.
provider = TracerProvider(resource=Resource.create({"service.name": "checkout-api"}))
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://127.0.0.1:14317", insecure=True),
    schedule_delay_millis=200))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("walkthrough")
for i in range(25):
    with tracer.start_as_current_span(f"checkout-{i}") as s:
        s.set_attribute("order.id", 7000 + i)
        s.set_attribute("customer.email", f"riya{i}@example.in")  # to be redacted
provider.force_flush(); time.sleep(1.5)

# 4) Inspect what the fake downstream actually received.
total = sum(len(ss.spans) for r in CAPTURED for rs in r.resource_spans for ss in rs.scope_spans)
sample_attrs = {}
for r in CAPTURED:
    for rs in r.resource_spans:
        for ss in rs.scope_spans:
            for sp in ss.spans:
                for a in sp.attributes:
                    sample_attrs.setdefault(a.key, str(a.value)[:40])
                    break
                break
            break
        break
    break
print(f"otlp messages received downstream : {len(CAPTURED)}")
print(f"spans on the wire after Collector : {total}")
print(f"deployment.environment present?   : {'deployment.environment' in sample_attrs}")
print(f"customer.email present?           : {'customer.email' in sample_attrs}")
print(f"sample attribute keys             : {sorted(sample_attrs.keys())}")
col.terminate(); col.wait()
Sample run (with otelcol-contrib v0.105.0 on a Razorpay-ish staging laptop):
otlp messages received downstream : 4
spans on the wire after Collector : 25
deployment.environment present?   : True
customer.email present?           : False
sample attribute keys             : ['deployment.environment', 'order.id']

Five lines tell the whole story. otlp messages received downstream : 4 is the batch processor at work — the SDK sent the 25 spans across roughly 25 small OTLP messages (one per span flush, since the BatchSpanProcessor was configured for 200 ms cadence and small fan-out), and the Collector's batch: send_batch_size: 8, timeout: 200ms re-batched them into 4 downstream messages of ~7 spans each. Why this matters: backend exporters charge or rate-limit per OTLP request, and an SDK that sends 25 small messages costs 6× more network round-trips than a Collector that re-batches to 4. The batch processor is the only reason most SaaS observability bills are not five times higher than they are. The send_batch_size is the lever — a 512-batch is normal for production, a 8-batch is what we used here only to make the count visible. spans on the wire after Collector : 25 confirms that no spans were dropped — the count in equals the count out, despite the rebatching. deployment.environment present?: True is the attributes processor's insert action — every span got the resource-environment attribute the application forgot to set, applied uniformly at the Collector layer where the platform team controls it. customer.email present?: False is the same processor's delete action — the PII attribute that the careless SDK code emitted was stripped before it reached the backend. Why redaction lives here and not in the SDK: a thousand microservices instrumented by a thousand product engineers will inevitably emit some PII; the platform team cannot review every span. Centralising the strip in the Collector gives a single place where the data-protection officer can audit and prove the rule is enforced. SDK-side redaction is best-effort; Collector-side redaction is policy. sample attribute keys contains exactly what the platform team intended — the resource environment plus the application's order.id, with PII gone.

This same script is the diagnostic ladder when something is wrong: if you see fewer messages downstream than you expect, the problem is batch or memory_limiter (under memory pressure, memory_limiter drops batches before the exporter even sees them). If you see span counts mismatch, the problem is sampling — head- or tail-sampling configured but undocumented. If you see PII still in the output, the attributes processor is missing or out of order in the pipeline (delete-after-insert is a no-op if the order is reversed across pipelines).

Receivers — every wire format is somebody's reality

Receivers exist because the world did not standardise on OTLP. The Collector ships with receivers for OTLP (gRPC and HTTP), Jaeger Thrift and gRPC, Zipkin v1 and v2, Prometheus scrape, OpenCensus, the StatsD wire format, the Carbon plaintext format, Filelog (tail any log file with regex parsing), Hostmetrics (CPU/mem/disk/network from /proc), Kafka (consume from a topic), AWS X-Ray, Datadog Agent, and several dozen more. The reason for the breadth is that fleets are heterogeneous: a service mesh might emit traces in Jaeger Thrift (because the Envoy fleet was deployed in 2019 with Jaeger), the application services might emit OTLP (because the SDK is OTel-native), the legacy databases might emit StatsD (because that was what the agent supported), and the host metrics come from /proc. The Collector ingests all of them through different receivers and unifies them into one in-memory representation — pdata (pmetric.Metrics, ptrace.Traces, plog.Logs, pprofile.Profiles).

The pdata representation is the silent hero of the design. Every receiver writes to pdata; every processor reads and mutates pdata; every exporter reads pdata and serialises to the wire format the backend expects. A Jaeger-Thrift trace ingested by the Jaeger receiver becomes a ptrace.Traces object; the same tail_sampling processor that sees an OTLP trace sees the converted Jaeger trace identically. The exporter on the other end can serialise that pdata back out as OTLP for Tempo, or Zipkin v2 JSON for an old Zipkin backend, or Kafka messages, or whatever. Format conversion is a side-effect of receivers + exporters; processors stay format-agnostic.

The Prometheus receiver is the most-used and most-misunderstood. Prometheus's pull model is fundamentally different from OTLP's push: the Collector configures itself as a Prometheus scraper (using a config block that is almost identical to a real Prometheus server's scrape_configs), pulls /metrics endpoints from configured targets at the configured scrape_interval, parses the OpenMetrics text format, and emits the resulting time series into the metrics pipeline. Because the receiver fully implements the Prometheus client protocol, the Collector can replace a Prometheus server entirely — but only for the scrape side; it cannot answer PromQL queries (that is the Prometheus server's job). The reason this matters operationally: when Razorpay's hypothetical platform team migrated from a sidecar Prometheus per pod to a centralised Collector, the migration was a config translation (every scrape_config block becomes a Prometheus-receiver target), not a service rewrite.

The Filelog receiver is the entry point for any log shipper that does not natively speak OTLP. Configure it with a glob (/var/log/payments/*.log) and an optional set of operators (regex parsers, JSON parsers, multiline joiners, severity extractors), and it tails the files, parses each line, and emits log records into the logs pipeline. The Collector replaces Fluent Bit and Filebeat in this role for fleets that have standardised on OTel.

The OTLP receiver itself supports both gRPC (port 4317) and HTTP/protobuf (port 4318). The wire format is the same; the choice is operational. gRPC has lower overhead and better backpressure (proper flow-control via HTTP/2), but proxies and load balancers handle gRPC less well. HTTP/protobuf is friendlier for ALBs, AWS API Gateway, and corporate proxies that strip arbitrary HTTP/2 traffic. Most fleets run both ports on the same Collector and let SDKs pick. Why a single fleet often runs both: a service mesh's sidecar containers prefer gRPC because they share a Linux namespace with the application and can use UNIX-domain sockets; an external lambda function emitting traces from outside the VPC has to use HTTP/protobuf because lambda's networking does not handle long-lived gRPC connections well. One Collector listening on both ports lets each emitter pick the protocol that survives its network path.

A subtle receiver-side concern is the header-and-auth chain. The OTLP receiver supports configuration of a chain of authenticators (OIDC validators, mTLS verifiers, a static bearer-token check) and a headers_setter that copies inbound headers (like a tenant ID) onto the in-memory batch as resource attributes. This is how multi-tenant routing starts: the receiver tags every batch at decode time with tenant=acme-corp based on a JWT claim, and a downstream routing connector reads that tag to fan out per-tenant. Without the receiver-side tagging, the tenant identity would be lost the moment OTLP became pdata. The discipline is that any policy decision that depends on who sent the data must be captured at the receiver layer; processors only see what the receiver decoded.

Processors — the place policy lives, in declared order

Processors are where every operational decision is made: which spans to keep, which attributes to add or remove, how to batch, how to redact, how to enrich, how to sample, how to split. The Collector ships with around forty processors in the contrib distribution; the half-dozen that matter for almost every fleet are memory_limiter, batch, attributes (and its newer cousin transform), resourcedetection, tail_sampling, and filter. Composing these correctly — and in the right order — is the entirety of operating the Collector well.

Order matters because each processor sees the output of the previous one. If memory_limiter runs before batch, the limiter sees small upstream batches and rejects them when memory is tight, causing back-pressure to the receiver and ultimately to the SDK; if memory_limiter runs after batch, the limiter sees large rebatched objects and rejects them late, after CPU has already been spent on rebatching. The correct order is memory_limiter first, always, because it is the only processor whose job is to be the relief valve. Past that, attributes and resourcedetection typically come before tail_sampling, because tail-sampling decisions often depend on attributes that the upstream processors add (deployment.environment, cloud.region, sampling-priority hints from baggage). And batch is almost always last, because it is the only processor that intentionally introduces latency to gain throughput, and you want it to operate on the final post-policy batch shape.

The memory_limiter processor is the back-pressure mechanism. It periodically samples the process's heap usage; if usage exceeds limit_mib, the processor returns an error to its upstream component (which, for a receiver, manifests as a gRPC RESOURCE_EXHAUSTED, which the SDK retries). If usage exceeds limit_mib + spike_limit_mib, the processor refuses harder and the Collector starts dropping data on the floor. Why this design is necessary: the Collector is a Go process with a garbage collector; if it OOMs, all in-flight batches are lost and all downstream backends see a discontinuity. Better to push back on the SDKs and let them buffer, retry, or shed load gracefully than to crash and lose everything. The limit_mib should be set to roughly 75% of the container's memory limit; 25% headroom is what Go's GC needs for short bursts. Hotstar's hypothetical IPL playbook sets limit_mib: 6144 on 8 GiB Collector pods; without it, a hot-key error storm at toss can OOM a whole fleet of Collectors in a 30-second window.

The tail_sampling processor implements the strategy of buffering all spans of a trace and making a keep/drop decision once the trace is complete. The mechanism is: span batches arrive, the processor keys them by trace_id, accumulates spans for a configurable decision_wait (default 30 s), then runs a chain of policies against the assembled trace and either forwards or drops. Policies include status_code (always keep error traces), latency (keep traces longer than X ms), numeric_attribute (keep based on a span attribute threshold), and probabilistic (random keep at rate r for traces that no other policy claimed). Memory is the cost — buffering 30 seconds of trace bytes for a Hotstar-scale fleet is gigabytes per Collector pod — and the gain is that you keep every error trace, every slow trace, and a representative random sample of the rest, instead of the head-based 1% that would lose 99% of error traces. See /wiki/tail-based-sampling-error-bias-and-late-decisions for the full mechanism.

Processor order matters — the same set, two orders, two outcomesTwo parallel pipelines are drawn. The top pipeline shows memory_limiter first, then attributes, then tail_sampling, then batch — the correct order. The bottom pipeline shows the same processors in reversed order — batch first, then tail_sampling, then attributes, then memory_limiter — and is annotated with three failure modes that this order produces.Processor order is policy — the same components in two orders give two outcomesIllustrative — boxes are processors, arrows show batch flow, top is correct order, bottom is broken.Correct order:memory_limiterrelief valve firstattributesenrich + redacttail_samplingkeep/drop after attrsbatchbatch lastBroken order — same processors, reversed:batchtail_samplingattributesmemory_limiterWhat breaks:tail_sampling decides on un-enriched traces (no environment label, no region) → wrong policy hits;attributes runs after sampling → PII redaction work on already-dropped spans is wasted; memory_limiter is too late.
Illustrative — the same four processors composed in two orders. The top order lets the relief-valve catch overload first, lets attributes enrich before sampling decides, and lets batch optimise the post-policy shape. The bottom order is a real misconfiguration: tail-sampling without enriched attributes makes wrong decisions, and the memory limiter runs too late to prevent OOM under burst.

The attributes and transform processors are the policy layer. attributes handles simple key-level operations: insert, update, delete, hash, redact-by-regex. transform (newer, more general) takes a small DSL — OTTL, the OpenTelemetry Transformation Language — and lets you write expressions like set(attributes["http.url"], URL(attributes["http.url"]).path) to strip query strings from URLs, or delete_key(attributes, "customer.email") where IsMatch(attributes["customer.email"], ".*@.*") to redact only spans where the attribute looks like an email. Both compile to fast in-memory mutations on pdata. The discipline is the same as in any rule engine: rules are evaluated in declared order, the order is the semantics.

The resourcedetection processor populates Resource attributes from cloud metadata services (AWS EC2 instance metadata, GCP metadata server, Azure IMDS, k8s downward-API) and from the host (hostname, os.type, host.arch). It is the "fix what the SDK forgot" processor — services that boot before the cloud-metadata service is reachable miss cloud.region, services running in k8s but configured manually miss k8s.pod.name. resourcedetection runs at the Collector layer where the metadata services are reliably reachable, and it sets the missing attributes uniformly across every signal flowing through.

The filter processor is the keep-or-drop sibling of attributes. Where attributes mutates fields, filter decides whether the entire batch element survives. A typical use case at a Swiggy-style fleet: filter/spans drops health-check spans (http.target == "/healthz") before they reach tail_sampling, so the sampling buffer is not wasted on traffic the platform team has already decided is uninteresting. The cost saving is non-trivial — health checks at 1 Hz per pod across 5000 pods is 5000 spans/sec of pure noise, and dropping them at the filter step (before tail_sampling's 30-second buffer) saves the buffer memory and the downstream backend ingest cost. The discipline is to filter aggressively early in the pipeline and let the late processors operate on a smaller, more interesting set.

A processor that surprises new operators: groupbyattrs. Its job is to re-batch incoming batches by an attribute — e.g. group spans by service.name so each downstream batch contains only one service's spans. The use case is multi-tenant exporting where the per-tenant exporter expects single-tenant batches; the connector pattern (described in §Going deeper) supersedes most uses today, but groupbyattrs remains the right tool when the policy needs to re-shape batches without forking the pipeline.

The transform processor with OTTL deserves a concrete example because the DSL is unfamiliar. To strip query strings from URLs, redact session tokens, and normalise HTTP status codes into broad classes, the OTTL block looks like:

transform/sanitise:
  trace_statements:
    - context: span
      statements:
        - set(attributes["http.url"], URL(attributes["http.url"]).path) where attributes["http.url"] != nil
        - delete_key(attributes, "http.request.header.authorization")
        - set(attributes["http.status.class"], Concat([Substring(Format("%d", attributes["http.status_code"]), 0, 1), "xx"], "")) where attributes["http.status_code"] != nil

Each statement runs against every span in every batch. The where clause is a guard; the action is a single mutation. Three statements compile to roughly 60 ns of overhead per span — at 3M spans/sec on one Collector pod, that is 180 ms of CPU per wallclock-second, or about 18% of one core. The lesson: OTTL is fast but not free, and a Collector with thirty transform blocks is a Collector spending a third of its CPU on string mutation. Profile with go tool pprof against the Collector's /debug/pprof/profile endpoint when CPU climbs.

Failure modes the three-stage shape produces

Operating a Collector fleet teaches you to recognise three families of failure, each rooted in one of the stages. Naming them well is the difference between a 30-minute incident and a 4-hour one.

Receiver-side failure: backpressure cascades. When the Collector's processors cannot keep up — usually because tail_sampling is buffering too many traces in memory and memory_limiter is starting to reject — the receiver returns gRPC errors to the SDKs. The SDKs retry. Retries arrive as new traffic at the same overloaded Collector. The retry storm doubles incoming RPS, the limiter rejects more, the SDKs retry harder, and within 30 seconds the Collector is processing more retry traffic than original traffic. The signature in the metrics is otelcol_receiver_refused_spans climbing in lockstep with otelcol_receiver_accepted_spans flatlining and SDK-side otel.exporter.queue.full filling. The fix is not "scale the Collector"; it is "fix the processor downstream of the receiver" — usually tail_sampling decision_wait too high, or memory_limiter mis-sized for the burst profile. Karan's hypothetical Hotstar incident at the start of this chapter was exactly this shape.

Processor-side failure: silent data corruption. A misconfigured attributes or transform processor can mutate spans in ways the application engineers did not authorise. A set(attributes["http.url"], "") clause meant to strip query strings from one path matches every URL because the predicate was wrong; suddenly Tempo shows traces with empty URLs and the on-call cannot search for the failing endpoint. The signature is "the data is arriving, the data is wrong" — receivers happy, exporters happy, dashboards lying. The diagnostic is to add a debug exporter on a parallel pipeline that sees the unmodified batch and compare side-by-side with the production exporter's output; the difference is the buggy processor's footprint. Always test OTTL changes in a canary Collector pod before rolling out fleet-wide.

Exporter-side failure: queue silently filling and dropping. Backend goes slow (Tempo ingester at 95% CPU, query queue building up), the OTLP exporter's send rate halves, the sending_queue fills, and once full, new batches from upstream are dropped at the queue insertion step with otelcol_exporter_enqueue_failed_spans climbing. The dashboard still shows traces — old ones that made it through before the queue filled — but new traces are vanishing. The signature is otelcol_exporter_queue_size plateaued at queue_size, otelcol_exporter_send_failed_spans flat, otelcol_exporter_enqueue_failed_spans climbing. The fix is either to scale the backend, increase the queue size (buying more buffer), or enable the persistent file-storage queue (buying durability across full-queue events). Why this is hard to diagnose at 03:00 IST: the symptom is "newer traces are missing" but the Collector is not erroring, the SDKs are not erroring, the backend is not erroring (just slow). Three healthy components, a fourth — the implicit "queue between exporter and backend" — that is silently shedding load. The runbook entry is one line: when traces are missing in Tempo but the Collector logs are clean, look at otelcol_exporter_enqueue_failed_spans first.

The discipline these failure modes impose: every Collector fleet must export its own self-telemetry to the same backend the application telemetry goes to. The Collector's internal_telemetry config block exposes ~80 metrics — otelcol_receiver_*, otelcol_processor_*, otelcol_exporter_* — and you scrape them with the same Prometheus that scrapes everything else. Without these self-metrics, the three failure modes above are invisible. With them, they are obvious.

Exporters — the bytes leaving the building

Exporters are where the in-memory pdata representation becomes wire bytes again, this time pointed at a backend. The set of available exporters mirrors the set of available receivers: OTLP (gRPC and HTTP) for OTel-native backends like Tempo and Honeycomb, Prometheus remote-write for Prometheus-compatible TSDBs (Cortex, Mimir, Thanos receive), Loki for Grafana Loki, Jaeger and Zipkin for legacy trace backends, Kafka for streaming pipelines, the cloud-vendor exporters (CloudWatch, Stackdriver, AWS X-Ray, Azure Monitor), and file/debug/logging for, well, debugging. Most production fleets use 2–4 exporters per pipeline: one to the primary backend, one to a long-term archive (Kafka or S3), one to a debug sink during migrations.

Every exporter has the same internal structure: a queue (configurable size, defaults to 1000 batches), a retry policy (exponential backoff with initial_interval, max_interval, max_elapsed_time), and the send logic (which speaks the backend's wire format). When the exporter receives a batch from the previous processor, it enqueues it; a worker pool drains the queue, attempts to send each batch, and on retriable errors (5xx, network errors, gRPC UNAVAILABLE) puts the batch back at the head of the queue with a backoff. On non-retriable errors (4xx, malformed data) the batch is dropped permanently. Why the queue and retry are critical at IPL scale: a backend that goes down for 30 seconds during peak load translates to ~3 million spans per Collector pod that need to be buffered while the backend recovers. With a 1000-batch queue at 5000 spans per batch, that is 5 million spans of buffer headroom — barely enough. The hypothetical Hotstar playbook tunes sending_queue.queue_size: 10000 for the OTLP exporter and runs Collectors with persistent volumes to spill the queue to disk if memory fills, because a 30-second outage is the easy case; a 5-minute outage during the toss is the case that decides whether the trace data survives or vanishes.

Exporters can also be configured to fan out the same batch to multiple backends simultaneously by listing multiple exporters in the pipeline's exporters: block. This is how the Cleartrip-style "shadow new backend during migration" pattern works at the Collector layer — every span is sent to the old Jaeger backend and the new Tempo backend, the team compares dashboards across both for a few weeks, and once parity is established the old exporter is removed in a one-line config change. The Collector's pipeline shape makes this trivial; without the Collector, every SDK in the fleet would have had to ship dual exporters and the migration would have been a coordinated rolling deploy across hundreds of services.

The exporter's sending_queue deserves more attention than most teams give it. The default 1000-batch size is fine for steady-state operation but is a footgun for any backend that experiences a "blip" — a 30-second outage in Tempo translates to ~6000 batches at typical Hotstar-scale ingress (3M spans/sec / 500 spans/batch / 30s), six times the queue size. Once the queue saturates, the exporter starts rejecting at enqueue time and otelcol_exporter_enqueue_failed_spans climbs. The defence is a layered tuning: increase sending_queue.queue_size to 5000–10000 for production exporters, set sending_queue.num_consumers to 8–16 (parallel workers draining the queue), and for the audit tier configure sending_queue.storage: file_storage so the queue spills to a persistent volume rather than dropping on overflow. The retry policy is the second lever: the default max_elapsed_time: 5m means an exporter retries a single batch for up to five minutes — fine for a transient blip, dangerous for a backend that is genuinely down because the in-memory queue grows unboundedly during the retry window. Pair the retry timeout with a queue size that matches your worst-case "backend down" window, or accept that prolonged outages will drop data.

Common confusions

  • "The Collector is the same thing as the SDK exporter." It is not. The SDK's exporter ships bytes to the Collector (or directly to a backend). The Collector is a separate process that receives those bytes, transforms them through processors, and ships them to backends. Pipelines are typed and policy-rich in the Collector; the SDK's exporter is a thin gRPC/HTTP client.
  • "Processors are run in parallel." They are not. Within a pipeline, processors run in strict declared order on each batch. The order is the policy. Two pipelines run in parallel, but a single pipeline is sequential through its processors.
  • "batch and memory_limiter are optional." They are effectively mandatory in production. Without batch, the Collector ships small messages and inflates network costs. Without memory_limiter, the Collector OOMs under burst load and loses every in-flight batch instead of pushing back gracefully on receivers.
  • "Tail-sampling at the Collector is a free latency win." It is not free — tail-sampling buffers spans until decision_wait elapses (30 s default), so every span sits in Collector memory for that long. The trade-off is memory cost vs the ability to keep all error traces; for a Hotstar-scale fleet this is gigabytes per Collector pod, paid against the win of 100% error-trace retention.
  • "The Collector is required to use OpenTelemetry." It is not — small fleets can ship OTLP straight from the SDK to a backend that accepts OTLP (Tempo, Honeycomb, Lightstep). The Collector becomes necessary at scale, when policy needs to live somewhere outside every SDK and the platform team needs a choke point. Below ~50 services, you can usually skip it.
  • "Receivers and exporters are symmetric." The wire-format set is symmetric (OTLP receiver pairs with OTLP exporter, and so on), but the semantics differ. A receiver decodes wire bytes and emits into the pipeline; an exporter takes pipeline pdata and serialises to wire bytes — and crucially handles retry, queue, and backoff that the receiver does not.

Going deeper

pdata is a memory-allocation arena, not a Go struct

The pdata representation that flows between pipeline stages is not a normal Go struct. The contrib team designed it as a memory-allocation arena where every field is a slice of bytes or a numeric primitive, allocated from a pool, with copy-on-write semantics for zero-cost batch mutations. This is why processors can mutate batches in place without paying for full deep-copies — when attributes deletes customer.email, it does not allocate a new attribute map; it marks the slot as deleted in the existing one. The trade-off is API ugliness — pdata.NewMap(), m.PutStr("k", "v") everywhere — but the throughput gain is roughly 4× over an idiomatic Go-struct representation. For a Collector handling 3M spans/sec on a 16-core pod, that 4× is the difference between fitting in one pod and needing four.

The connectors extension — pipelines that feed pipelines

Connectors are a recent addition (v0.86.0+) that act simultaneously as an exporter on one pipeline and a receiver on another. The killer use-case is spanmetrics: a connector that sits at the end of the trace pipeline (consuming spans), aggregates them into RED metrics (rate, errors, duration), and feeds those metrics into the metrics pipeline as if they had arrived from a Prometheus receiver. The result is RED dashboards in Grafana that derive from the trace stream automatically — no per-service instrumentation required. A hypothetical Razorpay payments fleet ran this for nine months and replaced ~400 lines of hand-rolled prometheus-client.Histogram code in their Flask apps with two lines of spanmetrics config in the Collector. The pattern generalises: servicegraph connector to derive service-dependency graphs from spans, routing connector to fan a single pipeline into many based on attributes, forward connector to plumb between exotic shapes.

Multi-tenant routing — one Collector for fifty teams

A single Collector pod can multiplex fifty product teams' traffic if you configure tenant-aware routing. The pattern: the OTLP receiver is configured with auth: { authenticator: oidc }, every SDK includes a tenant-identifying JWT, the routing connector reads headers["X-Tenant"] and routes the batch into one of fifty per-tenant exporter pipelines. Each tenant's exporter has its own sending_queue, its own retry policy, its own destination backend. A noisy tenant fills its own queue and gets dropped without affecting other tenants. This is how the hypothetical observability platform team at Cred runs forty internal teams through one Collector fleet without one team's instrumentation bug taking down everyone else's telemetry.

Persistent queue — surviving Collector restarts without losing data

The sending_queue defaults to in-memory, which means a Collector restart loses every batch in the queue. For fleets where that loss is unacceptable (Zerodha-style trading audit traces that must reach the audit backend), the queue can be configured with storage: file_storage/<id> and the queue spills to a persistent volume. On restart, the Collector re-reads the queue from disk and resumes sending. The cost is disk I/O on every batch — roughly halving max throughput compared to in-memory queues — but the gain is durability across pod restarts, rolling deploys, and node drains. The discipline: only enable persistent queue for the audit tier where compliance demands it; the operations tier (everyday telemetry) can tolerate the small loss across a 5-second restart window in exchange for full throughput.

The agent vs gateway deployment shapes, and the build-your-own-binary path

There are two canonical Collector deployment shapes. Agent (one Collector per node, typically as a DaemonSet sidecar): SDKs send to localhost:4317, the agent does light enrichment (resourcedetection for k8s downward-API attributes), and forwards to the gateway. Gateway (a horizontally-scaled fleet of Collector pods behind a load balancer): receives from agents, does heavy policy work (tail-sampling, redaction, multi-tenant routing), and ships to backends. The two-tier topology lets the agent stay tiny (256 MiB memory limit) while the gateway handles the GiB-scale tail-sampling buffers. Most Indian-fleet deployments at scale (Hotstar, Flipkart, Razorpay shape) run two-tier; smaller fleets run gateway-only.

The Collector binary itself is split into otelcol (core, ~dozen components) and otelcol-contrib (~200 components) to keep dependency trees and CVE-surface manageable. The discipline at production scale is to use the OpenTelemetry Collector Builder (ocb) to assemble a custom binary with only the components your fleet uses — a Razorpay-style fleet typically lands at a 90 MiB binary instead of the 250 MiB contrib distribution. The same Builder path is how you compile in a custom Go processor when OTTL is not enough; OTTL deliberately omits scripting hooks because at 3M spans/sec a 1 µs script call per span would burn 3 CPU-seconds per wallclock-second, so any in-process scripting must be Go-compiled-in, not interpreted-at-runtime.

Reproduce this on your laptop

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install opentelemetry-api opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc grpcio
# Get the Collector binary (Linux x86_64 example):
curl -L -o /tmp/otelcol.tgz https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.105.0/otelcol-contrib_0.105.0_linux_amd64.tar.gz
tar xzf /tmp/otelcol.tgz -C /tmp && sudo mv /tmp/otelcol-contrib /usr/local/bin/
python3 collector_walkthrough.py
# Watch: 25 spans in, ~4 OTLP messages out, customer.email gone.

Where this leads next

The next chapter /wiki/auto-instrumentation covers the opentelemetry-instrument zero-code agent — the productised version of the API/SDK split that the Collector then receives traffic from. Most production OTel deployments are auto-instrumentation feeding a Collector feeding Tempo/Loki/Mimir; understanding both ends is what makes the topology sensible.

After auto-instrumentation comes /wiki/otlp-the-wire-format, the protobuf-level details of what the OTLP receiver actually decodes — the bytes Karan was watching on the wire when he was trying to figure out where the lost spans went. Then /wiki/processors-sampling-attribute-policy for the policy half of the Collector — the OTTL grammar, the tail-sampling memory math, and the multi-tenant routing patterns that turn one Collector fleet into a billing-aware platform tier.

The closing thought is the one Karan reached at 23:47 IST when the Hotstar Collector fleet stabilised. The bug was not in the application services and was not in Tempo. It was in memory_limiter being configured after tail_sampling in the trace pipeline — the limiter was rejecting batches that had already paid the tail-sampling buffer cost, and the rejection cascaded back through receivers as gRPC errors, which the SDKs retried, which inflated incoming traffic, which made the limiter reject more, which made the SDKs retry harder. A one-line config change moved memory_limiter to the front of the processor chain and the sawtooth flattened in 90 seconds. The pipeline shape was the bug; the pipeline shape was the fix. Receivers, processors, exporters — three layers, one config, every operational decision lives in one of them.

References