Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

SDKs vs API

It is the Tuesday before a Razorpay quarterly review. Aditi, the platform engineer responsible for the new merchant-onboarding-api, has spent four days adding OpenTelemetry instrumentation to thirty-seven business-logic spans. The code reviewer approved every PR. The unit tests pass. The service deploys to production at 14:30 IST. At 18:00 the SRE team Slacks her — Tempo shows zero traces from merchant-onboarding-api. Not a sampling issue, not a network issue: zero spans, no error logs, no exporter retries, nothing. The service is healthy, returns 200s, the application logs show the spans being "started" and "ended". The OTel pipeline shows nothing.

The bug is not in Aditi's instrumentation. It is in her mental model of what from opentelemetry import trace actually does. She added thirty-seven tracer.start_as_current_span(...) calls. She did not add the SDK initialisation that turns those calls into recording spans. The OTel API is, by design, a no-op when no SDK is registered — it returns a sentinel NonRecordingSpan that swallows every set_attribute, every add_event, every end() call. Her thirty-seven spans were created, decorated, and discarded inside the process; nothing ever crossed a network boundary because there was nothing to export.

This split — API as a thin contract, SDK as the runtime — is the single most consequential design decision in OpenTelemetry. It is also the one that bites every team on first contact.

The OpenTelemetry API is what your application code calls — get_tracer(), start_as_current_span(), set_attribute(). It is intentionally a no-op when no SDK is registered, returning sentinel objects that record nothing. The SDK is the runtime that you wire up at process boot — TracerProvider, span processors, exporters — that turns API calls into actual OTLP messages on the wire. Library authors instrument against the API; application authors register the SDK. Mixing the two surfaces is the bug; separating them is the entire point.

Two surfaces, two audiences

The split has been an explicit OTel design choice since the very first spec drafts in 2019. The API package and the SDK package are two separately-versioned, separately-installable Python distributions — opentelemetry-api and opentelemetry-sdk — and you can run a process with only the API installed, in which case every instrumentation call becomes a no-op.

The reason for the split is the library author problem. A Python library — say, psycopg2, or redis-py, or requests — wants to add tracing so that any application using it gets database/cache/HTTP spans for free. But the library cannot decide on behalf of the application whether to record traces, where to export them, or how often to sample. If psycopg2 shipped with a hardcoded SDK that exported to OTLP at http://localhost:4317, the library would be unusable in any environment where that endpoint did not exist. So the library imports only opentelemetry-api, calls get_tracer("psycopg2", "2.9.10"), and emits spans against whatever TracerProvider the application has registered — including the default no-op provider, in which case the spans cost a few microseconds of allocation and are immediately discarded.

The application author is the one who installs opentelemetry-sdk, configures a TracerProvider, attaches a BatchSpanProcessor and an OTLPSpanExporter, and calls trace.set_tracer_provider(provider). Once that registration happens, every get_tracer() call in the process — including the ones inside psycopg2, redis-py, and requests — returns a tracer that produces recording spans. The library code does not change. The library does not even know whether the application registered an SDK; it just calls the API and lets the registered provider decide what to do.

The OpenTelemetry API / SDK split — two surfaces, two audiencesA two-column diagram. Left column shows the API surface: library authors and application code calling get_tracer, start_as_current_span, set_attribute. Right column shows the SDK surface: application boot code registering TracerProvider, span processors, and exporters. An arrow at the bottom shows the API surface delegating to whichever provider the SDK registered, defaulting to a no-op provider if none is registered.opentelemetry-api vs opentelemetry-sdk — two pip packages, two responsibilitiesIllustrative — package names and class names taken from opentelemetry-python 1.27.0.API surface (opentelemetry-api)what library authors and app code callfrom opentelemetry import tracetracer = trace.get_tracer("my.lib", "1.0")with tracer.start_as_current_span("op"):Library authors:import only opentelemetry-apinever instantiate a TracerProvidernever set an exporterDefault behaviour:if no SDK is registered,returns NonRecordingSpan(swallows every call)SDK surface (opentelemetry-sdk)what application boot code wires upfrom opentelemetry.sdk.trace import \ TracerProviderprovider = TracerProvider(resource=...)provider.add_span_processor(...)trace.set_tracer_provider(provider)Application authors:install opentelemetry-sdkregister provider once at bootconfigure processors + exportersAfter registration:every API call becomes recording— including library calls already in progresstrace.get_tracer_provider() — late-binds to whatever the SDK registeredif nothing registered → DefaultTracerProvider → NonRecordingSpanthis is the line where Aditi's 37 spans disappeared
Illustrative — the API/SDK split. The API knows only how to call `trace.get_tracer_provider()`; the SDK is what populates that provider. With no SDK registered, the default provider is a no-op and every span is silently discarded.

This is the pattern that lets a single Python process compose instrumentation from twelve different libraries (Flask, psycopg2, redis-py, requests, kafka-python, ...) without any of them coordinating with the others. They all import opentelemetry-api. They all call get_tracer(). The application installs the SDK once, and every library's spans flow through the same exporter. Why this layering matters: any other design would force every library to either (a) ship its own exporter (so you would have twelve exporters fighting over OTLP endpoints), or (b) require the application to pass an exporter into every library's init code (so every library would gain a tracer_provider= parameter). The OTel design lets the API be a thin contract that delegates to a globally-registered provider, and that contract is what makes the auto-instrumentation ecosystem work — pip install opentelemetry-instrumentation-flask works because Flask itself uses only the API.

A subtle but important point: the registration is a singletontrace.set_tracer_provider(provider) mutates a module-level global in opentelemetry.trace. You can register only once per process. Subsequent calls log a warning and are silently ignored. This is why frameworks like opentelemetry-instrumentation-fastapi warn against calling them before the SDK is set up — they capture the tracer at import time, and if the SDK is registered after that, the captured tracer is the default no-op tracer forever. The fix is to register the SDK first, then import the instrumentation packages, then start the application.

Watching the no-op happen — and watching it disappear when the SDK lands

The cleanest way to internalise the split is to run the same instrumented code twice — once with no SDK, once with the SDK registered — and watch what changes. The script below builds a small instrumented function, runs it under both regimes, and prints the span objects' types and recording status so you can see the no-op pattern with your own eyes.

# api_vs_sdk_demo.py — show what the OTel API does with and without an SDK.
# pip install opentelemetry-api opentelemetry-sdk \
#             opentelemetry-exporter-otlp-proto-grpc \
#             grpcio
import time
from concurrent import futures
import grpc
from opentelemetry import trace
from opentelemetry.proto.collector.trace.v1 import (
    trace_service_pb2, trace_service_pb2_grpc)
from opentelemetry.proto.collector.trace.v1.trace_service_pb2_grpc import (
    TraceServiceServicer, add_TraceServiceServicer_to_server)

# A function instrumented against ONLY the API. Note: no SDK import here.
def score_user(user_id: int) -> float:
    tracer = trace.get_tracer("recommendations.scoring", "1.0.0")
    with tracer.start_as_current_span("score_user") as span:
        span.set_attribute("user.id", user_id)
        span.set_attribute("model.name", "rec-v3")
        # Show what kind of span this actually is
        print(f"  span class      = {type(span).__name__}")
        print(f"  is_recording()  = {span.is_recording()}")
        print(f"  span_context.trace_id = {span.get_span_context().trace_id}")
        time.sleep(0.005)
        return 0.84

# === Run 1: NO SDK registered. The API is a no-op. ===
print("=== Run 1: no SDK registered ===")
score_user(42)
score_user(43)
# Inspect the global provider
print(f"  global provider = {type(trace.get_tracer_provider()).__name__}")

# === Run 2: register the SDK, point at a fake collector. ===
CAPTURED = []
class FakeCollector(TraceServiceServicer):
    def Export(self, request, ctx):
        CAPTURED.append(request)
        return trace_service_pb2.ExportTraceServiceResponse()

srv = grpc.server(futures.ThreadPoolExecutor(max_workers=2))
add_TraceServiceServicer_to_server(FakeCollector(), srv)
srv.add_insecure_port("127.0.0.1:14318"); srv.start()

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

resource = Resource.create({"service.name": "recommendations-api"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://127.0.0.1:14318", insecure=True),
    schedule_delay_millis=200))
trace.set_tracer_provider(provider)

print("\n=== Run 2: SDK registered, exporter pointed at fake collector ===")
score_user(44)
score_user(45)
print(f"  global provider = {type(trace.get_tracer_provider()).__name__}")

provider.force_flush(); time.sleep(0.5)
print(f"\nfake collector received {len(CAPTURED)} OTLP messages")
total_spans = sum(len(ss.spans) for req in CAPTURED
                                for rs in req.resource_spans
                                for ss in rs.scope_spans)
print(f"total spans on the wire: {total_spans}")
Sample run:
=== Run 1: no SDK registered ===
  span class      = NonRecordingSpan
  is_recording()  = False
  span_context.trace_id = 0
  span class      = NonRecordingSpan
  is_recording()  = False
  span_context.trace_id = 0
  global provider = ProxyTracerProvider

=== Run 2: SDK registered, exporter pointed at fake collector ===
  span class      = _Span
  is_recording()  = True
  span_context.trace_id = 178432651074821390742398217482310947
  span class      = _Span
  is_recording()  = True
  span_context.trace_id = 84321097483210947832109842310984210984
  global provider = TracerProvider

fake collector received 1 OTLP messages
total spans on the wire: 2

Six lines deserve attention. type(span).__name__ == "NonRecordingSpan" in Run 1 is the no-op surface — the API hands back a sentinel span object whose every method (set_attribute, add_event, set_status, end) is a no-op that returns immediately. span.is_recording() == False is the API method to detect this; instrumentation libraries that do expensive attribute computation should always check is_recording() first to avoid building strings the SDK will throw away. span_context.trace_id == 0 is the giveaway for tail-based samplers and log-correlation code — a zero trace_id means no SDK, not "tracing is off for this request". global provider = ProxyTracerProvider in Run 1 is the default that ships with opentelemetry-api; it forwards to a real provider once one is registered, and to a no-op tracer in the meantime. Why the proxy exists: the spec wants tracer = trace.get_tracer(...) calls made at module import time to keep working even if the application registers the SDK later. The proxy resolves the underlying tracer lazily on every API call, so an instrumentation library that captured tracer at import is not stuck with a no-op tracer forever — once set_tracer_provider() is called, every subsequent call on the proxy delegates to the real one. span class = _Span in Run 2 is the SDK's recording-span class from opentelemetry.sdk.trace; this is the class that builds OTLP messages and queues them for export. fake collector received 1 OTLP messages confirms the BatchSpanProcessor flushed; the count is 1 message because force_flush() ships everything queued at the time of the call, and the two spans were both still in-batch.

The diagnostic ladder for "I added spans and Tempo shows nothing" is exactly this script, condensed: print type(span).__name__ and span.is_recording() inside any one of your spans. If you see NonRecordingSpan and False, you are missing the SDK registration. If you see _Span and True, the spans are recording but the export is broken — different bug, different fix.

Why library authors must never import the SDK

The discipline the API/SDK split imposes is strict: a library that publishes telemetry must import only opentelemetry-api. If a library imports opentelemetry-sdk — even transitively — it is forcing every consumer of that library to install the SDK and to live with whatever provider that library happens to register. This breaks composability and is the reason most early OTel rollouts had violently incompatible SDKs colliding inside the same process.

The first failure mode is the double-registration warning. Suppose library-a and library-b both import the SDK and call set_tracer_provider(provider_a) and set_tracer_provider(provider_b) at module import time. Whichever import happens last wins; the other library's provider is silently dropped. The application has zero control over which one wins because Python's import order depends on dependency-resolution order. The result is non-deterministic instrumentation: the same code path emits to library-a's exporter on one machine and library-b's on another, depending on which library got imported first.

The second failure mode is the resource collision. The SDK's Resource block is set at provider construction. If two libraries each construct a TracerProvider with their own Resource, the application has no way to merge them — the singleton wins, and the loser's resource attributes are gone. The application's service.name might end up as library-a-internal instead of merchant-onboarding-api because library-a happened to set its resource last. Tempo will index spans under the wrong service. The service map will show mystery services. Aditi's actual production bug at a hypothetical Razorpay-scale fintech started exactly this way — a vendored library had imported the SDK to "get tracing for free", overwrote the application's TracerProvider, and the application's resource attributes were silently replaced by the library's. The fix was a four-line patch removing the library's SDK import; the time-to-find was eight days.

The third failure mode is the dependency bloat. The OTel SDK pulls in opentelemetry-exporter-otlp-proto-grpc (or http), which pulls in grpcio (about 6 MB compiled), which pulls in protobuf, which has its own version constraints. A small library that only wants to emit a few spans should not force every consumer to take that dependency tree. The API is small (~80 KB, no native deps) precisely so that any library can import it without imposing weight.

The OTel project enforces the discipline at the package level. The opentelemetry-api package's setup metadata explicitly does not depend on the SDK; you can pip install opentelemetry-api without opentelemetry-sdk ever entering your environment. Library authors should pin only the API and leave SDK installation to the application. The opentelemetry-instrumentation-* packages (Flask, psycopg2, requests, etc.) follow this rule — they instrument against the API and the application picks up SDK + exporter. Why this matters operationally: the Razorpay hypothetical fix was patching one vendored library to remove an import opentelemetry.sdk.trace. That single import was overriding the application's provider, the application's resource block, and the application's exporter. The diagnostic was running pip show opentelemetry-sdk and finding it pulled in by a library that should not have needed it; the cure was a one-line removal followed by a re-deploy. The whole class of bug exists only when libraries violate the discipline.

Bootstrapping the SDK — the four lines that matter

The application-side wiring of the SDK is small and almost mechanical, but each line carries a decision the application owner is the only one allowed to make. Here is the canonical four-line bootstrap, annotated:

# 1. Build the Resource — what every span/metric/log from this process inherits.
resource = Resource.create({
    "service.name": "merchant-onboarding-api",
    "service.version": "0.4.2",
    "deployment.environment": "production",
    "k8s.pod.name": os.environ["HOSTNAME"],
    "cloud.region": "ap-south-1",
})

# 2. Build the TracerProvider — the runtime that produces and routes spans.
provider = TracerProvider(
    resource=resource,
    sampler=ParentBasedTraceIdRatio(rate=0.10),  # head-based, 10% of root spans
)

# 3. Wire processors + exporter — how spans leave the process.
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True),
    max_export_batch_size=512,
    max_queue_size=2048,
    schedule_delay_millis=5000,
))

# 4. Register globally — make the API surface delegate to this provider.
trace.set_tracer_provider(provider)

The first line — the Resource — is the only line the SDK forces you to make decisions on; the API has nothing equivalent. The application owner picks the service.name, the deployment.environment, the cloud-region attribute. (See /wiki/the-data-model for which attributes belong on the Resource and which belong on the signal.)

The second line — TracerProvider(sampler=...) — is where the application makes the policy decision the libraries cannot. A library does not know whether the application wants 1% sampling for free-tier traffic and 100% sampling for paid-tier; only the application does, and only the application is allowed to set the sampler. Default sampler is ParentBasedTraceIdRatio(rate=1.0) (always-on for root spans, follow parent for children). Tail-based sampling is configured at the Collector layer, not in the SDK; see /wiki/tail-based-sampling-error-bias-and-late-decisions.

The third line — the BatchSpanProcessor + OTLPSpanExporter — is where the application decides where the spans go and how often. BatchSpanProcessor queues spans up to max_queue_size and flushes either every schedule_delay_millis or whenever max_export_batch_size is reached. The defaults (2048-span queue, 512-span batch, 5-second delay) are fine for most services; the failure mode is that under burst load the queue fills before the processor flushes, and excess spans are dropped silently. A Hotstar-scale service handling 800k spans/sec across 80 microservices needs the queue tuned up to 8192 and the batch to 1024, or the processor will drop ~5% of spans during the toss spike. The metric to monitor is otel.bsp.dropped_spans exposed by the SDK.

The fourth line — trace.set_tracer_provider(provider) — is what connects the API surface to this SDK. Until this line runs, every tracer.start_as_current_span(...) in the process is a no-op against the default provider. The order of this call matters: any module that imported a tracer at module-load time will not see the new provider through that captured tracer (unless the proxy is in play, which is the default in Python). The robust pattern is to run all four lines as the very first thing in main(), before any application module is imported.

A useful sanity check at deploy time: log the global provider class. If you see ProxyTracerProvider after the bootstrap, the SDK registration silently failed and you should look for an exception in the boot logs. If you see TracerProvider, the registration succeeded and the API now produces recording spans.

SDK bootstrap ordering — what happens before vs after set_tracer_providerA horizontal timeline diagram showing the process boot sequence. On the left a red zone covers the Python interpreter starting and importing application modules. In this zone, every tracer captured at module-load time references the ProxyTracerProvider. The middle marks the set_tracer_provider call. On the right, a green zone shows the running application where every API call delegates through the proxy to the registered SDK provider. Annotations point out the Aditi failure mode where instrumentation is captured before registration and spans go to the no-op tracer.Process boot timeline — when must set_tracer_provider() run?Illustrative — the proxy makes ordering recoverable, but only if instrumentation calls get_tracer() lazily.t=0: interpreterstartsset_tracer_provider()applicationserves trafficBefore registrationglobal provider = ProxyTracerProviderget_tracer() returns ProxyTracerspans are NonRecordingSpanAfter registrationglobal provider = TracerProvider (SDK)ProxyTracer delegates to SDK tracerspans are recording, exporter ships themThe Aditi failure modeif instrumentation captures the tracer at import via tracer = get_tracer(...) — the proxy keeps it alive,but if the instrumentation forgot to call get_tracer at all and just instantiated NoOpTracer directly,no later set_tracer_provider() call can rescue it — fix the import order, not the SDK
Illustrative — boot ordering. The proxy makes most ordering bugs recoverable, but only because every API call resolves the underlying tracer lazily on each invocation. Instrumentation that bypasses the proxy is the one case the SDK cannot rescue.

Failure modes the split deliberately permits

The API/SDK split solves several problems by accepting a small set of failure modes that the application owner is responsible for handling. Naming them is what separates teams who debug in minutes from teams who debug in days.

Forgotten registration is silent. The single most common production failure is the one Aditi hit — the SDK is never registered, the API quietly produces no-op spans, the application emits no telemetry, and no error fires anywhere because emitting nothing is the contract of the API in the absence of an SDK. The mitigation is a startup self-test: in main() after the bootstrap, create a span, check span.is_recording(), and crash hard if it returns False. A platform team at a hypothetical IRCTC-scale booking service caught seventeen production deployments over six months that had merged without SDK init via this five-line self-test; the alternative was finding out at the next Tatkal incident that the canary had no telemetry at all.

Captured tracers never refresh. A library that does from opentelemetry.trace import NoOpTracer; tracer = NoOpTracer() (i.e. ignores get_tracer() and constructs a tracer directly) will never see the SDK no matter when it is registered. The proxy mechanism only works for tracers obtained through get_tracer(). Some legacy instrumentation libraries written in 2020 still do the wrong thing here; the symptom is that part of your application produces spans and another part produces nothing. The fix is to find the offending library (grep -r 'NoOpTracer\|DefaultTracer' venv/lib/) and patch it to use get_tracer() instead.

Resource is fixed at provider construction. The Resource you pass to TracerProvider(resource=...) becomes immutable; you cannot add attributes after the first span has been emitted. If the cloud-region attribute is not available at boot (because the cloud-metadata service is slow, common on EC2 first-boot), the SDK ships spans with no cloud.region and you cannot retroactively add it. The mitigation is the Collector's resourcedetection processor, which fills in the missing attributes at the Collector layer based on the spans' source IP.

  • "Importing opentelemetry-api is enough to start tracing." It is enough to write instrumentation code, but the spans go nowhere until an SDK is registered. The API is a contract; the SDK is the runtime. If your pip list does not show opentelemetry-sdk or opentelemetry-exporter-*, you have an instrumented application that emits zero telemetry.
  • "tracer.start_as_current_span() always creates a real span." It returns a real span only if the registered TracerProvider is an SDK provider. With no SDK registered, it returns a NonRecordingSpan — every method on it is a no-op. Check span.is_recording() if you are uncertain.
  • "Library authors should ship a default SDK so users get tracing out of the box." They should not. A library that imports the SDK forces every consumer to take the SDK, the exporter, and gRPC as transitive dependencies — and worse, the library's set_tracer_provider() call may overwrite the application's. Library authors instrument against the API only.
  • "You can register multiple TracerProviders for different libraries." You cannot. set_tracer_provider() is a process-global singleton; the second call logs a warning and is ignored. Multi-tenant or multi-destination routing is configured inside the SDK (multiple span processors, multiple exporters on one provider), not by registering multiple providers.
  • "The OTel API and SDK versions can drift independently." They can be on different minor versions, but the SDK depends on a compatible API range. The SDK's setup.cfg pins opentelemetry-api ~= X.Y, so an SDK 1.27 with API 1.20 will fail to import. Always upgrade them together; the convenience metapackage opentelemetry-distro handles this.
  • "is_recording() is the same as span_context.is_valid." They are not. is_recording() tells you whether the SDK will record attributes and emit OTLP for this span. span_context.is_valid tells you whether the trace_id and span_id are non-zero — a span can have a valid context (because it inherited one from W3C traceparent) but still be non-recording (because the local sampler dropped it). The two are independent.

Going deeper

What "no-op" actually costs at the API layer

The NonRecordingSpan is not literally zero work — it is the cheapest object the API can construct that still implements the full Span interface. On CPython 3.11, a NonRecordingSpan allocation is ~40 bytes, the start_as_current_span() call is one context-manager push to a thread-local stack, and end() is one pop. Roughly 0.3 µs per span, end-to-end, on a Razorpay-scale node. For most services this is negligible noise. For ultra-hot paths (Zerodha trading-engine matching loops at >2 million decisions/second per process), even the no-op cost can be visible — the discipline there is to gate the instrumentation behind if tracer.is_recording() checks at a coarse granularity (per-batch, not per-decision), or to use the OTel noop_tracer_provider explicitly rather than relying on the default proxy. The proxy adds a single attribute lookup on every API call; switching to NoOpTracerProvider directly skips even that. The Zerodha-scale heuristic: if your instrumented function runs more than 100k times per second, profile it with py-spy and compare with-API-no-SDK vs with-no-API-at-all. If the gap is meaningful, you have hit the rare case where even the API matters.

The OTel API is more stable than the SDK on purpose

The OTel spec marks the API as "stable, no breaking changes after 1.0" while the SDK can ship breaking changes within minor versions. This is a deliberate version-policy choice — application code calls the API, and the API's start_as_current_span(...), set_attribute(...), record_exception(...) signatures are frozen. SDK internals — the BatchSpanProcessor's queueing, the OTLP exporter's gRPC retry logic, the sampler's decision-point — can evolve because only the application's bootstrap code calls them, and the bootstrap is concentrated in one place. A 2023 OTel SDK can run instrumentation written against the 2020 API; instrumentation written against a 2023 SDK might not import on a 2020 SDK runtime. This asymmetry is the only reason long-lived libraries can be safely instrumented at all — requests, psycopg2, redis-py author their tracing once against the stable API and never have to chase SDK churn.

Auto-instrumentation is the API/SDK split, productised

The opentelemetry-instrumentation-* ecosystem (Flask, FastAPI, Django, psycopg2, redis-py, requests, kafka-python, sqlalchemy, ...) is a direct consequence of the split. Each instrumentation package monkey-patches the target library to call the OTel API around the library's hot paths — psycopg2.connect() becomes with tracer.start_as_current_span("db.query"): psycopg2.connect_orig(...). The instrumentation packages depend only on opentelemetry-api; the application picks up the SDK separately. The launcher script opentelemetry-instrument python my_app.py (the OTel zero-code agent) does both: it sets up the SDK from environment variables (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, OTEL_TRACES_SAMPLER) before the application module imports, then runs the application. The application gets full instrumentation without writing a single SDK line. See /wiki/auto-instrumentation for the patching mechanism in detail.

Multiple exporters from a single SDK

A single SDK provider can have multiple span processors, and each processor can have its own exporter. The pattern is: provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(...))) followed by provider.add_span_processor(BatchSpanProcessor(JaegerExporter(...))) — both processors see every span, and both exporters ship the spans (now with two different wire formats) to two different backends. Cleartrip ran this dual-export pattern during their migration from Jaeger to Tempo for nine weeks: the OTLP exporter shipped to Tempo, the Jaeger exporter shipped to the existing Jaeger backend, and dashboards on both could be compared while the team validated that Tempo's ingestion was accurate. After the validation period, the Jaeger exporter was removed in a one-line config change. The SDK happily runs as many processors as you wire up; the cost is roughly linear in span volume per exporter, and the queues are independent so a slow downstream does not back up the others (it does eventually drop its own spans if the queue fills).

The cross-language consistency the split enforces

The API/SDK split is not a Python convention; it is a spec-level requirement. Java, Go, Node, Rust, .NET, Ruby, PHP, Erlang, Swift — every conformant OTel implementation has the same two-package structure. A Java library author imports io.opentelemetry.api; a Java application imports io.opentelemetry.sdk and the exporters. A Go library imports go.opentelemetry.io/otel; a Go application imports go.opentelemetry.io/otel/sdk. The cross-language discipline means a polyglot fleet — a Hotstar-scale service mesh with services in Go (auth), Java (recommendations), Python (analytics), Node (gateway) — can have one consistent SDK story per language while every service's instrumentation talks the same API shape. The teams that try to hand-roll instrumentation in one language and use OTel in another always end up regretting it; the API contract is what makes the polyglot work.

Reproduce this on your laptop

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install opentelemetry-api opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc \
            grpcio
python3 api_vs_sdk_demo.py
# Inspect: NonRecordingSpan in Run 1, _Span in Run 2.
# Then comment out trace.set_tracer_provider(provider) in Run 2 and watch
# the spans disappear from the fake collector — that is Aditi's bug.

Where this leads next

The next chapter /wiki/the-collector-receivers-processors-exporters follows the OTLP message after it leaves the SDK — into the OpenTelemetry Collector, where receivers parse it, processors transform it (sampling, attribute redaction, batching across services), and exporters ship it to the backends. The Collector is to the SDK what the SDK is to the API: another layer of pluggability that lets the application's wire-time decisions stay simple while the production policy lives in a config the platform team owns.

After the Collector come the auto-instrumentation packages (/wiki/auto-instrumentation) — the productised version of the API/SDK split that lets services get tracing with zero code changes. Then OTLP itself (/wiki/otlp-the-wire-format) for the protobuf-level details that Aditi's debugging script in this chapter only sketched. Then sampling and processors as policy (/wiki/processors-sampling-attribute-policy), where the application's SDK decides the head-based rate and the Collector decides the tail-based one.

The closing thought is the one Aditi arrived at on Tuesday at 19:30, after she finally added trace.set_tracer_provider(provider) to her boot script and Tempo lit up with seventeen thousand spans. The OTel API is not "tracing" — it is the interface through which tracing might happen. The SDK is what makes it actually happen. If your service emits zero spans, the question is never "is the API working" — the API is always working, in the sense that it is dutifully creating no-op spans and discarding them. The question is "is the SDK registered, and is the registration the singleton your code is reading from".

When you instrument a library, you write only against the API. When you operate an application, you wire up exactly one SDK. When you debug missing telemetry, you ask is_recording() first and work outward from there. The split is the design.

References