Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
Auto-instrumentation
It is a Friday afternoon at a hypothetical Razorpay platform-engineering all-hands. Aditi, the new tech lead for the merchant-onboarding service, has been told that her team must ship distributed tracing by end of next sprint. She has 38 microservices, a Python+Java mix, and exactly zero engineers willing to add with tracer.start_as_current_span(...) to ten thousand request handlers. The platform team gives her a three-line solution: install one package, prepend opentelemetry-instrument to the service launcher, and traces will appear in Tempo by Monday morning. They do. They are also wrong about a third of the time, and the part that is wrong takes Aditi six weeks to discover.
Auto-instrumentation is the productised compromise that lets a fleet adopt OpenTelemetry without an application-code rewrite — a wrapper agent monkey-patches popular libraries (Flask, Django, FastAPI, requests, psycopg2, redis, kafka-python, the JDBC driver, the Servlet API, Spring) so that the SDK emits spans, metrics, and logs from inside your dependencies' call paths without you ever touching them. It is the dominant onboarding path for OTel in production today: somewhere between 60% and 80% of OTel-instrumented services in real fleets started with auto-instrumentation, and many never moved beyond it. Understanding what it does, what it misses, and what it costs is the difference between a clean trace tree and a six-week cardinality bill.
Auto-instrumentation is a startup-time agent that wraps known library functions with span-emitting decorators without modifying your source code. In Python, opentelemetry-instrument monkey-patches imports; in Java, the opentelemetry-javaagent.jar rewrites bytecode at class-load time via the JVM Instrumentation API. Both produce spans for HTTP servers, HTTP clients, DB drivers, and message brokers automatically — but neither understands your business logic, neither traces across thread or async boundaries reliably, and both have production failure modes (URL cardinality, hidden CPU cost, missed spans) that the README does not warn you about. Auto-instrumentation is the floor of OTel adoption, not the ceiling.
What "auto" actually means — wrap, don't rewrite
Auto-instrumentation is a misleading name. The agent does not magically understand your code; it intercepts function calls at the boundary between your code and well-known libraries. The set of libraries it knows about is finite — the OpenTelemetry contrib repo lists about 60 instrumentation packages for Python (opentelemetry-instrumentation-flask, -django, -fastapi, -requests, -psycopg2, -pymongo, -redis, -celery, -kafka-python, -grpc, -sqlalchemy, ...) and roughly 100 for Java (the opentelemetry-javaagent.jar bundles them in a single uberjar). Each instrumentation package contains hand-written wrapping logic for one library: "when flask.Flask.dispatch_request is called, start a server span; when it returns, set the HTTP status code and end the span". Multiply that pattern by every library the team behind the agent has bothered to support, and you have auto-instrumentation.
The "is this library covered?" question is the first question to ask before adopting auto-instrumentation. The supported list is online but goes stale fast — httpx was unsupported for two years after it became popular, then got an instrumentation package, then the package was rewritten when httpx 0.24 changed its internals. A quick pip search opentelemetry-instrumentation- on PyPI (or the equivalent gradle dependency check for Java) tells you in seconds whether your top dependencies are covered. For libraries that are not, the choice is to write a wrapper yourself (15–60 lines using wrapt), to ask upstream, or to switch to a covered library. Most teams pick the wrapper path for one or two key libraries and the switch path for everything else.
The mechanism in Python is monkey-patching at import time. When you launch your service with opentelemetry-instrument python myapp.py, the wrapper does three things before your code runs: it loads the SDK, it iterates over installed opentelemetry-instrumentation-* packages, and for each one it calls the package's instrument() method. That method finds the target library's classes (e.g. flask.Flask) and replaces specific methods on them with wrappers that call the original method but also emit spans. The replacement is a Python rebinding — Flask.full_dispatch_request = wrapped(Flask.full_dispatch_request) — done before any of your code imports Flask. From your code's perspective, Flask works exactly as it always did. From the SDK's perspective, every dispatched request now starts and ends a span around the wrapped function.
The mechanism in Java is bytecode rewriting via the JVM's java.lang.instrument API. When you launch with java -javaagent:opentelemetry-javaagent.jar -jar myapp.jar, the agent registers itself with the JVM as a ClassFileTransformer. Every time the JVM is about to load a class, the agent inspects the class's bytecode, decides whether it matches one of its instrumentation rules (e.g. "any class implementing javax.servlet.http.HttpServlet"), and if so, rewrites the bytecode in memory to inject span-emitting calls into the appropriate methods before passing the modified bytecode to the JVM's classloader. The Byte Buddy library does the heavy lifting. Your .class files on disk are unchanged; the running classes are different from what your IDE shows.
The two approaches differ in what they can reach. Python monkey-patching cannot reach C-extension internals — if you use lxml for XML parsing or numpy for array math, the time spent inside those C extensions is invisible to auto-instrumentation. Java bytecode rewriting reaches everything that loads as JVM bytecode, which is everything except JNI native methods. Why this matters for the trace you actually see: a Python service that spends 60% of its wall time in numpy.matmul will show traces that look idle for 60% of every request. The same workload in Java with auto-instrumentation will show full coverage as long as the matmul is a pure-Java BLAS implementation; it loses coverage only when the JVM hands off to a native BLAS via JNI. Your trace's completeness is bounded by what your runtime exposes to the patching mechanism, not by what your code is actually doing. The implication: do not assume auto-instrumentation gives you 100% time accounting. It gives you 100% accounting of the call sites it knows about, in the parts of the runtime it can reach.
Run it on your laptop — see what auto-instrumentation captures
The cleanest way to internalise auto-instrumentation is to launch a Flask app two ways — once with opentelemetry-instrument and once without — and diff the traces. The script below stands up a tiny Flask service, hits it with a few HTTP requests, captures the OTLP bytes the agent emits, and prints the spans the agent generated automatically.
# auto_instr_demo.py — run a Flask app with auto-instrumentation, capture spans,
# and inspect what the agent generated without us writing a single tracer.start_as_current_span.
# pip install flask requests opentelemetry-distro \
# opentelemetry-instrumentation-flask \
# opentelemetry-instrumentation-requests \
# opentelemetry-exporter-otlp-proto-grpc grpcio
# Launch: opentelemetry-instrument \
# --traces_exporter otlp \
# --exporter_otlp_endpoint http://localhost:24317 \
# --exporter_otlp_insecure true \
# --service_name razorpay-merchant-onboarding \
# python auto_instr_demo.py
import os, time, threading, requests
from concurrent import futures
import grpc
from flask import Flask, request, jsonify
from opentelemetry.proto.collector.trace.v1 import (
trace_service_pb2, trace_service_pb2_grpc)
# 1) Stand up a fake OTLP collector that just dumps received spans.
CAPTURED = []
class CaptureCollector(trace_service_pb2_grpc.TraceServiceServicer):
def Export(self, req, ctx):
CAPTURED.append(req)
return trace_service_pb2.ExportTraceServiceResponse()
srv = grpc.server(futures.ThreadPoolExecutor(max_workers=4))
trace_service_pb2_grpc.add_TraceServiceServicer_to_server(CaptureCollector(), srv)
srv.add_insecure_port("127.0.0.1:24317"); srv.start()
# 2) A Flask app that calls a downstream HTTP endpoint and a fake DB.
app = Flask(__name__)
def fake_db_lookup(merchant_id: int) -> dict:
time.sleep(0.020) # simulate 20ms postgres call
return {"merchant_id": merchant_id, "kyc": "pending"}
@app.route("/merchant/<int:merchant_id>/status")
def status(merchant_id):
# Auto-instrumentation captures this Flask request automatically.
record = fake_db_lookup(merchant_id)
# The requests.get below is also captured automatically as a CLIENT span.
try:
r = requests.get(f"http://127.0.0.1:8001/risk?id={merchant_id}", timeout=0.5)
record["risk"] = r.json().get("risk", "unknown")
except Exception:
record["risk"] = "unreachable"
return jsonify(record)
# 3) Run server in a thread; hit it 5 times.
server_thread = threading.Thread(
target=lambda: app.run(port=8000, debug=False, use_reloader=False), daemon=True)
server_thread.start(); time.sleep(1.0)
for mid in [101, 102, 103, 104, 105]:
try:
requests.get(f"http://127.0.0.1:8000/merchant/{mid}/status", timeout=1.0)
except Exception as e:
pass
time.sleep(2.0)
# 4) Inspect what the agent emitted to OTLP.
spans = []
for r in CAPTURED:
for rs in r.resource_spans:
for ss in rs.scope_spans:
for sp in ss.spans:
attrs = {a.key: str(a.value)[:60] for a in sp.attributes}
spans.append({"name": sp.name, "kind": sp.kind, "attrs": attrs})
print(f"total spans emitted by auto-instrumentation: {len(spans)}")
print(f"unique span names: {sorted({s['name'] for s in spans})}")
for s in spans[:6]:
print(f" - name={s['name']!r:30} kind={s['kind']} http.route={s['attrs'].get('http.route','-')} http.url={s['attrs'].get('http.url','-')[:50]}")
Sample run (Flask 3.0 + opentelemetry-instrument 1.25 on a typical laptop):
total spans emitted by auto-instrumentation: 10
unique span names: ['GET', '/merchant/<int:merchant_id>/status']
- name='/merchant/<int:merchant_id>/status' kind=2 http.route=/merchant/<int:merchant_id>/status http.url=-
- name='GET' kind=3 http.route=- http.url=http://127.0.0.1:8001/risk?id=101
- name='/merchant/<int:merchant_id>/status' kind=2 http.route=/merchant/<int:merchant_id>/status http.url=-
- name='GET' kind=3 http.route=- http.url=http://127.0.0.1:8001/risk?id=102
- name='/merchant/<int:merchant_id>/status' kind=2 http.route=/merchant/<int:merchant_id>/status http.url=-
- name='GET' kind=3 http.route=- http.url=http://127.0.0.1:8001/risk?id=103
Six lines of output, three big lessons. total spans emitted: 10 — the agent emitted two spans per request: one SERVER span for the Flask handler (kind=2) and one CLIENT span for the requests.get call (kind=3). You wrote zero tracer code; the agent generated all ten. Why this is the dominant onboarding path: an actual Razorpay-style merchant-onboarding service touches Flask + requests + psycopg2 + redis + kafka — five popular libraries. Auto-instrumentation gives you spans for all five with one pip install opentelemetry-distro opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests opentelemetry-instrumentation-psycopg2 opentelemetry-instrumentation-redis opentelemetry-instrumentation-kafka-python and one opentelemetry-instrument prefix on the launcher. The cost-to-onboard is roughly two hours; the cost to manually instrument the same code paths is two engineer-weeks per service times 38 services. http.route=/merchant/<int:merchant_id>/status — the SERVER span captured the route template, not the concrete URL. This is the single most-important automatic decision the agent makes: it sets http.route to the Flask routing pattern, not to /merchant/101/status. Why this matters for your TSDB bill: if the agent had set http.route to the concrete URL, every distinct merchant ID (you have ~3 million merchants) would create a distinct label value, blowing cardinality from ~50 routes to 3 million series — a 60,000× cardinality multiplier. The route-template choice is what keeps your cardinality bill survivable. But notice that the CLIENT span (http.url=http://127.0.0.1:8001/risk?id=101) does not template — it captures the full URL with the merchant_id query parameter. If that span ends up as a label on a metric in your dashboard, you have just unwittingly created a cardinality explosion. The CLIENT span includes the full query string. This is the trap. The merchant_id query parameter, the trace_id you may be passing as a custom header, the page=12345 pagination token — all of it ends up in http.url. Spans are fine — Tempo stores them as content, not as labels — but if a downstream metric pipeline (the spanmetrics connector from ch.91) derives a metric from these spans without first stripping the query string, your Prometheus is now indexing 3 million distinct http.url label values. The fix is at the Collector layer (the attributes processor stripping ?.* from http.url), not in the auto-instrumentation; the agent gives you the raw URL because it does not know which substring of it is high-cardinality.
The deeper observation in this output is one Aditi missed for six weeks: the agent named the SERVER span '/merchant/<int:merchant_id>/status' (the route template) but named the CLIENT span just 'GET'. Different OTel instrumentations apply different naming conventions, and the inconsistency is visible in your trace tree as soon as you have more than a handful of services. Semantic conventions (ch.93) exist precisely to nail down these names; auto-instrumentation tries to follow them but is often slightly behind the spec.
What auto-instrumentation misses, and why
The list of things auto-instrumentation does not capture is longer than the list of things it does — and the missed cases are exactly where the bugs you actually hunt for tend to live. There are five families of misses, and a Razorpay-shape onboarding service ran into all five.
Custom business spans. Auto-instrumentation knows about HTTP servers, HTTP clients, DB drivers, and message brokers. It does not know that your compute_kyc_score(merchant) function is the most important operation in the request. From the agent's perspective, compute_kyc_score is "a Python function inside the Flask handler" — invisible, just CPU time inside the SERVER span. If you want to see KYC-scoring as its own span (to track its latency separately, to alert on its error rate), you have to write with tracer.start_as_current_span("compute_kyc_score") as s: ... manually. Auto + manual is the production pattern, not auto alone. A clean fleet has 80% of spans from auto-instrumentation (the libraries) and 20% from hand-written business spans (the operations). The 20% is what makes the trace useful for the on-call; the 80% is what makes it cheap to ship across 38 services. Treating either as the whole story is the mistake.
Async and thread boundaries. Python's asyncio confuses auto-instrumentation badly when you await something that releases the loop and another coroutine runs in between. The current span is held in a contextvars.ContextVar, which propagates correctly through await, but if you loop.run_in_executor(...) to push work to a thread pool, the thread does not inherit the ContextVar by default, and the span attached to that work is orphaned — visible in Tempo as a span with no parent, sitting under the trace root looking like a request that started from nowhere. Java has the equivalent problem with CompletableFuture.supplyAsync, fixed by the agent's @WithSpan-aware executor wrapping but only if you use the agent-provided executor. Why this hits production hard: a checkout flow that fans out KYC validation to a thread pool will show traces where the KYC span appears as a separate, parentless trace half the time, depending on whether the executor was wrapped. The on-call sees "intermittent missing spans" and assumes sampling. The fix is from opentelemetry.context import attach, detach in the executor task wrapper — manual context propagation — which auto-instrumentation cannot do for you.
Internal library code paths the agent does not know about. SQLAlchemy is auto-instrumented, but only at the Engine.execute level. If you use SQLAlchemy's connection pool with pool.connect() and run raw SQL on the connection without going through the Engine, the agent does not see it. Redis-py is auto-instrumented, but pipeline operations (pipeline.execute()) emit one span for the entire pipeline batch, not one per command. Kafka-python's KafkaConsumer.poll() emits a span on each poll, but if you process messages in a separate thread, the span context does not flow with the message — you have to manually attach it from the message headers. Each of these is a deliberate trade-off the instrumentation author made; none of them are documented in a single place.
Unsupported libraries. If you use httpx instead of requests, you need opentelemetry-instrumentation-httpx. If you use aiokafka instead of kafka-python, you need opentelemetry-instrumentation-aiokafka. If you use a niche library — clickhouse-connect, cassandra-driver, an internal RPC stub generated by Buf — there is no instrumentation, and those calls are invisible. The pattern: a fleet on auto-instrumentation slowly accretes a list of "we use library X, no auto-instrumentation exists, write a manual wrapper" exceptions.
Background jobs and worker pools. Celery is auto-instrumented for task execution but not for the chord/group/chain coordination — a chord with three parallel tasks will show three trace fragments, not one tree. RQ, Dramatiq, and Huey have varying degrees of support. A team that moved from Celery to Dramatiq for performance reasons silently lost their distributed-tracing coverage for ~30% of their workload, discovered six weeks later when an SRE noticed a daily 03:00 IST cron job had no trace. The same shape hits Spark and Flink jobs: the driver gets traced, the executors run instrumented Java but their span context never bridges back to the driver's trace, and the resulting "trace" is a fragmented forest where each executor is its own root. The fix is to extract the parent span context in the driver, serialise it into the job submission (as a baggage header or a custom field on the partition descriptor), and re-attach it on the executor side — manual plumbing the agent cannot do for you.
The honest summary: auto-instrumentation gives you 70% coverage in the first week, 85% with a sprint of cleanup, and 95% requires a multi-quarter manual-instrumentation effort. The last 5% is the spans that map to your business operations — KYC scoring, fraud check, ledger settlement — and those have to be written by the team that owns the operation. Auto-instrumentation is the floor, not the ceiling.
The ratio matters because it shapes how teams should budget engineering time. A common Razorpay-style mistake is to plan a "tracing rollout sprint" assuming auto-instrumentation will get the team to 100%. The sprint ships, the dashboards look full of spans, and three months later the on-call team realises that every interesting alert ends with a trace whose top-5 latency contributors are all unnamed gaps inside the SERVER span. The right plan is two-phase: phase one (one sprint) ships auto-instrumentation across the fleet and gets you to 70%; phase two (one quarter, one team at a time) adds manual spans for each service's top-10 business operations and gets you to 95%. Promising 100% in the first sprint trains stakeholders to distrust observability when it inevitably falls short.
The cost of auto-instrumentation — what it does to your hot path
Auto-instrumentation is not free, and the cost lives in three places that the marketing copy never mentions.
Per-request CPU overhead. Wrapping Flask.dispatch_request with span emission costs roughly 8–18 µs per request on a typical x86 server: span creation (~3 µs), context propagation (~2 µs), attribute setting (~3 µs), span ending and queueing for batched export (~5 µs). At 5000 RPS per pod, that is 40–90 ms of CPU per second — 4% to 9% of one core dedicated to span emission. The CLIENT span on each requests.get adds another 6–12 µs. For a service with 5 outbound HTTP calls per request, total auto-instrumentation overhead is ~12% of CPU at 5000 RPS. Why this matters for capacity planning: a service that ran at 60% CPU pre-instrumentation will run at 67–72% CPU post-instrumentation. If you sized for 70% headroom, you are now 5% from CPU saturation and a 20% traffic spike that you would have absorbed before now triggers autoscaling or, worse, latency degradation. The fix is not to skip auto-instrumentation; it is to size pods for ~10% extra CPU when you turn it on. Most teams forget and discover during a Big Billion Days traffic event that their headroom evaporated.
The numbers above are for Python with the OTel SDK in BatchSpanProcessor mode. The Java agent is similar in steady state (~10–15 µs per server span, ~6–10 µs per client span) but pays a higher one-time tax at startup because of the bytecode rewriting wave. Go's auto-instrumentation story is different: there is no Go agent that monkey-patches at runtime (Go's static linking and lack of monkey-patching primitives prevent this), so Go services either use compile-time wrapping (otelhttp.NewHandler, manual integration) or eBPF-based runtime instrumentation (Beyla, the OTel Go-Auto experiment). Per-request overhead in Go is therefore bounded by the wrapping you opted into, not by an agent transparently wrapping everything — which is one reason Go fleets often have lower observability overhead than Java fleets, but also why their coverage is lumpier.
Memory overhead from span buffering. The SDK buffers spans in a BatchSpanProcessor queue (default size 2048 spans) before exporting. Each span is roughly 800 bytes in pdata representation — span context (16 bytes trace_id + 8 bytes span_id + parent), 5–10 attributes (each ~50 bytes), timing fields (~40 bytes), event/link slots. A full queue is ~1.6 MB per process, which sounds small until you remember that the batch processor runs in addition to your application's normal heap. For a Java service with a 4 GiB heap and 200 active threads each holding span context, the agent adds 50–100 MiB of working set. Python is more compact (~30 MiB) but pays in GC pressure — span objects are short-lived but numerous, and gc.collect() runs more often once auto-instrumentation is on.
The queue size is also a tuning lever. Default 2048 is fine for steady state, but a service that experiences traffic bursts (a Hotstar service at toss, an IRCTC service at 10:00 IST Tatkal) can burst-emit 10× the steady-state span rate for 30 seconds. If the queue saturates faster than the OTLP exporter drains it, the SDK silently drops spans at enqueue time — otel.batch_processor.spans_dropped climbs in the SDK self-metrics. Production fleets size the queue to absorb their worst-known burst (5000+ for Tatkal-scale workloads), at the cost of an additional ~3 MiB of memory per process. The trade-off is the same as the Collector's exporter queue (ch.91) one layer up, and the discipline is to monitor both layers' drop counters together.
Startup latency. Java's bytecode rewriting hits hard on the first class-load wave. A Spring Boot service that started in 8 seconds will start in 11–14 seconds with the javaagent attached, because every Spring class touched at startup must go through the Byte Buddy transformer. Python's import-time monkey-patching adds ~200–400 ms of import-time work. For a long-running service this is invisible; for a serverless function (Lambda), it is the difference between a 3-second cold start and a 6-second cold start, and it is why most lambda observability tooling does not use the standard Java agent — they use custom layers that defer instrumentation until after the first invocation.
The trick for production fleets is to measure these costs once, in a benchmark, and to bake the results into capacity planning. The hypothetical Hotstar streaming-platform team runs an annual benchmark in March (before IPL) that re-validates the per-RPS overhead of their javaagent version against their service's hot path; the number has crept up by 1–2% per year as the agent has added new instrumentations, and the team scales the fleet accordingly. The benchmark itself is straightforward — wrk2 -R 5000 -d 60s against the service with and without the agent, p99 measured by HdrHistogram — but the discipline of repeating it annually is what catches drift. Without the annual measurement, a team that sized for 5% overhead on the agent v1.20 wakes up two years later running the agent v1.32 with 11% overhead and wonders why their headroom evaporated.
Failure modes you will see in production
Three failures happen often enough that recognising them on first sight saves hours of debugging.
The unsanitised URL cardinality bomb. The CLIENT span's http.url attribute contains the full URL with query parameters. If a downstream connector (spanmetrics, anything that turns spans into metrics) uses http.url as a label without first stripping ?.*, you get an explosion: 3 million distinct merchant_id values become 3 million distinct label values. The metric http_client_request_duration_seconds that should have ~50 series ends up with 3 million. Your Prometheus OOMs at 04:30 IST during the morning batch wave. The fix is mandatory: any transform processor that creates metrics from spans must strip query strings first. Best practice: a fleet-wide OTTL rule in the Collector that runs set(attributes["http.url"], URL(attributes["http.url"]).path) before anything reads http.url for cardinality-sensitive purposes.
The same pattern hits path parameters that the agent did not template. If your Flask route is /order/<int:order_id> the agent correctly captures http.route as the template — but the SERVER span's http.target attribute contains the concrete path /order/847291. Spanmetrics can use http.route (templated, ~50 cardinality) safely, or http.target (concrete, 3M cardinality) destructively. The choice of which attribute the connector keys on is a one-line config decision that determines whether your TSDB bill is ₹50,000/month or ₹50,00,000/month. Audit it before turning on spanmetrics.
The orphaned span tree. Auto-instrumentation propagates trace context through contextvars.ContextVar (Python) or ThreadLocal (Java). When work is dispatched to a thread pool or a different async context, the propagation can break. The signature in Tempo is a trace tree where some spans hang under the trace root with no parent, instead of nesting under the SERVER span that initiated the work. The fix is library-specific: Python's concurrent.futures.ThreadPoolExecutor works with from opentelemetry.context import attach, detach wrapping in submit; Java's Executors.newFixedThreadPool is fixed by the agent's WrappedExecutorService if you use the agent's helper. Manual context propagation is the discipline auto-instrumentation cannot do for you — the agent does not know your code's threading model.
A subtle variant: a Python service using gevent or eventlet for green-thread concurrency may break ContextVar propagation in ways that depend on the monkey-patching order. If gevent.monkey.patch_all() runs before opentelemetry-instrument activates, the ContextVar inheritance is wrong; if it runs after, it can be right but inconsistently. The diagnostic is the same as the thread-pool case (orphaned spans, no parent), and the fix is to standardise on one concurrency model and one patching order across the fleet rather than to debug per-service.
The "missing the most important span" failure. A trace shows the SERVER span (Flask), the DB span (psycopg2), the external HTTP CLIENT span (requests), and a 90 ms gap between the DB span ending and the HTTP span starting. Inside that 90 ms, your compute_kyc_score() function ran, but auto-instrumentation does not know about it, so the gap is just unaccounted time. On-call says "tracing is broken — there is a 90 ms hole". Tracing is not broken; tracing is exactly as good as the libraries it knows about, and your business logic is not a library. The fix is to add a manual span: with tracer.start_as_current_span("compute_kyc_score"): .... Five lines of code per business operation; the lift is per-service, not per-request.
The diagnostic ladder for "the trace is wrong" should always start with: is the missing span a known auto-instrumented library that should have been captured (likely a context-propagation issue), or is it custom business logic that nobody manually instrumented (you need to write the span)? The two failure modes look identical in Tempo and need entirely different fixes.
There is a fourth failure mode that hits less often but is harder to diagnose: double-instrumentation. If you pip install opentelemetry-instrumentation-flask and you also import the Flask integration manually in code (from opentelemetry.instrumentation.flask import FlaskInstrumentor; FlaskInstrumentor().instrument()), Flask gets wrapped twice. Each request emits two SERVER spans nested inside each other with identical attributes — the trace tree shows duplicate top-level spans and the on-call sees what looks like a recursion or a retry. The same shape hits when two different OTel distros (the OTel one and a vendor-provided one like Datadog's dd-trace-py running alongside) both instrument the same library. The fix is hygiene: pick one auto-instrumentation path per library, and never call instrument() manually if you are also using opentelemetry-instrument. The agent does the wrapping; your code should not.
The discipline these failure modes impose is concrete: when on-call gets a ticket "trace is incomplete", the runbook should walk through (1) is http.url being used as a metric label anywhere — if yes, suspect the cardinality bomb and check Prometheus memory; (2) does Tempo show parentless spans for the same trace_id — if yes, suspect orphaned context propagation; (3) is there an unaccounted gap in the SERVER span — if yes, the missing business span is the answer. Five minutes of triage against this checklist is faster than thirty minutes of staring at the trace.
Common confusions
- "Auto-instrumentation means I don't need to know OpenTelemetry." It is the opposite. Auto-instrumentation is a productivity multiplier on top of OTel knowledge — you need to know what spans look like, what semantic conventions exist, what the SDK pipeline does, otherwise you cannot debug when the agent does the wrong thing (which it will, on async boundaries, on unsupported libraries, on overloaded URLs).
- "
opentelemetry-instrumentadds spans to all my code." No. It adds spans only at the boundary of libraries it has instrumentation packages for. Your own business functions are invisible unless you wrap them manually. The 70%-coverage-out-of-the-box marketing claim assumes your service is mostly framework + DB + external HTTP — true for many, false for any service with serious in-process computation. - "Java auto-instrumentation is the same thing as Python's, just for Java." The mechanism is fundamentally different (bytecode rewriting vs monkey-patching), the coverage profile is different (Java reaches more, Python is constrained by C extensions), and the operational characteristics differ (Java has a noticeable startup hit; Python has GC pressure). They share an output format (OTLP) and a set of semantic conventions, not a runtime model.
- "Auto-instrumentation has zero performance cost." It has 4–12% CPU overhead on most realistic hot paths and 30–100 MiB of memory per process. Plan capacity accordingly. The cost is paid by every request; spans you never look at still cost you to emit.
- "If a span is missing in my trace, the agent is buggy." Sometimes. More often, the span is missing because (a) the library is not in the agent's instrumentation list, (b) the work crossed a thread/async boundary that broke context propagation, or (c) the operation is custom business code that was never wrapped. Triage which case you are in before opening a GitHub issue.
- "Auto-instrumentation replaces APM agents." Functionally yes, operationally only if you also run the SDK pipeline and the Collector to ship the data and the backends to store and query it. APM vendors used to bundle all four; OTel's auto-instrumentation gives you only the first. The Collector (ch.91) is what fills in the rest.
Going deeper
What instrumentation packages actually contain
An instrumentation package like opentelemetry-instrumentation-flask is roughly 200 lines of Python. The bulk of it is one class — FlaskInstrumentor(BaseInstrumentor) — that defines _instrument(self, **kwargs) and _uninstrument(self, **kwargs). The _instrument method imports Flask, finds flask.Flask, and uses wrapt.wrap_function_wrapper to wrap Flask.full_dispatch_request with a function that creates a span, calls the original, sets http.status_code from the response, and ends the span. The _uninstrument method reverses it. Reading one of these packages is the fastest way to understand exactly what auto-instrumentation does — pip show -f opentelemetry-instrumentation-flask and open the instrumentation.py file. Every "magic" claim of auto-instrumentation dissolves into 30 lines of wrapt-based wrapping.
The same exercise on opentelemetry-instrumentation-psycopg2 reveals a slightly different pattern: the instrumentation does not wrap psycopg2's classes directly. Instead, it wraps the psycopg2.connect factory function so that every connection returned is a traced subclass of psycopg2.extensions.connection. The traced subclass overrides cursor() to return a traced cursor, and the traced cursor overrides execute() to wrap each query in a span. This is a deeper wrapping pattern — three levels of subclassing — because psycopg2's API surface is broader than Flask's. The lesson is that "auto-instrumentation" hides a wide range of wrapping techniques, and the choice of technique is dictated by the target library's API shape. There is no single trick; there are sixty different tricks, one per package.
Bytecode rewriting and the JVM Instrumentation API
Java's java.lang.instrument API was added in Java 5 (2004) for profilers and debuggers. The OTel javaagent is built on top of it via Byte Buddy, which provides a fluent builder for matching classes (isAnnotatedWith, extendsClass, nameStartsWith) and for transforming methods (@Advice.OnMethodEnter, @Advice.OnMethodExit). When the agent matches a class, it generates a new bytecode file for the class with method-entry and method-exit hooks injected; the JVM loads the modified bytecode. The transformed class is functionally identical except for the added telemetry calls. The cost is per-class-load (one-time) plus per-call (microseconds). The benefit is universal coverage of any JVM-loaded code, including third-party libraries you never knew were on your classpath. The hypothetical IRCTC ticketing fleet runs the javaagent across 200+ Java services without any source-code touch from any team — the agent jar is a deployment-time concern, not a development-time concern, and that separation of concerns is most of what makes it scale organisationally.
The deeper subtlety is classloader hierarchy. A complex Java service often has multiple classloaders — the system classloader for the JDK, the application classloader for your code, plugin classloaders for libraries loaded dynamically (Hibernate's enhancer, OSGi modules, Tomcat web-app classloaders). The javaagent attaches to the system classloader by default and may miss classes loaded by child classloaders unless the instrumentation package explicitly handles the hierarchy. This is why the OTel javaagent has separate instrumentation modules for "Servlet 3.1" and "Servlet 5.0" — different Tomcat versions load servlets through different classloader paths, and the matching rules differ. When a Java team reports "auto-instrumentation works in unit tests but produces no spans in production", classloader hierarchy is the first thing to check. The agent's -Dotel.javaagent.debug=true flag dumps which classes were transformed, and the absence of org.your.Servlet from that list is the diagnostic.
Async context propagation — the messy details
Python's auto-instrumentation uses contextvars.ContextVar for span context, which is the right choice — ContextVar is asyncio-native and propagates through await correctly. The trouble starts when work crosses out of the async loop into a thread (run_in_executor) or a process (multiprocessing). The thread does not inherit the ContextVar by default; the process certainly does not. The OTel SDK provides helpers — opentelemetry.context.attach(token) to set the context in the new thread, detach(token) to clear it — but the application code must use them. Java's javaagent solves this more aggressively: it wraps ExecutorService.submit to capture the current span and re-attach it in the worker thread. Python is moving toward similar wrapping for concurrent.futures.ThreadPoolExecutor in newer instrumentation versions, but the wrapping is only partial; full async context propagation in Python remains a manual discipline for now.
Cross-process propagation is harder still. When a Flask service spawns a subprocess via subprocess.Popen or hands work to a Celery worker, the trace context must be serialised, transmitted, and re-attached on the other side. The OTel spec defines two propagation formats — W3C traceparent for HTTP-style headers, and a Baggage header for additional context. Auto-instrumentation handles the easy cases (Flask injects traceparent into requests' outbound headers, Celery injects it into task metadata) but the hard cases — a custom RPC protocol, a queue your team built in-house, a job submitted via aws sqs — require manual injection. The pattern is always the same: extract the current context with propagator.inject(carrier) before sending, restore it with propagator.extract(carrier) on receive. Skipping this step is what produces the "fragmented forest" of traces that looks like every request started fresh.
When to drop auto and write manual spans
A practical rule from production: add manual spans only for the operations that show up in your top-5 latency contributors. If compute_kyc_score is the operation that determines whether a request meets your p99 SLO, you write a manual span for it (and probably for its 2–3 sub-operations). If format_response_json is just glue code at the end of the handler, leave it inside the SERVER span. The question to ask: "if this operation was 100 ms slower, would the on-call need to see it as a separate span to debug?" Yes → manual span. No → leave it to auto. A typical Razorpay-shape merchant-onboarding service ends up with 8–12 manual spans per request, on top of the ~6 auto-generated ones.
The corollary is that the first manual spans you add should be informed by data, not by guessing. Run the service with auto-instrumentation only for two weeks. Look at the traces in Tempo. Filter by p99 latency. Look at the spans whose duration is the largest contributor to the trace's total — but is the SERVER span itself, not a child. Those gaps are where manual spans pay back. Rolling out manual spans without this data leads to teams instrumenting validate_input (which takes 50 µs) while ignoring compute_risk_score (which takes 80 ms), because the team that wrote validate_input cared about it and the team that wrote compute_risk_score did not. Data-driven manual instrumentation is what gets you to 95% trace usefulness; opinion-driven manual instrumentation gets you to 75% and a lot of low-value spans.
The hybrid future — auto + annotations, and the eBPF alternative
The OTel community's current direction is "auto for libraries, annotations for business logic". Java's @WithSpan annotation is the model: you annotate your business method, and the javaagent (which is already loaded) sees the annotation and auto-injects the span-emitting bytecode at class-load. Python's analogous feature — using a decorator like @tracer.start_as_current_span_decorator() — exists but is less elegant because Python decorators run at function-definition time, not at call time. The future is that auto-instrumentation handles the framework boundary, annotations handle the business operations, and the application code never explicitly references the OTel SDK at all. Spring Boot 3.2 with the javaagent is already this world; Python's FastAPI ecosystem is moving toward it via dependency-injected tracers.
A parallel direction is eBPF-based auto-instrumentation, where a kernel-side eBPF program attaches uprobes to known library symbols (http.HandleFunc in Go, Connection.execute in Python's psycopg2, HttpServletRequest.getMethod in Java) and emits span-shaped events to userspace without touching the application process at all. Tools like Grafana Beyla and Pixie use this approach. The win is zero application changes, zero language-specific agents, and zero startup cost paid by the application. Why eBPF cannot fully replace language agents yet: eBPF sees only what crosses the kernel/user boundary or hits a uprobed symbol, so the spans are coarser than what a language agent produces, and span attributes (the request payload, the SQL query text, the route template) are harder to extract because the eBPF program has to parse the language-specific data structure from outside the process. A Java HttpServletRequest object is reachable in heap memory but the eBPF probe would need to know the JVM's object layout — version-specific, hard to maintain. So eBPF gets you "this URL was hit, this took 47ms" reliably; it struggles to get you "the SQL was SELECT * FROM merchants WHERE kyc=$1, the params were [pending]". For most fleets in 2026, the right answer is "language agents for primary observability, eBPF for the cases where you cannot install an agent" — sidecar-less environments, third-party closed-source binaries, kernel-network-stack timing. eBPF auto-instrumentation is covered properly in /wiki/why-ebpf-changed-the-game.
Reproduce this on your laptop
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install flask requests opentelemetry-distro \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requests \
opentelemetry-exporter-otlp-proto-grpc grpcio
opentelemetry-instrument \
--traces_exporter otlp \
--exporter_otlp_endpoint http://localhost:24317 \
--exporter_otlp_insecure true \
--service_name razorpay-merchant-onboarding \
python auto_instr_demo.py
# Watch: 10 spans emitted, route templates preserved on SERVER, full URL on CLIENT.
# Then re-run without `opentelemetry-instrument` to see the same code emit zero spans.
# The diff between those two runs is what auto-instrumentation buys you.
Where this leads next
The next chapter /wiki/semantic-conventions covers the spec that defines what auto-instrumentation is supposed to put in span attributes — http.route vs http.target vs url.full, db.system vs db.name, the resource attribute set every span carries. Auto-instrumentation packages try to follow semantic conventions but are often a few versions behind the spec; the chapter explains why and what to do about it.
After semantic conventions comes /wiki/the-otlp-protocol, the wire-format detail of what your auto-instrumented spans actually look like as bytes leaving your process — the same protocol the Collector (/wiki/the-collector-receivers-processors-exporters) decodes on the other side. Together, the three form the OTel vertical: the agent emits, the protocol carries, the collector transforms.
The orthogonal direction worth following is sampling — even a fully auto-instrumented service produces too many spans to keep at scale, and tail-based sampling at the Collector is what lets you keep the interesting ones. /wiki/wall-sampling-is-where-the-hard-tradeoffs-live frames the choices; the auto-instrumentation chapter you just finished is what makes the choices necessary in the first place.
Aditi shipped auto-instrumentation across her 38 services in two sprints, hit the URL-cardinality bomb in week three, the orphaned-span issue in week five, and the missing-business-span gap in week six. By week ten her team had auto-instrumentation as the floor, ~150 manual spans across her services as the ceiling, and a runbook that named all three failure modes by name. The right way to read this chapter is the same way she did: the agent gives you a head start, not a finish line; understanding what it cannot do is what lets you finish.
References
- OpenTelemetry Python auto-instrumentation docs — official guide for
opentelemetry-instrument, the supported libraries list, and the configuration env vars. - OpenTelemetry Java agent docs — javaagent installation, supported libraries, debugging tips for class-load failures.
- opentelemetry-python-contrib on GitHub — source of every
opentelemetry-instrumentation-*Python package; reading one is the fastest way to understand the mechanism. - Byte Buddy documentation — the bytecode-rewriting library the Java agent uses; useful when debugging instrumentation behaviour at the JVM level.
- Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 7 on instrumentation strategy frames the auto-vs-manual trade-off well.
wraptlibrary docs — the decorator-wrapping library Python instrumentation relies on; the section on "transparent object proxy" explains why monkey-patching Python classes is safe.- /wiki/sdks-vs-api — internal: the API/SDK split that auto-instrumentation builds on.
- /wiki/the-collector-receivers-processors-exporters — internal: where the auto-emitted OTLP bytes go next.