Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

Building custom instrumentation

It is 02:14 IST on a Tuesday at a hypothetical Cred rewards-engine team, and Karan, the senior backend engineer on call, is reading a customer-support ticket that says "I redeemed 4,200 coins for a Myntra voucher and my balance dropped by 8,400". The auto-instrumentation traces look perfect — POST /redeem returned 200, the database span shows a single UPDATE that completed in 14ms, the downstream voucher-issuance span shows a single 200 from the partner API. There is no second span anywhere in the trace tree. Twenty minutes in, Karan adds three lines to the redemption handler — one custom span around the apply_redemption function with coins_debited and coupon_value_inr attributes — deploys to canary, and waits. The next reproduction shows two apply_redemption spans inside the same trace, both successful, the second one firing 240ms after the first because a Kafka rebalance had caused the consumer to reprocess the message and the idempotency check was reading from a Redis replica that had not caught up. Auto-instrumentation could never have shown that. The bug lived inside one Python function, between two database calls, in a code path the OTel flask and redis integrations have no idea exists.

Custom instrumentation is what you write when the bug is in your business logic, your domain language, your specific shape of correctness — when the ten built-in integrations have given you everything they can and you are still flying blind. It is the difference between "this request was slow" and "this request entered the fraud-review path because the device fingerprint matched 4 prior chargebacks within 7 days, then waited 2.1 seconds on the manual-review queue, then succeeded". Done well, custom instrumentation turns your service into a system that explains itself; done poorly, it produces gigabytes of low-cardinality noise that costs ₹4 lakh a month to store and answers no questions.

Custom instrumentation is the code you write to make your application's domain visible — spans around business operations, attributes naming the inputs and outcomes, events marking decisions. The seven decisions that separate useful from useless are: span name, span kind, parent span, attribute keys, attribute cardinality, event placement, and status. Get those seven right and a 02:14 IST debugging session turns into a five-minute query; get them wrong and you have an expensive log-line generator that nobody trusts.

What auto-instrumentation gives you, and what it cannot

Auto-instrumentation (chapter 92) hooks into well-known libraries — flask, requests, psycopg2, redis, pymongo, kafka-python, boto3 — and emits spans for every operation those libraries perform. A Flask GET /redeem/{voucher_id} produces a SERVER-kind span automatically; a downstream requests.post("partner-api/...") produces a CLIENT-kind span automatically; a cursor.execute("UPDATE wallets SET ...") produces a db.statement span automatically. The auto-instrumentation reads the library's internals, knows the wire-protocol-level operation boundaries, and emits spans at exactly those boundaries. You add zero lines of code and your trace tree is already 80% complete.

The 20% that auto-instrumentation cannot give you is everything that lives between library calls. The apply_redemption function in Karan's incident is pure Python — it reads from Redis, applies business rules, decides whether to debit coins, decides whether to issue a voucher, decides whether to log a fraud event. None of those decisions are library calls; they are conditional branches inside a function. Auto-instrumentation has no hook into "the moment the fraud-rule decided to require manual review" or "the moment the idempotency check read a stale value from a Redis replica". Those moments only become visible if you write the spans yourself.

Illustrative — auto-instrumentation reveals the library-level shape of a request; custom instrumentation reveals the business-logic shape. The 4.1 seconds of "business-logic darkness" in the top trace is exactly where the bugs live, and exactly where auto-instrumentation cannot help. The bottom trace is the same request after three `tracer.start_as_current_span` calls were added around the domain functions.

The decision of what to wrap in a custom span follows a simple rule: every function whose execution would change the answer to "what happened to this request" deserves a span. The Flask handler is already wrapped (auto-instrumentation). The Redis call is already wrapped. The database call is already wrapped. The apply_redemption function — the one that decides the outcome — is not. Wrap it. The check_idempotency function — the one that decides whether to proceed — is not. Wrap it. The fraud_review_decision function — the one that may add 2 seconds of latency to specific requests — is not. Wrap it. Why this rule beats "wrap every function": instrumenting every function produces a span tree with 200 nodes per request, dominated by trivial helper-function spans (format_currency, parse_uuid, validate_email) that add no information and 5–10x the span-storage cost. The right granularity is the outcome-changing function — the one whose result the trace consumer would actually want to filter or group by. A span around format_currency answers no debugging question; a span around apply_redemption with {decision=succeeded, coins_debited=4200} answers "how many redemptions succeeded with >4000 coins in the last hour" with a single TraceQL query.

The same logic applies to the attributes you attach. Auto-instrumentation gives you HTTP-level attributes (http.request.method, http.response.status_code, url.path) that follow the semantic conventions. Custom instrumentation gives you domain-level attributes — customer.id, redemption.coins_debited, redemption.voucher_value_inr, redemption.partner, fraud.score, fraud.rule_fired, idempotency.cache_hit, idempotency.replica_lag_ms. These are the attributes a debugging engineer at 02:14 IST will filter on. They are also the attributes that, if chosen carelessly, will drive your trace-storage bill into orbit; we will return to that.

The seven decisions, in working code

The cleanest way to internalise the decisions custom instrumentation forces is to write a small instrumented function, run it, and read the resulting span. The script below builds the same redemption pipeline Karan eventually shipped — three nested custom spans around three domain functions, attributes naming the domain state, an event marking the fraud-rule decision, status set on failure paths — and prints the resulting span tree to the console using the OTel ConsoleSpanExporter.

# custom_instrumentation_demo.py — wrap three domain functions in custom spans,
# show what the resulting span tree looks like and why each decision matters.
# pip install opentelemetry-sdk opentelemetry-api
import time, random, uuid
from opentelemetry import trace
from opentelemetry.trace import SpanKind, Status, StatusCode
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (BatchSpanProcessor, ConsoleSpanExporter)
from opentelemetry.sdk.resources import Resource

# Decision 0 — every service identifies itself once via Resource.
provider = TracerProvider(resource=Resource.create({
    "service.name": "cred-rewards-redemption",
    "service.version": "1.18.4",
    "deployment.environment": "production",
}))
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("cred.rewards.custom", "1.0.0")

def check_idempotency(idempotency_key: str) -> dict:
    # Decision 3 — name spans by domain operation, not by function name pattern.
    with tracer.start_as_current_span(
        "rewards.idempotency.check",                    # Decision 1: span name
        kind=SpanKind.INTERNAL,                         # Decision 2: span kind
    ) as span:
        time.sleep(0.012)
        replica_lag_ms = random.choice([5, 5, 5, 240])  # most lookups fast, occasionally stale
        cache_hit = False
        # Decision 4 — domain attributes name the inputs and outcomes,
        # using bounded-cardinality keys (no raw IDs as attribute *keys*).
        span.set_attribute("idempotency.cache_hit", cache_hit)
        span.set_attribute("idempotency.replica_lag_ms", replica_lag_ms)
        if replica_lag_ms > 100:
            # Decision 6 — events mark a moment the trace consumer should see
            # without inflating the span count.
            span.add_event("idempotency.replica_lag_warning",
                           attributes={"threshold_ms": 100, "observed_ms": replica_lag_ms})
        return {"hit": cache_hit, "replica_lag_ms": replica_lag_ms}

def fraud_review(customer_id: str, coins: int) -> dict:
    with tracer.start_as_current_span("rewards.fraud.review",
                                      kind=SpanKind.INTERNAL) as span:
        span.set_attribute("fraud.coins_at_risk", coins)
        # Decision 5 — bound the cardinality of attribute *values*. Use
        # rule names (low cardinality), not customer_ids (unbounded).
        rule_fired = "device_fingerprint_chargeback_lookback" if coins > 3000 else "none"
        decision = "manual_review" if rule_fired != "none" else "auto_approve"
        span.set_attribute("fraud.rule_fired", rule_fired)
        span.set_attribute("fraud.decision", decision)
        if decision == "manual_review":
            time.sleep(2.1)  # the queue wait
            span.add_event("fraud.queued_for_manual_review",
                           attributes={"queue_wait_ms": 2100})
        return {"decision": decision, "rule": rule_fired}

def apply_redemption(customer_id: str, coins: int, voucher_partner: str):
    with tracer.start_as_current_span("rewards.redemption.apply",
                                      kind=SpanKind.INTERNAL) as span:
        span.set_attribute("customer.id", customer_id)            # high-cardinality VALUE, low-cardinality KEY — fine
        span.set_attribute("redemption.coins_debited", coins)
        span.set_attribute("redemption.partner", voucher_partner)  # low cardinality — ~50 partners
        idem = check_idempotency(f"redeem:{customer_id}:{coins}")  # nested child span via context
        if idem["hit"]:
            span.set_attribute("redemption.outcome", "duplicate_skipped")
            return {"status": "skipped"}
        fraud = fraud_review(customer_id, coins)
        if fraud["decision"] == "manual_review":
            span.set_attribute("redemption.outcome", "queued_manual")
            # Decision 7 — status conveys outcome; it is NOT a duplicate of HTTP code.
            # OK = the operation completed as intended (queued is a valid outcome).
            span.set_status(Status(StatusCode.OK))
            return {"status": "queued"}
        time.sleep(0.014)
        span.set_attribute("redemption.outcome", "completed")
        span.set_status(Status(StatusCode.OK))
        return {"status": "completed"}

# Run the redemption pipeline with a request large enough to trip the fraud rule.
result = apply_redemption(customer_id=str(uuid.uuid4()), coins=4200, voucher_partner="myntra")
provider.force_flush()
print("Final result:", result)

Sample run (ConsoleSpanExporter output, summarised):
{ name: "rewards.idempotency.check"  duration: 12ms
  attributes: { idempotency.cache_hit: false, idempotency.replica_lag_ms: 240 }
  events: [{ name: "idempotency.replica_lag_warning", threshold_ms: 100, observed_ms: 240 }]
  status: UNSET }
{ name: "rewards.fraud.review"  duration: 2,103ms
  attributes: { fraud.coins_at_risk: 4200, fraud.rule_fired: "device_fingerprint_chargeback_lookback",
                fraud.decision: "manual_review" }
  events: [{ name: "fraud.queued_for_manual_review", queue_wait_ms: 2100 }]
  status: UNSET }
{ name: "rewards.redemption.apply"  duration: 2,118ms
  attributes: { customer.id: "8f3e...c7a2", redemption.coins_debited: 4200,
                redemption.partner: "myntra", redemption.outcome: "queued_manual" }
  status: OK }
Final result: {'status': 'queued'}

The three custom spans nest correctly without any explicit parent-passing — tracer.start_as_current_span reads the active span from contextvars and sets it as the parent of the new span, then makes the new span the active one for its body. This is decision 3 in disguise: parent-child relationships come from the call stack, not from arguments. Why this matters: the contextvars-based propagation means an awaited coroutine, a function called from inside a with block, or a callback registered with the active span all participate in the same trace tree without you wiring anything. The failure mode is when work escapes the context — a concurrent.futures.ThreadPoolExecutor.submit() does NOT inherit the active span by default, so a span started inside the submitted callable is parented to nothing (a new root). The OTel docs warn about this; the fix is with tracer.start_as_current_span(...): future = executor.submit(otel_context_wrapped(fn)) where otel_context_wrapped propagates the active context manually.

The seven decisions visible in those 60 lines are worth naming explicitly because every one of them has a wrong default that production fleets discover only at scale.

Decision 1 — span name. Use a hierarchical, dot-separated, low-cardinality name. "rewards.redemption.apply" is correct; "apply_redemption" is acceptable; f"apply_redemption_{customer_id}" is a disaster — every customer gets their own span name, your span-name index in Tempo blows up, and the auto-derived RED metrics (rate-error-duration per span name) become useless. The rule: span names should be cardinality-bounded by your operation set, not by your data. A redemption service has ~30 distinct operations; the span-name space should have ~30 distinct values, never thousands.

Decision 2 — span kind. INTERNAL for in-process work, SERVER for handling an inbound RPC, CLIENT for outbound RPCs, PRODUCER/CONSUMER for messaging. Auto-instrumentation gets this right by default; your custom spans are almost always INTERNAL (the OTLP protocol chapter covers why this matters for backend metric derivation). The wrong default is "leave it unset" — unset becomes INTERNAL in most SDKs, but some collectors and backends treat unset and INTERNAL differently for service-graph computation.

Decision 3 — parent. Always inherit the active context. Override only when you have a defensible reason — a fan-out worker that legitimately starts a new trace (use tracer.start_as_current_span("...", context=trace.set_span_in_context(NonRecordingSpan(SpanContext(...))))) or a follow-on async task that should link to but not be parented by its initiator (use Link instead). The wrong default is to manually pass parent= everywhere, which clutters every signature and breaks the moment one caller forgets.

Decision 4 — attribute keys. Use the semantic conventions for anything covered there (http.*, db.*, messaging.*, rpc.*); use a service-prefixed namespace for everything domain-specific (redemption.*, fraud.*, idempotency.*). Never invent ad-hoc keys (coins, redemed_for) — they collide across services. The keys are your stable query surface; renaming them later breaks every dashboard, alert, and saved query.

Decision 5 — attribute cardinality. Bound the number of distinct values per key. redemption.partner has ~50 values (good — fast aggregation). customer.id has 30M values (acceptable as a span attribute, because traces store cardinality cheaply, but never as a metric label). redemption.full_request_body_json has unbounded cardinality and unbounded size — never. Why the trace-vs-metric asymmetry matters here: a Prometheus time-series with 30M customer.id label values would create 30M active series, each consuming ~3KB of resident memory in the head block — about 90GB of RAM, OOMing every Prometheus instance within minutes. The same 30M customer.id values as span attributes cost ~30M sparse rows in Tempo's parquet blocks, which compress to a few hundred MB on disk because attribute values are stored only on the spans that have them, not as a multi-dimensional time-series cube. The data shape — sparse per-event rows vs dense per-second time-series — is what makes high-cardinality safe on traces and lethal on metrics. The chapter on cardinality covers why metrics fail under high cardinality; trace backends handle high-cardinality span attributes much better, but a 50KB attribute string still costs 50KB per span, and 220k spans/sec × 50KB = 11 GB/sec of telemetry, which no backend will swallow politely.

Decision 6 — events vs spans. A span.add_event("fraud.queued_for_manual_review", attributes={...}) is the right choice for a moment-in-time observation that does not have its own duration. A with tracer.start_as_current_span("fraud.queue_wait") is the right choice for an operation with a measurable duration. Events are 5–10x cheaper than spans (no span-context, no parent-pointer, no separate lifetime); a debugging engineer can still filter on them in the trace UI, but they do not pollute service-graph computation or per-span RED metrics.

Decision 7 — status. StatusCode.OK means "the operation completed as the code intended" — which includes outcomes like "queued for manual review" or "duplicate request, skipped". StatusCode.ERROR means "the operation failed in a way the calling code should treat as a failure". StatusCode.UNSET (the default) means "I don't know" and should be left alone for spans where success is not meaningful (a logging span, a context-propagation-only span). The wrong default is "set ERROR whenever HTTP status is 4xx" — a 404 on a GET /user/{id} is a normal outcome, not an error of the span.

Where custom instrumentation goes wrong at scale

A first-pass custom-instrumentation pull request usually ships before anyone audits its cost. The author wraps every function they care about, attaches every attribute they can think of, ships, and the trace cost climbs by 4x in the first week. The patterns below are the ones every fleet eventually has to clean up; recognising them early saves an embarrassing meeting with the FinOps team.

The "wrap every function" trap. A redemption service has perhaps 30 outcome-changing operations. A naive instrumentation wraps all 200 functions, including pure helpers (format_currency, parse_uuid, validate_email). The result is span trees with 200+ nodes per request, where 170 of them are 30-microsecond helpers carrying zero domain attributes. The Tempo cost climbs ~6x for ~5% added insight. The fix is to ask, for every span you propose: would a debugging engineer ever filter on this span name or its attributes? If no, do not create the span — leave the function uninstrumented and let the parent span's wall-clock duration absorb its cost.

The unbounded-attribute-value trap. A common pattern in early instrumentation is span.set_attribute("request.body", json.dumps(request_data)) — "future-me will thank present-me for capturing the full input". Future-you will not. A 4KB request body × 220k spans/sec × 24 hours = 75 TB/day of attribute storage; even with 60% gzip in OTLP that is 30 TB/day on the wire. The fix is the same as for log hygiene: pick the 5–10 fields that matter (customer.id, redemption.coins_debited, request.partner) and attach them as separate, bounded attributes. The full payload, if you really need it, belongs in a structured log line joined to the trace by trace_id.

The attribute-key namespace collision. Service A names its customer attribute userId. Service B names it user_id. Service C names it customer.id per semantic conventions. A trace that spans all three is uncorrelatable — TraceQL queries that filter on one key miss spans that use another. The fix is a service-wide attribute taxonomy enforced in code review and via the Collector's attributesprocessor (which can rename userId → customer.id at the edge). The Razorpay-shape platform team that owns this taxonomy maintains a single attributes.yaml file in the platform repo, code-reviews any addition, and rejects any PR that introduces a new key colliding with an existing one.

The status-as-HTTP-code antipattern. A handler that sees response.status_code == 404 calls span.set_status(Status(StatusCode.ERROR)). Three weeks later, the team's SLO dashboard shows a 12% error rate on /voucher/{id} and on-call gets paged at 03:00. The "errors" are all 404s on missing-voucher lookups — a normal outcome, not a service failure. The fix is to map HTTP status to span status thoughtfully: 5xx and intentional retries → ERROR; 4xx for client errors that are actually exceptional → ERROR; 4xx that represent normal outcomes (404 Not Found, 409 Conflict on idempotent requests) → OK or UNSET. The OTel HTTP semconv (§ http-spans) actually says: 5xx is always ERROR; 4xx is ERROR only on the client side, not the server side. Most teams get this wrong on first ship.

Illustrative — between roughly 8 and 25 thoughtfully-placed custom spans per request, you cover about 85% of common debugging questions at about 2x the auto-instrumentation-only cost. Past ~25 spans the cost grows linearly while the insight curve plateaus; this is where pure-helper-function instrumentation lives. The curve shape is robust across services we have seen at Razorpay/Hotstar-shape scale; the exact numbers shift by 30–50% depending on which auto-integrations are loaded.

The implication of the cost-vs-insight curve is that the right design is a span budget per request. A redemption-service handler aiming for 10–20 custom spans per request hits the insight plateau without falling off the cost cliff. The Hotstar-shape playback service that handles 1.2M concurrent IPL viewers cannot afford 200 spans per request — at 220k req/sec × 200 spans × 600 bytes/span × 60% compression × 86400 seconds/day, the wire bill alone is ₹40 lakh/month before storage. The discipline is to ship custom instrumentation in two phases: phase 1 wraps the 5–8 outcome-changing functions; phase 2 adds spans only where a real debugging session showed a blind spot. This forces every span to earn its place by answering a question that was actually asked.

The other operational discipline is deletion. Custom spans accumulate over years; the function they wrapped gets deleted, the span call gets renamed but not removed, the span's attributes drift from the actual data. A quarterly review of "spans whose name appeared in zero TraceQL queries in the last 90 days" — easy to compute from your trace backend's query logs — is the cheapest way to keep the span surface relevant. A platform team at a hypothetical Swiggy-shape delivery service runs this audit in their weekly platform-engineering meeting; spans that nobody queried get a one-week notice, then a PR that removes them, then a celebration when the trace bill drops.

Edge cases the SDKs will not warn you about

Custom instrumentation runs inside your application, on your call stack, in your async runtime, under your error-handling regime. The SDK does its best to be invisible, but a handful of edge cases bite every fleet that ships custom spans at scale. Knowing them avoids the 04:00 IST debugging session.

The exception-on-span-end leak. A with tracer.start_as_current_span(...) block that raises an exception inside the body will end the span on exit (the context manager handles cleanup), but the span's status is left UNSET unless you explicitly set it. Many tracing UIs render UNSET as success; the failed span looks identical to a successful one. The fix is record_exception=True (the default in newer SDKs) plus an explicit span.set_status(Status(StatusCode.ERROR, str(exc))) in an except block before re-raising, which both records the exception as a span event AND marks the status. The Python OTel API does this automatically for start_as_current_span when record_exception=True and set_status_on_exception=True are passed (defaults in opentelemetry-api >= 1.20); in older versions it does not, and many production codebases pinned to older versions silently swallow exception status.

The start_span vs start_as_current_span confusion. start_span returns a span but does NOT make it the active span for context propagation; start_as_current_span does both. If you call start_span and then tracer.start_as_current_span("child") inside its body, the child is parented to whatever was active before start_span, not to the span you just created. The fix is to almost always use start_as_current_span (or to manually with trace.use_span(span, end_on_exit=True): after start_span). The wrong-parent bug surfaces as "my child spans appear at the top of the trace alongside their parent" — usually noticed weeks after deploy when a debugging engineer cannot understand why a trace tree is flat.

Async context propagation breaks at thread-pool boundaries. Python's concurrent.futures.ThreadPoolExecutor creates worker threads with empty contextvars, so a span started inside a executor.submit(fn) callable is parented to nothing (it becomes a new trace root). The OTel opentelemetry-instrumentation-asyncio and opentelemetry-instrumentation-threading integrations propagate context across these boundaries; if you do not load them, you must wrap your callables manually: executor.submit(functools.partial(otel_context.copy_context().run, fn)) or use the helper opentelemetry.context.attach(otel_context.get_current()) inside the callable. The symptom is "my fan-out worker spans show up in their own traces" — fixable but ugly to detect after the fact.

FastAPI background tasks share request context only until response is sent. A BackgroundTasks.add_task(fn) in FastAPI runs the function after the response is returned. The OTel instrumentation has already ended the SERVER span by then; the background task's spans either start a new trace root (if context is detached) or, worse, attach to a closed span (if context is still set). The fix is to explicitly capture the trace context before the response — current_ctx = otel_context.get_current() — and re-attach it inside the background task: token = otel_context.attach(current_ctx); try: ...; finally: otel_context.detach(token). Or use a Link to the inbound trace instead of trying to extend it. Either way, this is application-specific glue you must write; the OTel auto-instrumentation does not know about your background-task pattern.

Sampling decisions are made on root-span creation, not later. When tracer.start_as_current_span creates a root span (no parent), the sampler runs and decides keep-or-drop. If you start a span manually with a forced trace ID and pretend it is part of a larger trace (a common pattern when reconstructing context from a Kafka message header), you must propagate the parent's sampling decision via the traceFlags field — otherwise your child spans run their own sampling and you end up with traces where some spans are kept and some dropped. The fix is to use the canonical context-propagation API (TraceContextTextMapPropagator().extract(carrier=...)) instead of constructing SpanContext manually; the propagator preserves the sampled bit. Manual SpanContext(trace_id=..., span_id=..., is_remote=True, trace_flags=TraceFlags(0x01)) is correct only if you remember the 0x01; the default TraceFlags(0) means "not sampled" and your spans get dropped silently.

Attribute-value type mismatches lose data silently. OTel's attribute API accepts str, int, float, bool, and homogeneous arrays of those. A dict value, a datetime, a Decimal, a bytes object — all are silently rejected by the SDK (some versions log a debug-level warning that nobody sees in production). The pattern that bites teams is span.set_attribute("redemption.timestamp", datetime.now()) — datetime is not a supported type, the attribute is dropped, the trace looks fine in the UI but is missing the key debugging field. The fix is span.set_attribute("redemption.timestamp", datetime.now().isoformat()) (string) or ... .timestamp() * 1e9 (int nanoseconds, matching the OTLP timestamp encoding). A pre-commit hook that scans for set_attribute calls with non-primitive arguments catches this before it ships.

The "I'll just monkey-patch the third-party library" detour. When auto-instrumentation does not exist for a library you depend on (a niche payment gateway SDK, a Bombay Stock Exchange feed parser), the temptation is to monkey-patch its functions to add spans. This works briefly. It breaks on the next library upgrade because the patched function's signature changed, or because the library moved to async, or because some other team's auto-instrumentation already patched the same function in a different way. The cleaner pattern is to write a thin wrapper module in your codebase that calls the third-party library and adds spans — def make_payment(...): with tracer.start_as_current_span("bse.feed.parse"): return bse_lib.parse(...) — and import that wrapper everywhere instead of the raw library. This trades a small amount of import-line churn for a stable instrumentation surface that survives library upgrades, and it confines the OTel knowledge to one file.

Common confusions

"Custom instrumentation replaces auto-instrumentation." It does not. Custom instrumentation complements auto-instrumentation — auto for library boundaries, custom for business logic. Removing the auto-instrumentation packages and writing all spans by hand means re-implementing the HTTP, database, and Redis spans you got for free; it is months of work to rebuild what pip install opentelemetry-instrumentation-flask opentelemetry-instrumentation-redis opentelemetry-instrumentation-psycopg2 gave you in 30 seconds.
"Every function deserves a span." It does not. The cost-vs-insight curve plateaus around 25 spans per request. Past that, you pay for spans nobody filters on. Wrap outcome-changing functions and the operations a debugging engineer would actually query.
"span.set_attribute and span.add_event are interchangeable." They are not. Attributes describe the span's whole operation (its inputs and outcomes); events describe specific moments within the span (a decision point, a queue-wait start, a retry attempt). A backend stores them differently — attributes are indexed for filtering, events are stored as a per-span timeline. Use attributes for filterable fields, events for moments in the span's history.
"Setting StatusCode.ERROR makes a span show up in error dashboards." It depends on the backend. Tempo's metrics-generator and Honeycomb's auto-RED dashboards filter on status.code = ERROR for the per-service error rate. Jaeger's UI filters on the error=true legacy tag, not on the OTLP status.code field, unless you have a recent collector that translates between them. Setting StatusCode.ERROR is necessary but not sufficient — confirm your backend honours it before relying on the dashboard.
"A custom span automatically inherits its parent's attributes." It does not. Each span carries only the attributes set on it directly. If you set customer.id on the SERVER span and then start an INTERNAL apply_redemption span, the INTERNAL span has no customer.id attribute unless you set it again. The propagation is on trace_id/span_id/baggage, not on attributes. The discipline is to either set high-value attributes on every relevant child span or to use Baggage for attributes you want to propagate across spans (with the cost of including them in every outbound RPC header).
"Custom instrumentation has to be in every service." It is most valuable in services where your team owns the business logic. A platform-team Kafka consumer can run on auto-instrumentation alone; the redemption service that decides whether to debit coins absolutely cannot. Distribute the effort: heavy custom instrumentation in domain services, light to none in pure-passthrough services.

Going deeper

The instrumentation library you build, not consume

Mature instrumentation work eventually produces a small internal library — call it cred_obs or razorpay_telemetry — that exposes a few decorators and helper functions on top of the OTel SDK. The decorators make the common pattern (start a span, set status on exception, record common attributes) one line instead of six. A typical helper looks like @traced(operation="rewards.redemption.apply", attrs=["coins", "partner"]) and inspects the function's arguments to populate attributes automatically. This is not an attempt to replace OTel — it is a thin convention layer on top, enforcing your service's naming conventions and reducing per-call-site boilerplate. The library should be ~200 lines and live in your service's repo or a shared platform-libs repo. A team that ships cred_obs early sees per-pull-request instrumentation diffs shrink from 12 lines to 1, which is the difference between custom instrumentation getting added consistently vs only-when-someone-feels-like-it.

The library is also where you enforce attribute-naming conventions automatically. The @traced decorator can validate attribute names against your attributes.yaml registry, reject keys that violate convention (raise at process startup, before any traffic), and rename legacy keys to canonical ones via the Collector's attributes processor. This is a lighter-weight version of what semconv enforcement looks like at scale and is the right onramp before adopting full semantic conventions discipline.

Linking spans across asynchronous boundaries

Some operations are not parent-child but related. A redemption that triggers a fraud-review job that runs 30 minutes later in a separate worker should not have the worker's span parented to the redemption's span (the redemption ended long ago). The right primitive is Link — a span can declare links to other spans, and trace UIs render the link as "this span is related to (but not a child of) trace X span Y". The Python API is tracer.start_as_current_span("...", links=[Link(SpanContext(trace_id=..., span_id=...))]). Tempo and Honeycomb both render links in the trace UI; Jaeger renders them but de-emphasises them. This is the right mechanism for fan-out workers, retry chains across separate service invocations, and any "the cause was over here, the effect is over there, but they are not in the same trace tree" pattern.

The link-vs-child decision boils down to time. If the second span starts before the first one ends, it is a child; the trace tree captures the temporal containment correctly. If the second span starts after the first ends — even if it was caused by the first — it is a link. A sloppy team uses parent-child for both, ending up with traces that have 30-minute gaps and look like the application stopped responding when actually it just dispatched a job to a queue.

The cardinality cost model for trace attributes vs metric labels

Trace backends (Tempo, Jaeger, Honeycomb) handle high-cardinality attributes much better than metric backends because traces are sparse — you store one row per span, not one time-series per attribute combination. A customer.id attribute with 30M unique values costs ~30M attribute rows in trace storage, which Tempo's parquet-based blocks compress to a few hundred MB. The same customer.id as a metric label would create 30M time-series, each with its own active series memory in Prometheus (~3KB/series = 90 GB of RAM, OOMing your Prometheus instances within minutes).

The implication is that trace attributes are the right home for high-cardinality identifying data; metric labels are not. The discipline is: anything you might want to correlate (find the trace for customer X) goes on spans; anything you might want to aggregate (count requests by service) goes on metric labels. The chapter on cardinality covers the metric side; this chapter is the trace side. Both are fluent in the same conversation: a span has customer.id (high cardinality, queryable), a metric has service.name and http.route (low cardinality, aggregatable). Putting customer.id on the metric is the mistake; leaving it off the span is also the mistake.

Why the instrumentation review goes in code review, not in a separate phase

A team's first instinct is to "do an instrumentation pass" as a separate piece of work — write new code, then go back and add spans. This always misses spans because the engineer who wrote the code is not the engineer adding instrumentation, and the why-this-decision context is already out of head. The mature pattern is to require custom instrumentation in code review for any function whose behaviour is conditional on data. The review checklist asks: does this function branch on inputs in a way that affects the outcome? If yes, where is the span? Where are the attributes naming the inputs? Where is the event marking the branch decision?

A Razorpay-shape platform team that adopted this discipline three years ago measures the result: median time-to-root-cause on production incidents dropped from 47 minutes to 11 minutes, primarily because every code path was already instrumented when the incident happened. The cost is modest — about 4 lines of instrumentation per 50 lines of production code — and the payback is paged engineers solving incidents in minutes instead of hours. The discipline lives in the code-review template and in the engineering-onboarding doc; it does not live in a separate "observability sprint" because separate sprints are where instrumentation goes to be skipped under deadline pressure.

Where instrumentation belongs in the testing pyramid

Custom spans are runtime behaviour and deserve runtime tests. The right testing pattern is the InMemorySpanExporter — an OTel exporter that captures spans into a Python list during test execution, letting you assert on the resulting span tree. A test for the redemption pipeline looks like: invoke apply_redemption(coins=4200); assert the span tree contains spans named rewards.redemption.apply, rewards.idempotency.check, rewards.fraud.review; assert fraud.decision == "manual_review" on the fraud span; assert the apply span's status is OK (the queue outcome is not a failure). This catches the entire class of "I refactored the function and the span name changed" regressions that would otherwise surface only in production three months later.

The pattern works because OTel's TracerProvider is configurable per-test — provider = TracerProvider(); exporter = InMemorySpanExporter(); provider.add_span_processor(SimpleSpanProcessor(exporter)) — and the tests run in milliseconds because no actual export happens. A team that adds 3–5 instrumentation tests per domain function builds a regression net that catches the silent-attribute-loss and span-rename bugs before merge. Compare this to discovering them in a 02:14 IST incident.

Reproduce this on your laptop

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install opentelemetry-sdk opentelemetry-api
python3 custom_instrumentation_demo.py
# Watch: three nested spans print to console with attributes, events, status.
# Try changing coins=2000 to skip the fraud rule, or replica_lag_ms=5 to skip
# the warning event. Note how the span tree changes — this is exactly what
# debugging via traces feels like in production.

Where this leads next

Custom instrumentation produces spans; those spans land in a backend via the OTLP protocol and are filtered, sampled, and routed by the Collector before being stored. The next chapter /wiki/instrumentation-best-practices covers the patterns that scale custom instrumentation across a fleet — naming taxonomies, attribute registries, code-review checklists, and the "instrumentation as a product" mindset that distinguishes a Razorpay-shape platform team from one that ships spans by accident.

The orthogonal direction is sampling: with custom instrumentation in place, your traces are richer but more expensive, and tail-based sampling becomes the right tool for keeping all errors and the slow-tail traces while dropping the unremarkable middle. The interaction between custom-instrumentation density and tail-sampling decision-rate is what determines your trace bill — heavy custom instrumentation with aggressive tail-sampling is cheaper than light instrumentation with head-sampling, because the spans you keep are the ones a debugging engineer actually wants.

Karan's incident at 02:14 IST resolved by 03:09 with three custom spans, one Redis-replica-read fix, and a Kafka-consumer-rebalance handler that became idempotent against double-delivery. The custom instrumentation he added is still in production three years later; it has answered seventeen distinct debugging questions across that time, none of which auto-instrumentation could have answered. That ratio — one PR of custom instrumentation, seventeen incidents debugged faster — is the case for treating custom instrumentation as a first-class engineering deliverable, not as a "nice to have" you get to once auto-instrumentation has been live for a while.

The deeper lesson is that custom instrumentation is domain-specific observability. The auto-instrumentation projects can ship integrations for HTTP and Redis because every team uses HTTP and Redis the same way. They cannot ship integrations for your fraud-review logic because your fraud-review logic is yours alone — its semantics, its decision points, its outcomes. The translation from your domain into spans, attributes, events, and status is something only you can do, and the quality of that translation determines how observable your service actually is. A service whose code is uninstrumented in its domain logic is a service whose owners do not yet understand what their service does at runtime. Custom instrumentation is the act of writing that understanding down in a form the system can show back to you at 02:14 IST.

References

OpenTelemetry Tracing API Specification — the binding spec for start_span, start_as_current_span, attributes, events, and status semantics.
OpenTelemetry Python Tracing API — the canonical Python reference for the patterns shown in this article.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapters 5 and 7 on instrumentation discipline and high-cardinality attributes.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018) — Chapter 4 on the case for explicit, domain-aware instrumentation.
OpenTelemetry HTTP Semantic Conventions — the rules for span status when HTTP status is 4xx vs 5xx.
OpenTelemetry InMemorySpanExporter (Python) — the test-time exporter for asserting on instrumentation behaviour.
/wiki/auto-instrumentation — internal: what you get for free before you write a single custom span.
/wiki/semantic-conventions — internal: the attribute-naming taxonomy your custom spans should align with.