Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

The data model

It is 23:47 on the night before a Hotstar IPL knockout game. Karan, an SRE three weeks into his second job, is debugging why the new recommendations-api shows up correctly in Tempo but is missing from the Grafana service map. He greps the spans — service.name=recommendations-api is set, the http.method attribute looks right, the trace IDs propagate from gateway correctly. The spans are there. The service-map query, written in TraceQL, is { resource.service.name = "recommendations-api" } and returns nothing.

The same query without the resource. prefix returns thousands of spans. Karan stares at this for ninety minutes before realising the SDK was emitting service.name as a span attribute, not as a resource attribute, because someone had written tracer.start_span("score").set_attribute("service.name", "recommendations-api") instead of putting it in the Resource. The match was alphabetic but the layer was wrong; the Tempo index was looking at the resource layer and only the resource layer.

This is not a bug in OpenTelemetry. It is a bug in Karan's mental model of where attributes live. The OTel data model is one schema with three nesting levels — Resource, Scope, Signal — and the level you put an attribute on decides whether it identifies the emitter (Resource), the library (Scope), or the event (Signal). Mix them up and your queries silently miss data.

OpenTelemetry has one data model with three nesting levels. The outermost level — Resource — describes the entity that emitted telemetry (a service, a host, a Kubernetes pod). The middle level — Instrumentation Scope — describes the library that produced the signal (opentelemetry-instrumentation-flask v0.45). The innermost level — the signal itself — is one of Span, Metric data point, Log record, or Profile sample, all sharing a common envelope. Where an attribute lives in this hierarchy decides which queries find it.

One schema, three nesting levels

The four pillars — traces, metrics, logs, profiles — are not four schemas. They are four signal types inside one schema, all wrapped in the same two-layer envelope of Resource and InstrumentationScope. This unification is the core engineering decision of the OpenTelemetry project, and it is the decision that lets you correlate a metric to a trace to a log without writing three different glue layers — they already share the resource block that says who emitted this.

The outermost wrapper is the Resource. A Resource is a set of attributes that describe the entity producing telemetry — service.name, service.version, service.namespace, deployment.environment, host.name, host.id, os.type, process.pid, process.runtime.name, process.runtime.version, k8s.pod.name, k8s.namespace.name, cloud.provider, cloud.region, cloud.availability_zone. The Resource is set once when the SDK initialises and stays constant for the lifetime of the process. Every span, every metric data point, every log record, every profile sample emitted by that process inherits the same Resource. Why this matters: queries like "show me p99 latency for recommendations-api in ap-south-1" filter on resource attributes, and the storage tier (Tempo, Mimir, Loki) indexes resource attributes specially — typically as a separate column or label set distinct from signal-level attributes — because resource attributes have low cardinality (one set per process) while signal-level attributes have high cardinality (one set per event). Putting service.name on a span attribute instead of the resource means it does not get the index treatment and the query plan looks different.

The middle wrapper is the InstrumentationScope (older docs call it InstrumentationLibrary). A Scope identifies the code that produced the signal — the library name and version. Auto-instrumentation sets the scope to the instrumentation package name (opentelemetry.instrumentation.flask); manual instrumentation sets it to whatever you pass to tracer = trace.get_tracer("checkout.pricing", "1.2.0"). The scope is per-library, not per-process, so a single process can have many scopes — Flask spans under one scope, psycopg2 spans under another, your business-logic spans under a third. Scope attributes are rarely queried directly but are essential for debugging which version of which instrumentation emitted a malformed span.

The innermost layer is the signal. There are five signal types as of mid-2026, all sharing the Resource + Scope envelope:

Span (a unit of work in a trace) — has trace_id, span_id, parent_span_id, name, kind (server, client, internal, producer, consumer), start_time_unix_nano, end_time_unix_nano, attributes, events, links, status.
Metric data point — there are several shapes: Sum (cumulative or delta counter), Gauge (snapshot value), Histogram (explicit-bucket), ExponentialHistogram (sparse log-scale buckets), Summary (legacy, discouraged in favour of Histogram). Each has a value plus time_unix_nano, attributes, exemplars (links to traces), and aggregation temporality.
Log record — has time_unix_nano, observed_time_unix_nano, severity_number, severity_text, body, attributes, trace_id, span_id (if emitted within a trace context), flags.
Profile sample (stable since OTel-spec 1.4) — has time_unix_nano, duration, stack_trace, value (CPU nanoseconds, allocation bytes, etc.), attributes.
Event (OTel 1.5, replacing standalone log records for in-span events) — semantically a Log record with a trace_id and span_id always set.

Illustrative — Resource wraps Scope wraps Signal. The level at which an attribute is set decides whether it appears as `resource.*` or as a signal-level attribute, and downstream queries discriminate sharply between the two.

The mental shortcut: Resource is constant for the process; Scope is constant for the library; Signal is per-event. If you find yourself setting service.name after a request arrives, you have made the same mistake Karan made.

A subtlety the spec carries forward from earlier OpenCensus and OpenTracing efforts is that the three layers are not just nesting — they correspond to three different lifecycles in your application. The Resource is constructed at SDK init and frozen for the process lifetime; mutating a Resource attribute after a span has been emitted is undefined behaviour and most SDKs simply ignore the attempt. The Scope is constructed once per library (per call to get_tracer(name, version)) and is also immutable thereafter. Only the signal-level attributes can be set, updated, and conditionally added during the lifetime of an event — the span attribute set during request handling, the metric attribute added to a histogram observation, the log attribute attached to an emitted record. Understanding the three lifecycles is what tells you when in your code to set what — Resource at SDK bootstrap, Scope at library import, Signal inside the request handler.

Reading a real OTLP message — where the levels live in protobuf

The OTLP wire format (defined in opentelemetry-proto) is the most concrete way to see the data model — every level is a distinct protobuf message with explicit field numbers, and you can capture an export from a real Python service and dissect it in fifteen lines. The script below runs a tiny Flask app instrumented with OTel, captures the OTLP TracesData export to a local fake collector, and prints the resource attributes, the scope, and the span tree separately so you can see which fields end up where.

# otlp_dissect.py — show how Resource, Scope, Span nest in a real OTLP export.
# pip install flask opentelemetry-api opentelemetry-sdk \
#             opentelemetry-exporter-otlp-proto-grpc \
#             opentelemetry-instrumentation-flask opentelemetry-proto requests
import threading, time, socket, struct, gzip, requests
from concurrent import futures
import grpc
from flask import Flask
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.proto.collector.trace.v1 import trace_service_pb2, trace_service_pb2_grpc
from opentelemetry.proto.collector.trace.v1.trace_service_pb2_grpc import (
    TraceServiceServicer, add_TraceServiceServicer_to_server)

CAPTURED = []

class FakeCollector(TraceServiceServicer):
    def Export(self, request, context):
        CAPTURED.append(request); return trace_service_pb2.ExportTraceServiceResponse()

# 1. Start a fake OTLP gRPC collector on :14317
srv = grpc.server(futures.ThreadPoolExecutor(max_workers=2))
add_TraceServiceServicer_to_server(FakeCollector(), srv)
srv.add_insecure_port("127.0.0.1:14317"); srv.start()

# 2. Configure the SDK with a real Resource block
resource = Resource.create({
    "service.name": "hotstar-recommendations-api",
    "service.version": "2.4.1",
    "deployment.environment": "production",
    "k8s.pod.name": "rec-api-7c9d-x2k4m",
    "k8s.namespace.name": "recommendations",
    "cloud.region": "ap-south-1",
    "host.id": "i-0a7b3c1d4e5f6a7b8",
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="http://127.0.0.1:14317", insecure=True),
    schedule_delay_millis=500))
trace.set_tracer_provider(provider)

# 3. Emit one parent + one child span via two different scopes
auto_tracer = trace.get_tracer("opentelemetry.instrumentation.flask", "0.45b0")
manual_tracer = trace.get_tracer("checkout.pricing", "1.2.0")
with auto_tracer.start_as_current_span("GET /score", kind=trace.SpanKind.SERVER) as parent:
    parent.set_attribute("http.request.method", "GET")
    parent.set_attribute("http.route", "/score")
    parent.set_attribute("user.cohort", "ipl-rerun")
    with manual_tracer.start_as_current_span("score_user_v3") as child:
        child.set_attribute("model.name", "rec-v3")
        child.set_attribute("scored_items", 42)

# 4. Flush and dissect the captured OTLP message
provider.force_flush(); time.sleep(1.0)
req = CAPTURED[0]
for rs in req.resource_spans:
    print("=== RESOURCE ===")
    for kv in rs.resource.attributes:
        print(f"  {kv.key} = {kv.value.string_value or kv.value.int_value}")
    for ss in rs.scope_spans:
        print(f"=== SCOPE: {ss.scope.name} v{ss.scope.version} ===")
        for sp in ss.spans:
            tid = sp.trace_id.hex(); sid = sp.span_id.hex()
            print(f"  span name={sp.name!r} kind={sp.kind} "
                  f"trace_id={tid} span_id={sid}")
            for kv in sp.attributes:
                v = kv.value.string_value or kv.value.int_value
                print(f"    attr {kv.key} = {v}")
print(f"\nbytes on the wire: {req.ByteSize()} B")

Sample run:
=== RESOURCE ===
  service.name = hotstar-recommendations-api
  service.version = 2.4.1
  deployment.environment = production
  k8s.pod.name = rec-api-7c9d-x2k4m
  k8s.namespace.name = recommendations
  cloud.region = ap-south-1
  host.id = i-0a7b3c1d4e5f6a7b8
  telemetry.sdk.language = python
  telemetry.sdk.name = opentelemetry
  telemetry.sdk.version = 1.27.0
=== SCOPE: opentelemetry.instrumentation.flask v0.45b0 ===
  span name='GET /score' kind=2 trace_id=8a3f...c91 span_id=4d2e...f8a
    attr http.request.method = GET
    attr http.route = /score
    attr user.cohort = ipl-rerun
=== SCOPE: checkout.pricing v1.2.0 ===
  span name='score_user_v3' kind=1 trace_id=8a3f...c91 span_id=7b1c...e2d
    attr model.name = rec-v3
    attr scored_items = 42

bytes on the wire: 487 B

Three lines deserve attention. Resource.create({...}) sets the seven attributes that every span in this process inherits — the SDK appends three more (telemetry.sdk.language, telemetry.sdk.name, telemetry.sdk.version) automatically because the OTel spec requires them. Why this matters: the spec mandates these three even on a minimal SDK init, which is the OTel team's way of guaranteeing that any backend can identify which SDK and which language emitted a given message — useful when an SDK bug produces malformed protobuf and you need to know whether to file the issue against the Python repo or the Go repo. auto_tracer = trace.get_tracer("opentelemetry.instrumentation.flask", "0.45b0") sets the InstrumentationScope; the dissection output shows two distinct === SCOPE === headers because the parent and child were emitted by different scopes, and the OTLP message groups spans by scope inside each ResourceSpans. parent.set_attribute("http.request.method", "GET") is a signal-level attribute — it lives on the span, not on the resource. If the team had instead written Resource.create({"service.name": ..., "http.request.method": "GET"}), the resource block would be wrong because http.request.method is per-event, not per-process — and worse, the SDK would silently accept it and attach http.request.method=GET to every span emitted from that process, breaking every cardinality budget downstream.

The ordering in the output is also load-bearing. A single OTLP message can contain multiple ResourceSpans, each with its own resource block — that lets a Collector batch spans from many services into one export. Inside each ResourceSpans is a list of ScopeSpans, one per scope. Inside each ScopeSpans is the list of spans. The protobuf field numbers (ResourceSpans.resource = 1, ResourceSpans.scope_spans = 2, ScopeSpans.scope = 1, ScopeSpans.spans = 2) are stable across spec versions and are what gives OTLP its decoder-tolerance — adding a field at level N does not break decoders at level N+1.

The same Resource + Scope envelope wraps the other signal types, with field renames at the leaf to match the signal. A metrics export uses ResourceMetrics → ScopeMetrics → Metric → DataPoint. A logs export uses ResourceLogs → ScopeLogs → LogRecord. A profiles export (the spec stabilised in OTLP 1.3, mid-2024) uses ResourceProfiles → ScopeProfiles → ProfileContainer → Sample. The unification means a single Collector pipeline can route all five through the same receivers and the same processors — service.name redaction at the Collector layer is one rule that applies to traces, metrics, logs, and profiles equally, because all four serialise it in the same Resource.attributes field.

The metric-data-point shape is where the data model gets denser, because there are five kinds: Sum (cumulative or delta — the SDK chooses based on the instrument type), Gauge (instantaneous), Histogram (explicit-bucket — you specify boundaries like [5, 10, 25, 50, 100, 250, 500, 1000]), ExponentialHistogram (sparse log-scale buckets, OTel-spec 1.0 stable, the format Prometheus is migrating toward), and Summary (legacy, supported for ingestion of old Prometheus client data). Each has an aggregation temporality — Cumulative (the value is a running total since process start, the Prometheus convention) or Delta (the value is the change since the last export, the Datadog convention). Mixing temporalities on the same metric is one of the nastier failure modes — a Prometheus exporter expecting cumulative will produce nonsense if it receives delta metrics, and the symptom is that all your rate() queries go negative whenever the SDK exports a value smaller than the previous one. Why this matters operationally: the OTel SDK lets you configure temporality per instrument via PreferredTemporality, and the default differs by exporter — the OTLP exporter defaults to Cumulative (Prometheus-friendly), the Datadog exporter defaults to Delta (Datadog-native). If your fleet emits to both and the SDK is configured once, one of the two backends is getting wrong data; the fix is to run two exporters with different temporality settings, or to convert at the Collector via the cumulativetodelta processor.

A concrete failure to make this real: a Flipkart-scale e-commerce platform once shipped a Python service whose SDK was configured to emit Delta temporality (because the on-call engineer copy-pasted from a Datadog example) while the rest of the fleet was Cumulative. The service exported through the standard Collector pipeline and was forwarded to Mimir. Mimir treated the Delta values as Cumulative and computed rate(checkout_orders_total[5m]) by subtracting consecutive samples. Because Delta values reset to zero each export interval, the rate query went negative on every export boundary — every 60 seconds the dashboard plotted rate = -value/60s instead of the real rate. The on-call dismissed it as "Mimir being weird" for two weeks before someone noticed. The fix was a one-line SDK config change (PreferredTemporality(Cumulative)), and the symptom in the data model is observable end-to-end — the captured OTLP message had aggregation_temporality: 1 (Delta) where every other service had aggregation_temporality: 2 (Cumulative). Reading the OTLP bytes was the diagnostic; staring at the dashboard was not.

The metric data point's exemplars field is the bridge to traces. An exemplar is a (value, time, span_id, trace_id, attributes) tuple attached to a histogram bucket — when the SDK records a 1.2-second observation in the [1000, 2500] bucket, it can also attach the trace_id of the request that produced that observation. The downstream effect is that a Grafana panel showing p99 latency can let the on-call click on a spike and jump to the trace that caused it. Without exemplars the histogram is anonymous; with exemplars the histogram is a debugger entry point. Exemplars are stored in the metric data point because that is where the value lives — they share the same Resource and Scope, but they reference the trace by ID, not by inclusion. The trace itself is in a separate ResourceSpans message (possibly even in a different Collector pipeline), and the join happens at query time in the storage tier (Tempo's traceID index, Mimir's exemplar storage).

For logs, the critical fields are body (the message — can be a string or any structured value), attributes (key-value structured fields), and trace_id + span_id (set automatically when the log is emitted within an active span context). The trace_id and span_id on a LogRecord are what makes log-to-trace correlation work — see /wiki/log-to-trace-correlation-trace-ids-in-logs for how this propagates through your Loki labels and TraceQL queries.

A subtlety worth naming: the LogRecord has two timestamps — time_unix_nano (when the event was generated by the application) and observed_time_unix_nano (when the SDK saw it). For applications that emit logs synchronously, the two are equal. For applications that buffer logs through an async handler (the Python QueueHandler pattern, Loguru's enqueue mode), observed_time_unix_nano is later than time_unix_nano by the queue depth in time. The OTel spec mandates both because the gap between them is a diagnostic — a healthy SDK has a sub-millisecond gap, an SDK whose async queue is backing up has a multi-second gap, and an SDK that has lost its queue thread entirely has observed_time_unix_nano set at SDK shutdown for events from hours earlier. The Zerodha trading platform's observability team monitors this gap as an SLI on their own SDK fleet — when observed_time - time exceeds 5ms on more than 1% of records, the on-call investigates the application's logging configuration before the missed-log incident becomes a regulatory data-loss issue. The data model carries the diagnostic for free; only the team that knows it is there can use it.

Where attributes belong — the rule that prevents most data-model bugs

The single most-violated rule of the OTel data model is putting on the wrong layer something that should live somewhere else. Three rules of thumb cover 90% of cases.

First, process-identity attributes go on the Resource. service.name, service.version, service.namespace, deployment.environment, host.name, k8s.pod.name, cloud.region, cloud.availability_zone, host.id, os.type, process.pid, process.runtime.* — these identify who emitted the telemetry and never change for the lifetime of the process. The OTel Resource.create() call in the SDK init is the single point where they should be set; the Collector's resourcedetection processor can fill in cloud-provider attributes that the SDK could not (because it ran before the cloud metadata was reachable). If you ever find yourself calling span.set_attribute("service.name", ...) after a request has arrived, stop — that data should have been on the Resource, and the symptom (service-map queries miss the service) is the same one Karan hit.

Second, per-request and per-event attributes go on the signal. http.request.method, http.route, http.response.status_code, db.statement, messaging.destination.name, user.cohort, tenant.id, feature_flag.checkout.v3 — these change request to request and belong on the span, the metric data point, or the log record. They have no place on the Resource because they would conflict with the Resource's "constant for the process" guarantee.

Third, library-identity attributes go on the Scope. The InstrumentationScope's name and version identify which library emitted the signal. Most SDKs handle this automatically — tracer = trace.get_tracer("checkout.pricing", "1.2.0") sets the scope and you never touch it again. The scope is rarely queried directly, but when an alert fires on "p99 of checkout.pricing v1.1.x has regressed" you want the scope version recorded so the diff with v1.2.x is mechanical.

A useful test: imagine grouping your queries. Anything you would always group by (the service, the region, the deployment) is a Resource attribute. Anything you would filter by per-request (the HTTP method, the user cohort, the feature flag) is a signal attribute. Anything you would group by only when debugging instrumentation (the library version) is a Scope attribute. The grouping question is the projection of the data model onto the query language, and getting it right at instrumentation time saves dashboard rewrites later.

There is a corollary worth stating — resource attributes have hard cardinality limits enforced by the storage tier. Mimir caps resource-derived label cardinality at ~1M series per metric by default; Tempo caps the indexable resource attribute set at 200 entries per service; Loki rejects log streams whose resource-label set exceeds 16 unique values per label. Push past these and the storage tier silently drops data, returns 429, or in Loki's case auto-cardinality-blocks the stream. Signal-level attributes have no such hard caps in storage but do count against query-time cardinality budgets — a span attribute with 14 million distinct values still gets stored, but TraceQL queries that group by it will time out. Knowing which limit applies — storage-tier cardinality (resource) versus query-time cardinality (signal) — is what lets you put tenant.id in the right place: as a resource attribute if you have <1000 tenants and want fleet-wide grouping, as a signal attribute if you have millions of users and only filter at query time. See /wiki/cardinality-the-master-variable for the long form of this argument.

Illustrative — the grouping test. Most attribute-placement bugs disappear when you ask "would I always group by this in a dashboard?" before deciding which layer to set it on.

Edge cases the data model deliberately leaves room for

The data model has four places where the spec is loose enough to let teams diverge, and the divergence is the source of most cross-team observability friction.

The first is the Resource collision case. The spec does not forbid setting the same attribute on both the Resource and the signal — service.name on the Resource and service.name on a span attribute is technically valid OTLP. The semantics are that the signal-level attribute shadows the resource-level one for that particular event, but most query languages (TraceQL, PromQL, LogQL) treat the two as different fields (resource.service.name vs service.name), and the dashboards built on one will silently miss the other. The pragmatic rule: never set the same attribute on two layers; if you need per-request override, use a different attribute name (override.service.name) and document why.

The second is structured log bodies. The OTel LogRecord's body field can be a string, a number, a bool, an array, or a key-value map. The spec encourages structured bodies (a key-value map carries more information than a serialised JSON string), but most log backends — Loki especially — store the body as an opaque string and lose the structure on ingestion. This is improving (Loki 3.0's pattern parser, ClickHouse's JSONExtractString), but the practical pattern in mid-2026 is to put structured fields in attributes and reserve body for the human-readable message. Otherwise you ship structured logs that the storage tier flattens into strings, and the structure was useless work.

The third is temporality at boundaries. When a metric crosses a Collector boundary, the temporality (Cumulative vs Delta) is a property of the data point, not of the metric definition. A processor like cumulativetodelta can convert one to the other, but if the Collector forwards a metric with Cumulative temporality to a backend that expects Delta (or vice versa), the backend's rate computations go wrong silently. This is the single biggest source of "metrics look fine in development, look catastrophic in production" reports — the dev environment is single-Collector, the production environment has a Collector chain, and somewhere in the chain the temporality changed.

The fourth is attribute typing on the wire versus in dashboards. OTLP's AnyValue field is typed — string, bool, int, double, bytes, array, kvlist — but most query languages (TraceQL, LogQL, PromQL) coerce everything to string at query time. A span attribute set as set_attribute("retry_count", 3) (an integer) is on the wire as int_value: 3 and renders correctly in Grafana — but a span attribute set as set_attribute("retry_count", "3") (a string) renders identically in Grafana, and the difference only surfaces when someone writes attributes.retry_count > 5, which TraceQL's range comparison silently filters to no results when the attribute is a string. This is a subtle bug class because the dashboard looks right; only specific query shapes break. Linters that check semantic-conventions compliance (the weaver tool) flag type mismatches; CI integration of weaver is a small investment that prevents a category of dashboard regressions that nobody finds until their on-call needs the query at 02:00.

Common confusions

"OTel logs and span events are the same thing." They are similar but not interchangeable. A LogRecord can be emitted from any code path; a span event must be inside an active span context. Span events are stored as part of the span (in Span.events) and travel with it; logs are a separate signal type with their own export path. As of OTel 1.5, the spec is converging — events are now Log records with span context — but the storage and query paths often remain separate.
"Putting service.name on the span as an attribute does the same thing as putting it on the resource." It does not. Resource attributes are indexed differently in the storage tier (typically as fixed columns or label sets with their own cardinality budget), and queries written as resource.service.name = X will miss spans where service.name was set as a span attribute. Karan's bug at the top of the article is exactly this confusion.
"The InstrumentationScope is just a label — I can leave it empty." The spec requires every span and every metric data point to have a non-empty Scope. Some SDKs default the scope name to __main__ or the module name when you call get_tracer(None); some Collectors reject exports with empty scope. Naming the tracer (get_tracer("checkout.pricing", "1.2.0")) is one line at SDK init that pays back at debug time when an instrumentation regression is bisected by scope version.
"Cumulative and Delta temporality are interchangeable as long as the values are right." They are not. Cumulative metrics carry a start time (start_time_unix_nano) that the consumer uses to compute rates over windows; Delta metrics do not, and forwarding Delta as Cumulative produces nonsense rate() queries. The conversion is one-directional in practice (Cumulative → Delta is mechanical via subtraction; Delta → Cumulative requires a stateful processor that remembers the running sum across exports).
"Adding more resource attributes is free — just add pod.name, node.name, region, everything you might want to query." Each resource attribute multiplies the cardinality of every metric stream emitted by that process. A resource block with 12 attributes that vary across the fleet (pod name, node name, container_id, request_id-as-resource — yes, this happens) produces 12-dimensional fan-out at the metric backend. The Collector's attributes processor can drop high-churn fields before they hit the backend; the SDK does not let you, because the SDK does not know which fields downstream consumers care about. The split — SDK enriches eagerly, Collector trims for downstream — is the correct shape, but it requires the Collector pipeline to actually exist.
"OTel logs are unstructured; metrics are structured; traces are something in between." All three are equally structured under the data model — they share Resource, share Scope, and have typed attribute key-value pairs at the signal level. The "logs are unstructured" intuition comes from legacy log shippers (syslog, Fluentd default config) that flatten everything to a string. OTel logs are first-class structured records; the storage backend is where most of the structure is sometimes lost, not the data model.

Going deeper

The protobuf field numbers and why they are stable

The OTLP protobuf schema (opentelemetry-proto repo) versions its messages with strict field-number stability — once a field number is allocated in a stable version of the spec, it cannot be reused or renumbered. ResourceSpans.resource is field 1; ResourceSpans.scope_spans is field 2; ResourceSpans.schema_url is field 3. New fields get new numbers (ResourceSpans.schema_url was added in 0.20.0 as field 3; older decoders see it as an unknown field and skip it). This is why a 2026 Collector running 1.27.0 SDK code can decode OTLP from a 2022 SDK running 1.10.0 — the field numbers all the older SDK knows about are stable, and any newer fields the older SDK never set are simply absent. The protobuf field-number stability is the operational property that makes mixed-version fleets workable; without it, an SDK upgrade in one service would break the Collector pipeline for every other service.

Schema URLs and conventions versioning

Each ResourceSpans, ScopeSpans, ResourceMetrics, ResourceLogs message has an optional schema_url field that points to a versioned semantic-conventions schema (e.g. https://opentelemetry.io/schemas/1.27.0). The schema URL tells downstream consumers which version of http.request.method (vs the older http.method) to expect. The Collector's schema processor can transform older schema URLs to newer ones — renaming http.method → http.request.method and back — making it the single place to manage convention upgrades for a mixed-version fleet. Most teams under-use this; the practical pattern at Razorpay-scale is to set the Collector's schema processor to normalise everything to the newest stable schema before exporting downstream, so dashboards see one set of attribute names regardless of which SDK version produced them.

The 128-bit trace_id and 64-bit span_id — and why they are not UUIDs

OTel's trace_id is 128 bits (16 random bytes), span_id is 64 bits (8 random bytes). The choice is deliberate and not arbitrary — 128 bits is enough for ~340 undecillion unique trace IDs (collision probability ~zero at any plausible scale), and the W3C traceparent HTTP header carries them as hex strings (32 hex chars for trace_id, 16 for span_id). They are not RFC 4122 UUIDs — UUID v4 is 122 bits of randomness with 6 fixed bits, and OTel uses all 128 bits as random. This matters when you index by trace_id in your storage tier — Tempo's index is keyed on the raw 16-byte trace_id, not on a UUID-formatted string, and parsers that try to validate the trace_id as a UUID will reject perfectly valid OTel trace_ids. The spec does require that random bits dominate (the RandomIdGenerator is the only conformant one in the SDK), but the format is its own and converting to UUID for cross-tool compatibility is a translation, not a serialisation.

Profile signals — the fifth pillar inside the same envelope

The OTel profile signal stabilised in spec 1.4 (mid-2024) and reached OTLP 1.3 (opentelemetry-proto v1.3) with the same Resource + Scope envelope as the other signals. A ProfileContainer carries one or more Sample entries, each with a stack trace, a duration, a value (CPU nanoseconds for on-CPU profiles, allocation bytes for memory profiles), and attributes. The unification is what makes pyroscope and parca emit OTLP-native profiles — they no longer have to invent a parallel protocol. The flamegraph rendering still happens at the storage tier (pyroscope's UI, parca's UI), but the wire format is now part of OTel proper. See /wiki/pyroscope-and-parca-architectures for how the storage backends consume these.

Why the data model unification is the load-bearing decision

Before OpenTelemetry, the three pillars had three separate data models — Prometheus exposition format, span schemas (Zipkin v1, Zipkin v2, Jaeger Thrift), JSON-Lines log formats, each with their own "service identity" representation. Correlating a metric to a trace meant gluing two schemas, and the glue was per-pair: Prometheus + Jaeger here, Datadog + Zipkin there, OpenTracing trying to unify the trace half. The OTel project's choice to unify the resource block across all signals is what makes correlation cheap — the trace_id-on-log-record, the exemplar-on-histogram, the span_id-on-profile-sample all work because they share the resource layer that already says "this came from recommendations-api in ap-south-1". A reader who internalises this stops asking "how do I correlate logs and traces in OTel" and starts asking "which signal-level field carries the cross-reference" — which is the right question. See /wiki/wall-opentelemetry-is-the-standard-understand-it-deeply for the part-level framing.

A concrete instance of the unification paying out at scale: a platform team at a hypothetical Razorpay-scale fintech ran a cardinality audit on their Mimir backend in late 2025 and found that 38% of their metric series cardinality was driven by resource attributes that should have been Collector-trimmed: process.pid (rotates every deploy, ~2400 values per service per day across the fleet), container.id (Kubernetes container IDs change with every pod restart), host.name (when the SDK falls back to hostname-as-identity rather than a stable host.id). The fix was a four-line attributes processor in the Collector that dropped these three before export to Mimir; the metric cardinality dropped by 36%, the Mimir ingester memory dropped from 18 GB to 11 GB, and the monthly storage cost dropped by ₹4.2 lakh. None of this required changing instrumentation in the 240 services — the data model deliberately leaves the SDK enriching eagerly and the Collector trimming for downstream, so the fix lives in one Collector config and is reviewed by the platform team alone. Why the audit pattern works: the SDK has no way to know which resource attributes downstream consumers will index — the same fleet exports to Mimir (which indexes everything as labels), Tempo (which indexes only service.name and name), and Loki (which indexes a small label set). Each backend has different cardinality economics. Centralising the trim at the Collector lets you tune per-backend without touching the SDK, and that is the design intent of the resource-vs-collector split. The audit script is itself ~40 lines of Python — query Mimir's /api/v1/labels for label names per series, group by metric, sort by cardinality, look at the top 20 — and it should run in CI for every team that owns instrumentation.

Reproduce this on your laptop

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install flask grpcio opentelemetry-api opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc \
            opentelemetry-instrumentation-flask opentelemetry-proto requests
python3 otlp_dissect.py
# Inspect: the RESOURCE block, the two SCOPE blocks, the parent and child spans.
# Then deliberately set a resource attribute as a span attribute and observe
# the duplication; that is the bug Karan hit at the top of this chapter.

Where this leads next

The data model is the substrate every other Part-14 chapter rests on. The next chapter /wiki/sdks-vs-api explains the runtime that produces these messages — why OTel separates the API (what your application calls) from the SDK (what produces and exports OTLP) and why the API is a no-op until an SDK is installed. After that comes the Collector internals (/wiki/the-collector-receivers-processors-exporters), where the data model becomes a pipeline of receivers, processors, and exporters operating on these same Resource + Scope + Signal envelopes. Then auto-instrumentation, semantic conventions, the OTLP wire format in detail, and custom instrumentation.

The data model is also the reason the cross-pillar correlation chapters in Part 12 work as they do. trace_id on a log record, exemplars on a histogram bucket, span events linking to logs — all three are signal-level fields that share the resource. Re-reading /wiki/exemplars-metrics-traces and /wiki/log-to-trace-correlation-trace-ids-in-logs after this chapter changes how those chapters read — what looked like a feature now looks like the obvious consequence of the data-model unification.

For the broader reframing of Part 14 as the operational layer that depends on the spec, return to /wiki/wall-opentelemetry-is-the-standard-understand-it-deeply. The wall set up the four surfaces; this chapter is the spec layer in detail. The next chapters are the SDK and the Collector layers.

The closing thought is the one Karan eventually arrived at after his ninety-minute debug. The OTel data model is not a list of fields to memorise — it is a small, deliberate hierarchy that mirrors how telemetry is produced (one process, many libraries, many events) and is queried (always group by service, sometimes by library, never by individual event).

When you instrument with that hierarchy in mind — Resource for who, Scope for what library, Signal for the event — your dashboards work, your correlation works, and the storage tier's cardinality stays inside its budget.

When you instrument by adding set_attribute(...) for whatever felt convenient at the moment, you ship the bug Karan shipped and the IPL knockout is the worst possible night to discover it. The data model is the contract; the SDK call is just one way of writing to it.

References

OpenTelemetry data model specification — the binding document. Read the trace, metrics, and logs sub-specs end-to-end at least once; the data model section in each is the one this chapter draws on.
opentelemetry-proto on GitHub — the protobuf schema. The trace.proto, metrics.proto, logs.proto, and profiles.proto files are the ground truth for field numbers.
Charity Majors, Liz Fong-Jones, George Miranda, Observability Engineering (O'Reilly, 2022) — Chapter 4 on event-shaped data is the clearest published treatment of why the unified envelope matters.
OpenTelemetry semantic conventions — the controlled vocabulary the data model assumes; service.*, http.*, db.*, k8s.* are the most-used namespaces. Cindy Sridharan's Distributed Systems Observability (O'Reilly, 2018, Ch. 4) is the foundational text that motivated this unification across vendors.
W3C Trace Context (Recommendation) — the cross-process binding for the trace_id and span_id fields; the data model defines the fields, this spec defines how they propagate over HTTP and gRPC.
OTLP specification — the wire format that serialises Resource + Scope + Signal as protobuf messages; the next-chapter detail of how the data model travels between processes.
/wiki/wall-opentelemetry-is-the-standard-understand-it-deeply — internal: the four surfaces of OpenTelemetry, the Part 13 wall this chapter operationalises.
/wiki/cardinality-the-master-variable — internal: why the Resource-vs-Signal placement decision is also a cardinality decision, and how the storage tier enforces the limits the data model leaves open.