JSON logs and schema drift

Karan is debugging an alert at 02:14 IST — payments-api timeout rate above 0.5% should have paged at 01:42 but did not, and the on-call dashboard is reading zero failures for a service the customer-care team is screaming about. He pulls the LogQL query the alert evaluates, {service="payments-api"} | json | reason="GATEWAY_TIMEOUT" | rate by (acquirer), runs it against the last six hours, and gets back zero rows. He runs it without the reason filter and gets thousands of rows. He stares at the records, scrolls until his eyes hurt, and finally spots it: every event has a field called failure_reason, not reason. Three weeks ago, a junior engineer renamed the field as part of a tidy-up PR. The dashboards quietly read zero. The alerts quietly stopped firing. Nobody noticed because the absence-of-rows looks identical to the absence-of-failures, and the on-call rotation interpreted the silent dashboard as the system being healthy.

This is schema drift, and it is the dominant failure mode of structured-logging systems in production. The chapter that introduced structured vs unstructured logging argued that JSON-per-line is the wire format that unlocks queryability. That argument is right, but it is only half the contract. The other half is that the field names, types, and shapes have to stay stable across emitters and across time, or every dashboard, alert, and incident query that depended on the old shape silently rots. Drift is harder to fix than to prevent, and the systems that handle it well treat the log schema as a versioned, owned, enforced contract — the same way they treat their public API.

A JSON log line is queryable only as long as its schema is stable; the moment a field name, type, or nesting level changes, every query that depended on the old shape returns silently wrong results. Schema drift is invisible because absence-of-rows looks like absence-of-events, and the only durable defence is to treat the log schema as a versioned contract — owned by a team, validated at emit time, version-tagged on every record, and migrated by the agent rather than the application. Most production drift is a rename, a type change, or an enum-value addition; all three are detectable at PR time if you instrument the producer, and undetectable in production if you do not.

What schema drift actually looks like in production

The word "drift" suggests slow gradual change, but in practice it is usually one PR. A developer needs to add information to an existing event, or rename a field that was always wrong, or fix the type of a number that was being emitted as a string. The change passes code review because the diff looks small and harmless, deploys at 11am on a Tuesday, and breaks every downstream consumer of that field by 11:01am. The break is silent — the application logs are still flowing, the agent is still parsing, the backend is still indexing — but the queries that consumed the old shape now return zero rows where they used to return numbers, and the dashboards and alerts built on those queries are now lying.

There are five common drift patterns, and almost every production incident attributable to drift is one of these five. Renamereason becomes failure_reason, or user_id becomes customer_id, or amt becomes amount. The new field is correct; the old field stops being emitted; every query that filtered or grouped by the old name silently reads zero. Type changeamount was a JSON string "4280" in 2024 because the original developer concatenated it from a string-typed config field, and someone fixes it in 2026 to be a JSON number 4280. The fix is correct; the dashboard that did amount > 5000 was secretly comparing strings lexicographically and now starts comparing numbers, so the panel jumps overnight in a way that looks like a real incident. Enum driftreason used to be one of {OK, GATEWAY_TIMEOUT, INSUFFICIENT_FUNDS, RISK_BLOCK} and someone adds STEP_UP_AUTH_REQUIRED because a new payment flow needs it. The addition is correct; the alert rule that fires on reason!="OK" and groups by reason now has a new bucket that never existed before, and the dashboard's hand-crafted colour palette runs out of colours and assigns the new bucket grey, which happens to be the colour the team uses for "no data". Nesting changemerchant: "M01023" becomes merchant: {id: "M01023", region: "south"} because the team needs region for a regulatory filter. The change is correct; every JSON path query against the old merchant string field now returns the JSON object as a string {id:M01023,region:south} and the equality filter merchant="M01023" returns nothing. Field removalpan is removed from logs because of a privacy review (the right call), but a fraud-investigation dashboard that joined logs to a known-pan list still has the join configured and now displays nothing.

Five drift patterns and what each one does to a downstream queryA grid showing five rows. Each row depicts the schema before, the schema after the drift, and the symptom seen by a downstream LogQL query. Patterns shown: rename (field name changes, query reads zero), type change (string to number, range comparison flips), enum addition (new value appears, group_by gets an unexpected bucket), nesting change (flat to nested, equality filter breaks), removal (field gone, joins return empty). Right column shows the downstream symptom in muted red text.five drift patterns — same producer, different breakPATTERNBEFOREAFTERDOWNSTREAM SYMPTOMrenamereason: "TIMEOUT"failure_reason: "TIMEOUT"query reads zero rowsalerts stop firing silentlytype changeamount: "4280"amount: 4280range filter flips"4280">"5000" is true; 4280>5000 is falseenum additionreason ∈ {OK,TIMEOUT,…}reason ∈ {OK,TIMEOUT,…,STEP_UP}new bucket appearsdashboard palette runs out of coloursnesting changemerchant: "M01023"merchant: {id, region}equality filter breaksmerchant="M01023" returns nothingremovalpan: "ABCDE1234F"(field omitted)joins return emptyfraud dashboard goes blankwhy all five are silentabsence of matching rows looks identical to absence of events; backends have no concept of "this query used to return data"the producer changed; the consumer's expectations did not; nothing in the pipeline alerts on the divergence
Illustrative — the five common shapes of JSON log drift. Each one is a single change at the producer that quietly breaks downstream consumers. The producer's view ("I just renamed a field") and the consumer's view ("the alert stopped firing") are connected only by hindsight.

The reason all five drift patterns are dangerous is the same: the backend has no concept of "this query used to return rows and now does not". Loki, Elasticsearch, ClickHouse, Splunk — all of them happily evaluate the query, find zero matching records, and return zero rows. Zero rows is a perfectly valid result for "are there any payment failures?" — sometimes the answer really is no. The query engine cannot tell the difference between "the system is healthy" and "the field name changed and you are asking about a field nobody emits anymore". That distinction lives outside the query, in the schema contract between producer and consumer, and if that contract is not enforced somewhere the drift is invisible until an incident exposes it.

Why type changes are the most insidious of the five: rename and removal at least produce zero rows, which a sufficiently paranoid alert ("alert if zero rows for 30 minutes when QPS > 100") can catch. A type change produces non-zero rows that look right but answer the wrong question. The lexicographic-vs-numeric example (amount > 5000 returning "4280" because string "4280" sorts after string "5000" is false but "50000" is true) silently inverts the filter on roughly half the value space and is invisible in low-volume staging where amounts cluster in similar magnitudes. The same trap exists for booleans rendered as "true"/"false" strings (Python's bool("false") is True because any non-empty string is truthy) and for ISO-8601 timestamps mixed with epoch-seconds ("2026-04-25" < "2026-04-26" works lexicographically by accident, until someone emits epoch and the comparison goes nonsense).

Detecting drift before it ships — the producer-side contract

The cheapest place to catch drift is at the producer, before the log line is ever emitted, in the same way that the cheapest place to catch a typed-API regression is at compile time. The discipline that has held up across Razorpay, Swiggy, Cred, and Flipkart is to define the log schema as a typed contract in code, validate every emission against it, and break the build when the contract changes incompatibly. The contract lives next to the application, the validator runs on every test, and the schema-version field on every log line lets the agent and downstream consumers detect drift at parse time.

The script below shows the pattern at its smallest: a pydantic-validated schema for two event types, a logger wrapper that enforces the schema, a deliberate set of drift attempts (rename, type change, enum addition), and the resulting validation failures. The pattern scales to a real codebase by extracting the schema definitions to a shared package (e.g. razorpay_log_schema) that every service imports.

# log_schema_contract.py — typed log schema with producer-side enforcement
# pip install pydantic loguru orjson
import sys, time
from typing import Literal
from pydantic import BaseModel, Field, ValidationError
from loguru import logger
import orjson

# ----- the contract: every log event has a typed pydantic model -----
class PaymentEvent(BaseModel):
    schema_version: Literal[3] = 3
    event: Literal["payment_attempted", "payment_succeeded", "payment_failed"]
    merchant: str = Field(pattern=r"^M\d{5}$")
    amount_paise: int = Field(ge=100, le=10_000_000)  # ₹1 to ₹1L
    method: Literal["UPI", "CARD", "NETBANKING", "WALLET"]
    acquirer: str
    reason: Literal["OK", "GATEWAY_TIMEOUT", "INSUFFICIENT_FUNDS", "RISK_BLOCK", "STEP_UP_AUTH_REQUIRED"]
    retries: int = Field(ge=0, le=5)
    trace_id: str = Field(min_length=32, max_length=32)

class RiskEvent(BaseModel):
    schema_version: Literal[3] = 3
    event: Literal["risk_evaluated"]
    user: str = Field(pattern=r"^U\d+$")
    score: int = Field(ge=0, le=100)
    verdict: Literal["ALLOW", "REVIEW", "BLOCK"]
    trace_id: str = Field(min_length=32, max_length=32)

# ----- the wrapper: log() validates before emit, fails loud on drift -----
def log(model: BaseModel) -> None:
    try:
        validated = model.model_dump()
    except ValidationError as e:
        # In production: emit a metric AND a fail-fast log; never silently drop
        sys.stderr.write(f"SCHEMA_VIOLATION {model.__class__.__name__}: {e}\n")
        return
    sys.stdout.write(orjson.dumps(validated).decode() + "\n")

# ----- a clean event passes -----
log(PaymentEvent(
    event="payment_failed", merchant="M01023", amount_paise=4280,
    method="UPI", acquirer="razorpay-acq-3", reason="GATEWAY_TIMEOUT",
    retries=3, trace_id="a"*32,
))

# ----- four drift attempts; each fails at validation -----
print("--- drift attempts ---", file=sys.stderr)
try:
    PaymentEvent(event="payment_failed", merchant="M01023", amount_paise="4280",  # str
                 method="UPI", acquirer="x", reason="OK", retries=0, trace_id="a"*32)
except ValidationError as e:
    print(f"type-change caught: {e.errors()[0]['msg']}", file=sys.stderr)

try:
    PaymentEvent(event="payment_failed", merchant="M01023", amount_paise=4280,
                 method="UPI", acquirer="x", reason="STEP_UP_REQUIRED",  # not in enum (typo)
                 retries=0, trace_id="a"*32)
except ValidationError as e:
    print(f"enum-drift caught: {e.errors()[0]['msg'][:60]}...", file=sys.stderr)

try:
    PaymentEvent(event="payment_failed", merchant="M01023", amount_paise=4280,
                 method="UPI", acquirer="x", failure_reason="OK",  # rename
                 retries=0, trace_id="a"*32)
except ValidationError as e:
    print(f"rename caught: {[err['type'] for err in e.errors()]}", file=sys.stderr)

try:
    PaymentEvent(event="payment_failed", merchant="MERCHANT_01023",  # pattern fail
                 amount_paise=4280, method="UPI", acquirer="x", reason="OK",
                 retries=0, trace_id="a"*32)
except ValidationError as e:
    print(f"pattern caught: {e.errors()[0]['msg']}", file=sys.stderr)

Sample run:

{"schema_version":3,"event":"payment_failed","merchant":"M01023","amount_paise":4280,"method":"UPI","acquirer":"razorpay-acq-3","reason":"GATEWAY_TIMEOUT","retries":3,"trace_id":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"}
--- drift attempts ---
type-change caught: Input should be a valid integer, unable to parse string as an integer
enum-drift caught: Input should be 'OK', 'GATEWAY_TIMEOUT', 'INSUFFICIENT_FUNDS', 'RISK...
rename caught: ['extra_forbidden', 'missing']
pattern caught: String should match pattern '^M\d{5}$'

The four drift attempts each surface as a pydantic ValidationError at PR-test time, not as a silent dashboard regression at 02:14 IST. The Literal types catch enum drift; the Field(ge=, le=) constraints catch type and range drift; the pattern= catches identifier-shape drift; the implicit extra="forbid" (set via model_config = ConfigDict(extra="forbid") in the base, omitted here for brevity) catches rename. Every drift mode in the previous figure has a corresponding pydantic check, and every check produces a clear test failure, and every test failure blocks the merge.

The per-line walkthrough: line schema_version: Literal[3] = 3 is the version-tag — every record emitted carries schema_version=3 so downstream consumers can branch on it during migration. Line merchant: str = Field(pattern=r"^M\d{5}$") constrains merchant not just by type but by shape — a typo like MERCHANT_01023 fails at emission, not at the dashboard. The Literal[...] type for reason (line 24) is the enum lock — adding a new value requires changing the schema definition, which requires bumping schema_version to 4, which makes the schema migration visible at code review. The def log(model: BaseModel) wrapper centralises emission so every log call goes through validation; this is the single code path the team owns, audits, and instruments. Why centralised emission matters more than the validator: the validator catches violations only of calls that go through the wrapper. A team that ships pydantic schemas but lets developers also call logger.info(json.dumps(...)) directly has the same drift problem as before, because the validator never sees those calls. The discipline is to forbid raw logger.info at the lint level (a flake8/ruff plugin checks for it) and force every emission through the typed wrapper. Without this lint rule, the schema is aspirational; with it, the schema is the only path to production.

A second piece of producer-side hygiene is schema diffing in CI. The pydantic models compile to JSON Schema (PaymentEvent.model_json_schema()), and the JSON schemas are committed to the repo. A CI job compares the current PR's JSON schema against the main branch's schema and flags every change as either backwards-compatible (adding an optional field, broadening an enum) or incompatible (renaming, removing, narrowing a type). Compatible changes pass with a notification; incompatible changes require a schema_version bump in the same PR. This pattern is borrowed from the data-engineering and gRPC worlds where contract testing is the norm; observability has been slower to adopt it, but the teams that have (Cred 2024, Razorpay early-2025) report that schema-related incidents drop 80-90% in the year after adoption.

What the agent does — schema migration and version-aware parsing

The producer-side contract catches drift at PR time for new code, but it does nothing for the years of already-deployed services running on schema v1, v2, and v3 simultaneously. A real fleet has at least three versions of every event in flight at any moment: the version that was current when the service was last deployed, the version that was current six months ago when the slower-moving services were last touched, and the version of records still being read out of long-retention storage. The dashboards and alerts that consume these events cannot branch on schema version themselves — they would be unreadable — so the agent layer translates every record to the canonical current schema before it reaches the backend.

Vector and the OTel Collector both support this pattern as a transform. The Python below shows the algorithm; production deploys use Vector's VRL (Vector Remap Language) or OTel's transform processor, which is the same logic in a higher-level config. The transform reads schema_version from the record, looks up the migration chain for that version, applies each step in order, and emits the canonicalised record.

# schema_migrate.py — agent-side schema migration with version-tagged records
# pip install orjson
import orjson
from typing import Callable

# ----- the migration registry: version N -> version N+1 -----
MIGRATIONS: dict[int, Callable[[dict], dict]] = {}

def migration(from_version: int):
    def decorator(fn):
        MIGRATIONS[from_version] = fn
        return fn
    return decorator

@migration(1)
def v1_to_v2(rec: dict) -> dict:
    # v2 split flat merchant into {id, region}
    if isinstance(rec.get("merchant"), str):
        rec["merchant"] = {"id": rec["merchant"], "region": "unknown"}
    rec["schema_version"] = 2
    return rec

@migration(2)
def v2_to_v3(rec: dict) -> dict:
    # v3 renamed `failure_reason` back to `reason`, narrowed amount to integer
    if "failure_reason" in rec:
        rec["reason"] = rec.pop("failure_reason")
    if isinstance(rec.get("amount_paise"), str):
        try:    rec["amount_paise"] = int(rec["amount_paise"])
        except (ValueError, TypeError): rec["amount_paise"] = None
    rec["schema_version"] = 3
    return rec

CURRENT_VERSION = 3

def migrate(rec: dict) -> dict:
    v = rec.get("schema_version", 1)  # records pre-versioning are v1
    while v < CURRENT_VERSION:
        if v not in MIGRATIONS:
            rec["_migration_error"] = f"no migration from v{v}"
            return rec
        rec = MIGRATIONS[v](rec)
        v = rec["schema_version"]
    return rec

# ----- a stream of mixed-version records, all canonicalised to v3 -----
samples = [
    # v1: flat merchant, string amount, "failure_reason"
    {"event": "payment_failed", "merchant": "M01023", "amount_paise": "4280",
     "failure_reason": "GATEWAY_TIMEOUT", "schema_version": 1},
    # v2: nested merchant, still uses "failure_reason"
    {"event": "payment_failed", "merchant": {"id": "M00071", "region": "south"},
     "amount_paise": 12500, "failure_reason": "INSUFFICIENT_FUNDS", "schema_version": 2},
    # v3: native canonical form, passes through
    {"event": "payment_failed", "merchant": {"id": "M03340", "region": "west"},
     "amount_paise": 9800, "reason": "RISK_BLOCK", "schema_version": 3},
    # ancient pre-versioning record (no schema_version field at all)
    {"event": "payment_failed", "merchant": "M99001", "amount_paise": "1000",
     "failure_reason": "OK"},
]

for rec in samples:
    out = migrate(rec)
    print(orjson.dumps(out).decode())

Sample run:

{"event":"payment_failed","merchant":{"id":"M01023","region":"unknown"},"amount_paise":4280,"reason":"GATEWAY_TIMEOUT","schema_version":3}
{"event":"payment_failed","merchant":{"id":"M00071","region":"south"},"amount_paise":12500,"reason":"INSUFFICIENT_FUNDS","schema_version":3}
{"event":"payment_failed","merchant":{"id":"M03340","region":"west"},"amount_paise":9800,"reason":"RISK_BLOCK","schema_version":3}
{"event":"payment_failed","merchant":{"id":"M99001","region":"unknown"},"amount_paise":1000,"reason":"OK","schema_version":3}

Four input records at four different schema versions, all emerging in a single canonical v3 shape. The dashboards downstream of the agent see only v3 — they do not branch on version, do not handle string amounts, do not handle flat merchant fields. The branching lives in the migration registry, which has one function per version transition, each owned by the team that introduced that version.

Why the agent is the right place for migration, not the application or the backend: putting migration in the application means every service has to deploy a new version every time the schema changes, which for a fleet of 80 microservices is a six-month rollout for what should be a config change. Putting migration in the backend (Loki, Elasticsearch) means the backend has to grow application-specific logic, which is a layering violation and forces every backend change to wait on the schema migration. The agent is the single layer that already touches every record, runs in every cluster, and is owned by the platform team — so it is the natural home for "every record gets normalised to one shape before downstream sees it". This is the same architectural argument as for the push-vs-pull collection decision: the agent is the boundary that lets the rest of the pipeline assume canonical input.

The migration registry has two operational properties worth calling out. First, every migration is forward-only and idempotent — running a v2-to-v3 migration on a record that is already v3 must be a no-op (in practice this is enforced by the version check in the loop). Second, migrations never delete information. A v1-to-v2 migration that splits merchant: "M01023" into merchant: {id: "M01023", region: "unknown"} adds a placeholder for the missing region; it does not throw away the merchant id. This matters because forensic queries against old records still need to work, and dropped fields cannot be recovered without a re-shipment from the application that no longer exists. The defensive default is to add an unknown placeholder for missing values rather than null or omit, because dashboards typically treat null and missing as "no data" and that is rarely what the migration intends.

A subtle related issue is that migrations apply only to logs in transit, not to logs in cold storage. A query against last quarter's records goes against the storage shape that was current then, which is some mix of v1, v2, and v3 records. Two strategies handle this. The first is read-time migration: the query layer applies migrations as records are read from storage, which keeps the dashboards reading canonical v3 even on old data. Loki's | line_format and | label_format can do limited remapping; ClickHouse's view-based reads can do full migrations via SQL functions; Elasticsearch's runtime fields can compute current-shape fields from old-shape fields at query time. The second is rewrite-on-rotation: when a chunk rolls from hot to warm to cold storage, the rotation job applies the current migration chain and writes the canonicalised version back to cold storage. Razorpay does the latter for payments logs (their cold-storage layout has been v3 since mid-2025); Cred does the former (read-time migration with a 30-day TTL on the legacy mapping). Both work; the rewrite approach is cleaner but pays the storage rewrite cost; the read-time approach is cheaper but imposes a per-query CPU cost forever.

Living with drift — the architectural patterns that actually hold

The producer-side contract and agent-side migration handle the cases where the team owns the producer and is willing to invest in the discipline. In practice, a real fleet has services owned by different teams with different schema-discipline cultures, third-party services emitting their own schemas that you cannot change, and legacy services nobody is willing to touch. The architecture has to handle this heterogeneity without forcing a single team to become the schema police for the whole company.

The pattern that works at scale is a federated schema registry — one shared repo where every team publishes the JSON Schema for the events they emit, owned by the emitting team, reviewed by the platform team. The registry is the source of truth for "what events flow through the pipeline and what shape are they". The platform team's job is to enforce that every emission has a registered schema; the application team's job is to keep their schema accurate and version-bumped. The registry is queryable as a service (schema-registry.internal/v1/schemas/payment_failed?version=3) so the agent can pull the current schema for validation, the dashboard editor can autocomplete field names, and the alert-rule linter can flag references to fields that do not exist.

Schema registry as the contract layer between producers, agents, and consumersA central node, the schema registry, in the middle of the diagram. Producers (payments-api, risk-engine, ledger) push their JSON schemas to the registry. The agent (Vector / OTel collector) pulls schemas from the registry to validate incoming records and apply migrations. Consumers (dashboards, alerts, analysts) pull schemas from the registry for autocomplete, linting, and query validation. The arrows show the flow of metadata, separate from the log-data path which goes producer to agent to backend.schema registry — single source of truth for the log contractschemaregistryJSON Schema per eventpayments-apipublishes v3risk-enginepublishes v2ledgerpublishes v4agent (Vector)pulls schemas, validates, migratesdashboardsautocompletealert linterflags missing fieldsanalyst toolsquery validationproducers push schemasconsumers pull schemasdata path (producer → agent → backend) is separate from the schema path shown here
Illustrative — the schema registry as the contract layer between producers, the agent, and consumers. Producers publish their JSON Schemas; the agent uses them for validation and migration; consumers (dashboards, alerts, analysts) use them for autocomplete, linting, and query-validation. The data path is separate.

The registry is also where schema deprecation lives. When the v2-to-v3 migration happens, the registry marks v2 as deprecated with a sunset date six months out. Any service still publishing v2 records six months after the deprecation gets a CI failure on its next build (the registry serves a deprecation header, the linter checks the header, the build breaks). This is the lever that keeps the migration honest — without a sunset and a hard CI break, services drift on the old schema indefinitely and the agent's migration registry grows unboundedly. The teams that have run schema registries for two or more years (Confluent's experience with Avro/Schema Registry in Kafka, the gRPC ecosystem with Buf's breaking-change detector, Razorpay's internal observability registry since mid-2024) all converge on the same answer: deprecation has to be enforced, not suggested, because suggestions get ignored when there is a backlog.

A second pattern worth knowing is dual-emission during migration. When failure_reason is being renamed to reason, the application emits both fields for one release cycle — the old field for old consumers, the new field for new consumers — and bumps schema_version to indicate that both are present and consumers should migrate. The dual-emission window is typically one to four weeks; long enough for downstream dashboards and alerts to be updated, short enough that the cost of carrying both fields does not dominate. After the window, the old field is removed and the schema bumps to the next version. Dual-emission is what makes large-scale schema changes shippable without a flag day, and it works because the cost of one extra string per log line for a few weeks is much cheaper than coordinating a synchronous rewrite of every consumer.

A third pattern is drift-detection as an SLO. The platform team measures the rate of schema-violation events (records that fail validation at the agent) and treats it like any other SLO with an error budget. A budget of 0.01% violations per million records is typical for mature pipelines; spikes above 0.1% page the team because they signify a producer that is emitting non-conforming records faster than the migration registry can absorb them, which usually means a service has bypassed the typed wrapper and is calling logger.info(json.dumps(...)) directly. The metric is rate(schema_violations_total[5m]) / rate(records_total[5m]) and the alert fires on multi-window burn rate exactly the way an availability SLO does; the violation events themselves are also logged (to a separate stream so they don't recurse) so the on-call can see which producer is at fault.

Common confusions

Going deeper

Avro, Protobuf, and the lessons from the Kafka world

The data-engineering world solved schema evolution for streaming pipelines a decade before observability got serious about it. Apache Avro (used heavily in Kafka pipelines via Confluent's Schema Registry) and Protocol Buffers (used heavily in gRPC, OTLP, and Kafka with Schema-aware producers) both formalise the producer-consumer schema contract in a way that JSON does not. Avro carries the writer's schema with the records (or delegates to a Schema Registry that returns the schema by ID); the reader specifies its expected schema, and Avro's resolution rules describe exactly how fields are renamed, defaulted, or skipped during reading. Protobuf assigns numeric tags to every field, which makes renames free (the tag is what matters, not the name) and additions safe (unknown tags are preserved on round-trip). Neither format is a perfect fit for logs — Avro's schema-with-data is heavy for ad-hoc log records, and Protobuf's typed wire format is awkward for the free-form attribute bags that wide events want — but both ecosystems have battle-tested patterns the observability world is now adopting. Specifically: schema-id-as-prefix (every record carries a 4-byte schema fingerprint that the agent uses to look up the schema in the registry), typed defaults (every field has a well-defined default value used when the field is missing, instead of null/undefined ambiguity), and forwards-compatible change rules (adding optional fields is fine, removing required fields is not, type changes require a version bump). The OTLP wire format itself is Protobuf-based and inherits these properties for the OTel-defined attributes; the user-defined attribute space sits on top of OTLP's KeyValue map and is therefore back to needing application-level schema discipline.

Why "wide events" makes drift worse, and what to do about it

The wide-events school (championed by Honeycomb and discussed in structured-vs-unstructured-logging) advocates emitting log records with hundreds of attributes — every attribute the producer has access to, on the theory that disk is cheap and you cannot predict which attribute you will need to query in a future incident. The intuitive concern is that wide events make drift worse because there are more fields that can change. In practice the relationship is more subtle: wide events have more fields, but they also have less coupling between fields (each is independently named and typed, so changing one rarely cascades), and the schema is typically machine-generated from the application's data model rather than hand-written, so renames are caught by the codegen step. The drift mode that wide events make harder is type drift on attributes that the producer constructs dynamicallyattributes["custom_" + key] = value patterns where the type of value depends on runtime control flow. The defensive pattern, used by Honeycomb's own SDK and by the observability teams that emit very-wide events at scale (LinkedIn's Burrow, Uber's M3 wide-event experiments), is to enforce a typed-bag layer between application and emission: event.set_int("custom_tries", n) and event.set_str("custom_status", s) at compile/lint time make the type explicit; event.set("custom_tries", n) would let n change type without notice. The lesson is that wide events are not in tension with schema discipline — they require a different shape of discipline (typed setters, codegen, runtime asserts) but the underlying contract is the same.

Detecting drift in production with anomaly-detection on field shapes

For drift that escapes the producer (because the producer bypassed the validator, or used a third-party library that emits its own schema, or the contract was incomplete), the only way to catch it is in production by watching the shape of the data. The pattern is to track, per field, the JSON type distribution, the value cardinality, and the null rate in 5-minute windows, and to alert when any of these jumps. A field that has been 100% integer for six months suddenly emitting 0.1% strings is almost always a type-drift bug; a field whose null rate jumps from 0% to 8% is almost always a removal-during-rename; a field whose value cardinality doubles overnight is almost always an enum addition. The OTel Collector's groupbyattrsprocessor combined with a custom metricsgenerator can produce these telemetry-of-telemetry signals; Cred's platform team published their version (an open-source Vector transform) in late 2024. The signals are noisy — legitimate releases also change distributions — but the anomalies are concentrated in time and tied to specific producers, so the on-call workflow is to receive the alert, look at the producer's most recent deploy, and ask "did this PR change the schema?" Most of the time, the answer is yes and the fix is a schema-version bump that should have happened in the PR.

What the OTel Schema Specification gets right and what it punts on

The OpenTelemetry Schema Specification (released 2022, evolving since) is the most ambitious attempt at an industry-standard solution to log/metric/trace schema versioning. It defines a YAML format for declaring schema transformations between versions ("rename attribute X to Y in version 1.5.0"), a schema_url field on every record/resource that points to the schema document, and a translation primitive that collectors can implement to canonicalise records to a target schema version. What the spec gets right: the locus of migration is the collector, not the application (matching the architectural argument made earlier in this chapter); migrations are declarative, not imperative (the YAML is data, not code, so it can be reviewed, diffed, versioned); the schema covers all three signals (a single schema-url applies to logs, metrics, and traces from the same resource, so you do not coordinate three separate schema registries). What the spec punts on: application-level attributes. The spec covers the OTel-defined semantic conventions (http.status_code, db.statement) but explicitly says that user-defined attributes are out of scope. This is a reasonable scoping decision but it means the bulk of an application's drift surface is still the application's responsibility. The spec is worth reading because it formalises the architectural principles, but it is not a complete solution by itself.

When to break the schema vs work around it

Sometimes the right call is to break the schema deliberately. A field name that is offensive, misleading, or wrong (customer.gender as a single-byte enum that does not match the company's diversity policy; user.email that the privacy team requires removed; card.full_pan that should never have been logged at all) should be removed even though it breaks consumers. The pattern is to schedule a hard break at a known time, communicate widely, provide a backwards-compatibility layer (the field exists but is null, so the absence is graceful) for one to two weeks, and then remove the field. The architectural hygiene of a registry and a migration registry makes these breaks possible without chaos because the consumers have a single signal (deprecation header on the schema) that tells them what is changing and when. Without the registry, every consumer has to be hunted down individually, which is how you end up with the schema staying drift-broken for two years because nobody can finish the migration. The lesson from a decade of API versioning is that the ability to break things deliberately is the discipline that makes versioning worth the cost; if you cannot break, you are paying the migration overhead with no benefit. The registry plus the version field plus the agent-side migration is the toolkit that makes deliberate breaks safe.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pydantic loguru orjson
python3 log_schema_contract.py
python3 schema_migrate.py
# Expected: contract validation rejects four drift attempts at PR time;
# migration canonicalises four records at four schema versions to a single
# v3 shape. Combine the two scripts in a real pipeline by running the
# pydantic-validated emitter against a Vector instance configured with the
# migration registry transform.

Where this leads next

The next chapters in this section move from the schema-and-shape question to the query language that operates on the structured records — LogQL's grammar for label vs structured-metadata vs body queries, what each layer can and cannot express, and the latency and cost implications of each query shape. The contract this chapter described is what makes those queries reliable; without the contract, the queries are clever syntax over unreliable data, and the dashboards built on them are accidents waiting for the next rename PR.

References