JSON logs and schema drift
Karan is debugging an alert at 02:14 IST — payments-api timeout rate above 0.5% should have paged at 01:42 but did not, and the on-call dashboard is reading zero failures for a service the customer-care team is screaming about. He pulls the LogQL query the alert evaluates, {service="payments-api"} | json | reason="GATEWAY_TIMEOUT" | rate by (acquirer), runs it against the last six hours, and gets back zero rows. He runs it without the reason filter and gets thousands of rows. He stares at the records, scrolls until his eyes hurt, and finally spots it: every event has a field called failure_reason, not reason. Three weeks ago, a junior engineer renamed the field as part of a tidy-up PR. The dashboards quietly read zero. The alerts quietly stopped firing. Nobody noticed because the absence-of-rows looks identical to the absence-of-failures, and the on-call rotation interpreted the silent dashboard as the system being healthy.
This is schema drift, and it is the dominant failure mode of structured-logging systems in production. The chapter that introduced structured vs unstructured logging argued that JSON-per-line is the wire format that unlocks queryability. That argument is right, but it is only half the contract. The other half is that the field names, types, and shapes have to stay stable across emitters and across time, or every dashboard, alert, and incident query that depended on the old shape silently rots. Drift is harder to fix than to prevent, and the systems that handle it well treat the log schema as a versioned, owned, enforced contract — the same way they treat their public API.
A JSON log line is queryable only as long as its schema is stable; the moment a field name, type, or nesting level changes, every query that depended on the old shape returns silently wrong results. Schema drift is invisible because absence-of-rows looks like absence-of-events, and the only durable defence is to treat the log schema as a versioned contract — owned by a team, validated at emit time, version-tagged on every record, and migrated by the agent rather than the application. Most production drift is a rename, a type change, or an enum-value addition; all three are detectable at PR time if you instrument the producer, and undetectable in production if you do not.
What schema drift actually looks like in production
The word "drift" suggests slow gradual change, but in practice it is usually one PR. A developer needs to add information to an existing event, or rename a field that was always wrong, or fix the type of a number that was being emitted as a string. The change passes code review because the diff looks small and harmless, deploys at 11am on a Tuesday, and breaks every downstream consumer of that field by 11:01am. The break is silent — the application logs are still flowing, the agent is still parsing, the backend is still indexing — but the queries that consumed the old shape now return zero rows where they used to return numbers, and the dashboards and alerts built on those queries are now lying.
There are five common drift patterns, and almost every production incident attributable to drift is one of these five. Rename — reason becomes failure_reason, or user_id becomes customer_id, or amt becomes amount. The new field is correct; the old field stops being emitted; every query that filtered or grouped by the old name silently reads zero. Type change — amount was a JSON string "4280" in 2024 because the original developer concatenated it from a string-typed config field, and someone fixes it in 2026 to be a JSON number 4280. The fix is correct; the dashboard that did amount > 5000 was secretly comparing strings lexicographically and now starts comparing numbers, so the panel jumps overnight in a way that looks like a real incident. Enum drift — reason used to be one of {OK, GATEWAY_TIMEOUT, INSUFFICIENT_FUNDS, RISK_BLOCK} and someone adds STEP_UP_AUTH_REQUIRED because a new payment flow needs it. The addition is correct; the alert rule that fires on reason!="OK" and groups by reason now has a new bucket that never existed before, and the dashboard's hand-crafted colour palette runs out of colours and assigns the new bucket grey, which happens to be the colour the team uses for "no data". Nesting change — merchant: "M01023" becomes merchant: {id: "M01023", region: "south"} because the team needs region for a regulatory filter. The change is correct; every JSON path query against the old merchant string field now returns the JSON object as a string {id:M01023,region:south} and the equality filter merchant="M01023" returns nothing. Field removal — pan is removed from logs because of a privacy review (the right call), but a fraud-investigation dashboard that joined logs to a known-pan list still has the join configured and now displays nothing.
The reason all five drift patterns are dangerous is the same: the backend has no concept of "this query used to return rows and now does not". Loki, Elasticsearch, ClickHouse, Splunk — all of them happily evaluate the query, find zero matching records, and return zero rows. Zero rows is a perfectly valid result for "are there any payment failures?" — sometimes the answer really is no. The query engine cannot tell the difference between "the system is healthy" and "the field name changed and you are asking about a field nobody emits anymore". That distinction lives outside the query, in the schema contract between producer and consumer, and if that contract is not enforced somewhere the drift is invisible until an incident exposes it.
Why type changes are the most insidious of the five: rename and removal at least produce zero rows, which a sufficiently paranoid alert ("alert if zero rows for 30 minutes when QPS > 100") can catch. A type change produces non-zero rows that look right but answer the wrong question. The lexicographic-vs-numeric example (amount > 5000 returning "4280" because string "4280" sorts after string "5000" is false but "50000" is true) silently inverts the filter on roughly half the value space and is invisible in low-volume staging where amounts cluster in similar magnitudes. The same trap exists for booleans rendered as "true"/"false" strings (Python's bool("false") is True because any non-empty string is truthy) and for ISO-8601 timestamps mixed with epoch-seconds ("2026-04-25" < "2026-04-26" works lexicographically by accident, until someone emits epoch and the comparison goes nonsense).
Detecting drift before it ships — the producer-side contract
The cheapest place to catch drift is at the producer, before the log line is ever emitted, in the same way that the cheapest place to catch a typed-API regression is at compile time. The discipline that has held up across Razorpay, Swiggy, Cred, and Flipkart is to define the log schema as a typed contract in code, validate every emission against it, and break the build when the contract changes incompatibly. The contract lives next to the application, the validator runs on every test, and the schema-version field on every log line lets the agent and downstream consumers detect drift at parse time.
The script below shows the pattern at its smallest: a pydantic-validated schema for two event types, a logger wrapper that enforces the schema, a deliberate set of drift attempts (rename, type change, enum addition), and the resulting validation failures. The pattern scales to a real codebase by extracting the schema definitions to a shared package (e.g. razorpay_log_schema) that every service imports.
# log_schema_contract.py — typed log schema with producer-side enforcement
# pip install pydantic loguru orjson
import sys, time
from typing import Literal
from pydantic import BaseModel, Field, ValidationError
from loguru import logger
import orjson
# ----- the contract: every log event has a typed pydantic model -----
class PaymentEvent(BaseModel):
schema_version: Literal[3] = 3
event: Literal["payment_attempted", "payment_succeeded", "payment_failed"]
merchant: str = Field(pattern=r"^M\d{5}$")
amount_paise: int = Field(ge=100, le=10_000_000) # ₹1 to ₹1L
method: Literal["UPI", "CARD", "NETBANKING", "WALLET"]
acquirer: str
reason: Literal["OK", "GATEWAY_TIMEOUT", "INSUFFICIENT_FUNDS", "RISK_BLOCK", "STEP_UP_AUTH_REQUIRED"]
retries: int = Field(ge=0, le=5)
trace_id: str = Field(min_length=32, max_length=32)
class RiskEvent(BaseModel):
schema_version: Literal[3] = 3
event: Literal["risk_evaluated"]
user: str = Field(pattern=r"^U\d+$")
score: int = Field(ge=0, le=100)
verdict: Literal["ALLOW", "REVIEW", "BLOCK"]
trace_id: str = Field(min_length=32, max_length=32)
# ----- the wrapper: log() validates before emit, fails loud on drift -----
def log(model: BaseModel) -> None:
try:
validated = model.model_dump()
except ValidationError as e:
# In production: emit a metric AND a fail-fast log; never silently drop
sys.stderr.write(f"SCHEMA_VIOLATION {model.__class__.__name__}: {e}\n")
return
sys.stdout.write(orjson.dumps(validated).decode() + "\n")
# ----- a clean event passes -----
log(PaymentEvent(
event="payment_failed", merchant="M01023", amount_paise=4280,
method="UPI", acquirer="razorpay-acq-3", reason="GATEWAY_TIMEOUT",
retries=3, trace_id="a"*32,
))
# ----- four drift attempts; each fails at validation -----
print("--- drift attempts ---", file=sys.stderr)
try:
PaymentEvent(event="payment_failed", merchant="M01023", amount_paise="4280", # str
method="UPI", acquirer="x", reason="OK", retries=0, trace_id="a"*32)
except ValidationError as e:
print(f"type-change caught: {e.errors()[0]['msg']}", file=sys.stderr)
try:
PaymentEvent(event="payment_failed", merchant="M01023", amount_paise=4280,
method="UPI", acquirer="x", reason="STEP_UP_REQUIRED", # not in enum (typo)
retries=0, trace_id="a"*32)
except ValidationError as e:
print(f"enum-drift caught: {e.errors()[0]['msg'][:60]}...", file=sys.stderr)
try:
PaymentEvent(event="payment_failed", merchant="M01023", amount_paise=4280,
method="UPI", acquirer="x", failure_reason="OK", # rename
retries=0, trace_id="a"*32)
except ValidationError as e:
print(f"rename caught: {[err['type'] for err in e.errors()]}", file=sys.stderr)
try:
PaymentEvent(event="payment_failed", merchant="MERCHANT_01023", # pattern fail
amount_paise=4280, method="UPI", acquirer="x", reason="OK",
retries=0, trace_id="a"*32)
except ValidationError as e:
print(f"pattern caught: {e.errors()[0]['msg']}", file=sys.stderr)
Sample run:
{"schema_version":3,"event":"payment_failed","merchant":"M01023","amount_paise":4280,"method":"UPI","acquirer":"razorpay-acq-3","reason":"GATEWAY_TIMEOUT","retries":3,"trace_id":"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"}
--- drift attempts ---
type-change caught: Input should be a valid integer, unable to parse string as an integer
enum-drift caught: Input should be 'OK', 'GATEWAY_TIMEOUT', 'INSUFFICIENT_FUNDS', 'RISK...
rename caught: ['extra_forbidden', 'missing']
pattern caught: String should match pattern '^M\d{5}$'
The four drift attempts each surface as a pydantic ValidationError at PR-test time, not as a silent dashboard regression at 02:14 IST. The Literal types catch enum drift; the Field(ge=, le=) constraints catch type and range drift; the pattern= catches identifier-shape drift; the implicit extra="forbid" (set via model_config = ConfigDict(extra="forbid") in the base, omitted here for brevity) catches rename. Every drift mode in the previous figure has a corresponding pydantic check, and every check produces a clear test failure, and every test failure blocks the merge.
The per-line walkthrough: line schema_version: Literal[3] = 3 is the version-tag — every record emitted carries schema_version=3 so downstream consumers can branch on it during migration. Line merchant: str = Field(pattern=r"^M\d{5}$") constrains merchant not just by type but by shape — a typo like MERCHANT_01023 fails at emission, not at the dashboard. The Literal[...] type for reason (line 24) is the enum lock — adding a new value requires changing the schema definition, which requires bumping schema_version to 4, which makes the schema migration visible at code review. The def log(model: BaseModel) wrapper centralises emission so every log call goes through validation; this is the single code path the team owns, audits, and instruments. Why centralised emission matters more than the validator: the validator catches violations only of calls that go through the wrapper. A team that ships pydantic schemas but lets developers also call logger.info(json.dumps(...)) directly has the same drift problem as before, because the validator never sees those calls. The discipline is to forbid raw logger.info at the lint level (a flake8/ruff plugin checks for it) and force every emission through the typed wrapper. Without this lint rule, the schema is aspirational; with it, the schema is the only path to production.
A second piece of producer-side hygiene is schema diffing in CI. The pydantic models compile to JSON Schema (PaymentEvent.model_json_schema()), and the JSON schemas are committed to the repo. A CI job compares the current PR's JSON schema against the main branch's schema and flags every change as either backwards-compatible (adding an optional field, broadening an enum) or incompatible (renaming, removing, narrowing a type). Compatible changes pass with a notification; incompatible changes require a schema_version bump in the same PR. This pattern is borrowed from the data-engineering and gRPC worlds where contract testing is the norm; observability has been slower to adopt it, but the teams that have (Cred 2024, Razorpay early-2025) report that schema-related incidents drop 80-90% in the year after adoption.
What the agent does — schema migration and version-aware parsing
The producer-side contract catches drift at PR time for new code, but it does nothing for the years of already-deployed services running on schema v1, v2, and v3 simultaneously. A real fleet has at least three versions of every event in flight at any moment: the version that was current when the service was last deployed, the version that was current six months ago when the slower-moving services were last touched, and the version of records still being read out of long-retention storage. The dashboards and alerts that consume these events cannot branch on schema version themselves — they would be unreadable — so the agent layer translates every record to the canonical current schema before it reaches the backend.
Vector and the OTel Collector both support this pattern as a transform. The Python below shows the algorithm; production deploys use Vector's VRL (Vector Remap Language) or OTel's transform processor, which is the same logic in a higher-level config. The transform reads schema_version from the record, looks up the migration chain for that version, applies each step in order, and emits the canonicalised record.
# schema_migrate.py — agent-side schema migration with version-tagged records
# pip install orjson
import orjson
from typing import Callable
# ----- the migration registry: version N -> version N+1 -----
MIGRATIONS: dict[int, Callable[[dict], dict]] = {}
def migration(from_version: int):
def decorator(fn):
MIGRATIONS[from_version] = fn
return fn
return decorator
@migration(1)
def v1_to_v2(rec: dict) -> dict:
# v2 split flat merchant into {id, region}
if isinstance(rec.get("merchant"), str):
rec["merchant"] = {"id": rec["merchant"], "region": "unknown"}
rec["schema_version"] = 2
return rec
@migration(2)
def v2_to_v3(rec: dict) -> dict:
# v3 renamed `failure_reason` back to `reason`, narrowed amount to integer
if "failure_reason" in rec:
rec["reason"] = rec.pop("failure_reason")
if isinstance(rec.get("amount_paise"), str):
try: rec["amount_paise"] = int(rec["amount_paise"])
except (ValueError, TypeError): rec["amount_paise"] = None
rec["schema_version"] = 3
return rec
CURRENT_VERSION = 3
def migrate(rec: dict) -> dict:
v = rec.get("schema_version", 1) # records pre-versioning are v1
while v < CURRENT_VERSION:
if v not in MIGRATIONS:
rec["_migration_error"] = f"no migration from v{v}"
return rec
rec = MIGRATIONS[v](rec)
v = rec["schema_version"]
return rec
# ----- a stream of mixed-version records, all canonicalised to v3 -----
samples = [
# v1: flat merchant, string amount, "failure_reason"
{"event": "payment_failed", "merchant": "M01023", "amount_paise": "4280",
"failure_reason": "GATEWAY_TIMEOUT", "schema_version": 1},
# v2: nested merchant, still uses "failure_reason"
{"event": "payment_failed", "merchant": {"id": "M00071", "region": "south"},
"amount_paise": 12500, "failure_reason": "INSUFFICIENT_FUNDS", "schema_version": 2},
# v3: native canonical form, passes through
{"event": "payment_failed", "merchant": {"id": "M03340", "region": "west"},
"amount_paise": 9800, "reason": "RISK_BLOCK", "schema_version": 3},
# ancient pre-versioning record (no schema_version field at all)
{"event": "payment_failed", "merchant": "M99001", "amount_paise": "1000",
"failure_reason": "OK"},
]
for rec in samples:
out = migrate(rec)
print(orjson.dumps(out).decode())
Sample run:
{"event":"payment_failed","merchant":{"id":"M01023","region":"unknown"},"amount_paise":4280,"reason":"GATEWAY_TIMEOUT","schema_version":3}
{"event":"payment_failed","merchant":{"id":"M00071","region":"south"},"amount_paise":12500,"reason":"INSUFFICIENT_FUNDS","schema_version":3}
{"event":"payment_failed","merchant":{"id":"M03340","region":"west"},"amount_paise":9800,"reason":"RISK_BLOCK","schema_version":3}
{"event":"payment_failed","merchant":{"id":"M99001","region":"unknown"},"amount_paise":1000,"reason":"OK","schema_version":3}
Four input records at four different schema versions, all emerging in a single canonical v3 shape. The dashboards downstream of the agent see only v3 — they do not branch on version, do not handle string amounts, do not handle flat merchant fields. The branching lives in the migration registry, which has one function per version transition, each owned by the team that introduced that version.
Why the agent is the right place for migration, not the application or the backend: putting migration in the application means every service has to deploy a new version every time the schema changes, which for a fleet of 80 microservices is a six-month rollout for what should be a config change. Putting migration in the backend (Loki, Elasticsearch) means the backend has to grow application-specific logic, which is a layering violation and forces every backend change to wait on the schema migration. The agent is the single layer that already touches every record, runs in every cluster, and is owned by the platform team — so it is the natural home for "every record gets normalised to one shape before downstream sees it". This is the same architectural argument as for the push-vs-pull collection decision: the agent is the boundary that lets the rest of the pipeline assume canonical input.
The migration registry has two operational properties worth calling out. First, every migration is forward-only and idempotent — running a v2-to-v3 migration on a record that is already v3 must be a no-op (in practice this is enforced by the version check in the loop). Second, migrations never delete information. A v1-to-v2 migration that splits merchant: "M01023" into merchant: {id: "M01023", region: "unknown"} adds a placeholder for the missing region; it does not throw away the merchant id. This matters because forensic queries against old records still need to work, and dropped fields cannot be recovered without a re-shipment from the application that no longer exists. The defensive default is to add an unknown placeholder for missing values rather than null or omit, because dashboards typically treat null and missing as "no data" and that is rarely what the migration intends.
A subtle related issue is that migrations apply only to logs in transit, not to logs in cold storage. A query against last quarter's records goes against the storage shape that was current then, which is some mix of v1, v2, and v3 records. Two strategies handle this. The first is read-time migration: the query layer applies migrations as records are read from storage, which keeps the dashboards reading canonical v3 even on old data. Loki's | line_format and | label_format can do limited remapping; ClickHouse's view-based reads can do full migrations via SQL functions; Elasticsearch's runtime fields can compute current-shape fields from old-shape fields at query time. The second is rewrite-on-rotation: when a chunk rolls from hot to warm to cold storage, the rotation job applies the current migration chain and writes the canonicalised version back to cold storage. Razorpay does the latter for payments logs (their cold-storage layout has been v3 since mid-2025); Cred does the former (read-time migration with a 30-day TTL on the legacy mapping). Both work; the rewrite approach is cleaner but pays the storage rewrite cost; the read-time approach is cheaper but imposes a per-query CPU cost forever.
Living with drift — the architectural patterns that actually hold
The producer-side contract and agent-side migration handle the cases where the team owns the producer and is willing to invest in the discipline. In practice, a real fleet has services owned by different teams with different schema-discipline cultures, third-party services emitting their own schemas that you cannot change, and legacy services nobody is willing to touch. The architecture has to handle this heterogeneity without forcing a single team to become the schema police for the whole company.
The pattern that works at scale is a federated schema registry — one shared repo where every team publishes the JSON Schema for the events they emit, owned by the emitting team, reviewed by the platform team. The registry is the source of truth for "what events flow through the pipeline and what shape are they". The platform team's job is to enforce that every emission has a registered schema; the application team's job is to keep their schema accurate and version-bumped. The registry is queryable as a service (schema-registry.internal/v1/schemas/payment_failed?version=3) so the agent can pull the current schema for validation, the dashboard editor can autocomplete field names, and the alert-rule linter can flag references to fields that do not exist.
The registry is also where schema deprecation lives. When the v2-to-v3 migration happens, the registry marks v2 as deprecated with a sunset date six months out. Any service still publishing v2 records six months after the deprecation gets a CI failure on its next build (the registry serves a deprecation header, the linter checks the header, the build breaks). This is the lever that keeps the migration honest — without a sunset and a hard CI break, services drift on the old schema indefinitely and the agent's migration registry grows unboundedly. The teams that have run schema registries for two or more years (Confluent's experience with Avro/Schema Registry in Kafka, the gRPC ecosystem with Buf's breaking-change detector, Razorpay's internal observability registry since mid-2024) all converge on the same answer: deprecation has to be enforced, not suggested, because suggestions get ignored when there is a backlog.
A second pattern worth knowing is dual-emission during migration. When failure_reason is being renamed to reason, the application emits both fields for one release cycle — the old field for old consumers, the new field for new consumers — and bumps schema_version to indicate that both are present and consumers should migrate. The dual-emission window is typically one to four weeks; long enough for downstream dashboards and alerts to be updated, short enough that the cost of carrying both fields does not dominate. After the window, the old field is removed and the schema bumps to the next version. Dual-emission is what makes large-scale schema changes shippable without a flag day, and it works because the cost of one extra string per log line for a few weeks is much cheaper than coordinating a synchronous rewrite of every consumer.
A third pattern is drift-detection as an SLO. The platform team measures the rate of schema-violation events (records that fail validation at the agent) and treats it like any other SLO with an error budget. A budget of 0.01% violations per million records is typical for mature pipelines; spikes above 0.1% page the team because they signify a producer that is emitting non-conforming records faster than the migration registry can absorb them, which usually means a service has bypassed the typed wrapper and is calling logger.info(json.dumps(...)) directly. The metric is rate(schema_violations_total[5m]) / rate(records_total[5m]) and the alert fires on multi-window burn rate exactly the way an availability SLO does; the violation events themselves are also logged (to a separate stream so they don't recurse) so the on-call can see which producer is at fault.
Common confusions
- "JSON logs are self-describing, so the schema is enforced by the format itself." JSON specifies six wire types (string, number, boolean, null, object, array) but says nothing about which fields exist, what their semantics are, or what types each field should hold.
{"amount": "4280"}and{"amount": 4280}are both valid JSON and the format has no opinion on which is correct. The schema lives outside the JSON, in the producer's contract; without a separate enforced contract, JSON logs are exactly as drift-prone as any other format and the JSON-ness gives a false sense of safety. - "Schema validation at the producer is too slow for a hot logging path." A pydantic v2 model validates a typical log record in 5-15 microseconds; a
dataclasswith manual checks is 1-3 microseconds; both are negligible compared to the I/O cost of writing to disk or shipping over the network. The concern was real for pydantic v1 (which validated in tens of microseconds) but pydantic v2 rewrote the validator in Rust and the per-call cost dropped 10-50x. For pipelines emitting 10,000 events per second per process, validation costs 0.05-0.15% of CPU; for any volume below that, validation is free. - "Backwards-compatible changes don't need a schema-version bump." They do, even though the change is technically compatible. The reason is that without a version bump, old consumers cannot tell the difference between "this record is from a producer that has the new field" and "this record happens to lack the new field for a different reason". The schema version is a coordination signal between producer and consumer about what shape the record commits to, not just a technical compatibility flag. Bumping the version on every change — even adding an optional field — keeps the coordination clean and the migration logic simple.
- "OpenTelemetry's schema_url solves schema versioning." OTel's
schema_urlfield on a resource or scope points to a versioned schema document and is meant to enable forward and backward compatibility between SDK versions and collector versions for the OTel-defined attributes (semantic conventions). It does not solve schema versioning for application-level event attributes — those still need their own contract. OTel's schema_url is a useful primitive but it covers a narrow band of the problem (the OTel core attributes) and most application drift happens in the user-defined attribute space which OTel intentionally leaves to applications. - "Once we have pydantic at the producer, we don't need agent-side migration." Producer-side validation catches drift in new code at PR time but does nothing for records already emitted by older deployments. Until every service in the fleet is running on the same schema version (which never happens in practice for fleets above a few dozen services), the agent will see records at multiple schema versions in the same hour, and the only way to give downstream consumers a single canonical shape is to migrate at the agent. Producer enforcement and agent migration are complementary, not alternatives.
- "Schema drift is an Indian-fintech problem; mature stacks have solved it." Every observability stack at every company has drift; the variation is in how visible it is and how fast it gets caught. Stripe, Spotify, Uber, and Datadog have all published postmortems where a log-schema change caused an alert to stop firing or a dashboard to read wrong; the difference between them and a less-mature stack is not the absence of drift but the presence of detection (drift SLOs, schema registries, dual-emission discipline). The default state of any sufficiently large logging system is drift; the well-run ones have built the immune system, not eliminated the disease.
Going deeper
Avro, Protobuf, and the lessons from the Kafka world
The data-engineering world solved schema evolution for streaming pipelines a decade before observability got serious about it. Apache Avro (used heavily in Kafka pipelines via Confluent's Schema Registry) and Protocol Buffers (used heavily in gRPC, OTLP, and Kafka with Schema-aware producers) both formalise the producer-consumer schema contract in a way that JSON does not. Avro carries the writer's schema with the records (or delegates to a Schema Registry that returns the schema by ID); the reader specifies its expected schema, and Avro's resolution rules describe exactly how fields are renamed, defaulted, or skipped during reading. Protobuf assigns numeric tags to every field, which makes renames free (the tag is what matters, not the name) and additions safe (unknown tags are preserved on round-trip). Neither format is a perfect fit for logs — Avro's schema-with-data is heavy for ad-hoc log records, and Protobuf's typed wire format is awkward for the free-form attribute bags that wide events want — but both ecosystems have battle-tested patterns the observability world is now adopting. Specifically: schema-id-as-prefix (every record carries a 4-byte schema fingerprint that the agent uses to look up the schema in the registry), typed defaults (every field has a well-defined default value used when the field is missing, instead of null/undefined ambiguity), and forwards-compatible change rules (adding optional fields is fine, removing required fields is not, type changes require a version bump). The OTLP wire format itself is Protobuf-based and inherits these properties for the OTel-defined attributes; the user-defined attribute space sits on top of OTLP's KeyValue map and is therefore back to needing application-level schema discipline.
Why "wide events" makes drift worse, and what to do about it
The wide-events school (championed by Honeycomb and discussed in structured-vs-unstructured-logging) advocates emitting log records with hundreds of attributes — every attribute the producer has access to, on the theory that disk is cheap and you cannot predict which attribute you will need to query in a future incident. The intuitive concern is that wide events make drift worse because there are more fields that can change. In practice the relationship is more subtle: wide events have more fields, but they also have less coupling between fields (each is independently named and typed, so changing one rarely cascades), and the schema is typically machine-generated from the application's data model rather than hand-written, so renames are caught by the codegen step. The drift mode that wide events make harder is type drift on attributes that the producer constructs dynamically — attributes["custom_" + key] = value patterns where the type of value depends on runtime control flow. The defensive pattern, used by Honeycomb's own SDK and by the observability teams that emit very-wide events at scale (LinkedIn's Burrow, Uber's M3 wide-event experiments), is to enforce a typed-bag layer between application and emission: event.set_int("custom_tries", n) and event.set_str("custom_status", s) at compile/lint time make the type explicit; event.set("custom_tries", n) would let n change type without notice. The lesson is that wide events are not in tension with schema discipline — they require a different shape of discipline (typed setters, codegen, runtime asserts) but the underlying contract is the same.
Detecting drift in production with anomaly-detection on field shapes
For drift that escapes the producer (because the producer bypassed the validator, or used a third-party library that emits its own schema, or the contract was incomplete), the only way to catch it is in production by watching the shape of the data. The pattern is to track, per field, the JSON type distribution, the value cardinality, and the null rate in 5-minute windows, and to alert when any of these jumps. A field that has been 100% integer for six months suddenly emitting 0.1% strings is almost always a type-drift bug; a field whose null rate jumps from 0% to 8% is almost always a removal-during-rename; a field whose value cardinality doubles overnight is almost always an enum addition. The OTel Collector's groupbyattrsprocessor combined with a custom metricsgenerator can produce these telemetry-of-telemetry signals; Cred's platform team published their version (an open-source Vector transform) in late 2024. The signals are noisy — legitimate releases also change distributions — but the anomalies are concentrated in time and tied to specific producers, so the on-call workflow is to receive the alert, look at the producer's most recent deploy, and ask "did this PR change the schema?" Most of the time, the answer is yes and the fix is a schema-version bump that should have happened in the PR.
What the OTel Schema Specification gets right and what it punts on
The OpenTelemetry Schema Specification (released 2022, evolving since) is the most ambitious attempt at an industry-standard solution to log/metric/trace schema versioning. It defines a YAML format for declaring schema transformations between versions ("rename attribute X to Y in version 1.5.0"), a schema_url field on every record/resource that points to the schema document, and a translation primitive that collectors can implement to canonicalise records to a target schema version. What the spec gets right: the locus of migration is the collector, not the application (matching the architectural argument made earlier in this chapter); migrations are declarative, not imperative (the YAML is data, not code, so it can be reviewed, diffed, versioned); the schema covers all three signals (a single schema-url applies to logs, metrics, and traces from the same resource, so you do not coordinate three separate schema registries). What the spec punts on: application-level attributes. The spec covers the OTel-defined semantic conventions (http.status_code, db.statement) but explicitly says that user-defined attributes are out of scope. This is a reasonable scoping decision but it means the bulk of an application's drift surface is still the application's responsibility. The spec is worth reading because it formalises the architectural principles, but it is not a complete solution by itself.
When to break the schema vs work around it
Sometimes the right call is to break the schema deliberately. A field name that is offensive, misleading, or wrong (customer.gender as a single-byte enum that does not match the company's diversity policy; user.email that the privacy team requires removed; card.full_pan that should never have been logged at all) should be removed even though it breaks consumers. The pattern is to schedule a hard break at a known time, communicate widely, provide a backwards-compatibility layer (the field exists but is null, so the absence is graceful) for one to two weeks, and then remove the field. The architectural hygiene of a registry and a migration registry makes these breaks possible without chaos because the consumers have a single signal (deprecation header on the schema) that tells them what is changing and when. Without the registry, every consumer has to be hunted down individually, which is how you end up with the schema staying drift-broken for two years because nobody can finish the migration. The lesson from a decade of API versioning is that the ability to break things deliberately is the discipline that makes versioning worth the cost; if you cannot break, you are paying the migration overhead with no benefit. The registry plus the version field plus the agent-side migration is the toolkit that makes deliberate breaks safe.
# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install pydantic loguru orjson
python3 log_schema_contract.py
python3 schema_migrate.py
# Expected: contract validation rejects four drift attempts at PR time;
# migration canonicalises four records at four schema versions to a single
# v3 shape. Combine the two scripts in a real pipeline by running the
# pydantic-validated emitter against a Vector instance configured with the
# migration registry transform.
Where this leads next
- Structured vs unstructured logging — the chapter this one builds on, covering why the schema discipline matters and what the wire-format choice unlocks. The drift problem is the price of admission for the queryability gains structured logging promises.
- Cardinality: the master variable — schema drift and cardinality drift have similar shapes (a producer-side change silently breaks a downstream consumer), and the patterns for catching them — version tags, registries, anomaly detection — overlap heavily. Treat the two together when designing the platform team's drift-detection SLO.
- Wall: logs are the oldest pillar and the most abused — drift is the third pathology in that chapter's framing, after over-emission and under-structure. The producer-side contract patterns here are the same primitives that defang all three pathologies; this chapter goes deepest on the schema-stability piece.
The next chapters in this section move from the schema-and-shape question to the query language that operates on the structured records — LogQL's grammar for label vs structured-metadata vs body queries, what each layer can and cannot express, and the latency and cost implications of each query shape. The contract this chapter described is what makes those queries reliable; without the contract, the queries are clever syntax over unreliable data, and the dashboards built on them are accidents waiting for the next rename PR.
References
- OpenTelemetry — Schema specification — the canonical spec for schema_url, schema-versioning of resource and signal attributes, and the YAML format for declaring inter-version transformations.
- Confluent — Schema Registry and Schema Evolution — the Kafka-world predecessor that observability is now borrowing from; covers compatibility modes (FORWARD, BACKWARD, FULL) and the registry-as-service pattern.
- Pydantic v2 — performance — the Rust-rewrite numbers that make producer-side validation cheap enough to be the default; per-record validation in single-digit microseconds.
- Charity Majors, "Wide Events Are the Future of Observability" (Honeycomb, 2022) — the wide-events position and the practical patterns (typed setters, codegen) for keeping wide events' schemas stable.
- Buf — breaking-change detection for Protobuf — the gRPC-world tool that catches schema breaking changes at PR time; the model the observability schema-registries are converging on.
- Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 5 — practical guidance on schema design for log-based observability, with the foundational arguments for why schema is a contract.
- Structured vs unstructured logging — internal chapter on the prerequisite discipline that this chapter's drift-management patterns operate on top of.