Wall: logs are the oldest pillar and the most abused

It is 03:47 IST on a Friday. A Razorpay payments-API pod has just failed a health check, the Slack channel is filling up, and Aditi — the SRE on call — opens her laptop. She does not start with the metric dashboard. She does not pull a trace. She types kubectl logs payments-api-7d8f9c-xkqj2 --tail=2000 | grep ERROR because that is what she has done since 2014 and the muscle memory is older than the company. Twenty seconds later she has the stack trace. The metric panel that finally drew her eye to the right pod was useful; the trace would have shown her the request shape. But the thing that told her why was 412 lines of unstructured text written by a logger.error() call somewhere in a payments-gateway library nobody owns anymore. The metric said "something is wrong"; the trace said "this is the path the request took"; the log said "this is the actual exception, with the actual SQL fragment, with the actual stack". Every senior engineer she has worked with has the same pattern. The log is where the truth lives, even when nobody admits it on the architecture diagram.

Logs are the oldest of the three pillars. They predate metrics-as-time-series, predate distributed tracing, predate the word "observability". A Unix system in 1972 already wrote messages to a file, and the design has barely moved — the file is now an object in S3, the messages are now JSON, the consumer is now Loki or Elasticsearch, but the contract between an engineer and her log line is the same one that shipped with syslog(3). That conservatism is the source of both the strength of logs and every operational pathology around them. This chapter is about that contract: what a log line actually promises, why it ends up being abused at scale, and what the discipline of structured wide-event logging looks like when you take the cost seriously. The concrete reader takeaway is a vocabulary — over-emission, under-structure, indefinite retention — for naming the three pathologies you will personally meet within six months of running a production logging stack at any meaningful volume.

Logs are append-only event streams that hold the highest-detail record of what your system did. Every other pillar — metrics, traces, SLOs — is in some sense a structured aggregation built on top of that record. Logs win on detail and lose on volume; the three pathologies that make them the most abused pillar are over-emission (every event becomes a log line), under-structure (free-text messages cannot be queried), and indefinite retention (yesterday's debugging never gets cheaper).

The contract a log line makes — and what every other pillar gives up

Every other primitive in observability has a shape before it has a value. A counter is a number with a name and a known set of labels. A histogram is a fixed set of buckets with counts. A span has a parent, a duration, and a status enum. A log line has none of that. It is the place where the system says "I want to record something" and the engineer who wrote the line is the one who decides what that something looks like — what fields exist, what types they have, what conventions they follow. That total absence of imposed shape is the source of every property in this chapter, good and bad. It is what makes logs a perfect debugging companion at 03:47 on a Friday and what makes them the most expensive observability spend on a Monday-morning finance review. Both come from the same root: when the data shape is decided per emit, the system handling the data has to handle every shape that has ever been emitted.

A log line is the most expressive primitive in the observability stack. It carries a timestamp, a level, a free-form message, and arbitrary key-value attributes — and it is the only pillar where the producer can put anything at all into the payload. Metrics restrict you to a number with bounded labels. Traces restrict you to spans with a fixed parent-child structure. A log line says nothing about its shape. The PHP request handler that Riya wrote in her first job at Cleartrip in 2017 emitted lines like Booking failed for PNR=TLF8K2 amount=4280 reason=GATEWAY_TIMEOUT retries=3 user_segment=GOLD ip=49.36.182.x and the queries built on top of them — grep "GATEWAY_TIMEOUT" | awk '{print $5}' | sort | uniq -c — were the entire production debugging toolkit of the company for years.

That maximal-expressiveness is also why logs are the most expensive pillar at scale. A Razorpay-shaped fleet that emits 1,200 payments-events per second per pod across 800 pods generates roughly 960k log lines per second. At 1.4 KB average line size — which is conservative once you include trace IDs, request IDs, customer attributes, and a stack trace on every fifth line — that is ~1.34 GB per second of raw log volume. Compress at 8× (gzip on JSON is generous), index 30 days, and you are storing ~430 TB of indexed log data for the payments service alone. Compare that to metrics: the same fleet emits roughly 14k active series per pod after careful labelling, total ~11M series, ~14 MB of compressed Prometheus blocks per day. Logs are 11,000× more expensive per "thing observed" than metrics, and that ratio is not a tooling artefact — it is fundamental to the data shape.

The expressive freedom also comes with a discipline cost the other pillars do not pay. A metric is born well-defined: http_requests_total{method="POST",status="500",service="payments-api"} — every label is named, every value is a number, every dimension is bounded by the cardinality of the label values. A trace span is born well-defined: a name, a duration, a parent_span_id, and a structured set of attributes. A log line is born as a string. Whether trace_id lives at the top level of the JSON or nested under meta, whether the timestamp is ts or time or @timestamp, whether level is INFO or info or 20 (Python's numeric levels) — all of these are decisions every codebase makes independently and inconsistently, and the agent in the middle has to coerce them into a single shape before the backend can index them. A four-year-old codebase with logs from twelve different versions of three different frameworks will have at least four different timestamp formats and the agent's parsing rules grow to match. This is the work that does not show up in any architecture diagram but consumes most of the platform team's time during a logging migration.

Log volume vs metric volume — same workload, different shapesA side-by-side comparison of the per-second data shape produced by the same payments fleet. Left column: logs — 960k events/sec, 1.4 KB per line, 1.34 GB/sec raw. Right column: metrics — 11M active series, 14 active series per pod, 14 MB compressed per day. The ratio is highlighted at the bottom: logs cost roughly 11,000 times more per observed event than metrics. Two stacked bars show the volume difference at log-scale.LOGS — every eventMETRICS — aggregated960,000 events/sec800 pods × 1,200/sec1.34 GB/secraw, before compressionavg line: 1.4 KB30-day indexed: ~430 TBcost grows linearly11M active series14/pod × 800 pods × labels14 MB/dayGorilla-compressed~1.3 bytes/sample30-day total: ~420 MBcost grows with cardinalityratio: logs cost ~11,000× more per observed event than metrics
Illustrative — same workload, two pillar shapes. Logs hold every event verbatim; metrics aggregate into pre-defined time-series. The detail-vs-cost axis is the single most important trade-off in observability.

There is one more property of logs worth highlighting before the cost discussion: logs are append-only and totally-ordered per emitter. A counter increment can be lost (the scrape window covered it but the value got overwritten by a newer one) or aggregated away; a span can be sampled-out at the trace tail; but a log line, once flushed, is a discrete record that the downstream pipeline either has or does not have, and the order in which two log lines from the same process leave the process is the order they will sit in the storage backend. That total ordering is what makes logs the reliable backbone for forensics — when something has gone wrong, the sequence of log lines from the affected process is the most complete reconstruction of the in-process state available, ordered as the process saw events. Metrics smear time across scrape boundaries; traces carry causal but not absolute order. Logs are a tape recording, and that property is what every other pillar lacks. The cost of that property is what this whole chapter is about.

Why the 11,000× ratio is fundamental, not a tooling problem: a log line stores (timestamp, level, message, attributes) for every event — N events make N records. A metric stores (name, labels, sample[]) and aggregates all events for a (name, labels) pair into a single time-series; a counter that increments 960k times in a second produces one sample at scrape time, not 960k. The compression ratio on metrics comes from this aggregation step, not from byte-level encoding — Gorilla XOR is doing single-digit-byte savings on top of an already-aggregated structure. You cannot apply the same trick to logs without losing the per-event detail that is the entire point of having logs in the first place. Aggregating logs into a counter is a metric. Aggregating logs into a sampled stream is a trace. Once you stop being event-level, you are no longer a log.

The three pathologies — over-emission, under-structure, indefinite retention

Every team that runs logs at scale eventually meets the same three pathologies, in the same order, usually as separate tickets six months apart that nobody connects until the postmortem. Understanding them as a single shape is what separates a team that pays for logs by the petabyte from one that pays for them by the terabyte and gets more debugging value out of the smaller bill.

The script below stands up a tiny structured-logging pipeline using loguru, emits a synthetic burst of payment events, and computes the three pathologies as concrete numbers — over-emission rate, structure ratio, and retention cost — for the run. It is the same shape of measurement Razorpay's platform team runs as a weekly job to catch log-bill regressions before finance does.

# log_pathologies.py — measure over-emission, structure, retention cost
# pip install loguru orjson
import loguru, orjson, gzip, time, random, sys, statistics
from collections import Counter

logger = loguru.logger
logger.remove()  # drop the default stderr sink

# Sink that captures lines into a list so we can measure them
captured: list[str] = []
def sink(message):
    captured.append(message.rstrip("\n"))
logger.add(sink, format="{message}", level="DEBUG")

# Simulate a payment service emitting events for 1 second of traffic
random.seed(13)
EVENTS_PER_SEC   = 1_200
ERROR_RATE       = 0.008   # 0.8% errors
DEBUG_RATE       = 0.40    # 40% debug-level chatter
PAYMENTS         = ["UPI", "CARD", "NETBANKING", "WALLET"]
MERCHANTS        = [f"M{i:05d}" for i in range(2_000)]

start = time.time()
for i in range(EVENTS_PER_SEC):
    method   = random.choice(PAYMENTS)
    merchant = random.choice(MERCHANTS)
    amount   = random.randint(100, 50_000)
    is_error = random.random() < ERROR_RATE
    is_debug = random.random() < DEBUG_RATE

    payload = {
        "ts": time.time(), "service": "payments-api", "region": "ap-south-1",
        "method": method, "merchant_id": merchant, "amount_paise": amount,
        "trace_id": f"{random.getrandbits(64):016x}",
    }
    if is_error:
        payload["error"] = "GATEWAY_TIMEOUT"
        payload["upstream"] = "razorpay-acquirer-3"
        logger.bind(**payload).error("payment_failed")
    elif is_debug:
        logger.bind(**payload).debug("payment_attempt")
    else:
        logger.bind(**payload).info("payment_succeeded")

elapsed = time.time() - start

# Three pathology measurements
total_lines = len(captured)
debug_lines = sum(1 for l in captured if '"level": "DEBUG"' in l or "DEBUG" in l)
raw_bytes   = sum(len(l) for l in captured)
gz_bytes    = len(gzip.compress("\n".join(captured).encode()))

# Structure ratio: of the captured lines, what fraction parses as JSON?
def is_json(line):
    try: orjson.loads(line); return True
    except Exception: return False
structured = sum(1 for l in captured if is_json(l))

print(f"events emitted        : {total_lines:,} in {elapsed*1000:.0f} ms")
print(f"debug-level lines     : {debug_lines:,} ({debug_lines/total_lines*100:.0f}% of total)")
print(f"raw bytes             : {raw_bytes:,}  ({raw_bytes/total_lines:.0f} B/line avg)")
print(f"gzip bytes            : {gz_bytes:,}  ({gz_bytes/raw_bytes*100:.0f}% of raw)")
print(f"structured fraction   : {structured/total_lines*100:.0f}%")
print(f"projected 30-day cost : {(raw_bytes*86400*30)/(1024**4):.2f} TB raw  /  "
      f"{(gz_bytes*86400*30)/(1024**4):.2f} TB compressed")
print(f"if DEBUG dropped      : {((raw_bytes - sum(len(l) for l in captured if 'DEBUG' in l))*86400*30)/(1024**4):.2f} TB raw")

Sample run on a 2024 MacBook Air:

events emitted        : 1,200 in 38 ms
debug-level lines     : 478 (40% of total)
raw bytes             : 412,344  (344 B/line avg)
gzip bytes            : 67,128  (16% of raw)
structured fraction   : 100%
projected 30-day cost : 0.97 TB raw  /  0.16 TB compressed
if DEBUG dropped      : 0.58 TB raw

The script measures three things at once, and each maps to one pathology:

The reason these three pathologies compound — rather than being independent line items — is that each one masks the others. Over-emission inflates the dataset, which makes under-structure expensive (every unstructured line you cannot query is paid for in storage even though it has no value), which makes indefinite retention painful (you keep things you can't query because dropping them feels like throwing away signal). Fix any one in isolation and the other two close half the gap automatically: drop DEBUG and the unstructured-fraction matters less; structure your logs and the retention question becomes "which streams" rather than "all of it"; tier retention and the over-emission cost stops growing linearly with calendar time. The teams that keep their log bill under control are the ones that recognise this loop and hit all three at once.

Why the gzip ratio of 16% is misleading as a cost lever: log lines contain repeated structural tokens — JSON keys ("trace_id", "merchant_id", "region"), constant string values ("ap-south-1", "payments-api"), and similar fields across adjacent lines compress aggressively because gzip's sliding-window dictionary catches the repetition. But this only helps the storage tier, not the index. Loki and Elasticsearch both build inverted indexes on labels and tokens, and those indexes are not compressed at the same ratio — Loki's chunks compress to roughly 8-12% of raw, but the per-stream label index sits in a separate boltdb-shipper structure that does not benefit from the same dictionary repetition. When the cost line item on your bill is "ingest + index", gzip ratio on the chunks is doing maybe a third of the work; the rest is paid as label cardinality. The trick to cheap log retention is therefore not "compress harder" but "have fewer streams" — which means keeping log labels low-cardinality and pushing the high-cardinality fields into the structured payload where they are searched, not indexed.

What logs are good at — and the questions only logs can answer

Before diving deeper into the discipline, it is worth being honest about what logs uniquely buy you, because the over-emission pathology often hides behind the assumption that "more logs = more visibility". Logs are the right primitive for exactly four kinds of question, and the right pillar for almost nothing outside that set.

The first kind is per-event forensics. "Show me everything that request r-7d8f9c did across all services" is a question only logs (or traces, if you have full trace retention) can answer, because the answer is a sequence of specific events with their specific attributes — the SQL query that ran, the HTTP body that was sent, the exception that was thrown. A counter that ticked is irrelevant; you need the actual event. This is the on-call engineer's primary use of logs, and it is the use that justifies most of the cost.

The second kind is rare-event detail. "Show me all the times the gateway returned a 502 between 02:00 and 04:00 last night" — when the events themselves are infrequent (a few hundred a day, not millions), keeping every one of them with full detail is cheaper and richer than aggregating into a histogram and losing the per-event attributes. Anything below ~100 events/sec is in this regime; anything above probably wants a metric for the count and a sampled subset of logs for the detail.

The third kind is audit and compliance. "Show me every payment-state-transition for transaction T14829204, in order, with the actor for each transition" — this is a structurally log-shaped question because the audit record is the event-by-event history. RBI, PCI-DSS, and SOC 2 all expect this shape; a metric cannot replace it because the regulator wants the per-event record, not the aggregate.

The fourth kind is bootstrap before metrics exist. The first three months of a new service, before anyone has wired up the metric instrumentation, the logs are everything. You ship logger.info() calls liberally, learn what is happening from production traffic, and then promote the high-frequency ones into Prometheus counters as the service matures. This is a legitimate pattern; the failure mode is forgetting the second step and leaving the verbose log emission on forever.

Anything outside these four categories — counts, rates, latency distributions, dashboards, alerts on aggregate behaviour — is a metric question, not a log question. Asking it of the log pipeline pays the per-event cost without buying the per-event detail. This is the structural reason the over-emission pathology is so persistent: every "we should log this" decision is locally justifiable, but the aggregate decision should usually have been a metric. The discipline is to ask, before adding a logger.info(), whether the question this log is for is in the four-kinds list. If it is not, it should be a counter, gauge, or histogram, and the log call should not exist.

How structured logging changes the contract

The single largest improvement you can make to a log pipeline is to stop emitting f"User {user_id} did {action} at {timestamp}" and start emitting logger.bind(user_id=user_id, action=action).info("user_action"). Both produce a line. The first is unparseable at query time; the second is queryable as {service="payments"} | json | user_id="U10234" in LogQL or user_id:"U10234" in Elasticsearch DSL. The structural shift turns logs from a grep corpus into a queryable event stream, and it is the foundation of every modern logging stack.

Loguru, structlog, Python's logging with a JSONFormatter, and Java's Logback with logstash-logback-encoder all emit the same shape — a JSON object per line with a timestamp, a level, a message, and arbitrary attributes nested in. The convention that has settled across the industry (called structured logging or sometimes wide events) is that the message stays short and constant — "payment_failed", not "payment failed for merchant M01023 amount 4280 reason GATEWAY_TIMEOUT" — and all the variable parts go into the attributes. The reason this matters is that the message text is what becomes a log-stream label in Loki and an inverted-index term in Elasticsearch; if it changes per event, you blow up the cardinality of the label or term and the storage system slows down or rejects the writes.

Unstructured vs structured logs — same event, different queryabilityTwo stacked log lines on the left side. Top: an unstructured f-string log line "Payment failed for merchant M01023 amount 4280 reason GATEWAY_TIMEOUT". Bottom: the equivalent structured JSON log line with discrete fields. To the right of each line, a list of queries that work or fail. The unstructured line supports only grep-style substring search. The structured line supports range queries, aggregations, and field-level filters. A vertical divider separates the two examples.UNSTRUCTURED — f-string2026-04-25T11:23:14 INFO Payment failed for merchant M01023amount 4280 reason GATEWAY_TIMEOUT user_segment GOLDqueries that work:grep "GATEWAY_TIMEOUT"queries that fail:amount > 5000 group by merchant p99(amount)STRUCTURED — JSON{"ts":"2026-04-25T11:23:14","level":"info","msg":"payment_failed", "merchant":"M01023","amount":4280,"reason":"GATEWAY_TIMEOUT", "user_segment":"GOLD","trace_id":"a3e9c1f2..."}queries that work:| json | amount > 5000 | rate by merchantlabel budget:low — only msg/level/service indexed; amount/merchant in payload
Illustrative — the structural choice between unstructured f-strings and JSON wide events decides which queries are possible. The same information lives in both forms; only the structured form lets the backend index, group, and aggregate it.

The trap is the natural drift back to f-strings. New engineers join, copy a log call from elsewhere in the codebase, and write logger.error(f"Failed to charge {customer.id} for {amount}: {exception}") because that is what every Stack Overflow answer shows. The codebase regresses one log call at a time. The fix is lint-level enforcement — a pre-commit hook that flags any logger.<level>(f"...") calls in production code paths and forces the structured form. Razorpay added this rule to their CI in 2022 and the structured fraction across their Go services climbed from 38% to 94% over six months without any campaign or refactor.

The other discipline that pays off is propagating the trace context into every log line. If your service is already instrumented with OpenTelemetry, the active span carries a trace_id and span_id that uniquely identify the request in flight. Bind those into every logger.info() call (the loguru bind() and structlog contextvars patterns make this a one-liner per request) and your logs become navigable from a trace — open the slow span in Tempo, copy its trace_id, paste it into Loki, and you have every log line that handler emitted. Without that bind, the trace and the log live in separate worlds and the only way to correlate them is the timestamp, which is fragile across machines with skewed clocks. Swiggy's order-pipeline team wrote up their 2024 migration to this pattern and reported that median time-to-resolution on customer-facing tickets dropped from 28 minutes to 9 minutes — not because they got faster at querying, but because the trace and log surfaces stopped being two separate manual joins.

A subtle structural decision is what becomes a label vs what stays in the payload. The label set in Loki, or the indexed-fields set in Elasticsearch, controls cardinality directly: a label service with 50 distinct values produces 50 streams; adding user_id to the label set with 14 million distinct values produces 14 million streams and breaks the backend. The right discipline is the same as for metrics — labels are for the small, slow-changing dimensions you query on (service, level, region, cluster, environment), and the high-cardinality identifying fields (user_id, trace_id, merchant_id, request_id) live in the structured payload where they are searched by the per-chunk scanner, not indexed. Most "Loki got slow last week" incidents resolve to a single PR that added a high-cardinality field as a label, and most fix PRs are a one-line move from bind() (which becomes a label) to extra={} (which stays in the payload).

The wider observation is that structured logging is what makes log-as-pillar fungible with the other pillars. A structured log line carries a trace_id, so it can be joined to a span in Tempo. It carries a service and a level, so it can drive an alert rule the same way a metric does. It carries enumerable fields like method and region, so a query {service="payments-api"} | json | rate(error_count[5m]) by (region) produces a metric-shaped time-series derived from the log stream — Loki and Elasticsearch both support this kind of log-to-metric extraction. Once you are emitting structured wide events, the boundary between "log" and "metric" stops being a property of the data and becomes a property of the query: the same event can be aggregated into a counter, scanned for a specific incident, or correlated with a trace, depending on which question you are asking. Charity Majors's "logs are streams not files" essay frames this as the core insight of the modern observability era, and it is — but it only works if every log line is structured. An f-string log line is permanently stuck being a log; only a wide event can be promoted into the other pillars.

The retention question — what nobody actually queries past day 4

The hardest log conversation is the one with whoever owns the budget. Every team wants 90-day retention "for the audit". The audit, when you actually ask, is a regulatory requirement that 90 days of aggregate transaction records be retained — not 90 days of every DEBUG line every microservice ever emitted. The Reserve Bank of India's payment-aggregator guidelines specifically require transaction logs (the structured payment events with PII redacted), not application traces. PCI-DSS requires audit logs from systems that handle card data, with specific event types listed. Neither regulation requires 90-day retention of payment_attempt lines from a payments service.

The data shape that matters here is how recently a log line was last queried, not how recently it was written. A log line written at 03:14 on Monday gets queried roughly 8-12 times in the next 4 hours (the on-call doing the immediate triage), maybe 2-3 times in the next 4 days (the postmortem investigation), and then approximately zero times forever. The empirical curve is sharply log-decreasing — every published study from a large-shop logging postmortem (Cloudflare, Datadog's customer-aggregate numbers, Grafana Cloud's product-team reports) shows somewhere in the 90-99% range of queries hitting data less than 7 days old, and Hotstar's internal numbers from the 2023 IPL season landed in the same shape, with 99.4% of queries in the 0-72 hour window. Tiering retention to that curve is the single biggest cost lever in the log pipeline:

A fleet that spends ₹6 lakh/month on flat-30-day-hot retention typically spends ₹1.8 lakh/month on the same coverage tiered correctly — the saved 70% pays for the engineer who set up the tiering, with change. The reason most teams still pay the flat rate is the friction of deciding what tier each stream belongs in, which is itself a structured-logging problem. If your payment-success logs are queryable by service and level labels, you can write a Loki retention policy service=~".*", level=info, retention=7d and a separate level=error, retention=90d and the tiering happens automatically. If your logs are unstructured grep-corpus, every line gets the same retention because the system has no way to discriminate.

There is a corollary worth pulling out about retention: the dashboards and alerts you build today will outlive the people who built them, and dashboards that query logs older than a few days have a habit of failing silently when you compress retention. A dashboard panel that says count_over_time({service="payments-api", level="error"}[7d]) works on a fleet with 7-day hot retention; reduce hot retention to 24 hours without telling the dashboard owner and the panel quietly returns wrong numbers (zero, or a partial count) and the on-call uses those numbers to make decisions. The discipline is to declare retention as explicit per-stream policy — committed to the same git repo as the dashboards — and to add a CI check that warns when a dashboard query window exceeds its data source's retention. Almost no team does this. Almost every team eventually has the postmortem that begins with "the dashboard said zero errors but there were 14,000".

A second pattern that compounds the saving is per-stream sampling at the agent. The Vector or Fluent Bit pipeline sitting between your application and the log backend can apply a per-stream sample rate — keep 100% of level=error, 100% of level=warn, 10% of level=info for high-volume services like a CDN edge or an order-routing service, and 1% of level=debug. The arithmetic is brutal: a 10% INFO-sample on a service that emits 80% of its volume at INFO drops the total bill by ~72%, and the "loss" is a per-event detail you would query for an incident — but for incidents, you have already kept all errors and warns, plus the 10% sample of INFO statistically catches anything that happened in any window over a few seconds. The trick is doing the sampling at the agent, not at the application — application-side sampling distributes the decision and means a misconfigured service can resync the entire fleet, but agent-side sampling is one config file change that takes effect fleet-wide. Cred's platform team reported in their 2024 SREcon talk that moving from "log everything at INFO from every service" to "sample INFO at 5% per high-volume service, keep error/warn at 100%" cut their monthly Loki bill by 64% and not one engineer noticed the change in debugging quality — the errors they actually queried for were never sampled out.

Why "tiered retention" is more powerful than "shorter retention": shorter retention loses information uniformly — you keep nothing past day 7. Tiered retention loses random-access speed uniformly past day 7 but still keeps the data, so the rare audit query that reaches into day-45 still works (just slower). The cost difference is roughly 100×, which means tiering past day-7 is essentially free compared to keeping it hot. The only cost is query latency on cold data, which is acceptable for the post-incident or audit query patterns that hit cold data — nobody on call at 03:00 is querying day-45. The mistake teams make is treating retention as a single dial rather than a per-tier schedule, and once they make that mistake the only available cost lever is the blunt "shorten everything" which loses real value.

A practical observation: most of the cost wins in this chapter compound. Drop DEBUG, structure the lines, sample INFO, tier the retention, redact PII at the agent — each is a single config or PR, and applied together they typically cut a flat-rate logging bill by 60-75% with no loss of debugging signal. The reason most fleets do not get there is not that the techniques are exotic; they are well-documented in every tool's introductory guide. The reason is that each technique requires a specific person to own it, and in most organisations the team that ships logs (application teams) is not the same team that owns the bill (platform team) is not the same team that owns the dashboards (SRE) is not the same team that owns the alerts (ops). Without one team that owns the outcome — the cost-per-incident-resolved ratio — every individual decision is locally rational and globally pathological. The first observability article a platform team should write is the one that names the owner.

Failure modes the docs do not warn you about

Three failure modes show up in production log pipelines that no introductory documentation calls out, and recognising the shape early saves the on-call team several hours of "why is the dashboard wrong" debugging.

The first is log loss during pod termination. When Kubernetes sends SIGTERM to a pod, the application has a terminationGracePeriodSeconds (default 30s) to exit cleanly, but the log-collecting agent (Fluent Bit, Vector) running as a sidecar or DaemonSet may have a buffer of unsent log lines that the central Loki has not yet acknowledged. If the pod terminates before the buffer drains, those lines are lost — and the failure is silent, because the agent itself goes down with the pod. The fix is two-part: configure the agent's flush-on-shutdown to be aggressive (flush_at_shutdown: true in Vector, Grace=5 in Fluent Bit), and in the application, log "shutting down" at INFO before exiting so the on-call can later check Loki for the line. If you didn't, your pod terminated faster than the agent could ship; widen the grace period or batch smaller. Hotstar's video-pipeline team lost six hours of session-close logs to this exact failure during the 2023 IPL final because the rolling deploy kicked in a faster grace period than the Vector flush could match.

The second is stdout blocking the application under log-volume backpressure. Most container runtimes (Docker, containerd) write stdout to a file on the node, and if the disk is slow or the file rotates, a print() or logger.info() call can block the application's main thread for tens to hundreds of milliseconds. A payments service that calls logger.info() synchronously in its hot path during a log-burst can see its own p99 latency double for the duration of the burst, with the symptom looking like "the database got slow" because the application traces show time spent in the post-DB code path. The fix is to use the asynchronous logging modes (loguru's enqueue=True, Python logging's QueueHandler+QueueListener, slf4j's async appender) which decouple the application thread from the I/O. The trade is a small in-process queue that drops log lines if the queue fills under sustained backpressure — which is exactly the right trade, because the application surviving is more important than every log line being preserved.

The third is log lines bigger than the ingest limit. Loki's default per-line limit is 256 KB; Elasticsearch's is 100 MB but the practical limit on a single document is much smaller before query performance degrades. A stack trace from a deeply-nested microservice failure can easily exceed 100 KB, especially in Java or Python with framework-heavy stacks. When the line exceeds the limit, the agent either truncates silently or rejects with a 413, and either way the on-call engineer sees "the trace is missing the actual error" precisely when they need it most. The fix is to log structured exception data — logger.bind(exception_type=type(e).__name__, exception_message=str(e), traceback=traceback.format_exc()[:8000]).error("handler_failed") — with the traceback explicitly truncated, and ship the full traceback to a separate sink (a "crash dump" object store) when it exceeds the threshold.

Why these three failures share a shape: in all three cases, the failure is at the boundary between the application and the log pipeline — the moment of pod termination, the moment of stdout write back-pressure, the moment of agent-side parsing. The application sees no error (the log call returns successfully or print blocks invisibly), the agent sees a partial signal (a flush timeout, a 413, a truncation), and the backend sees missing or malformed data. The on-call engineer sees only the missing data, several layers downstream, and has no clean way to back-trace which boundary was where the loss happened. The defensive pattern across all three is the same: log a heartbeat of every state transition the application makes (startup_complete, request_received, handler_done, shutdown_initiated) at INFO with structured attributes, so that the absence of a heartbeat in Loki distinguishes "log pipeline failed" from "application failed". Without those heartbeats, every log gap becomes a forensic exercise; with them, the gap shape itself is a signal.

Common confusions

Going deeper

What syslog settled in 1980 and what a 2026 log line adds on top

The syslog daemon shipped with 4.2BSD in 1983 and defined a contract that has propagated almost unchanged into every modern logging stack: a log message has a facility (which subsystem produced it: kernel, mail, auth, daemon, user, local0-7) and a severity (emergency, alert, critical, error, warning, notice, info, debug). Combined as <facility * 8 + severity>, that single byte is the priority field that still appears in RFC 5424 syslog-format messages and that Linux's journalctl -p err still parses. Every modern logging library — Python's logging, Go's log/slog, Java's slf4j, Rust's tracing — has a level enum that is one-to-one with syslog severities, and every cloud log backend (CloudWatch, Stackdriver, Loki, Elastic) accepts a level field in its ingest API for routing. The decisions baked into syslog — that severity is one-dimensional, that facility is enum-not-string, that the priority is per-line not per-process — are foundational and almost never reconsidered. The one decision that did get reconsidered is the message format itself: syslog's <priority>timestamp hostname tag: message was strictly textual and unparseable at scale, which is why every modern shop layers JSON-on-top while keeping the level enum from 1983.

What modern structured logs added on top is roughly twelve standard fields, and recognising what each is for is the difference between a useful pipeline and a parsing nightmare. ts (or time, @timestamp) — the wall-clock timestamp, RFC 3339 with timezone, microsecond resolution ideally. level — the syslog severity. msg (or message, event) — the constant event name like payment_failed, never an f-string. service — the emitting service, low cardinality, becomes a label. env — production/staging/dev, low cardinality, becomes a label. trace_id and span_id — OpenTelemetry context propagated from the active span; high cardinality, stays in payload. request_id — per-request UUID emitted at the entry edge, useful for joining logs even when tracing is off; high cardinality, payload. user_id / merchant_id / tenant_id — business identifiers; high cardinality, payload. error_type — exception class name on errors; low cardinality, can be a label. duration_ms — numeric, payload, queried as > / <. The label-vs-payload split is the single most important performance decision in the pipeline, and getting it wrong is the difference between a backend that costs ₹1.8 lakh and one that costs ₹6 lakh for the same coverage.

Loki's index-vs-chunk split — why it costs what it does

Loki was built explicitly around the insight that log indexes are expensive and most queries hit recent data with a known label scope. The architecture splits each log stream into two pieces: labels (a small set of low-cardinality fields like service, level, cluster that get indexed in a boltdb-shipper structure) and chunks (the actual log line bodies, gzip-compressed and stored in S3 or GCS). A LogQL query {service="payments-api", level="error"} |= "GATEWAY_TIMEOUT" first uses the label index to locate the chunks for (service=payments-api, level=error) over the time range, then linearly scans those chunks for the substring GATEWAY_TIMEOUT. The label index is small and fast; the chunk scan is bounded by the label filter. This is structurally cheaper than Elasticsearch's "index every token in every field" approach by roughly 5-10× on storage and similar on ingest, with the trade that any-field-substring searches are slower because they cannot use a token index — Loki has to scan. The lesson Loki encodes is that most log queries are label-scoped substring searches, and you can build a much cheaper system if you optimise for that pattern at the cost of slowing down full-text queries that hit the long tail.

Vector, Fluentd, and the agent that does the real work

Most production logging pipelines today put a routing/filtering agent (Vector, Fluentd, Fluent Bit, OpenTelemetry Collector) between the application and the storage backend, and that agent is where the actual cost-control and compliance happens. The agent reads log lines from stdout/stderr (or a file, or a Kubernetes-pod tail), applies parsing rules to enforce structure, drops or samples lines based on level/source/content, redacts PII, and routes different streams to different sinks. The pattern that Razorpay's platform team settled on in 2023 is: ERROR-level lines go to Loki with 90-day retention and PagerDuty integration, INFO lines go to Loki with 7-day retention, DEBUG lines get sampled at 1% and sent to S3-only-no-index, and any line containing a PAN/Aadhaar/mobile regex match gets routed through a redaction transform before any sink sees it. Done in Vector with about 200 lines of TOML, the routing replaces a previous Fluentd setup that cost roughly ₹4 lakh/month more on backend ingest.

The redaction step is the part that cannot live in the application. Indian fintech and health-tech logs end up with PII by accident — a stack trace that includes the request body, a debug log that prints customer.dict() and pulls in the PAN, an error message that says "Aadhaar 1234-5678-9012 not found in DB". Once that line is in the log backend, it is in backups, in long-term archive, and arguably in violation of the DPDP Act 2023 which requires deletion-on-request. The discipline that has held up across every shop that has faced an audit is redaction at the agent — a Vector transform that runs every log line through a regex set (\b\d{4}[ -]?\d{4}[ -]?\d{4}\b for Aadhaar, [A-Z]{5}\d{4}[A-Z] for PAN, the 10-digit mobile pattern, the email pattern) and replaces matches with [REDACTED:AADHAAR] etc. Doing this at the agent means a single application logging mistake does not ship raw PII to backups; doing it in the application means trusting every developer in every service to remember the rule on every call, which is a discipline failure waiting to happen. Paytm, Zerodha, and CRED all ship a shared agent config of about 40 redaction rules; the platform team owns the rules, application teams cannot accidentally bypass them. The agent is not glamorous — it is the boring middle layer most architecture diagrams gloss over — but it is where the difference between a ₹6 lakh and a ₹1.8 lakh log bill, and between an audit-pass and an audit-fail, actually lives.

Why log-to-metric extraction is cheaper than parallel emission

A pattern that has solidified across mature observability stacks is using the log stream as the source-of-truth for derived metrics, rather than emitting the log line and the metric independently. Loki's metrics query (rate({service="payments-api", level="error"}[5m])) and Elasticsearch's transform API both convert a log stream into a counter or histogram on the fly. The reason this matters for cost: the alternative — having the application emit logger.error("payment_failed", merchant=...) and errors_total.labels(merchant=...).inc() — pays the data path twice, once at log volume and once at metric volume, and the two streams drift if a developer adds the logger.error and forgets the metric increment (or vice versa). Extracting the metric from the log eliminates the drift, halves the emission cost, and keeps the metric definition in the same place as the log line. The trade is that the metric is only as fast as the log query, which on Loki means 5-30 second freshness rather than the 15-second scrape cadence of native Prometheus — usually fine for SLO panels and dashboards, sometimes too slow for tight burn-rate alerts. For the high-frequency-write tier you keep a small set of native Prometheus counters; for the long-tail of "what was the rate of this rare event grouped by some attribute" questions, log-derived metrics win on both cost and consistency.

Coordinated emission — the synchronised log burst

The same synchronised-burst problem that hits push-collected metrics hits logs, and worse. Every service that does periodic work (cache refresh, batch aggregator, cron-style cleanup) emits a flurry of log lines at the same moment of the minute or hour, and the pipeline downstream sees a periodic spike rather than smooth ingest. The Loki ingester's per-stream rate-limit defaults to 4 MB/s; a payments fleet that synchronises batch-completion logs at :00 of every minute can push a stream over its limit for a few seconds and trigger 429-rejected lines (which are silently lost unless the agent is configured to retry with backpressure). The fix is the same as for metrics — jitter the emission — but logs add a wrinkle: the emission cadence is usually controlled by application code (a for loop that logs each iteration), not by a flush timer, so jittering means changing the application logic rather than the agent config. Hotstar's video-pipeline team learned this during the 2024 IPL season when their session-cleanup job logged every closed session at the same moment and the resulting 12-second log gap (caused by the ingester rejecting bursts) showed up as a "missing data" panel on the dashboards used to debug the very session-cleanup job. The fix was a single time.sleep(random.uniform(0, 0.1)) between iterations — 6 lines of code that resolved the burst.

# Reproduce this on your laptop
python3 -m venv .venv && source .venv/bin/activate
pip install loguru orjson
python3 log_pathologies.py
# Expected: ~1200 events, 40% DEBUG, ~344 B/line raw, ~16% gzip ratio.
# Drop the DEBUG lines to see the cost-reduction lever first-hand.
# To compare to a real Loki ingest, run a local Loki via docker
# (docker run -p 3100:3100 grafana/loki) and ship via python-logging-loki.

There is also a chapter on cardinality coming up that addresses the label-vs-payload split rigorously, with explicit cardinality budgets and the arithmetic of what happens when each one is exceeded. The intuition this chapter builds — that high-cardinality fields go in the payload, low-cardinality in the label — gets formalised there.

A short note on tooling choices before moving on: nothing in this chapter is specific to a particular log backend. The pathologies (over-emission, under-structure, indefinite retention) and their fixes (level discipline, structured wide events, tiered retention) apply identically to Splunk, Elasticsearch, Loki, Honeycomb, ClickHouse-as-log-store, and the half-dozen other systems shops choose between. The choice between them is a separate, mostly orthogonal decision that depends on query patterns (label-scoped substring vs full-text), team familiarity, and existing infrastructure. The discipline this chapter prescribes survives any of those choices; the choice that does not survive is "we will figure it out later".

Where this leads next

The next chapter in this section moves from "what is a log" to "how do you query a billion of them efficiently" — Loki's LogQL grammar, Elasticsearch's index strategies, and the structural reason that label-vs-payload separation determines query throughput. The discipline that started with a print() and a tail-and-grep ends in a system that holds petabytes and answers questions in seconds — and the journey is mostly about which fields you index, not about how fast the disks are.

A line worth carrying forward into every chapter that follows: the goal is not "more logs" or "better logs in isolation". It is the right log line in the right stream, indexed by the right labels, retained for the right duration, with the right detail in the payload, so that the on-call engineer who opens her laptop at 03:47 IST has the answer to "what happened" within the first twenty seconds of the incident. Every operational decision in a logging pipeline — agent config, retention policy, sampling rate, structured-vs-text — eventually traces back to that one outcome. The systems that get this right are not the ones with the cleverest backend; they are the ones that treat logs as a finite budget and spend it where the questions actually get asked.

The next few chapters in Part 2 build on this foundation. They walk the structured-event discipline through to its conclusions: how Loki and Elasticsearch differ at the index layer and how that determines which queries are cheap; how log-to-metric extraction makes the same data answer two kinds of question without paying twice; how the per-line cost trade-offs from this chapter manifest in the per-query latency of the dashboards built on top. By the end of the section, the reader should be able to look at a logging architecture diagram for any production fleet and say which pathologies it has built in, where its cost-per-incident-resolved sits relative to its peers, and which one config change would cut the bill the most without losing signal. That diagnostic ability — not raw familiarity with any particular tool — is what makes the difference between a team that runs logs and a team that lets logs run them.

A closing thought before the references: the title of this chapter calls logs "the most abused" pillar, and that framing is deliberate. It is not that logs are bad; it is that the freedom they offer — write anything, anytime, with any structure — has no built-in pushback against the natural tendency to overuse them. Metrics push back: every new label costs cardinality, every cardinality-blow-up shows up immediately as a Prometheus OOM. Traces push back: every new span costs context-propagation work, every new attribute costs storage in the sampled tail. Logs do not push back until the bill arrives, and by then the codebase has six years of logger.info(f"...") calls in it and "fix the logs" is a six-month engineering project. The discipline this chapter describes is not glamorous and not new; it is the discipline of pushing back early, while the codebase is small enough that the pattern can still be set, and the team is small enough that one person can own the outcome.

References