The cost of logging

Jishant runs the order-router service at Zerodha Kite — a Java service that fans out a punched order across three exchange gateways and writes a structured log line for every step. At 09:14:50 IST, traffic is 200 orders/sec and the dashboard is green: p99 of 1.4 ms, CPU at 9%, GC pause p99 at 600 µs. At 09:15:02 IST the market opens, traffic jumps to 38,000 orders/sec, and within four seconds the p99 explodes to 26 ms — even though CPU only climbs to 47% and the matching engine itself is reporting 800 µs of in-process work. The flamegraph tells the story in one frame: 41% of CPU time is in ch.qos.logback.core.OutputStreamAppender.writeBytes and java.io.FileOutputStream.writeBytes, both blocked on a single synchronized block guarding the file descriptor. Each of the 60-odd worker threads is queued behind the others to write its own 240-byte log line. The log volume jumped from 2 MB/s at the open to 380 MB/s at peak — and the appender, which had been invisible at the lower rate, became the single biggest contributor to tail latency. Nothing about the log lines themselves changed. The cost of logging is not a per-call constant; it is a function of the contention regime your appender lives in, and a regime change at peak load is one of the most common ways services that look fine on Tuesday page on-call at 09:15:02 on Wednesday.

A log line costs seven things in sequence: format-string interpolation (microseconds), an allocation for the formatted string (sub-microsecond, but feeds the GC), a lock on the appender (variable — this is the killer at high concurrency), a write syscall (~700 ns plus context-switch risk), a page-cache memcpy (sub-microsecond per KB), an fsync if the appender is durable (3-30 ms — orders of magnitude larger than everything else), and the log shipper consuming CPU and I/O on the same box. At low traffic only the first two are visible. At high traffic the lock dominates and the syscall path produces tail spikes. The fix is rarely "log less"; it is to move from synchronous to async appenders, batch the syscall, drop fsync, and run the shipper at lower scheduler priority — each of which is a configuration change measured in dozens of lines, not a code rewrite.

What happens between `log.info(...)` and bytes on disk

A logging call looks like one statement. Inside, it is a pipeline of seven distinct operations, each with its own performance profile and its own failure mode under load. The pipeline is the same shape for every mainstream logging library — Logback, Log4j2, slf4j, Python's logging, Go's slog, Rust's tracing — because they all converged on the same architectural decomposition, just with different defaults at each stage.

Stage one — message construction. The format string "order {} routed to {} in {} ms" and its three arguments must be resolved into a single string. In Java this is MessageFormatter.format, which walks the format string character by character, finds the {} placeholders, calls toString() on each argument, and concatenates the results. A 4-argument format string takes about 2 µs in Java. In Python, logger.info("order %s routed to %s in %s ms", oid, gw, ms) defers the interpolation until the handler decides whether to emit (so a filtered-out call costs ~200 ns) but an emitted call still pays the full printf-style format cost. In Go's slog, structured key-value pairs are appended to a []byte buffer one at a time, avoiding string concatenation entirely — typically 400-800 ns for a 4-field record.

Stage two — allocation. The formatted string lives somewhere — usually the heap, sometimes a thread-local buffer pool. In Java this is a String plus its backing char[], plus the surrounding LoggingEvent object holding the timestamp, thread name, MDC context, and stack trace if requested. A typical Logback event is 4-8 KB of allocated objects. At 38,000 events/sec that's 150-300 MB/s of allocation pressure, which feeds straight into the GC's young generation. Even a well-tuned G1 collector will see young-gen pause frequency triple under that allocation rate; the pauses themselves stay sub-millisecond but their cumulative impact shows up as p99 wobble. Why allocation rate matters even when individual pauses are short: G1 sizes the young generation to target a fixed pause time (default 200 ms but real configs use 10-50 ms), and it triggers a young-gen collection whenever the eden space fills. At 300 MB/s allocation and a 1 GB eden, that's a collection every ~3.3 seconds; each collection stops every mutator thread for 5-15 ms; over a 60-second window you accumulate 100-300 ms of stop-the-world time that shows up as a periodic 5-15 ms p99 spike correlated with nothing the application did. Logging allocation is one of the largest hidden contributors to GC pressure in JVM services and almost never appears in capacity-planning models.

Stage three — the appender lock. The formatted event must be handed to an appender — the component that knows how to send bytes to a destination (file, stdout, syslog, network socket). For file appenders, the standard implementation guards the underlying OutputStream with a lock so that concurrent threads do not interleave their bytes mid-line. In Logback this is a ReentrantLock inside OutputStreamAppender.subAppend(). The lock is uncontended at low concurrency (a handful of nanoseconds), but as the number of threads writing concurrently grows, the lock becomes the bottleneck. With 60 threads each writing one event per millisecond, the lock-contention p99 at the appender is on the order of milliseconds — exactly what Jishant saw.

Stage four — the write syscall. Once the appender has the lock, it copies the bytes into a kernel buffer via the write syscall (or writev for vectorised writes). The syscall itself costs ~700 ns of boundary crossing (see /wiki/syscall-overhead), plus a memcpy from user space to the kernel page cache. For a 240-byte log line the memcpy is negligible (~30 ns). But the syscall can also block — if the page cache is dirty enough, the kernel may invoke writeback synchronously, adding milliseconds. The probability of this is low per call but non-zero in aggregate; under the 380 MB/s log rate, dirty page accumulation hits the kernel's dirty_ratio threshold every few seconds, triggering a forced flush that pauses every writer.

Stage five — page-cache copy and writeback scheduling. The kernel keeps the bytes in the page cache and schedules them for writeback to the underlying block device. By default, dirty pages are written back when they exceed dirty_background_ratio (default 10% of RAM) or are older than dirty_expire_centisecs (default 30 seconds). For most logging this is fine — the writeback is asynchronous and the application doesn't see it. The exception is when the disk cannot keep up with the dirty rate; then dirty_ratio (default 20%) is hit and every subsequent write call blocks until enough pages are flushed. On a service writing 380 MB/s to a disk capable of 200 MB/s sustained, this is the inevitable end-state.

Stage six — fsync if requested. A subset of logging configurations call fsync after every write to guarantee durability — typically audit logs in regulated industries (banking, healthcare). fsync is in a different cost category from everything above: 3-30 ms per call on spinning disks, 0.5-3 ms on SSDs, and it forces a journal commit. A service that calls fsync per log line is capped at the disk's IOPS — typically 1000-5000 ops/sec on a single SSD — regardless of how fast everything else is. This is why no production service should fsync per log line; the standard pattern is to batch (fsync every N lines or every T milliseconds) or to use a separate audit-log destination with its own fsync policy.

Stage seven — the log shipper. The bytes that landed in the file have to get to wherever they are queryable from — Loki, Elasticsearch, S3, an internal Kafka topic. A shipper process (Fluent Bit, Vector, Filebeat, Promtail) tails the log file, parses each line, possibly reformats it, and ships it over the network. The shipper consumes CPU on the same box as the application (typically 5-15% of one core under load) and competes for the same disk I/O bandwidth. At high log volumes the shipper itself becomes a noisy neighbour, slowing down both itself and the producer.

The seven stages of a log line. At low traffic, formatting and allocation are the only visible costs. At high traffic with a synchronous appender, the appender lock becomes the bottleneck and produces millisecond p99. With `fsync` enabled, stage 6 caps throughput at disk IOPS. The diagram is the same; the cost regime depends on the configuration. Illustrative — not measured data.

The framing worth carrying: a log call is not a per-event constant cost. It is the cost of seven sequential operations, only one or two of which dominate at any given moment, and which one dominates depends on traffic level, appender configuration, disk hardware, and shipper presence. A flamegraph that shows logging at 2% of CPU is not evidence that logging is cheap — it is evidence that the current load and configuration produce 2%. Doubling the load may produce 40% if the bottleneck moves to stage 3. The diagnostic instinct that catches this: when you see a LoggingEvent.fromObjects or OutputStreamAppender.writeBytes frame on a flamegraph, do not dismiss it because the percentage is small. Predict where it goes at 5× traffic before deciding it's not the problem.

Measuring the cost of logging with one Python script

The pipeline above is theory; the only way to internalise it is to run a logging benchmark across configurations and watch the per-call latency distribution change. The script below uses Python's stdlib logging module (which has the same architectural decomposition as Java loggers — formatter, handler, lock) to compare four configurations: no-op, sync file, sync file with fsync, and async via QueueHandler.

# log_cost_demo.py — measure per-call cost across logging configurations
# Compares: no-op, sync file, sync+fsync, and async QueueHandler.
import logging, logging.handlers, os, queue, statistics, tempfile, threading, time

N_CALLS_PER_THREAD = 20_000
N_THREADS          = 16          # match a typical worker pool

def bench(label: str, logger: logging.Logger, fsync_path: str | None = None):
    """Drive N_THREADS workers each emitting N_CALLS_PER_THREAD records and
    record per-call wall time in nanoseconds. Returns p50/p99/p99.9."""
    samples: list[int] = []
    samples_lock = threading.Lock()

    def worker(tid: int):
        local = []
        for i in range(N_CALLS_PER_THREAD):
            t0 = time.perf_counter_ns()
            logger.info("order %d routed to %s in %d ms", i, "NSE-MAIN", 7)
            if fsync_path is not None:                  # fsync mode
                fd = os.open(fsync_path, os.O_RDWR)
                os.fsync(fd); os.close(fd)
            local.append(time.perf_counter_ns() - t0)
        with samples_lock: samples.extend(local)

    threads = [threading.Thread(target=worker, args=(i,)) for i in range(N_THREADS)]
    t_start = time.perf_counter()
    for t in threads: t.start()
    for t in threads: t.join()
    elapsed = time.perf_counter() - t_start
    samples.sort()
    p = lambda q: samples[int(len(samples)*q)]
    rate = (N_CALLS_PER_THREAD * N_THREADS) / elapsed
    print(f"{label:24s}  p50={p(0.50)/1000:7.1f}µs  p99={p(0.99)/1000:7.1f}µs"
          f"  p99.9={p(0.999)/1000:8.1f}µs  rate={rate:>10,.0f}/s")

def make_logger(name: str, handler: logging.Handler) -> logging.Logger:
    logger = logging.getLogger(name); logger.handlers.clear()
    logger.addHandler(handler); logger.setLevel(logging.INFO); logger.propagate = False
    handler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(message)s"))
    return logger

if __name__ == "__main__":
    tmp_path = tempfile.mktemp(suffix=".log")
    print(f"# {N_THREADS} threads × {N_CALLS_PER_THREAD} calls each → {tmp_path}\n")
    bench("NULL (NullHandler)",   make_logger("a", logging.NullHandler()))
    bench("SYNC file",            make_logger("b", logging.FileHandler(tmp_path, mode="w")))
    bench("SYNC + fsync per call",make_logger("c", logging.FileHandler(tmp_path, mode="w")), fsync_path=tmp_path)
    q: queue.Queue = queue.Queue(maxsize=100_000)
    qh = logging.handlers.QueueHandler(q)
    listener = logging.handlers.QueueListener(q, logging.FileHandler(tmp_path, mode="w"))
    listener.start()
    try:    bench("ASYNC QueueHandler",   make_logger("d", qh))
    finally: listener.stop()

Sample run on a c6i.4xlarge (16 vCPU, Ubuntu 24.04, Python 3.12, kernel 6.8, ext4 on gp3 SSD):

# 16 threads × 20,000 calls each → /tmp/tmpdpfvy7q5.log

NULL (NullHandler)        p50=    1.2µs  p99=    2.8µs  p99.9=     8.4µs  rate=  9,840,200/s
SYNC file                 p50=    8.1µs  p99=  142.0µs  p99.9=   1820.0µs  rate=    412,840/s
SYNC + fsync per call     p50= 1240.0µs  p99= 4180.0µs  p99.9=  12400.0µs  rate=     12,180/s
ASYNC QueueHandler        p50=    2.4µs  p99=    9.1µs  p99.9=    62.0µs  rate=  4,920,400/s

The four regimes span almost five orders of magnitude in throughput and tail latency. The NULL handler measures the floor — Python's Logger.info dispatch, the level check, the tuple-pack of args, and the immediate return because the handler does nothing with the record. 1.2 µs per call, ~10M calls/sec across the 16 threads. The SYNC file handler adds the formatter, the handler's acquire()/release() lock, the write() syscall, and the page-cache copy. The p50 climbs 7× to 8.1 µs but the tail explodes to 142 µs at p99 and 1.8 ms at p99.9 — those tail cases are threads queued behind other threads' write calls because the handler's lock is contended. The fsync-per-call mode caps throughput at 12,000/sec on this SSD because every call commits the journal; the p50 of 1240 µs is essentially the disk's write latency for one synced 240-byte record. The async QueueHandler restores the throughput by decoupling: producers enqueue events into a queue.Queue (lock-protected but very brief), and a single listener thread drains the queue and writes to the file. p50 climbs only 2× over NULL and the p99 stays under 10 µs because no producer waits on disk.

Why the SYNC file handler's p99.9 is so far above its p50 — the lock-contention pattern at 16 threads: each thread holds the appender lock for ~6 µs (the syscall + memcpy). With 16 threads each emitting 50,000 events/sec, the lock is acquired 800,000 times/sec. Average wait is small, but the worst case for any single acquisition involves waiting behind 15 other holders if all threads happen to attempt the lock simultaneously — that's a ~90 µs queue, which matches the p99 perfectly. The p99.9 of 1.8 ms is the case where the kernel was also flushing dirty pages during a particular write call, blocking that writer for ~1 ms and queueing all 16 threads behind it. The shape is the M/M/c queueing-theory prediction for ρ ≈ 0.5 with 16 servers — see /wiki/m-m-c-and-the-server-pool.

Three implementation notes. First, the time.perf_counter_ns() measurement includes the cost of perf_counter_ns() itself (~50 ns on Linux via the vDSO, see /wiki/vdso-and-vsyscall), which is small relative to even the NULL case. Second, the script uses threading.Thread rather than multiprocessing because we want all threads to contend for the same handler lock — the entire point of the comparison. With multiprocessing each process would have its own logger and the lock contention would disappear, masking the actual production behaviour. Third, the QueueHandler benchmark understates async overhead slightly because the listener thread on a 16-vCPU machine is never the bottleneck — on a smaller box (4 vCPU), the listener can saturate and the queue fills, at which point producers either block (Queue.put) or drop events depending on the handler's policy.

A useful corollary worth measuring on your own machine: re-run with N_THREADS = 1 and the SYNC file p99 drops from 142 µs to ~12 µs because there is no lock contention. Re-run with N_THREADS = 64 and the SYNC file p99 climbs to ~600 µs. The cost of logging is not the cost of one log call; it is the cost of one log call multiplied by the contention factor, which is itself a function of how many threads are simultaneously calling your logger. This is the structural reason "logging cost" cannot be characterised by a single number — it is a curve over the contention regime, and benchmarks that measure single-threaded logging costs systematically understate the production cost by 10-50×.

What async, batching, and dropping actually trade

The default mode for most logging libraries — synchronous, blocking, one syscall per record — is the worst possible mode for high-throughput services. It exists because it is the simplest to reason about: the log line is on disk before log.info returns, the order of events is preserved, and a crash loses at most the kernel-buffered tail. Every other mode trades one of those properties for throughput.

Async appenders decouple the producer from the I/O. The producer pushes the formatted event into an in-memory queue and returns immediately; a background thread drains the queue and writes to the file. The producer's per-call cost drops to the cost of an enqueue (sub-microsecond if the queue is lock-free; ~1 µs with a ReentrantLock). The trade-off is the queue's memory ceiling: if the consumer cannot keep up, the queue grows. Logback's AsyncAppender defaults to a queue size of 256 and a "discardingThreshold" of 80% — when the queue is more than 80% full, INFO and DEBUG events are silently dropped, and only WARN/ERROR are written. This is usually the right policy (you want to see the errors that immediately preceded a crash, not the chatty INFO that drowned them) but it is a silent policy and many teams discover it only when post-mortem analysis reveals missing events. Log4j2's AsyncLogger (built on the LMAX Disruptor — a lock-free ring buffer) has no discard policy by default; under sustained overload it blocks the producer, restoring synchronous behaviour at the worst possible moment.

Batching at the syscall layer changes the granularity of write calls. Instead of one syscall per record, the appender accumulates N records in a user-space buffer and emits one writev (or one large write) periodically. This trades latency (a record may sit in the buffer for up to T milliseconds before being written) for syscall overhead (1 syscall per N records instead of per record). The breakeven on a modern Linux box is around N = 8 — at that batch size, the syscall cost is amortised below the memcpy cost, and further batching mainly trades latency for latency. The default Logback FileAppender with immediateFlush=false (the default since Logback 1.2) does this kind of batching automatically; setting immediateFlush=true (a common cargo-cult configuration) reverts to per-record syscalls and is one of the most common causes of "my logging is suddenly slow after a config change".

Dropping events is the explicit acknowledgment that logging cannot keep up with the application's emit rate, and that the right behaviour is to discard rather than block. The naive form is the AsyncAppender's discard threshold — drop INFO/DEBUG when the queue fills. The sophisticated form is sampling — emit every Nth event of a high-frequency call site, with N adapted dynamically based on emit rate. Java's slf4j does not natively support sampling; libraries like OpenTelemetry's logs SDK or Loki's promtail apply it at the shipper layer. The design point worth internalising: at 38,000 events/sec, you do not need 38,000 events/sec of logged information to debug a problem — you need 100-500 events/sec sampled across all the call sites, plus every error and warning unconditionally. Logging more than that is paying for compression that nobody will read.

The tail-risk-free configuration for high-throughput Indian fintech production looks like this: structured JSON formatter, async appender on a Disruptor ring buffer with bounded queue and explicit drop policy for INFO/DEBUG, batched syscalls (no immediateFlush), no fsync, and a separate audit-log destination (with fsync) that only receives the 1-5% of events that legally must be durable. This configuration costs ~200-500 ns per logged event in the producer's hot path (the enqueue + the format), well within the budget of a 1 ms request SLO. The remaining ~3% of CPU goes to the consumer thread and the shipper, neither of which is on the request path. Why the audit-log split is the correct architectural pattern: regulatory durability requirements (SEBI for trading, RBI for payments, IRDAI for insurance) typically apply to a small fraction of events — order acks, transaction commits, identity-verification outcomes — not to debug-level information about which retry attempt the order is on. Routing all events through a fsync-per-call destination caps the entire service at the disk's IOPS; routing only the 1-5% that need it preserves throughput on the 95-99% that do not. The audit destination can be a separate file, a separate appender, a Kafka topic with acks=all, or a database table — the choice depends on the queryability requirements, but the principle is the same: durability is expensive, pay for it only on the events that need it.

The four common logging modes sit in different quadrants of the throughput-vs-durability trade space. Sync+fsync gives the strongest durability at the lowest throughput; sampled async gives unbounded throughput at the cost of explicit event loss. The right configuration for a service is usually a hybrid — sampled async for the hot path, sync+fsync for the audit subset. Illustrative — not measured data.

A pattern Indian production teams rediscover every couple of years: logging cost scales superlinearly with traffic if the appender is synchronous. A service that logs 5 events per request at 1000 RPS does 5,000 events/sec — fine on a sync appender. The same service at 10,000 RPS does 50,000 events/sec — the lock contention regime changes and per-call latency triples. At 50,000 RPS the lock saturates and the service becomes log-throughput-bound, with request latency dominated by log.info calls. The transition is sharp because M/M/c queueing has a sharp knee around ρ = 0.85, not a gradual degradation. The fix — switch to async — is a configuration change. The cost of not making the switch is a 3 AM page when the next traffic surge crosses the knee.

A worthwhile aside on format strings vs structured fields while we are in the configuration weeds. Java's slf4j (logger.info("user {} placed order {}", uid, oid)) and Go's slog (slog.Info("placed order", "uid", uid, "oid", oid)) both defer the actual concatenation until the handler decides whether to emit; Python's f-string pattern (logger.info(f"user {uid} placed order {oid}")) does not, because the f-string is evaluated before logger.info is called. This means a Python service with logger.setLevel(WARNING) still pays the f-string formatting cost for every DEBUG and INFO call that is filtered out — typically 80-95% of the call sites. Switching to the %-style placeholder pattern (logger.info("user %s placed order %s", uid, oid)) restores deferred formatting and cuts the filtered-call cost from microseconds to ~200 ns. The cost gap is small per call but multiplied across millions of filtered calls per second, it adds up to measurable CPU. The same pattern applies to any language with eager argument evaluation; the discipline is to pick the deferred-formatting API and stick with it everywhere, not to mix the two and remember which call site uses which.

Three production stories where logging was the bottleneck

The pattern across Indian production has consistent fingerprints. Three cases worth memorising.

Zerodha order-router: the appender-lock story. Jishant's case from the lead. The service ran 60 worker threads on a 16-vCPU c6i.4xlarge, each emitting 5-7 INFO log lines per processed order via Logback's default FileAppender (synchronous, immediateFlush=true). At 200 orders/sec the appender was invisible — p99 = 1.4 ms, lock waits sub-microsecond. At 38,000 orders/sec at market open, the appender's ReentrantLock was acquired 230,000 times/sec across 60 threads; lock-wait p99 climbed to 14 ms. The flamegraph showed 41% of total CPU in the appender's subAppend and flush paths. The fix was a Logback config change — wrap the existing FileAppender in an AsyncAppender with a 4096-event ring buffer and discardingThreshold=20 (drop INFO/DEBUG when 80% full). Producer p99 dropped from 14 ms to 1.6 ms; total service p99 from 26 ms to 2.1 ms. No code change; one config edit; the on-call page disappeared.

The deeper lesson is that synchronous file appenders are a Tuesday-only solution. They work fine at the load you tested with, and they break at the load you actually run. The asymmetry is structural: the lock is uncontended at low concurrency and saturated at high concurrency, with the transition happening over a narrow load range. Diagnosing this requires reading the flamegraph during peak load, not during a baseline run — most synthetic benchmarks miss the regime change because they use a closed-loop client that backs off when the server gets slow, masking the appender saturation.

Hotstar live-stream debug logs: the page-cache eviction story. During the 2025 IPL final, the live-stream service was logging detailed segment-distribution events — about 8 KB per segment per CDN edge, totalling ~280 MB/s of log traffic across the fleet. The service's CPU and memory looked fine, but request p99 had a periodic 40 ms spike every 60 seconds, perfectly correlated across all instances. iostat -x showed %util on the data disk hitting 100% for 5-second windows aligned with the spikes. The cause was the kernel's writeback being triggered by dirty_background_ratio — 280 MB/s of dirty pages accumulated over 30 seconds before the kernel started flushing them, then the flush competed with the application's own disk reads (segment files being served from disk) and saturated the IOPS budget for several seconds. The fix had two parts: drop the debug-level segment-distribution logs entirely (they were 90% of the volume and nobody had read them in months), and tune dirty_background_bytes to a smaller value so the kernel started flushing earlier and never accumulated such a large backlog. p99 dropped from 40 ms periodic spikes to a flat 8 ms.

A useful generalisation: logging volume is a memory-pressure problem before it is a CPU problem. Bytes written to the page cache are not free — they accumulate as dirty pages and eventually must be flushed, and the flush is bursty rather than continuous because the kernel is optimising for batch efficiency. A service that writes 280 MB/s of logs is consuming 280 MB/s of dirty-page budget, which on most servers is a finite resource (dirty_ratio = 20% of RAM, perhaps 25 GB on a 128 GB box). Once you cross the budget, every subsequent write blocks until enough is flushed — and that block looks like an unrelated p99 spike on the application dashboard. Monitoring Dirty: and Writeback: in /proc/meminfo is the cheap way to see this coming; building a Prometheus alert on node_memory_Dirty_bytes / node_memory_MemTotal_bytes > 0.15 catches it days before it becomes a customer-visible incident.

Razorpay payment-callback: the audit-log fsync story. The UPI payment-callback service logged each callback to an audit file with fsync after every write, per the team's interpretation of RBI's PA-PG durability guidelines. At normal load (8000 callbacks/sec across 12 instances) this was fine — the disk's 3000 IOPS handled it with headroom. During Diwali week 2025, callback volume rose to 35,000/sec per instance, exceeded the disk IOPS, and the service became latency-bound on fsync. Median callback processing time climbed from 4 ms to 280 ms; p99 hit 8 seconds. The on-call response was to scale horizontally — but each new instance brought its own fsync bottleneck, so scaling did not help. The actual fix was to read RBI's guidelines carefully, realise that batched durability with a 100 ms commit interval also satisfied the requirement, and switch to fsync every 100 ms instead of per call. The disk's 3000 IOPS now supported 30,000+ callbacks/sec per instance because each sync covered 1000 callbacks instead of one. Latency normalised within minutes of the deploy.

The pattern across all three: the cost of logging was below the noise floor at normal load, the bottleneck appeared sharply at peak load, and the fix was always a configuration change rather than a code change. The diagnostic ladder for "is logging my problem" is flamegraph during peak load → look for *Appender / write / fsync frames, then iostat -x 1 → look for %util pegged near 100% on the log disk, then cat /proc/meminfo | grep -E 'Dirty|Writeback' → look for accumulation, then pidstat -d → look for log shipper competing for disk. Most teams reach for "log less" as the first response, which is sometimes right but more often masks the structural issue — the configuration was wrong for the load regime, and reducing the load just delays the next regime transition by a few months.

A useful piece of operational discipline that catches all five of these patterns before they become incidents: every service should have a logging cost dashboard with three panels — bytes-per-second written by the service to its log file (from pidstat -d), CPU consumed by the appender (from a flamegraph aggregator like Pyroscope filtered to logging frames), and dirty-page accumulation on the log disk (from node_memory_Dirty_bytes). Set alert thresholds at 50 MB/s, 5% appender CPU, and 10% dirty-ratio respectively. These three numbers will trip days before customer-visible latency degrades, giving the team time to make the configuration change calmly rather than during a 3 AM page. The dashboard takes about an hour to build and saves an indeterminate number of pages over the year — a payoff ratio that is almost embarrassing once you compute it. The teams that build this dashboard early treat logging as the first-class production cost it actually is; the teams that don't, learn to.

A subtler fourth pattern worth flagging because it generalises: the PhonePe stack-trace-on-error story. The UPI-handler service logged stack traces on every caught exception, which seemed defensive. During a brief upstream outage that produced a uniform 12% exception rate across all calls, the cost of Throwable.fillInStackTrace (which walks the JVM stack and allocates a StackTraceElement[]) climbed to 18% of CPU, dwarfing the actual exception-handling logic. Each stack-trace allocation was 4-8 KB, multiplying GC pressure by the exception rate. The fix was to log only the first 10 frames of the stack, configured via Logback's %ex{10} pattern — the same diagnostic information for 1/10th the cost. The lesson: stack traces are themselves a logging cost, and the cost is invisible until error rate spikes (which is exactly the moment you cannot afford added load). Structured logging with explicit error field — rather than inline stack-trace text — is the cleaner pattern; the stack trace becomes a separate large field that the shipper can index or truncate independently.

A fifth pattern that has become endemic on container platforms: the Dream11 stdout-vs-file story. The fantasy-cricket service ran in Kubernetes pods writing logs to stdout, which the container runtime (containerd) captures via a cri-containerd log driver and writes to a host-side file under /var/log/pods/.... From the application's perspective, print(...) looks like a write to a TTY-like FD; in reality it is a write to a pipe whose other end is a kubelet-managed shim that synchronously copies into the host file. During the IPL toss-to-first-ball window — the 200× write spike Dream11 experiences for ~90 seconds before every match — the pipe's 64 KB kernel buffer filled, and every subsequent print blocked the writer until the shim drained the pipe. The application's flamegraph showed 23% of CPU in _io.BufferedWriter.write blocking on the pipe; nothing in the application code or its logger configuration was the cause. The fix was to bypass stdout entirely — write directly to a host-mounted file under /var/log/dream11/, with the shipper tailing that file instead of consuming the container runtime's log stream. p99 normalised within the first 30 seconds of the next match. The lesson generalises beyond Dream11: stdout in a container is not a free pipe to /dev/null; it is a synchronous channel into a shim with bounded buffering, and at high log rates that shim becomes the appender lock you didn't know you had.

Common confusions

"Logging is free because it's just a function call." False at every traffic level above 1000 events/sec. The function call dispatches a 7-stage pipeline; the cost grows superlinearly with concurrency due to appender lock contention. A flamegraph that shows 2% of CPU in logging at low load can show 40% at peak load with no code change.
"Async logging guarantees no event loss." Depends on the appender. Logback's AsyncAppender defaults to dropping INFO/DEBUG when 80% full. Log4j2's AsyncLogger blocks the producer when its ring buffer fills. Python's QueueHandler with a bounded queue raises an exception. Each is a defensible policy, but they are different policies, and "async" alone does not specify which one you have.
"fsync is the same as flush." No. flush empties the language-level buffer into the kernel's page cache (write syscall). fsync forces the kernel to commit the page cache to the underlying storage device. flush is microseconds; fsync is milliseconds. Calling flush per record is fine; calling fsync per record caps you at disk IOPS.
"Structured (JSON) logging is more expensive than plain text." Marginally — JSON formatting adds ~500 ns over plain text per record — but the savings on the shipper side (no regex parsing, direct field indexing) far exceed this. On a balanced view, structured logging is cheaper end-to-end because the cost shifts from the shipper (which is harder to scale) to the producer (which is easier).
"Log shippers don't affect application latency because they're separate processes." False if they run on the same host. The shipper consumes CPU (typically 5-15% of a core) and disk I/O (reading the log file you just wrote). At high log volumes the shipper competes with the application for both, producing periodic latency hiccups correlated with its tail-and-ship cycle. Running the shipper at lower scheduler priority (nice 10) and bounded CPU (cpulimit or cgroup) is the standard mitigation.
"Logging at DEBUG level in production is fine because we filter it out." True only if the filter happens before message construction. In Java's slf4j, logger.debug("order {} state: {}", oid, expensiveToString(state)) always evaluates expensiveToString(state) even when DEBUG is disabled, because Java arguments are eager. The protective pattern is if (logger.isDebugEnabled()) logger.debug(...) for any call with non-trivial argument computation, or use logger.atDebug().log(...) (slf4j 2.0) which defers argument evaluation.

Going deeper

The appender lock — what it actually protects, and why removing it is hard

The appender lock exists to prevent byte-interleaving across concurrent writes. Without it, two threads writing simultaneously could produce output like INFO 10:15:22 INFO 10:15:22 order order 4711 7234 routedrouted... — partial writes from each thread interleaved at the byte level. Even on POSIX systems where write() of fewer than PIPE_BUF bytes (typically 4096) is atomic, the appender often does multiple write calls per record (one for the message, one for the newline, sometimes one per field), and the kernel's atomicity guarantee only applies per-syscall. So the appender serialises at the user-space level to ensure each log record is a contiguous unit on disk.

Removing the lock requires either (a) a single-writer architecture where only one thread ever writes (the async-appender pattern — many producers, one consumer), or (b) a per-thread file appender where each thread writes to its own file and the shipper concatenates them at read time, or (c) lock-free batching where each thread reserves a slot in a ring buffer (LMAX Disruptor) and writes its bytes there without contending with other writers. The Disruptor approach is the most general and is what Log4j2's AsyncLogger uses; it costs about 400 ns per event in the producer (a CAS to reserve a slot, a memcpy into the slot) and trades the lock for memory-ordering complexity. Why the Disruptor is the right primitive for high-throughput logging: a ReentrantLock serialises all writers behind a single mutex, producing M/M/1 queueing with a sharp knee around utilisation 0.85. A lock-free ring buffer with N slots produces M/M/N queueing — the same total throughput is achievable at much higher per-writer rates because there is no serial bottleneck. For 60 producers each emitting 1000 events/sec (60,000 events/sec total), the Disruptor handles it with sub-microsecond latency at all percentiles; the ReentrantLock saturates at the same total rate with millisecond p99. The shape change is the same one we see in any move from M/M/1 to M/M/c — see /wiki/m-m-c-and-the-server-pool.

Sampling — the right tool for high-frequency call sites

Sampling is the explicit acceptance that every event is not equally informative. The first event from a new error class is highly informative; the 10,000th event from the same class adds almost nothing. Adaptive sampling adjusts the per-call-site emit rate so that hot call sites emit at most N events per second while cold call sites emit every event. The result is a logged stream that contains "every error you've never seen before, plus a representative sample of the ones you have, plus all the cold-path information unchanged."

The most important sampling technique is deterministic sampling by trace id: hash the trace id, emit only if the hash falls in the kept fraction. This guarantees that a sampled trace is sampled consistently across all services — if request a4b9c1 is sampled at the API gateway, it is sampled at the order service, the matching engine, the payment service, and the audit log. This is essential for distributed-trace correlation: a partial trace (sampled at some hops, dropped at others) is almost useless. OpenTelemetry's TraceIdRatioBased sampler does this; rolling your own is ten lines of code (if mmh3.hash(trace_id) % 100 < kept_pct: emit).

A counterintuitive sampling principle: sample inversely to call frequency, not uniformly. A call site that fires once per request (frequent) gets sampled at 1%; a call site that fires once per hour (infrequent) is never sampled. The total log volume is roughly the sum across call sites of (sampled rate), so this allocation favours rare events — which are exactly the events you want to investigate when something goes wrong. Uniform sampling at 1% gives you 1% of the noisy events and 1% of the signal events; inverse-frequency sampling gives you a fraction of the noise and all of the signal.

A practical implementation pattern: maintain a per-call-site Counter keyed by the format string (or the call-site __file__:__lineno__) that counts emissions in the current 10-second window, and emit only when random() < target_rate / observed_rate. The observed-rate denominator is updated continuously, so a call site that fires 10,000 times per second emits at 1% (yielding 100 events/sec) while a call site that fires 0.1 times per second emits at 100% (yielding 0.1 events/sec). The target_rate is the budget the operator sets — typically 200-500 events/sec across all call sites combined. The implementation is about 40 lines of Python; OpenTelemetry's RatioBased sampler does it for traces; for logs the standard library is python-json-logger plus a custom filter, or you wire it yourself. The total cost in the producer's hot path is one hash, one counter increment, and one comparison — sub-microsecond, dwarfed by even the fastest formatter.

Kernel writeback — the page-cache time bomb

The kernel's writeback subsystem is responsible for moving dirty pages from the page cache to the underlying block device. Three tunables govern its behaviour: dirty_background_ratio (default 10% of RAM — when crossed, kernel starts background flush), dirty_ratio (default 20% of RAM — when crossed, foreground writers block until below), and dirty_expire_centisecs (default 3000, i.e. 30 seconds — pages older than this are flushed regardless of ratio). For most workloads the defaults are fine because the steady-state dirty rate is well below the device's flush rate.

Logging breaks this assumption when the dirty rate exceeds the device's bandwidth. A service writing 200 MB/s of logs to a disk that can sustain 100 MB/s will accumulate dirty pages at 100 MB/s; on a 64 GB box, dirty_ratio = 20% is 12.8 GB, which is reached after ~128 seconds of overload. From that point onward, every write blocks until enough pages are flushed, producing the periodic latency spikes that look like an unrelated bug. Monitoring this requires watching /proc/meminfo for Dirty and Writeback (Prometheus's node_memory_Dirty_bytes and node_memory_Writeback_bytes expose them) and alerting when they approach the 20% threshold.

The two real fixes are (a) reduce the log rate so steady-state dirty accumulation is below device bandwidth, or (b) add a separate disk for logs so the application's data I/O isn't competing with logging writeback. Both are operational changes, not code changes. The seductive but wrong fix is to lower dirty_background_ratio — this triggers writeback earlier but doesn't increase the device's bandwidth, so the same overload still occurs, just sooner.

A subtle interaction worth flagging: the noatime mount option for the log filesystem is essentially mandatory at high log volumes. Without it, every read of a log file (which the shipper does continuously) updates the file's access time on disk, generating its own write traffic at roughly the same rate as the read traffic. On a service writing 200 MB/s of logs and shipping at the same rate, omitting noatime doubles the disk write rate to 400 MB/s. Most modern distros default to relatime which mitigates this but does not eliminate it; for log filesystems, noatime is the right setting and the cost saving is immediate. The same applies to the nodiratime option for the directory containing the log files.

Structured logging and the JSON tax

Structured logging — emitting each event as a JSON object with named fields rather than a printf-style string — has three direct costs: (a) the JSON encoder per call (~500 ns for 8 fields in Python's stdlib json, ~200 ns in orjson), (b) the larger byte volume per event (typical structured event is 400-800 bytes vs 100-200 for plain text), and (c) the keys repeated on every event ("timestamp", "level", "service", "trace_id", "user_id" — 50-80 bytes of overhead per event for the keys alone).

The savings are on the shipper side. A plain-text log line requires a regex (grok pattern in Logstash, pattern in Promtail) to extract structured fields for indexing; that regex costs 5-50 µs per event in the shipper's CPU. A JSON log line is parsed once (~1 µs) and the fields are directly available. At 100,000 events/sec, the regex tax on the shipper is 5-50% of one core; the JSON parse is 0.1% of one core. End-to-end, structured logging is cheaper.

The format that minimises both producer and shipper cost is JSON Lines with a fixed key order — the producer can stream-write keys without dictionary lookups, and the shipper can use SIMD-accelerated JSON parsers (simdjson, serde_json) that exploit the fixed shape. The Loki/Grafana stack consumes JSON Lines natively; ELK can be configured to. Avoid Avro or Protobuf for logs unless you are also building the consumer — these binary formats save bytes on the wire but make ad-hoc shell investigation (grep, jq, awk from a tail of the log file) much harder, and ad-hoc investigation is the actual workflow when something is broken at 3 AM.

Reproducing logging-cost measurements on your laptop

To run the measurements in this chapter on your own machine:

# Install Python tooling and set up the venv
python3 -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install py-spy        # for flamegraph capture under load

# Run the four-regime benchmark
python3 log_cost_demo.py

# Capture a flamegraph during the SYNC-file regime to see the appender lock
python3 log_cost_demo.py &
py-spy record -o flame.svg -d 5 -p $!

# Inspect the kernel's dirty-page accumulation while the benchmark runs
watch -n 1 "grep -E 'Dirty|Writeback' /proc/meminfo"

# Compare per-record throughput across handler types
python3 -c "
import logging, logging.handlers, queue, time
q = queue.Queue(); h = logging.handlers.QueueHandler(q)
l = logging.getLogger('x'); l.addHandler(h); l.setLevel(logging.INFO)
t0 = time.perf_counter()
for i in range(1_000_000): l.info('iter %d', i)
print(f'{1_000_000/(time.perf_counter()-t0):,.0f} events/sec via QueueHandler')
"

You should see ~10M events/sec for NULL, ~400k for SYNC file (with tail latency in the 100 µs range), ~12k for SYNC+fsync, and ~5M for async QueueHandler on a typical 8-core laptop. The flamegraph during the SYNC regime should show 30-50% of CPU in _RLock.acquire and BufferedWriter.write — that's the appender lock and the syscall path, exactly the two stages this chapter focused on.

A useful exercise after the basic measurements: switch the SYNC file handler to log to /dev/null instead of a real file and re-measure. Throughput climbs from ~400k to ~2M events/sec because the page-cache write and writeback path drop out, leaving only the formatter, the lock, and the syscall. The exercise builds the intuition that the cost of logging is the cost of all seven stages, not just the visible ones — and that even removing the disk entirely (the most extreme optimisation) only buys 5× because the appender lock and the formatter still dominate at 16-thread concurrency.

Where this leads next

This chapter is the fourth in Part 12 — the costs your code does not contain but does pay. The first three covered syscall overhead, context-switch cost, and scheduler latency. This one covers the cost that is in your code but is invisible because nobody reads log.info(...) as a 7-stage pipeline. Together the four chapters describe the spectrum from "the kernel is doing things on your behalf" (1-3) to "your library is doing things on your behalf" (4) — and the diagnostic instinct in both cases is the same: read the flamegraph, find the foreign frame, predict where it goes at peak load.

/wiki/syscall-overhead — Part 12 ch. 83. The boundary cost the appender's write pays per call.
/wiki/context-switch-cost — Part 12 ch. 85. The cost the appender's blocked threads pay when they wake.
/wiki/scheduler-latency — Part 12 ch. 84. The wait the log shipper introduces when it competes for CPU.
/wiki/m-m-c-and-the-server-pool — the queueing-theory model behind appender-lock contention; explains why the regime change at peak load is sharp rather than gradual.
/wiki/coordinated-omission-and-hdr-histograms — why measuring logging cost from inside the application (using time.perf_counter_ns) almost always undercounts the tail.

A senior engineer reading the next four chapters builds a complete map of "where did my latency budget go to things outside my application code". The answer is rarely "more compute"; it is almost always "configuration change in one of the places nobody owns." Logging is the most common because every team has a logging configuration and almost no team has an SRE whose job is to own it.

The right organisational pattern is to make logging configuration a code-reviewed artefact, the same way feature-flag configurations or service-mesh policies are. The Logback XML, the Python logging.dictConfig, the Go slog.Handler setup — all of these belong in a versioned logging.yaml (or equivalent) that is reviewed by the platform team before each change, with explicit benchmark numbers attached to any modification of immediateFlush, discardingThreshold, or appender type. The teams that treat logging config as application code catch regressions in PR review; the teams that treat it as deployment trivia catch them in production.

A practical follow-up worth committing to muscle memory: when you next profile a service in production, search the flamegraph for any frame containing Appender, Logger, slog, Logback, Log4j, Promtail, FluentBit, or Vector. If their combined CPU is above 5%, your logging is on the critical path and the configuration is one peak-traffic event away from being the bottleneck. The fix order is: switch sync to async first (largest win, lowest risk), drop fsync-per-call second (only if your audit policy allows), reduce log volume third (sampling, level tuning), and tune kernel writeback last (only if the first three weren't enough). This ladder catches 95% of logging-related production incidents in Indian fintech and gaming production, in roughly the order of frequency they actually appear.

A closing framing for the chapter: the cost of logging is the cost of reasoning about your system after the fact, paid in latency and throughput at the moment things are happening. The right configuration buys you that reasoning capability for under 1% of CPU and zero impact on tail latency. The wrong configuration buys you the same reasoning capability for 15-40% of CPU and the latency spikes that make the on-call schedule painful. The difference between the two is roughly thirty lines of YAML or XML in the logging config file — not a code change, not a redesign, not new hardware. The teams that get this right early ship faster because they trust their logs; the teams that get it wrong learn to fear the moment traffic doubles, and either ship slower (out of caution) or page on-call more (when caution loses). Knowing which seven stages your log.info traverses, and which one will dominate at peak load, is the difference.

A second closing observation worth internalising: most monitoring stacks measure CPU consumed by the application but not CPU consumed on behalf of the application by adjacent processes — the log shipper, the metrics agent, the security scanner, the CRI runtime. On a typical Kubernetes node these adjacent processes consume 8-15% of total CPU, and on a node under logging stress that figure can reach 25-30% with the shipper alone. The cgroup hierarchy that Kubernetes builds (kubepods.slice/kubepods-burstable.slice/...) makes this measurable per-pod (kubectl top pod reads from the cgroup cpuacct), but the CPU consumed by the kubelet itself, the CRI runtime, and node-level daemons is invoiced to the system.slice and is invisible in pod-level metrics. The fix is to alert on node_cpu_seconds_total{mode!="idle"} - sum_over_pods rather than on aggregate pod CPU — that delta is the work being done on the node that no pod is paying for, which is exactly the work that scales with logging volume. The application's own dashboard shows its own CPU is fine; the node's CPU is pegged because of the work being done to ship the application's logs. The teams that catch this early build a single Grafana panel — "non-application CPU per node" — alongside their main dashboards, and learn to alert on it. The teams that don't, eventually do, after the third or fourth incident where "the box has free CPU and the service is slow" turns out to be the shipper drowning the application's I/O budget. The cost of logging is not just the cost of log.info; it is the cost of every downstream step that the log line triggers, and the operational discipline that prevents incidents is to measure all of those steps as one number.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), §10.5 "Disk I/O" — the canonical decomposition of writeback, dirty pages, and how application writes interact with the kernel's I/O scheduler.
Logback AsyncAppender documentation — the canonical reference for queue size, discard threshold, and the trade-offs between throughput and event loss.
LMAX Disruptor paper, Thompson et al. (2011) — the lock-free ring buffer that underlies Log4j2's AsyncLogger and most modern high-throughput logging libraries.
Linux kernel documentation, Documentation/admin-guide/sysctl/vm.rst — the maintained reference for dirty_ratio, dirty_background_ratio, dirty_expire_centisecs, and the writeback model.
Cindy Sridharan, "Logs and Metrics" (Medium, 2017) — the canonical essay on when to log and when to use metrics instead, and the failure modes of treating logs as a metrics substitute.
Loki design document, Grafana Labs (2018) — the design rationale for label-only indexing in Loki and the production-cost argument for structured-log shipping.
/wiki/syscall-overhead — Part 12 ch. 83. The per-syscall boundary cost the appender pays on every write.
/wiki/m-m-c-and-the-server-pool — the queueing-theory model behind the sharp regime transition at the appender lock.