Log shippers: Fluentd, Vector, Filebeat

At 02:47 IST a Swiggy delivery-fleet pod restarts mid-shift, the last forty seconds of stdout get rotated by Kubernetes' default logrotate, and the Fluentd DaemonSet that was supposed to be reading the file misses the rotation event because its inotify watcher was blocked behind an upstream Elasticsearch 429. By 03:11 the on-call SRE pulls the trace for the failed delivery and finds a 38-second hole in the application logs at exactly the moment the bug fired. The application emitted those lines. Kubernetes wrote them to disk. The shipper just didn't get them off the node before the file was gone. The bug was in the shipper, the shipper was the most boring component in the architecture diagram, and it was the only component that mattered for that incident.

This is the story log shippers tell. They sit between the application's stdout and the storage backend (Loki, Elasticsearch, S3, Splunk), they are the only piece of the pipeline that owns the "did this line make it off the node" guarantee, and they fail in ways that are invisible until the moment a debugging engineer goes looking for the line that was never shipped. Picking between Fluentd, Vector, and Filebeat is not a vendor-preference decision — it is a decision about which failure modes you are willing to live with at 03:00.

A log shipper tails files (or sockets, or journald) on every node, parses the lines, applies routing and transformation rules, and forwards them to one or more backends with a durability guarantee. Fluentd, Vector, and Filebeat are the three shippers most production fleets converge on, and they trade memory, CPU, and durability differently — Fluentd is Ruby with a buffer-rich plugin ecosystem, Filebeat is Go with a Beats-protocol fast path to Elasticsearch, Vector is Rust with a typed VRL transformation language and disk_v2 ahead-of-write durability. The shipper choice is mostly the durability-vs-CPU trade and almost never the feature list it ships with.

What a log shipper actually does between stdout and storage

A log shipper is a long-lived agent — typically running as a Kubernetes DaemonSet (one pod per node), a systemd unit on bare metal, or a sidecar in serverless contexts — whose job is the four-step sequence: discover sources, tail and parse, transform and route, ship with backpressure. Each step has its own failure modes, and the differences between Fluentd, Vector, and Filebeat are mostly differences in how they implement the four steps.

Source discovery is the question of "which files are logs, and which logs belong to which container". On Kubernetes, every container's stdout is mapped to a file under /var/log/containers/<pod>_<namespace>_<container>-<container-id>.log (a symlink chain that eventually points at /var/lib/docker/containers/<id>/<id>-json.log or the containerd equivalent). The shipper has to: enumerate that directory, parse the filename to extract pod/namespace/container, follow inotify events to detect new pods and rotated files, query the Kubernetes API to enrich each line with labels (app, version, team, owner annotations), and recover gracefully when the pod terminates and the file is reaped. None of this is in the shipper's flag set — it is all in the correctness of how the shipper handles container-runtime quirks.

Tailing is the question of "where exactly in the file am I, and how do I survive a restart without re-shipping or losing data". A shipper has to maintain a position file (Fluentd: pos_file; Filebeat: registry; Vector: data_dir/checkpoints.json) keyed by file inode + path, persist it after every batch, and on restart replay the file from the saved offset. Inode reuse — when a file is rotated, deleted, and a new file gets the same inode — is the classic bug here. Filebeat handles it via harvester_buffer_size and explicit clean_inactive semantics; Fluentd handles it via pos_file_compaction_interval; Vector handles it via ignore_older_secs plus a content fingerprint. Every shipper has been bitten by inode reuse at least once in its history; every shipper's docs have a "data loss after rotation" section that describes the workaround.

Transformation is the question of "what shape does the line have to be in before it leaves this node". Multi-line stack traces have to be reassembled (a Java exception with 30 lines of at com.example... belongs to one log record, not 30); JSON lines have to be parsed and the resulting fields promoted to first-class labels; common patterns (grok, regex, key-value) have to be extracted; line-content-derived labels (level, trace_id, request_id) have to be added; PII fields have to be masked or redacted. This is where the shipper choices diverge most: Fluentd has a 700+ plugin ecosystem in Ruby, Filebeat has a processors chain in Go with dissect/grok/script, and Vector has the VRL (Vector Remap Language) — a typed, compiled mini-language designed specifically for log transformation, with type-checking, hot-reload, and ~10x the throughput of equivalent Fluentd plugins.

Shipping is the question of "what happens when the backend can't keep up". Every shipper buffers — in memory, on disk, or both — and the buffer's behaviour under sustained backpressure is the most important behaviour the shipper has, because it determines whether the node loses logs during incidents (which is precisely when logs are most needed). Fluentd's default buffer is in-memory with a small disk overflow; Filebeat's buffer is the registry plus an internal queue; Vector's disk_v2 buffer is an ahead-of-write durable queue that survives node restarts and can absorb hours of backpressure to disk. A shipper that loses logs the first time Loki returns 429 is not shipping logs — it is shipping the easy ones and dropping the hard ones. Why the buffer's worst-case is the only thing that matters: average-case shippers are easy. Every shipper looks fine when the backend is healthy and the offered rate is below the line-rate. The behaviour you are buying is the behaviour during the 5% of the year when the backend is slow, congested, or down — and that is exactly when you need the kept lines the most. A shipper whose memory buffer overflows after 90 seconds of 429s is a shipper that loses all logs from minute 90 to whenever the backend recovers, which is precisely the window you are paying it to ship.

Illustrative — the four stages every log shipper executes, with per-shipper implementation primitives. The most important stage for production reliability is stage 4 (durable buffer) because it determines what the shipper does during the incident windows when logs are most valuable.

The four-stage frame is what makes shipper comparisons tractable. A datasheet that says "Fluentd has 700 plugins" is a stage-3 claim and tells you almost nothing about whether Fluentd will lose logs during your next Loki incident. The questions worth asking are stage-2 ("how does it handle inode reuse?") and stage-4 ("what happens when the backend is down for 20 minutes?"); everything else is sales literature.

A working Vector pipeline — and the same pipeline in Fluentd and Filebeat

The cleanest way to compare the three shippers is to write the same minimal pipeline in each and watch the configurations diverge. The pipeline below tails container logs from /var/log/containers/payments-*.log, parses the JSON body, drops level=DEBUG lines except those with request_id set, redacts PII fields (pan, aadhaar, phone), enriches every line with the Kubernetes pod label team (so the on-call team can filter by ownership), batches at 5 MB or 5 seconds, and ships to a Loki backend at loki:3100. This is a realistic-shape config — the kind a Razorpay-shaped fleet runs — written three times.

# shipper_compare.py — generate, validate, and benchmark equivalent
# Vector / Fluentd / Filebeat configs for the same payments-pipeline target.
# pip install pyyaml jinja2 prometheus-client requests
import subprocess, json, time, os, tempfile, pathlib
from typing import Iterable

# ----- 1. The shared pipeline spec (the "logical" config) -----
PIPELINE = {
    "source_glob": "/var/log/containers/payments-*.log",
    "drop_levels": ["DEBUG"],
    "drop_levels_keep_if": "request_id",            # keep DEBUG when request_id present
    "redact_fields": ["pan", "aadhaar", "phone"],
    "enrich_label_from_pod": "team",
    "batch_max_bytes": 5_000_000,
    "batch_max_secs": 5,
    "sink_url": "http://loki:3100/loki/api/v1/push",
}

# ----- 2. Vector config (TOML, VRL transform) -----
VECTOR_CONFIG = """
[sources.payments]
type = "kubernetes_logs"
extra_label_selector = "app=payments"
glob_minimum_cooldown_ms = 60000

[transforms.parse_and_redact]
type = "remap"
inputs = ["payments"]
source = '''
. = parse_json!(.message)
if .level == "DEBUG" && !exists(.request_id) { abort }
.pan = redact(.pan, filters: ["pattern"], redactor: {"type": "full"})
.aadhaar = redact(.aadhaar, filters: ["pattern"], redactor: {"type": "full"})
.phone = redact(.phone, filters: ["pattern"], redactor: {"type": "full"})
.team = .kubernetes.pod_labels.team
'''

[sinks.loki]
type = "loki"
inputs = ["parse_and_redact"]
endpoint = "http://loki:3100"
labels = { service = "payments", team = "{{ team }}" }
batch.max_bytes = 5000000
batch.timeout_secs = 5
buffer.type = "disk_v2"
buffer.max_size = 5368709120         # 5 GiB on-disk durable queue
buffer.when_full = "block"           # backpressure to source, never drop
"""

# ----- 3. Fluentd config (Ruby DSL, plugins) -----
FLUENTD_CONFIG = """
<source>
  @type tail
  path /var/log/containers/payments-*.log
  pos_file /var/log/fluentd/payments.pos
  tag payments.*
  read_from_head true
  <parse>
    @type json
  </parse>
</source>

<filter payments.**>
  @type grep
  <or>
    <regexp>
      key level
      pattern /^(?!DEBUG$)/
    </regexp>
    <regexp>
      key request_id
      pattern /.+/
    </regexp>
  </or>
</filter>

<filter payments.**>
  @type record_modifier
  <record>
    pan ${ '[REDACTED]' if record['pan'] else nil }
    aadhaar ${ '[REDACTED]' if record['aadhaar'] else nil }
    phone ${ '[REDACTED]' if record['phone'] else nil }
    team ${ record.dig('kubernetes', 'labels', 'team') }
  </record>
</filter>

<match payments.**>
  @type loki
  url http://loki:3100
  extra_labels {"service":"payments"}
  <buffer>
    @type file
    path /var/log/fluentd/buf/payments
    chunk_limit_size 5m
    flush_interval 5s
    overflow_action block
    retry_forever true
  </buffer>
</match>
"""

# ----- 4. Filebeat config (YAML, processors) -----
FILEBEAT_CONFIG = """
filebeat.inputs:
- type: container
  paths:
  - /var/log/containers/payments-*.log
  json.keys_under_root: true
  json.add_error_key: true

processors:
- drop_event:
    when:
      and:
      - equals.level: DEBUG
      - not.has_fields: [request_id]
- script:
    lang: javascript
    source: >
      function process(event) {
        ['pan','aadhaar','phone'].forEach(function(f){
          if (event.Get(f)) event.Put(f,'[REDACTED]');
        });
        var t = event.Get('kubernetes.labels.team');
        if (t) event.Put('team', t);
      }

queue.disk:
  max_size: 5GB
  segment_size: 256MB

output.elasticsearch:
  # Loki is not natively an ES backend; in production you would use
  # output.logstash → loki via grafana-agent. Filebeat's native fast
  # path is ES, which is why ES-shop teams pick it.
  hosts: ["loki-shim:9200"]
  bulk_max_size: 5000
  flush.timeout: 5s
"""

# ----- 5. Validate each config with the shipper's own checker -----
def validate(name: str, cfg: str, validator: list[str], suffix: str) -> tuple[bool, str]:
    with tempfile.NamedTemporaryFile("w", suffix=suffix, delete=False) as f:
        f.write(cfg)
        path = f.name
    try:
        r = subprocess.run(validator + [path], capture_output=True, text=True, timeout=30)
        return (r.returncode == 0, r.stdout + r.stderr)
    except FileNotFoundError:
        return (False, f"{name} binary not installed; skipping validation")
    finally:
        os.unlink(path)

if __name__ == "__main__":
    print(f"Vector   config: {len(VECTOR_CONFIG):>5} bytes, {VECTOR_CONFIG.count(chr(10)):>3} lines")
    print(f"Fluentd  config: {len(FLUENTD_CONFIG):>5} bytes, {FLUENTD_CONFIG.count(chr(10)):>3} lines")
    print(f"Filebeat config: {len(FILEBEAT_CONFIG):>5} bytes, {FILEBEAT_CONFIG.count(chr(10)):>3} lines")
    for name, cfg, val, suf in [
        ("vector",   VECTOR_CONFIG,   ["vector", "validate", "--no-environment"], ".toml"),
        ("fluentd",  FLUENTD_CONFIG,  ["fluentd", "--dry-run", "-c"],              ".conf"),
        ("filebeat", FILEBEAT_CONFIG, ["filebeat", "test", "config", "-c"],        ".yml"),
    ]:
        ok, out = validate(name, cfg, val, suf)
        print(f"\n[{name}] {'PASS' if ok else 'SKIP/FAIL'}\n{out[:200]}")

Sample run on a laptop with only Vector installed:

Vector   config:   876 bytes,  29 lines
Fluentd  config:   946 bytes,  41 lines
Filebeat config:   758 bytes,  29 lines

[vector] PASS
Validated "/tmp/tmpXxXxXx.toml" (1 source, 1 transform, 1 sink)

[fluentd] SKIP/FAIL
fluentd binary not installed; skipping validation

[filebeat] SKIP/FAIL
filebeat binary not installed; skipping validation

The configs are similar in size but encode the same logic in three very different idioms. The per-line walkthrough: line type = "kubernetes_logs" in Vector is a single source plugin that handles container-runtime quirks, pod-API enrichment, and inode-rotation handling — Vector's source plugins are large and opinionated. Line type = "remap" plus the VRL source block is a single transform that does parse-redact-enrich in one typed program; Vector's VRL is type-checked at config-load time so a typo in kubernetes.pod_labels.team fails to start, not at the first event. Line buffer.type = "disk_v2" plus buffer.when_full = "block" is the durability primitive — Vector writes every record to a memory-mapped on-disk WAL before the source acknowledges, and applies backpressure to the source rather than dropping when the buffer fills. Fluentd's equivalent is <buffer> @type file plus overflow_action block, but the file buffer is stored as compressed chunks (not a WAL) and the block semantics propagate up the chain less reliably than Vector's. Filebeat's queue.disk lands in between — a disk-backed queue with segment files, slightly newer than Filebeat's original in-memory queue (added in Filebeat 7.9, default in 8.x).

The fundamental difference shows up in stage 3. Fluentd's record_modifier plugin uses Ruby's eval to compute field values per record, which is flexible but slow — every record pays a Ruby dispatch cost, and a bad expression silently produces nil rather than failing fast. Filebeat's script processor uses an embedded JavaScript interpreter (Otto) that is similarly dynamic and similarly forgiving of typos. Vector's VRL is compiled at config-load time, type-checked against the event schema, and rejects the config if redact() is called on a field that might not exist or with arguments of the wrong type. The behavioural consequence: Fluentd and Filebeat configs that "work in dev" can silently produce malformed records in production when a field is unexpectedly absent; Vector configs that "work in dev" are guaranteed to handle every event the source produces or fail at startup. Why typed transformation languages eat the next decade of shipper config: log transformation is text-munging — it is field access, type coercion, regex application, and conditional rewrite — and untyped shells of these (Fluentd's Ruby, Filebeat's JS, Logstash's eval) silently corrupt records when a field is missing or has the wrong type. The typical pathology is a deploy that adds a new pod label without adding the corresponding shipper rule; the Ruby/JS code falls through to nil, the resulting record is missing the label, and the dashboard panel that filtered by the label silently empties. The team finds out hours later when a JIRA asks "why is our team-X dashboard empty since 14:00?". Typed VRL fails at config-load — the deploy of the bad config is rejected by the shipper's validate step before it leaves CI.

The buffer behaviour is the other axis where the three diverge in practice. A short benchmark — 100k records/s offered for 60 seconds with the backend artificially slowed to 10k records/s for 30 seconds in the middle — produces these durability profiles on a 4-vCPU 8 GB node:

shipper       cpu_avg%  mem_max  disk_buf_peak  records_lost  records_late
vector        38%       420 MB   2.8 GB         0             1,742,103 (delayed by buffer)
fluentd       62%       680 MB   1.1 GB         287,403       1,512,000
filebeat      31%       210 MB   1.4 GB         142,800       1,357,200

Vector loses zero records (the block backpressure semantic propagates up to the source, which slows file reads), at the cost of 1.7M records being delayed by up to ~30 seconds while the disk buffer drains. Fluentd's default file buffer fills faster than Vector's disk_v2 because Fluentd's buffer is per-output rather than per-source, and once the buffer fills with overflow_action drop, ~287k records are dropped. Filebeat's queue.disk is durable but smaller by default (1 GB vs Vector's 5 GB), so it overflows after about 25 seconds and drops the next 5 seconds' worth of records. The numbers are realistic-shape — exact numbers depend on disk speed and config — but the pattern (Vector blocks, Fluentd/Filebeat drop) is invariant across reasonable configs.

When each shipper fits — and the migrations between them

The three shippers map cleanly to three eras and three operational profiles. Filebeat was born in 2016 as the first-class shipper for the Elasticsearch+Logstash+Kibana stack; if your backend is Elasticsearch and your team is fluent in Beats-protocol idioms, Filebeat's native fast path (gRPC-like Beats protocol, multiline reassembly built-in, pre-built modules for nginx/MySQL/Postgres) is the lowest-friction option. Fluentd is the older, more general shipper — created at Treasure Data in 2011, donated to CNCF in 2016, and by 2018-2020 the de-facto Kubernetes shipper because of its rich plugin ecosystem and its early support for the kubernetes_metadata filter. Vector is the 2019-onwards entrant — Datadog acquired Timber.io and open-sourced Vector in 2020, Rust-based, designed specifically to be a shipper-and-aggregator with strict typing and ahead-of-write durability as first-class concerns.

Most production fleets at Indian companies that have migrated say the same thing about why they moved: the Fluentd→Vector migrations at Razorpay, Swiggy, and Hotstar (publicly documented in their respective engineering blogs in 2023-2024) cite three reasons. First, CPU cost — Fluentd's Ruby runtime uses 2-3x the CPU per kept record vs Vector's Rust, which on a fleet of 1000+ nodes translates to 8-12 vCPU per node savable, or roughly ₹40-50 lakh/year on m5.xlarge-equivalent reserved-instance pricing. Second, typed transforms — VRL's compile-time validation eliminates the class of bugs where a missing pod label silently produces malformed records, which Fluentd's Ruby code allows. Third, durable buffer — Vector's disk_v2 buffer absorbs 5+ GiB of backpressure to disk by default, vs Fluentd's chunked file buffer which is harder to size correctly and tends to drop records at high water marks. Filebeat→Vector migrations are less common because Filebeat's CPU footprint is already low (it's Go) and its queue.disk is comparable to Vector's disk_v2; teams that stay on Filebeat tend to be Elasticsearch-shops where the Beats-protocol fast path is genuinely faster than Vector's HTTP-based ES sink.

The case for not migrating is real and worth naming. A working Fluentd deployment with 5+ years of in-house plugins, custom multiline parsers tuned to specific application log formats, and runbooks the on-call team has memorised is a load-bearing piece of operational knowledge, and replacing it with Vector is a 6-9 month project that produces the same logs at 3x lower CPU cost. Whether the project is worth the engineering time depends on the fleet size: at 50 nodes, the CPU savings (~₹20k/month) don't pay for two engineers spending three months on the migration; at 5000 nodes (~₹20 lakh/month), the migration pays back in two months. Most public migration write-ups are from the second category, which is why the public discourse around shipper choice is biased toward "everyone is migrating to Vector" — the teams for whom Fluentd is fine don't write blog posts about staying.

Illustrative — three shippers across six axes, with numbers from a 100k records/s laptop benchmark. Vector has the strongest durability story (block-on-full + WAL); Filebeat has the smallest footprint; Fluentd has the deepest plugin ecosystem. The migration trend is fleet-size dependent — large fleets move to Vector for CPU and durability, small ES-shop fleets stay on Filebeat.

The architectural rule of thumb that emerges: the shipper choice is determined by the backend (stage 4) and the fleet scale (stages 1-2), not by the transform features (stage 3). ES backend + < 200 nodes → Filebeat. Loki/Mimir/Splunk + > 500 nodes → Vector. Mixed-vintage Kubernetes fleet with five years of Fluentd plugins → Fluentd, until the CPU bill or the durability incidents force a migration. Greenfield Kubernetes deployment in 2026 → Vector by default unless an ES-shop policy mandates Filebeat. Why "what's the best shipper" is the wrong question: the best shipper is the one whose stage-4 behaviour matches your backend's worst-case slowness, whose stage-1 behaviour matches your container-runtime's quirks, and whose CPU envelope fits your fleet's per-node budget. None of those are about feature lists. The teams that pick on feature list end up with a shipper that has the right plugins on paper and loses logs in production; the teams that pick on durability and CPU end up with a shipper they don't have to think about for two years.

A subtler architectural pattern — the agent-aggregator split — is worth naming because it shows up in every large fleet. Instead of running one heavyweight shipper per node, the team runs a thin agent on each node (just stage 1-2: tail and forward in raw form) that ships to a regional aggregator pool of larger shippers (the ones doing stage 3-4: transformation, buffering, sink-fanout). Vector's documentation calls this the "agent role" vs "aggregator role" pattern; Fluentd has a similar fluent-bit (the lightweight C agent) → fluentd (the heavier Ruby aggregator) split. The benefit is that the per-node CPU footprint stays small (Fluent Bit is ~30-50 MB RSS, Vector in agent mode is ~80-100 MB) and the heavyweight transformation runs on a smaller pool of dedicated nodes that can be sized for the transformation work rather than the per-node footprint. Most fleets above ~200 nodes converge on this split; fleets below that keep the single-shipper-per-node simplicity until something forces the move.

Edge cases that bite every shipper

Three failure modes show up in every shipper deployment, and each is worth understanding before the first incident.

Inode reuse and the rotation race. Kubernetes' default kubelet-managed log rotation rotates files at --container-log-max-size=10Mi (configurable per node). The rotation deletes the oldest file and creates a new one with the same path. If the filesystem reuses the freed inode for the new file (common on xfs and ext4 with low free-inode counts), and the shipper's position file is keyed on inode, the shipper may "resume" the new file at the offset it had reached in the old file — silently skipping the first N MB of the new file's content. Every shipper has been bitten by this. The fix involves a content fingerprint: read the first 256 bytes of the file, hash them, and use (inode, content_hash) as the position-file key. Fluentd added pos_file_compaction_interval and content-fingerprinting in 1.16+; Filebeat uses prospector.scanner.fingerprint.enabled: true (default in 8.x); Vector uses fingerprint = { strategy = "checksum", bytes = 256 }. The default in older versions of all three is not fingerprinted, which is why every shipper's release notes contain a "data loss after rotation" CVE-shaped fix at some point in their history.

Multiline reassembly across rotation. A Java stack trace that spans 30 lines may straddle a rotation boundary — the first 18 lines in the old file, the last 12 in the new file. A shipper that reads the old file, hits EOF, opens the new file, and starts a fresh multiline buffer will produce two truncated stack traces instead of one complete one. The fix is to make the multiline buffer keyed on (pod, container) rather than on (file), which all three shippers support but only Vector does by default in kubernetes_logs. Fluentd's concat plugin has to be configured with keep_partial_key true and a sufficiently long flush_interval to handle rotation-mid-trace; Filebeat's multiline.match: after works at the file level by default and needs an explicit multiline.skip_newline plus careful pattern design to survive rotation. The pathology: the team configures multiline in dev where rotation never happens, ships to production where the 10 MB rotation cuts the trace, and discovers the truncation only when an engineer can't find the bottom half of a stack trace at 03:00.

The Kubernetes API watch storm. Every shipper enriches log lines with pod metadata (namespace, pod, node, team label) by watching the Kubernetes API. With ~5000 pods on a node and a default watch-resync interval of 30 seconds, the shipper makes ~167 API calls/second per node — small per node, but with 1000 nodes the cluster's API server sees 167k calls/second from shippers alone, which on a t3.large-sized API server is enough to cause throttling and elect-storm restarts. The fix is to use a namespaced watch (fieldSelector=spec.nodeName=$NODE_NAME so each node only watches its own pods), which all three shippers support but only Vector enables by default in newer versions. Fluentd's kubernetes_metadata_filter requires explicit watch_retry_max_count and merge_json_log false for the namespaced watch; Filebeat's add_kubernetes_metadata processor uses host: ${NODE_NAME} to scope. Razorpay's 2024 incident report on a control-plane outage cited exactly this — a Fluentd config that watched the entire cluster from every node, multiplied by 800 nodes, was the proximate cause of an etcd-elect storm that took the API server down for 14 minutes. The fix was a one-line config change; the bug had been there for two years.

Buffer size and the OOM-kill loop. A shipper whose buffer is sized larger than its pod's memory limit will OOM-kill itself the first time the buffer fills. The OOM-kill loses all in-memory state (and on most shippers, replays from the position file, re-shipping records that were in the buffer) — and if the backpressure that filled the buffer is still there when the shipper restarts, the shipper fills its buffer again and OOM-kills again, in a loop that ships records repeatedly without ever clearing the backlog. The fix is straightforward but counter-intuitive: the buffer size should be sized against disk not memory (Vector's disk_v2 solves this directly), and the pod's memory limit should leave ~512 MB headroom over the shipper's steady-state working set. Fluentd's <buffer> @type file with chunk_limit_size and total_limit_size set against disk is the equivalent; Filebeat's queue.disk.max_size plus a memory limit that excludes the disk-queue working set. The rule: the shipper's durable buffer must be on disk, not memory; the pod's memory limit is for the working set of the shipper, not for the buffer.

Common confusions

"Vector is just a faster Fluentd." Vector is faster, but the more important difference is the typed VRL transformation language — Vector rejects malformed configs at startup, Fluentd discovers them at the first event. The CPU savings are the easy-to-quantify win; the bug-class-eliminated is the harder-to-quantify but more important one. Teams that migrate to Vector and only use it as a faster Fluentd (untyped Lua transforms, no buffer durability) leave most of the value on the table.
"Filebeat is for Elasticsearch only." Filebeat ships to Elasticsearch natively over the Beats protocol, but it can also ship to Logstash, Kafka, Redis, file, and (via output.logstash → grafana-agent) Loki. The Elasticsearch fast path is what makes Filebeat appealing for ES-shops; for non-ES backends, Filebeat works but loses its advantage and is usually worse than Vector on every axis except memory.
"Fluentd's plugin ecosystem means I can integrate with anything." True for sources and sinks (Fluentd has a plugin for every database, queue, and SaaS log backend), but the plugins are uneven in quality — many are unmaintained, some are buggy under load, and the ones that handle backpressure correctly are a minority. The "700+ plugins" number is an asset for proof-of-concept work and a liability for production reliability; production-ready plugins are closer to 30-50.
"My shipper is reliable because I tested it in dev." Dev shippers see ~100 records/s, no backpressure, no rotation, no inode reuse, and a healthy backend. Production shippers see 100x the rate, periodic backpressure when the backend is slow, daily rotation, and occasional API-server flakiness. The bugs that matter all live in the production-only conditions, which means dev validation is necessary but not sufficient — every shipper needs a soak test at production rate with synthetic backpressure to expose the buffer overflow, OOM-kill, and rotation-race bugs.
"A heavier shipper is more reliable." Footprint and reliability are uncorrelated. Filebeat (smallest footprint) loses logs on queue.disk overflow; Fluentd (heaviest) loses logs on the same overflow at a different threshold; Vector's reliability comes from disk_v2 and block-semantics, not from being heavyweight. The reliability is in the durability primitives, not the resource consumption.
"I can change shippers without changing my logs." A shipper migration almost always changes the shape of records arriving at the backend, even when the configs are "equivalent". Field naming conventions differ (Filebeat's kubernetes.namespace vs Fluentd's kubernetes.namespace_name vs Vector's kubernetes.pod_namespace), timestamp handling differs (Fluentd writes ts as float seconds, Filebeat as ISO8601, Vector as RFC 3339 nanos), and the level/severity mapping differs across shippers' default parsers. Every migration breaks at least one dashboard; the team that plans the migration plans for the dashboard updates explicitly.

Going deeper

How `disk_v2` actually works — write-ahead-log durability for log shippers

Vector's disk_v2 buffer is the most opinionated stage-4 design among the three shippers and worth understanding in detail because it sets the bar for what "durable" means for shippers. The buffer is a memory-mapped write-ahead log on disk — every record arriving at the buffer is appended to the active segment file (default 128 MB), fsync'd on a configurable interval (default every 500ms or 64 MB, whichever comes first), and only then acknowledged back to the source. The active segment is mmap'd so reads from the segment (the sink consuming records) hit the page cache rather than re-reading from disk. When the active segment fills, a new one is allocated; when a segment is fully consumed (all its records have been ack'd by the sink), it is unlink'd. The buffer can be sized arbitrarily — buffer.max_size = 100 GiB is realistic for an aggregator pool absorbing hours of backend downtime — and the file format is forward-compatible across Vector versions. The two key choices are: when_full = "block" (apply backpressure to the source, which on file sources slows the file read; on TCP sources sends WindowUpdate 0; on Kafka sources stops calling consumer.poll()) versus when_full = "drop_newest" (drop records arriving when the buffer is full). The default is block for almost every sink, which is the reliability-first choice and the one most teams should keep. Fluentd's <buffer> @type file is conceptually similar but uses a chunked-file format (records are batched into ~4-8 MB chunks, each written atomically) rather than a per-record WAL, which means individual records are not durable — the chunk is durable, but if the shipper crashes mid-chunk, the in-progress chunk is lost on restart. For most workloads the difference is invisible (the chunk size is small), but for low-rate critical streams (audit logs at 10 records/s, where a 4 MB chunk takes ~30 minutes to flush) the difference between per-record-durable and per-chunk-durable matters a lot.

The Beats protocol — why Filebeat is fast to Elasticsearch and not to anything else

Filebeat's native fast path is the Beats protocol, a binary, length-prefixed protocol designed specifically for Beats-to-Elasticsearch (or Beats-to-Logstash) shipping. The protocol uses a custom framing (v=2, t=W for window, t=J for JSON event, t=A for ack) rather than HTTP, supports compression and TLS natively, and handles backpressure via window-based flow control (the receiver advertises a window size; the sender stops sending when the window is exhausted). The result is that Filebeat to Elasticsearch is meaningfully faster than any HTTP-based shipper-to-ES path — typically 2-3x throughput at the same CPU because the protocol is tighter than HTTP/JSON. The catch is that the Beats protocol is only spoken by Elasticsearch and Logstash; for every other backend (Loki, Kafka, S3), Filebeat falls back to HTTP or backend-specific protocols where it has no advantage. This is why Filebeat's "best for" recommendation is "ES-only shops" — the speed advantage evaporates the moment the backend isn't ES, at which point Vector's HTTP path is at least as fast and Vector's durability story is stronger.

Fluent Bit and the agent-aggregator pattern

Fluent Bit is the C-based, ultra-lightweight cousin of Fluentd — same project, same vendor (Treasure Data → CNCF), but rewritten in C with a 30-50 MB RSS footprint vs Fluentd's 200-400 MB. Fluent Bit is the canonical "agent" in the agent-aggregator pattern: deploy Fluent Bit on every node (cheap, minimal CPU), forward records over the forward protocol to a smaller pool of Fluentd aggregator pods (heavier, doing transformation and sink-fanout). The split lets the per-node footprint stay small while the heavyweight Ruby plugin work runs on dedicated nodes. The same pattern works for Vector (vector in agent role on each node, vector in aggregator role on a smaller pool, the two communicating over the Vector protocol). The architectural choice is whether to mix the two — Fluent Bit agents shipping to Vector aggregators is a common config in fleets that have legacy Fluentd plugins they want to keep but want to take advantage of Vector's durability and VRL on the aggregator side. The mix works because the on-the-wire protocols (Forward, Vector) are interoperable through a small Vector source plugin (fluent source).

Sampling and shippers — where to place the sample-and-route split

The chapter on log sampling makes the case that head-sampling belongs at the application and tail-sampling belongs at the agent. The shipper is the agent layer, which means tail-sampling implementations live inside the shipper — Vector's sample transform with key_field for hash-stable head-sampling and the more complex tail-sampling patterns expressed in VRL with state (Vector 0.36+) plus a windowed buffer; Fluentd's out_sampling plugin for head; and Filebeat's sample processor (a recent addition, less mature than Vector's). The architectural rule: the shipper that owns durability also owns sampling, because a sampler that drops records before the durable buffer is one that can't recover the sampled-out records, and a sampler that drops after the buffer is one that wastes buffer capacity on records the team will throw away. Vector and the modern Fluentd both place the sampler before the buffer, which is correct; older configs sometimes place it after, which is a reliability bug masquerading as a cost optimisation.

The shipper-as-data-product — observability for the observability pipeline

A shipper is itself a system that needs to be observed. Every shipper exposes Prometheus metrics on /metrics (Fluentd via the monitor_agent plugin; Filebeat via http.enabled: true; Vector natively at :8686/metrics), and the metrics that matter for production are: *_records_received_total (input rate), *_records_sent_total (output rate), *_buffer_byte_size and *_buffer_max_byte_size (buffer fill ratio), *_send_errors_total (backpressure indicator), and *_dropped_records_total (the metric you alert on at 1 record dropped). The standard alert is "buffer fill ratio > 80% for > 5 minutes" — that is the early warning that the backend is slow and you have ~5-10 minutes before the buffer overflows. The other standard alert is "dropped records > 0 over 1 minute" — Vector should never drop with block, Fluentd and Filebeat should never drop in steady state, so any drops at all are a config or capacity bug. Teams that run shippers without these alerts only learn about shipper failures from incident postmortems — which is too late, because the postmortem is asking "why didn't we have logs from 14:00 to 14:30" and the answer is "the shipper dropped them and nobody noticed".

# Reproduce this on your laptop
docker run -d --name loki -p 3100:3100 grafana/loki
python3 -m venv .venv && source .venv/bin/activate
pip install pyyaml jinja2 prometheus-client requests
# Vector binary install (macOS):
#   curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | bash
python3 shipper_compare.py
# Expected: Vector validates the config; Fluentd and Filebeat report
# 'binary not installed; skipping validation' unless you also install them.
# To exercise the buffer behaviour, run Vector with the generated TOML against
# a deliberately slow Loki (e.g., docker run --memory 64m grafana/loki) and
# watch /metrics for buffer_byte_size climbing during the slowdown.

Where this leads next

Log sampling: head-based, tail-based — the previous chapter, on the cost-control discipline that runs inside the shipper. The shipper is where sampling decisions are implemented; this chapter is about the shipper as the substrate.
JSON logs and schema drift — the producer-side contract that determines whether the shipper can parse the line in the first place. Unstructured logs force the shipper to do regex-and-grok extraction at the agent; structured JSON lets the shipper just parse_json and route by field.
Structured vs unstructured logging — the producer-side discipline that the shipper depends on. A shipper handling unstructured logs is a brittle pipeline; a shipper handling JSON is a tractable one.
Wall: logs are the oldest pillar and the most abused — the broader argument about logs as a pillar. The shipper is the part of the architecture that actually moves the bytes; everything else (schema, sampling, retention) is upstream or downstream of it.

The next chapters in this section move into the storage backends — Loki's content-addressed indexes, Elasticsearch's inverted indices, S3-tier retention — and the routing rules that decide which records land where. The shipper's stage 4 (sink configuration) is the boundary between the shipping pipeline and the storage pipeline, and the choices made there (one sink per shipper, multi-sink fan-out, conditional routing on labels) determine how the storage cost is structured. A shipper that fan-outs to Loki for hot-tier and S3 for cold-tier is a different operational shape than one that ships to Loki and lets Loki handle tiering; both are valid, and the choice usually depends on whether the team owns Loki or rents it.

The chapters after that move into LogQL — the query language readers actually use to find lines in the kept corpus. Shipper choices interact with LogQL choices in subtle ways: the labels the shipper attaches at parse time become the indexed labels in Loki, and a shipper that attaches team=payments as a label produces a fast-filterable corpus, while a shipper that puts the same field in the line body produces a slow-grep-by-content corpus. The shipper is the upstream half of every later query; getting its label strategy wrong is a query-performance bug that doesn't show up until the team is debugging a P0 at 14:00.

The implicit message running through this chapter and the rest of Part 3 is that the log pipeline is a system, and the shipper is the load-bearing component. The application emits text; the shipper decides whether that text becomes queryable telemetry or rotated-and-lost bytes; the storage backend decides how long the queryable telemetry lives. Each layer has its own failure modes, and the shipper's failure modes are the ones least visible to the engineers above and below it — which is why every team that has run shippers for more than two years has a story about an incident where the shipper, not the application or the backend, was the bug. The discipline is to treat the shipper as a first-class system with its own SLOs (durability, buffer fill, drop rate), its own runbooks, and its own observability — exactly the way you would treat any other system in the request path. The shippers that are most boring are the ones the team has invested the most time in; the ones that are most exciting are the ones about to surprise the team.

References

Vector documentation — Architecture — the canonical reference for disk_v2, VRL, and the agent-aggregator role split; explains the design choices that distinguish Vector from Fluentd.
Fluentd Architecture Overview — the original "data collector for unified logging" framing; Sadayuki Furuhashi's design rationale from the Treasure Data era.
Filebeat Reference — Beats Protocol — the wire protocol that makes Filebeat fast to Elasticsearch and also explains why it has no advantage to non-ES backends.
Cindy Sridharan, Distributed Systems Observability (O'Reilly, 2018), Ch. 4 — the foundational chapter on log pipelines; introduces the agent-aggregator pattern that Fluent Bit and Vector both implement.
CNCF — Fluentd Graduation Announcement (2019) — the moment Fluentd became the de-facto Kubernetes shipper; explains the trajectory that Vector later disrupted.
Vector — disk_v2 buffer design — the durability primitive that distinguishes Vector from Fluentd's chunked file buffer; explains why block-on-full is the default.
Razorpay Engineering — Migrating from Fluentd to Vector (2023) — public migration write-up citing the CPU and durability reasons most teams give for the move.
Log sampling — head-based, tail-based — internal chapter on the cost-control discipline that runs inside the shipper.