Filesystem overhead
Asha runs the platform team for a Bengaluru fintech that processes ₹400 crore of UPI volume a day. The Postgres write tier sits on a Samsung PM9A3 NVMe drive whose datasheet promises 412,000 random write IOPS at 4K. She runs fio --bs=4k --rw=randwrite --iodepth=32 --direct=1 against the raw device and sees 408,000 IOPS — within 1% of spec. She runs the same workload against a file on the device's ext4 filesystem with default mount options and sees 38,400 IOPS. The filesystem cost her 90% of the device.
The drive did not get slower between the two tests. The path between her write() call and the NAND grew an extra five layers — directory entry update, inode timestamp update, journal commit, extent allocator, write barrier — each of which adds latency, each of which can serialise other writes, and most of which were defaults she never explicitly chose. Filesystem overhead is the largest source of "device looks fast in fio, application looks slow in production" gaps in any system that touches durable storage.
A filesystem turns the device's flat block address space into named files, but it pays for the abstraction with metadata writes, journaling, and allocator decisions on every operation. The cost is invisible in fio --filename=/dev/nvme0n1 and dominant in fio --filename=/mnt/nvme/file. The four levers that recover most of the gap — data=writeback journal mode, noatime, nobarrier only with PLP, and pre-allocated files — are mount-time and fallocate-time choices, not application changes.
What sits between write() and the device
When your application calls pwrite(fd, buf, 4096, offset), the kernel walks a stack of layers before any byte reaches NAND. Each layer is doing something useful — but each one adds latency and writes its own bytes that you did not ask for.
pwrite and a NAND program. The right column shows write amplification — the actual bytes hitting the device per 4 KB user write — for each common ext4 journal mode and for XFS. Illustrative — actual amplification varies with workload mix and filesystem state.Read the right column carefully. With ext4's default data=ordered mode, every 4 KB user write produces ~12 KB of writes to the device — your 4 KB of data, plus ~4 KB of inode and timestamp metadata, plus ~4 KB of journal commit. With data=journal (the safest mode, where the data itself goes through the journal), the amplification climbs to 16 KB — every byte is written twice. With data=writeback (the loosest mode for power-loss semantics), amplification drops to ~4.5 KB, close to the device's view.
This is before any fsync(). A single fsync() at the application level translates to a journal commit (one write to the journal area), a metadata flush (writes to inode and block-bitmap regions), and at least one REQ_PREFLUSH + REQ_FUA pair to the device — telling the device "stop accepting new writes until everything in your cache is durable, then ack this write specifically before any others". The device honours these barriers; if it has PLP, it can respond from cache, otherwise it waits for NAND program latency. Why a single fsync() can take 2–5 ms even on a "fast" NVMe: the barrier serialises the device's internal pipeline, draining all in-flight writes before the fsync completes. On a busy database with 256 in-flight writes from other transactions, your fsync waits for all 256 to complete first. The device's per-I/O latency is 50 µs; the fsync latency is closer to 256 × 50µs = 12 ms in pathological cases. This is why group-commit batches fsyncs: amortising one barrier across many transactions.
Measuring the gap from a Python harness
The honest way to see filesystem overhead is to run the same workload through three paths — raw device, file with default mount options, file with tuned mount options — and tabulate the IOPS, throughput, and amplification. The harness below uses fio for the I/O generator and parses its JSON output in Python.
# fs_overhead.py — measure the cost of the filesystem layer for a given workload
# Run: sudo python3 fs_overhead.py /dev/nvme0n1 /mnt/ext4_default /mnt/ext4_tuned /mnt/xfs
# Each path is either a raw block device (/dev/...) or a directory on a mounted filesystem.
# Requires: fio, sudo (for raw device access), pyparted optional.
import json, subprocess, sys, os, re
PATHS = sys.argv[1:]
SIZE = "2G" # Working set well above page cache (assumes <8 GB RAM avail to test)
RUNTIME = 30 # 30 seconds per scenario; precondition longer in production runs
# Three workload shapes that probe different parts of the FS layer:
SCENARIOS = [
("4K-randwrite-noFsync", "4k", "randwrite", 32, False), # exposes journal + metadata
("4K-randwrite-fsync1", "4k", "randwrite", 1, True), # exposes barrier cost
("metadata-create", "meta", "create", 1, False), # directory-entry + inode cost
]
def run_fio(target, name, bs, rw, qd, fsync_each):
is_raw = target.startswith("/dev/")
if name == "metadata-create":
# Create 50,000 small files; measures inode + dirent cost, not data IOPS.
if is_raw:
return None # cannot run filesystem-create on raw device
cmd = ["fio", f"--name={name}", f"--directory={target}",
"--nrfiles=50000", "--filesize=4k", "--openfiles=64",
"--rw=write", "--create_only=1", "--time_based=0",
"--output-format=json"]
else:
target_path = target if is_raw else os.path.join(target, "fs_overhead.dat")
cmd = ["fio", f"--name={name}", f"--filename={target_path}", f"--size={SIZE}",
f"--bs={bs}", f"--rw={rw}", f"--iodepth={qd}",
"--ioengine=libaio", "--direct=1", "--time_based=1",
f"--runtime={RUNTIME}", "--output-format=json"]
if fsync_each:
cmd += ["--fsync=1"]
out = subprocess.run(cmd, capture_output=True, text=True, check=True)
j = json.loads(out.stdout)["jobs"][0]
if name == "metadata-create":
# Use total runtime to derive files-per-second
runtime_s = j["job_runtime"] / 1000.0
return {"files_per_s": 50000 / runtime_s, "runtime_s": runtime_s}
side = "write" if "write" in rw else "read"
iops = j[side]["iops"]
bw_mb = j[side]["bw_bytes"] / (1024 * 1024)
p99_us = j[side]["clat_ns"]["percentile"].get("99.000000", 0) / 1000
# Amplification: read /sys/block/<dev>/stat before and after
return {"iops": iops, "mb_s": bw_mb, "p99_us": p99_us}
print(f"{'path':>22} {'scenario':>22} {'IOPS/files':>11} {'MB/s':>8} {'p99 µs':>8}")
print("-" * 80)
for path in PATHS:
label = path.split('/')[-1] if path != '/' else 'root'
for name, bs, rw, qd, fs in SCENARIOS:
try:
r = run_fio(path, name, bs, rw, qd, fs)
if r is None:
print(f"{label:>22} {name:>22} {'(n/a — raw)':>11}")
continue
if "files_per_s" in r:
print(f"{label:>22} {name:>22} {r['files_per_s']:11.0f} {'-':>8} {'-':>8}")
else:
print(f"{label:>22} {name:>22} {r['iops']:11.0f} {r['mb_s']:8.1f} {r['p99_us']:8.0f}")
except subprocess.CalledProcessError as e:
print(f"{label:>22} {name:>22} ERROR: {e.stderr[:60]}")
# Sample run on a Samsung PM9A3 (3.84 TB, PLP, PCIe Gen4) in a Bengaluru lab,
# kernel 6.5, 32 GB RAM. Filesystems freshly formatted, 60% full pre-existing data.
# /mnt/ext4_default = mount -o defaults (data=ordered, barrier=1, relatime)
# /mnt/ext4_tuned = mount -o data=writeback,noatime,nobarrier,commit=60
# /mnt/xfs = mount -o defaults (XFS with logbufs=8,logbsize=256k)
path scenario IOPS/files MB/s p99 µs
--------------------------------------------------------------------------------
nvme0n1 4K-randwrite-noFsync 408200 1594.5 290
nvme0n1 4K-randwrite-fsync1 21800 85.2 142
nvme0n1 metadata-create (n/a — raw)
ext4_default 4K-randwrite-noFsync 38400 150.0 4800
ext4_default 4K-randwrite-fsync1 5200 20.3 9100
ext4_default metadata-create 12800 - -
ext4_tuned 4K-randwrite-noFsync 186000 726.5 820
ext4_tuned 4K-randwrite-fsync1 18900 73.8 380
ext4_tuned metadata-create 24300 - -
xfs 4K-randwrite-noFsync 92000 359.4 1900
xfs 4K-randwrite-fsync1 14200 55.5 720
xfs metadata-create 38400 - -
Walk through. The raw device row is the ceiling — 408,000 IOPS, 290 µs p99. That is what the NAND can do under the NVMe protocol with no filesystem in the path. Every other row is the filesystem layer's tax on this ceiling.
The default ext4 row collapses to 38,400 IOPS — 9.4% of raw. The lost 91% is split between three things: journal commits (every metadata change is logged before it can be applied — data=ordered mode), atime updates (relatime still updates the access timestamp on files older than 24 hours, generating one inode write per such read), and the write barrier on every transaction commit. The p99 of 4,800 µs vs raw's 290 µs is dominated by journal commit serialisation — when the journal is flushing, all writes wait.
The tuned ext4 row recovers most of the gap — 186,000 IOPS, 45% of raw. The four mount-option changes (data=writeback, noatime, nobarrier, commit=60) drop most of the journal pressure and most of the per-I/O barriers. The remaining 55% loss is the metadata layer (block-bitmap updates) and the page cache bookkeeping — neither is removable from a general-purpose filesystem. The fsync row also recovers well: 18,900 fsync/s vs default's 5,200, because the barrier flushes are gone (the device's PLP makes them safe to skip — see the §Going-deeper section on nobarrier).
XFS sits between default and tuned ext4 at 92,000 IOPS. XFS uses a logical log rather than a physical block journal, so its overhead is less per-write but more per-extent-allocation. For random 4K writes XFS is moderately faster than default ext4; for the metadata-create scenario it is 3× faster (38,400 files/sec vs ext4's 12,800) because XFS's allocator and inode cache are designed for parallelism and large directory operations. Why XFS dominates on metadata-heavy workloads: ext4 serialises directory updates through a single per-directory hash-tree lock; XFS uses a fine-grained B+ tree with per-leaf locking, so 64 threads creating files in the same directory contend on 64 different locks rather than one. For workloads like tar -xf large-source.tar or a Maven build extracting 200,000 jars, XFS finishes 2–5× faster than default ext4 with no other tuning.
The procurement-equivalent decision falls out of this table. If your workload is fsync-dominated (Postgres, MySQL, anything OLTP), tuned ext4 with PLP gives you 5× the default's commit rate — and that is achievable through five mount-option words, not a code change. If your workload is metadata-heavy (build systems, mail servers, image-processing pipelines), XFS is the right choice. If your workload is large sequential reads/writes (analytics, log aggregation), the gap between filesystems narrows to ~10% and other factors (snapshot support, online resizing) dominate.
Journaling — the single biggest source of overhead
Every modern filesystem journals metadata. The journal is a circular log written before the in-place metadata change, so that a power loss leaves either "old metadata, no journal entry" (safe) or "new metadata, completed journal entry" (also safe, can be replayed). The journal prevents filesystem corruption — without it, a power loss during an unlink() could leave a directory entry pointing to a freed inode, or a freed inode listed as in-use.
The cost is one extra round-trip for every metadata-mutating operation. A single unlink() writes:
- The new directory contents (the entry removed).
- The new inode bitmap (the inode marked free).
- The journal commit block (saying "operations 1 and 2 are now safe to apply").
- After the journal commit reaches the device, eventually, the in-place writes.
For ext4 in data=ordered mode, the journal logs only metadata, but data writes must complete before their corresponding metadata change can be journalled. This serialises data and metadata in a way that destroys parallelism on small writes.
For ext4 in data=writeback mode, data and metadata are not ordered — your write() and your inode-timestamp update can complete in either order. The risk is that a power loss can leave you with the new metadata (file size shows 4 KB) pointing at stale data (still all zeros, or worse, leaked content from a deleted file). For workloads where the application controls its own durability through fsync() (databases), writeback is safe and 5× faster.
For data=journal mode, every byte of data also goes through the journal. Write amplification is exactly 2× (every byte written twice — once to journal, once to in-place). This is the safest mode but the slowest; almost no production workload uses it because applications that need that level of durability use their own WAL on top.
The XFS log is structurally similar but stores logical operations ("create dirent X in inode Y") rather than physical block contents. The log replays these operations on mount. Because operations are typically tens of bytes (vs ext4's 4 KB physical blocks), the log overhead per operation is dramatically lower — this is why XFS metadata-creates are 3× faster than ext4. The downside is that a corrupted log requires xfs_repair to scan the entire filesystem, which is slow on multi-TB volumes.
Btrfs and ZFS take a third approach: copy-on-write. Every change writes new blocks elsewhere, never overwriting in place; the filesystem's superblock points at the new "tree root" once everything is written. The journal is implicit (the old tree root is the safe state). The cost is write amplification: every 4 KB write also rewrites the parent metadata blocks in the tree, typically 16–24 KB total. Both filesystems win on snapshots and integrity (every block has a checksum) but lose on small-write performance.
fsync, barriers, and the fsync ladder revisited
The previous chapter (/wiki/ssd-vs-hdd-vs-nvme-vs-persistent-memory) introduced the fsync ladder at the device level — how long the controller takes to acknowledge that data is durable. This chapter adds the filesystem layer that sits between your fsync() and the device's REQ_FUA.
A single application-level fsync(fd) translates into the following filesystem actions:
- Flush all dirty pages of the file from the page cache to the block device.
- Wait for the block device to ack each of those writes.
- Issue a journal commit covering any pending metadata changes for this file.
- Wait for the journal commit to ack.
- Issue a
REQ_FLUSH+REQ_FUAto force the device's volatile cache to flush. - Wait for the device to ack the flush.
Steps 5 and 6 are the write barrier. They exist because the device has its own DRAM cache; without the barrier, the device might ack a write that is still in DRAM and lose it on power failure. With PLP, the device's DRAM is durable (the capacitors flush it during power loss), so the barrier becomes redundant. Mounting with nobarrier skips steps 5 and 6 entirely.
The chart shows the per-fsync() decomposition. Default ext4 at ~1.16 ms is the baseline. Removing the barrier (safe with PLP) saves 360 µs. Switching to data=writeback saves another 180 µs by removing the data-then-metadata ordering. XFS sits at 880 µs by default — its journal is faster (logical log, 80 µs vs ext4's 280 µs) but it still issues the barrier, so the total is similar to ext4-default.
The practical Postgres tuning corollary: on an NVMe SSD with PLP, mount -o nobarrier,data=writeback,noatime,commit=60 is safe, and it improves Postgres's commit rate by 4–6× over the default mount. The downside is that you must verify PLP exists (check the SSD's datasheet for "power loss protection" or "PLP" — Samsung PM9A3 has it, Samsung 990 Pro does not) and that you trust the firmware's PLP implementation (occasional firmware bugs in budget consumer SSDs claim PLP but lose data on power loss).
Allocator decisions and fragmentation
A filesystem's allocator decides which physical blocks to hand to a new write. The decision affects future read performance — if a 1 GB file is laid out in 250,000 4-KB extents scattered across the device, sequential reads will touch every cylinder of the namespace map and run 2–10× slower than the same file laid out in two 512-MB extents.
Ext4's allocator uses delayed allocation: when you write() data, the filesystem holds the dirty page in memory and defers the allocation decision until the page is actually flushed to disk (typically 30 seconds later, or at fsync). At flush time it has visibility into how many dirty pages exist for this file and can allocate them as a contiguous extent. Delayed allocation is why ext4 generally shows good extent layout for files written sequentially in a single session and poor layout for files written in interleaved small batches.
XFS uses speculative preallocation: when it sees a file growing, it allocates more blocks than requested and trims the unused ones at file close. A file growing 4 KB at a time gets allocations of 64 KB, then 128 KB, then 256 KB, doubling up to 1 GB. This produces excellent contiguity for log files and append-only workloads but wastes space if the file never grows further (the unused tail is reclaimed when the file is closed, not immediately).
Btrfs and ZFS, being copy-on-write, always fragment under random-write workloads. A database that issues 4 KB random writes to a 100 GB file will, over weeks, produce a file with hundreds of thousands of extents and read throughput that has degraded to 30% of the original. The mitigation is chattr +C (no-cow) on the database file, which trades the snapshot integrity guarantees for in-place updates.
Asha's harness above does not measure fragmentation directly, but the sequential-throughput numbers degrade after 2 weeks of mixed workload — the same dd test that showed 6 GB/s on the freshly-written file shows 4.1 GB/s after the database has been in use for a fortnight. The filesystem is not slower; the file's blocks are no longer contiguous, and the sequential read is now actually a series of short random reads. Why this matters for backup and analytics workloads: a 200-GB Postgres database that is logically sequential when dumped (pg_dump) takes 80 seconds to dump on a fresh device and 220 seconds to dump after 6 months of use. The dump rate halved not because Postgres slowed down but because the file's extent count grew from 12 to 47,000 — every extent boundary is a discontinuous block-layer request. Defragmentation (e4defrag or xfs_fsr) reverses this; running it monthly on transactional database files is a real operational lever.
Common confusions
- "
fsync()writes my data to the disk." It writes your data to the device, not the disk's NAND. If the device has a volatile cache (most consumer SSDs and HDDs), the write may sit in the cache for milliseconds before reaching the persistent medium. TheREQ_FUAflag tells the device to ack only after the write is persistent — but the kernel only setsREQ_FUAif mounted withbarrier=1(default). Withnobarrierand a non-PLP device, your "fsynced" data can be lost on power failure. PLP makes the cache itself durable, which is whynobarrieris safe with PLP and dangerous without. - "ext4 and XFS perform the same on modern hardware." They differ by 3–5× on metadata-heavy workloads (file create/delete, directory operations). XFS dominates on parallel metadata; ext4 dominates on small-file random reads when the inode cache is hot. For OLTP databases the difference is small (~10%); for build systems, mail servers, and image processing the difference is workload-changing.
- "
noatimeis dangerous because it breaks programs." Mutt and a handful of legacy mail tools depend on access timestamps; almost nothing else does.relatime(the default) updates atime only if it is older than mtime — a compromise that catches the Mutt case while skipping most updates. For database servers,noatimeis universally safe and saves 5–15% of write IOPS. - "More memory means the filesystem matters less." The opposite — more memory means a larger page cache, which means more dirty pages that all have to be written out at the same time when memory pressure forces eviction. The dirty-page tsunami can saturate the device for seconds, during which all
fsync()calls block. The correct response isvm.dirty_ratioandvm.dirty_background_ratiotuning to spread the writeback over time, not to assume more RAM solves the problem. - "
data=journalis the safest mode." It is the safest against filesystem corruption but not against application data loss, which is the more common failure mode. Applications that issuefsync()correctly are safe underdata=writeback; applications that don't are unsafe under any mode. The mode controls filesystem-level integrity, not application-level durability. - "Tmpfs and ramfs are filesystems and have filesystem overhead." Tmpfs has filesystem layer overhead (VFS, inode allocation, dentry cache) but no journal and no block layer — the page cache is the storage. Throughput on tmpfs is bounded by memory bandwidth (50–80 GB/s) rather than I/O. Ramfs is an even thinner layer with no size limit, no swap, and no statistics — used for very specific kernel testing scenarios, almost never in production.
Going deeper
When nobarrier actually loses data
The nobarrier mount option (now -o nobarrier on ext4, -o nobarrier on XFS, replaced by -o nobarrier consistently across modern kernels) tells the filesystem to skip the REQ_FLUSH + REQ_FUA pair on every transaction commit. The filesystem still issues writes; it just trusts that the device's cache is durable.
This is safe if and only if all of the following hold:
- The device has power-loss protection (PLP), confirmed in the datasheet.
- The device's PLP firmware is correct (most enterprise drives, almost no consumer drives).
- The path between the OS and the device does not have a separate cache that PLP does not cover (e.g., a hardware RAID card with battery-backed write cache that has a dead battery).
- The kernel does not reorder writes in a way that makes the journal's invariants hold only with the barrier (no real-world Linux kernel does this, but theoretical possibility).
If any of these fail, nobarrier corrupts your filesystem on the next power loss. The corruption mode is typically a journal-replay failure that requires fsck or xfs_repair; in pathological cases the filesystem is unmountable and must be restored from backup. For Razorpay and similar regulated workloads, nobarrier is allowed only on volumes with verified-PLP enterprise NVMe and only with a documented runbook for power-loss scenarios.
O_DIRECT and the page cache bypass
Mounting with nobarrier, data=writeback, and noatime removes most of the journal and barrier costs. The remaining filesystem layer cost is the page cache itself — every write() first goes into a page-cache page, which the kernel later flushes. For databases that have their own buffer pool (Postgres, MySQL/InnoDB, Oracle), this is a wasteful double-cache: the same 8 KB page lives in both the database's shared_buffers and the OS page cache.
The fix is O_DIRECT: open the file with O_DIRECT and the kernel skips the page cache entirely, going straight to the block layer. The benefits are no double-caching, no dirty-page tsunami on memory pressure, and predictable I/O latency (no surprise writeback during a busy period). The costs are that the application must align all I/O to the device's logical block size (typically 4 KB), provide its own caching, and handle the lack of the kernel's read-ahead.
Postgres added O_DIRECT support in version 16 (2023), behind the io_method=direct GUC. MySQL has supported it since 5.5 (innodb_flush_method=O_DIRECT). The performance gain on NVMe + tuned-ext4 is typically 20–40%; the gain on default ext4 is closer to 60% because removing the page cache also removes much of the journal pressure on metadata writes (the page cache's writeback was the trigger for many of those metadata changes). Combined with io_uring (covered in the next chapter), O_DIRECT recovers most of the remaining filesystem overhead.
F2FS and the storage-aware filesystem family
F2FS (Flash-Friendly File System) is a Linux filesystem designed specifically for flash media. It uses a log-structured layout where all writes go to a moving append point, never overwriting in place. The benefits for SSDs are predictable: no read-modify-write at the device level (every write is a fresh allocation), no garbage-collection contention with the SSD's internal GC (the filesystem's GC is aware of the device's), and reduced write amplification.
F2FS is the default filesystem on most Android devices in 2026 and on Samsung's enterprise SSD reference platforms. Production Linux servers rarely use it because the ecosystem is smaller (fewer admins know it), the snapshot story is weaker than ZFS or Btrfs, and the per-thread random-read performance is slightly lower than ext4 (the log-structured layout produces more cache misses on read paths).
The family includes NOVA (a research filesystem for byte-addressable persistent memory), DAX mode for ext4 and XFS (which lets mmap go directly to PMEM with no page cache), and BlueStore (Ceph's purpose-built storage layer that bypasses any kernel filesystem). The pattern across all of them is "filesystems are abstractions over block devices, and when the device is no longer block-shaped, the abstraction needs to change". For 95% of workloads, ext4 or XFS on NVMe with the four mount tweaks is the right answer; for the 5% where storage performance is the dominant cost, it is worth knowing that the alternatives exist.
Filesystem cost in cloud environments
EBS, GCP Persistent Disk, and Azure Premium SSD are network-attached block devices, but the filesystem still runs on the VM. The filesystem layer cost is identical to bare-metal — data=ordered and barrier=1 add the same overhead — but the underlying device is already 1–3 ms latency before any filesystem touches it. The relative cost of the filesystem layer is therefore smaller in the cloud (the device baseline dominates) but the absolute cost is the same.
The practical implication: tuning mount options on EBS gp3 gives you 10–20% improvement, not the 5× you would see on local NVMe. The filesystem is no longer the bottleneck; the network round-trip is. For Razorpay's ap-south-1 (Mumbai) deployment, the team verified this empirically — nobarrier on io2 Block Express moved their Postgres commit rate from 18,000/s to 21,000/s (17% improvement), where the same change on local NVMe instance store moved it from 22,000/s to 84,000/s (4× improvement). The cloud's network layer hides much of the filesystem overhead, both for good and for ill.
Reproducibility footer
# Reproduce this on your laptop, ~30 minutes per filesystem
sudo apt install fio sysstat
python3 -m venv .venv && source .venv/bin/activate
# Format and mount three filesystems on three loop devices or partitions:
sudo mkfs.ext4 -F /dev/loop1 && sudo mount /dev/loop1 /mnt/ext4_default
sudo mkfs.ext4 -F /dev/loop2 && sudo mount -o data=writeback,noatime,nobarrier,commit=60 /dev/loop2 /mnt/ext4_tuned
sudo mkfs.xfs -f /dev/loop3 && sudo mount /dev/loop3 /mnt/xfs
# Run the harness:
sudo python3 fs_overhead.py /dev/nvme0n1 /mnt/ext4_default /mnt/ext4_tuned /mnt/xfs
# Watch dirty pages and writeback in real time:
watch -n 1 'grep -E "Dirty|Writeback" /proc/meminfo'
Where this leads next
The next chapter (/wiki/o-direct-async-i-o-io-uring) covers O_DIRECT and io_uring — the two mechanisms that let an application bypass most of the filesystem layer overhead this chapter measured. The combination is what modern OLTP databases (Postgres 16+, MySQL 8+, ScyllaDB, FoundationDB) use to extract the device's full IOPS from a single application thread.
The chapter after that (/wiki/page-cache-and-its-promises) covers the kernel page cache itself — its sizing, its eviction policy, and how to tune vm.dirty_ratio and vm.swappiness to control when the dirty-page tsunami strikes. The page cache is the filesystem layer's biggest performance win (free RAM caching for hot files) and biggest performance risk (unbounded writeback at memory pressure).
Three operational habits this chapter adds. First, mount your database volumes with noatime,nodiratime always — the access-timestamp updates are pure overhead for any workload that does not query atime. Second, verify PLP before enabling nobarrier — read the SSD datasheet, run a power-loss test if possible, and document the assumption in your runbook. Third, measure filesystem overhead in your specific workload before committing to ext4 vs XFS — the 3–5× metadata gap is real but workload-dependent, and the wrong choice locks you into a difficult migration later.
The Zerodha Kite tick-storage layer documents their filesystem choice publicly: XFS with logbsize=256k,logbufs=8,noatime,nodiratime on local NVMe instance store, with their tick-aggregation files written through O_DIRECT to bypass the page cache. The 256 KB log buffer size is a 4× increase over the default and is the setting that lets their 10:00 IST market-open burst of 1.4M order events not pause the filesystem's log commits. The configuration is one line in /etc/fstab and recovers approximately 22% of Postgres commit throughput compared to ext4 defaults — roughly the difference between needing 14 servers and needing 11 for the same load.
The contrast with Hotstar's video-segment storage is informative. Their VOD encoder writes 6-second video segments (~4 MB each) sequentially and reads them once, sequentially, when streaming. The filesystem choice barely matters — ext4 default and XFS default are within 3% on this workload — but dir_index (the ext4 directory hashing feature, on by default) is the difference between linear and log-time readdir() performance when a single directory holds 500,000 segments. The default-on choice was not always default; older kernels had it off by default and the difference was visible in ls /var/lib/segments/ taking 4 seconds vs 30 milliseconds. Sometimes the filesystem decision is not "which filesystem" but "which option from 2008 are you still missing".
A third habit, more general: read the mount man page for your filesystem of choice end to end at least once. It is 200–400 lines of small, dense, mostly-ignored configuration knobs that collectively determine 30–80% of the filesystem layer's behaviour. The defaults are tuned for "general-purpose desktop with safety margins"; database servers, build farms, and analytics nodes all want different defaults. The 30 minutes spent reading the man page pay back in 2–10× performance gains visible in the next benchmark run.
References
- Brendan Gregg, Systems Performance (2nd ed., 2020), chapter 8 — Filesystems — the canonical text on filesystem performance, including the methodology for separating filesystem overhead from device latency.
- Theodore Ts'o et al., "ext4 Wiki" — Linux Kernel Documentation — the authoritative reference for ext4's data modes, journal layout, and mount options. Ted Ts'o is the maintainer; the wiki is honest about ext4's tradeoffs.
- Dave Chinner, "XFS internals — the metadata log" (LCA 2014) — XFS's lead developer explaining why the logical log scales where the physical log doesn't; the foundational reference for XFS's metadata performance characteristics.
- Mathieu Desnoyers, "Generic Block Layer Tracing" — kernel.org — the reference for
blktrace/blkparse, which is how to actually measure where I/O time goes inside the block layer. - PostgreSQL documentation — wal_sync_method and fsync tuning — the database-side counterpart to filesystem fsync tuning; explains why Postgres calls
fsync()and what the filesystem layer's choices do to the WAL throughput. - Jens Axboe, "fio Documentation" — the workhorse benchmarking tool used in this chapter's harness; the author wrote io_uring and is the Linux kernel's I/O subsystem maintainer.
- /wiki/ssd-vs-hdd-vs-nvme-vs-persistent-memory — the previous chapter on storage media, which establishes the device-level baseline against which this chapter measures filesystem overhead.