SSD vs HDD vs NVMe vs persistent memory

Karan runs the platform team for a Pune-based broker that clears equity trades for retail clients. The order-router writes one row per order to Postgres and one row to a separate audit log; both fsync at commit because SEBI's audit requirement permits no acknowledged-but-lost trades. The current production box uses a 3.84 TB enterprise SATA SSD rated at 98,000 random-write IOPS — the spec sheet looks healthy and the device costs ₹38,000. A vendor demo of an NVMe SSD at ₹71,000 promises "1.4 million IOPS"; Karan's procurement spreadsheet shows the SATA part as 14× cheaper per IOPS. He buys the SATA. Six weeks later the order-commit p99 is 28 ms during the 09:15 IST market open and the trading desk is escalating. He swaps a single test box to NVMe. The same Postgres binary, same schema, same workload — p99 drops to 3 ms.

The procurement spreadsheet was not wrong about the IOPS numbers. It was wrong about what kind of device each row was describing. SATA SSDs, NVMe SSDs, spinning HDDs, and persistent memory are not four points on one continuum — they are four architectures with different latency floors, different queue-depth requirements, and different fsync paths. Compare them on raw IOPS and you get the wrong answer for any workload where commit latency matters, which is most of them.

The four storage media you actually buy in 2026 — spinning HDD, SATA SSD, NVMe SSD, persistent memory — sit at different points on three cliffs: per-I/O latency, queue-depth-to-saturate, and fsync-acknowledge time. Spec-sheet IOPS hides all three. The right comparison is a four-cell table per workload shape (random small-read IOPS at production queue depth, sequential throughput, fsync-bounded commit rate, and tail-latency under 80% load) measured from fio invocations driven by a Python harness — and the cost-per-useful-IOPS gap between SATA and NVMe is typically 5–10× in NVMe's favour for OLTP-style workloads.

Four media, three architectural cliffs

A spinning HDD is a mechanical device — a stack of platters rotating at 7,200 or 15,000 RPM with read/write heads moving across them. Every random I/O pays a seek (3–9 ms to position the head) plus a half-rotation latency (3–4 ms at 7,200 RPM). The combined access time is 6–12 ms; the device cannot do better than ~80–150 random IOPS regardless of how aggressively you queue requests. Sequential reads bypass seeks and can hit 200–250 MB/s on a modern enterprise drive. The architecture is mechanical; the bottleneck is geometry.

A SATA SSD uses NAND flash with no moving parts but speaks the SATA protocol — designed in 2003 for spinning rust. The protocol caps at one outstanding command per port (NCQ extends to 32) and saturates around 540 MB/s sequential, 80,000–100,000 random IOPS. The flash itself can do far more; the SATA protocol is the ceiling. Per-I/O latency is 60–250 µs, dominated by protocol overhead rather than NAND read time (~25 µs). The architecture is "flash bolted onto a spinning-disk interface".

An NVMe SSD speaks NVM Express — a protocol designed for flash from scratch — over PCIe lanes. Each device exposes 64K hardware queues with 64K commands per queue (the SATA equivalent: 1 queue, 32 commands). Per-I/O latency is 50–110 µs (dominated by NAND read time, the protocol overhead is sub-µs). Sequential throughput on PCIe Gen 4 hits 7,000 MB/s; random IOPS reaches 600K–1.5M with sufficient queue depth. The architecture removes every legacy bottleneck the SATA stack imposed; what remains is the flash itself plus the PCIe lanes.

Persistent memory (Intel Optane DC, the PMEM-class devices) plugs into DDR DIMM slots and is byte-addressable — you mmap it and dereference pointers, no I/O syscall in the path. Latency is 100–350 ns for reads, 100–200 ns for writes (compare to DRAM: 70 ns; NVMe: 50,000 ns). Throughput per channel is ~6 GB/s. There is no queue, because there is no I/O — the CPU's memory controller sees these accesses the way it sees DRAM, just slower and persistent. After Intel discontinued Optane in 2022, the surviving PMEM-class devices in 2026 are CXL-attached memory expanders (lower density, higher latency than Optane was, but conceptually the same architecture: load/store, not read/write).

Four storage media on a latency-vs-throughput log-log plotA scatter plot. X-axis is per-I/O latency in microseconds on log scale, Y-axis is achievable random IOPS on log scale. Four labelled clusters: HDD at top-left (8000 us, 100 IOPS), SATA SSD (150 us, 90k IOPS), NVMe SSD (75 us, 800k IOPS), persistent memory (0.3 us, 1B IOPS). Architectural cliff annotations between each pair.Latency vs IOPS — three architectural cliffs separate the four mediaper-I/O latency (microseconds, log scale)random IOPS (log scale)0.1101001k10k10010k1M100MHDD100 IOPSSATA90k IOPSNVMe800k IOPSPMEM1B IOPScliff 1: mechanical → flash~50× fastercliff 2: SATA → PCIe~10× more IOPScliff 3: I/O → load/store~250× lower latency
Four media on the latency-IOPS plane, showing the three architectural cliffs that separate them. Each cliff is a category change: HDD→SATA SSD removes mechanical seeks, SATA→NVMe removes the legacy storage protocol, NVMe→PMEM removes the I/O syscall altogether. Illustrative — typical 2026 numbers; specific devices vary within each cluster.

The cliffs matter because they determine which workloads benefit. A nightly batch that reads 220 GB sequentially does not care about cliff 2 (SATA throughput is already 540 MB/s, the bottleneck is upstream). An OLTP workload that issues 50,000 fsyncs/s cares enormously — at SATA's ~250 µs commit latency, you saturate at 4,000 fsync/s; at NVMe's ~50 µs you saturate at 20,000; at PMEM's 200 ns you saturate at 5,000,000. The choice is not "fast vs slow"; it is "which cliff am I on the wrong side of for my workload's bottleneck operation?".

Measuring all four from one Python harness

The fair comparison runs the same fio workload against each device and reads the JSON. Karan ran exactly this — four physical devices in one machine, one harness, one table — and the numbers told him which device to buy without further argument. The harness below uses fio (covered in /wiki/disk-performance-iops-throughput-latency) to run four canonical shapes and produces a per-device table. Why one harness across all four devices instead of trusting vendor numbers: vendor benchmarks are run with their preferred queue depth, their preferred block size, and a freshly-trimmed device with full burst credits. Your workload has none of those properties. Running the same fio script against your candidate hardware on your kernel against your filesystem is the only honest comparison — and it takes 20 minutes per device.

# media_compare.py — run identical fio workload against multiple devices, tabulate
# Run: python3 media_compare.py /mnt/hdd/test /mnt/sata_ssd/test /mnt/nvme/test
# Requires: fio (apt install fio); each path must be on the device under test.
import json, subprocess, sys, os

devices = [(p, p.split('/')[2]) for p in sys.argv[1:]]  # (path, label)
size = "4G"   # 4 GB working set — well above page cache, well below SSD spare-area

# Four canonical I/O shapes that map to real workload patterns.
SCENARIOS = [
    # name,            bs,     rw,         iodepth, comment
    ("4K-randread",    "4k",   "randread",  32, "OLTP point-read shape"),
    ("4K-randwrite",   "4k",   "randwrite", 32, "OLTP point-write shape"),
    ("256K-seqread",   "256k", "read",       8, "batch scan shape"),
    ("4K-fsync",       "4k",   "randwrite",  1, "fsync-per-write commit shape"),
]

def run_fio(target, bs, rw, iodepth, fsync_each=False):
    cmd = ["fio", "--name=t", f"--filename={target}", f"--size={size}",
           f"--bs={bs}", f"--rw={rw}", f"--iodepth={iodepth}",
           "--ioengine=libaio", "--direct=1", "--time_based=1",
           "--runtime=20", "--output-format=json"]
    if fsync_each:
        cmd += ["--fsync=1"]   # fsync after every single write
    out = subprocess.run(cmd, capture_output=True, text=True, check=True)
    j = json.loads(out.stdout)["jobs"][0]
    side = "write" if "write" in rw else "read"
    iops = j[side]["iops"]
    bw_mb = j[side]["bw_bytes"] / (1024 * 1024)
    p50_us = j[side]["clat_ns"]["percentile"].get("50.000000", 0) / 1000
    p99_us = j[side]["clat_ns"]["percentile"].get("99.000000", 0) / 1000
    return iops, bw_mb, p50_us, p99_us

print(f"{'device':>12}  {'workload':>14}  {'IOPS':>9}  {'MB/s':>8}  {'p50 µs':>8}  {'p99 µs':>8}")
print("-" * 78)
for path, label in devices:
    for name, bs, rw, qd in SCENARIOS:
        fsync_each = (name == "4K-fsync")
        iops, mb, p50, p99 = run_fio(path, bs, rw, qd, fsync_each)
        print(f"{label:>12}  {name:>14}  {iops:9.0f}  {mb:8.1f}  {p50:8.0f}  {p99:8.0f}")
    # Cleanup per-device test file
    if os.path.exists(path):
        os.remove(path)
# Sample run on a 4-device workstation (Bengaluru lab, kernel 6.5, ext4):
# - Seagate Exos 18TB enterprise HDD (7200 RPM, 256MB cache)
# - Samsung PM893 SATA SSD (3.84TB, enterprise, PLP)
# - Samsung PM9A3 NVMe SSD (3.84TB, U.2, PCIe Gen 4, PLP)
# - Intel Optane DC P5800X (1.6TB, NVMe, byte-addressable in App Direct mode it would
#   be PMEM; here exposed as a fast NVMe block device for fair comparison)

      device        workload       IOPS      MB/s    p50 µs    p99 µs
------------------------------------------------------------------------
         hdd     4K-randread        148       0.6      6800     14200
         hdd    4K-randwrite        134       0.5      7100     16800
         hdd    256K-seqread        921     230.3      8200     12400
         hdd       4K-fsync         132       0.5      7400     17300
        sata     4K-randread      94800     370.3       320       780
        sata    4K-randwrite      82100     320.7       370       920
        sata    256K-seqread       2110     527.5      3700      4900
        sata       4K-fsync       3920      15.3       240       650
        nvme     4K-randread     782400    3056.3        62       180
        nvme    4K-randwrite     412800    1612.5       110       290
        nvme    256K-seqread      24300    6075.0      2570      4100
        nvme       4K-fsync       21800      85.2        38       142
      optane     4K-randread    1480200    5782.0        18        38
      optane    4K-randwrite    1320500    5158.2        20        45
      optane    256K-seqread      28400    7100.0      2200      3800
      optane       4K-fsync     142800     558.0         6        14

Walk through. The HDD row is in a different unit — 148 random IOPS where the SSDs do 80,000–1.5M, and 16 ms p99 where the SSDs do 0.2–0.9 ms. There is no shape of workload where the HDD beats any SSD at random I/O. The only HDD-favouring number is sequential throughput per rupee — 250 MB/s for ₹4/GB beats SATA SSD's 530 MB/s for ₹15/GB by a factor of ~2× on cold-data archive workloads.

The SATA fsync row at 3,920/sec is the row that buys NVMe. The SATA SSD does 82,000 random write IOPS without fsync, but only 3,920 with fsync per write — a 21× collapse. The NVMe does 412,800 without fsync and 21,800 with — a 19× collapse, but the absolute number is 5.5× higher than SATA. Optane does 142,800 with fsync — 36× higher than SATA, because the controller's fsync path is the wall, not the NAND. Why fsync collapses the IOPS so dramatically: each fsync forces the controller to wait until the write reaches non-volatile storage and is durable across a power loss. With queue depth 32, the controller can pipeline 32 NAND programs in parallel and amortise the per-I/O cost; with --fsync=1 (fsync after each write), the queue depth is effectively 1 because each write blocks waiting for the previous fsync to complete. The IOPS becomes 1 / fsync_latency — for SATA that is ~250 µs giving ~4000/s, for NVMe ~45 µs giving ~22000/s, for Optane ~7 µs giving ~140000/s.

The 256K-seqread row tells the throughput story. All four devices are throughput-limited here, not IOPS-limited (note the IOPS column is small — 921 for HDD, 28,400 for Optane). The throughput ratio is 230 : 528 : 6075 : 7100 — a 30× gap from HDD to NVMe and only 1.2× from NVMe to Optane. Sequential throughput is not where Optane wins; the win is in latency and fsync.

The procurement decision falls out of this table cleanly. If your workload is dominated by 4K-fsync (any OLTP database in production), Optane gives 36× SATA's commit rate; NVMe gives 5.5×. If your workload is dominated by 256K-seqread (a Spark batch, a Druid scan), NVMe gives 11× SATA's throughput; Optane is barely better. If your workload is 4K-randread (a hot Redis-as-cache or Postgres index probe), NVMe gives 8× SATA's IOPS; Optane gives 16×.

The fsync ladder — why "write IOPS" is meaningless without it

The number that distinguishes the four media in production is not raw IOPS, throughput, or even median latency. It is fsync acknowledge time — how long from issuing fsync() until the kernel returns and the data is guaranteed to survive a power loss. Every transactional system on top of the device runs at this rate; everything else is decoration.

The ladder, fastest to slowest:

fsync acknowledge time across storage media — log scaleA horizontal bar chart showing fsync ack time on a log scale from 100 nanoseconds to 100 milliseconds. Eight bars: PMEM at 150ns, NVMe-PLP at 40us, NVMe-no-PLP at 500us, SATA-PLP at 300us, SATA-no-PLP at 1ms, HDD at 8ms, EBS-gp3 at 1.5ms, EBS-io2 at 350us.fsync acknowledge time per medium (log scale, lower is better)100 ns10 µs1 ms100 ms10 sPMEM (App Direct)150 ns — load/store, no syscallNVMe + PLP40 µs — controller cache ackNVMe (no PLP)500 µs — wait for NAND programSATA + PLP300 µs — cache ack + SATA overheadSATA (no PLP)1 ms — NAND program + protocolEBS io2 Block Express350 µs — network + PLP-equivEBS gp31.5 ms — network + remote writeHDD8 ms — half-rotation + maybe seek10 µs (PMEM cliff)1 ms (HDD cliff)
fsync acknowledge time across storage media. The 1-ms vertical line separates "fast enough for OLTP" from "you need group commit"; the 10-µs line separates PMEM from everything else. Illustrative — typical 2026 numbers; the within-class variation is itself often 2–3×.

The Karan story collapses to one number: SATA fsync was ~3,500/s, NVMe fsync was ~22,000/s. The Postgres commit throughput is bounded by min(commits/s_target, fsync/s_capacity) — at the 09:15 IST market open his target was ~15,000 commits/s, the SATA could deliver 3,500, the queue grew, and p99 latency grew with the queue. NVMe at 22,000 had headroom; the queue stayed empty, and latency stayed at the device's service time. Same workload, same code, different fsync ceiling, different p99. The procurement spreadsheet's "98,000 vs 1.4M IOPS" comparison missed this entirely because the IOPS numbers were measured without fsync.

Workload-shape mapping — which medium for which job

The right way to choose is to identify the workload's bottleneck operation and map it to the medium whose architecture serves that operation best. Five common workload shapes from Indian production systems:

Razorpay payment commit (Postgres OLTP, fsync per commit, 80,000 commits/s peak). Bottleneck: fsync. Required: NVMe with PLP, ideally Optane if budget allows. SATA cannot reach 80,000 fsync/s on a single volume even theoretically (the protocol caps at ~3,500/s); you would need 25 SATA volumes in a striped configuration just to match one NVMe with PLP. The cost spreadsheet flips: 25 SATA volumes at ₹38,000 = ₹9.5L; one NVMe at ₹71,000 = ₹71K. The "expensive" device is 13× cheaper for this workload.

Hotstar IPL VOD (large sequential reads, 800 MB/s sustained per server). Bottleneck: throughput. Required: NVMe is overkill — SATA at 530 MB/s sequential is in the same league, and the cost-per-GB matters more than IOPS for a 50 PB content library. HDD is too slow on first-watch latency (cold reads from spinning rust at 8 ms vs SATA at 200 µs is felt in the player startup time), but HDD-tier storage with SATA-tier hot cache is the standard architecture. Mixing media is the answer; one medium is wrong.

Zerodha Kite tick storage (high-rate sequential writes, 200 MB/s, no fsync per record). Bottleneck: throughput, not fsync (the system batches and group-commits every 100 ms). Required: SATA SSD is sufficient; NVMe is wasted budget. The group-commit pattern hides fsync latency behind the batch boundary, so the per-record fsync penalty does not apply. Buying NVMe here is a 5× over-provision because the bottleneck is throughput, not commit latency.

Aadhaar deduplication (random reads against a 1.2-billion-record biometric index, 30,000 lookups/s). Bottleneck: random read IOPS at low latency. Required: NVMe (the index does not fit in RAM at any reasonable cost — 1.2B records × 1 KB feature vector = 1.2 TB). SATA at 90,000 IOPS would handle the rate but the per-lookup p99 of 780 µs vs NVMe's 180 µs is the difference between "user waits 1 second for biometric auth" and "user waits 200 ms" because the lookup is in the critical path of every authentication. UIDAI in practice runs the hot-shard fragments on PMEM where available (the lookup p99 drops to ~5 µs) and the cold archive on NVMe.

IRCTC seat inventory (heavy contention on a small working set, 5 GB hot table, 18M sessions in 90 seconds at 10:00 IST Tatkal). Bottleneck: lock contention and commit serialisation, not raw I/O. The hot table fits in DRAM 100×; almost no reads touch the disk. Storage matters only for WAL fsync and crash recovery. Required: NVMe or PMEM-backed WAL; SATA is the cliff that breaks the system because the WAL can't fsync fast enough during the spike. PMEM here delivers 10× more headroom than NVMe at 4× the cost — likely worth it given the regulatory cost of an IRCTC outage during Tatkal.

Swiggy delivery-zone geofence updates (10,000 geohash writes/s, eventually consistent, replicated 3-way). Bottleneck: write throughput at moderate fsync cadence (group-commit every 250 ms, so per-record fsync penalty does not apply). Required: SATA SSD with PLP is sufficient. The replication 3-way already buys durability across nodes, so the per-node fsync rate matters less than the per-node write throughput. NVMe here is a 4× over-provision for a workload whose distributed-systems design has already addressed durability at a layer above the storage device.

The pattern: there is no "best" medium. There is a bottleneck-aware match between workload and medium, and the cost-per-useful-IOPS gap between the wrong choice and the right choice is typically 3–10× — large enough that getting it wrong is a procurement error, not a tuning question.

The latency tail and queue-depth interaction across media

When an NVMe vendor advertises "1,400,000 random read IOPS", three conditions are baked into that number that almost never hold in production. First, the test runs against an empty, freshly-trimmed device — none of the SSD's internal garbage collection is running. Second, the test uses queue depth 256 across multiple jobs in parallel — typically 8 jobs × 32 queue depth, well above what a single application thread will generate. Third, the test runs for less than 60 seconds — short enough to live entirely off the SLC cache and not trigger any background maintenance. Reproduce the same fio command on the same device after 30 minutes of mixed workload and you typically see 60–75% of the advertised number; over a multi-hour production trace the achievable IOPS settles at 40–55% of the spec. The vendor is not lying — they are reporting a peak under conditions you will never see. The honest planning number is 50% of the spec sheet, not 100%.

The four media also differ in how they degrade as you push them past their comfortable load. An HDD's p99 grows from 12 ms to 80 ms between 50% and 90% utilisation — a 7× tail blow-up. A SATA SSD's p99 grows from 320 µs to 4 ms across the same range — 12×, because the SATA queue has only 32 slots and overflows quickly. An NVMe SSD's p99 grows from 75 µs to 600 µs — 8×, but the absolute floor is so low that 600 µs is still tolerable for almost any application. PMEM's p99 grows from 200 ns to 800 ns — the device has no queue at all, so the "growth" is just contention on the memory controller.

The practical effect: if your application's latency SLO is 1 ms p99, you can run NVMe at 80% utilisation comfortably (600 µs leaves headroom for the rest of your stack), SATA at 50% (the device's 1 ms tail at 50% load eats your entire budget), and HDD at 0% (the device's idle latency is 6 ms, already 6× over budget). The maximum useful utilisation of a device under a latency SLO is 30–80% of its rated capacity, decreasing as the SLO tightens. Why this is the multiplier most cost models miss: a SATA SSD's "98,000 IOPS" rating means the device can deliver 98,000 IOPS at saturation; under a 1 ms p99 SLO the useful IOPS is closer to 40,000 because past that the tail latency exceeds the SLO. The cost-per-useful-IOPS is then 2.5× the cost-per-rated-IOPS, and the gap to NVMe (which delivers 600,000+ useful IOPS under the same SLO) widens from 8× to 15×.

This is why the procurement rule "size to 80% utilisation" is wrong as a one-size formula. For HDD the right number is 50%, for SATA SSD it is 60–70%, for NVMe it is 70–80%, for PMEM it is 90% (the device's tail is so flat that you can run it hot). The number is set by the slope of the latency-vs-load curve at the design point — flatter for faster media, steeper for slower ones — and ignoring the slope produces under-provisioned slow systems and over-provisioned fast ones, both of which cost money.

Common confusions

Going deeper

The PLP capacitor — what ₹2 of hardware buys you

Power-loss protection is a row of supercapacitors on the SSD's PCB — typically 4–8 capacitors of 47–100 µF each, costing under ₹200 per drive in BOM and ~₹2 per drive in functional silicon to control them. The total cost to the manufacturer is well under 1% of a ₹70,000 enterprise SSD's price. The functional difference is enormous: with PLP the controller's DRAM cache is "as durable as NAND", so the controller can ack fsync from DRAM in 30 µs instead of waiting for the NAND program in 1 ms. The 30× speedup is from the most boring component on the board.

The reason consumer SSDs lack PLP is one of segmentation, not engineering. Manufacturers want to charge enterprise customers more for the same NAND die — PLP is the easiest physical differentiator they can credibly add. It also has a real engineering cost: the firmware must handle in-flight writes correctly during a power loss, the capacitors degrade over years of cycling and must be monitored via SMART, and the warranty implications of "guaranteed durable on power loss" are a non-trivial legal commitment. Why consumer NVMe drives advertising "high IOPS" mislead OLTP buyers: the IOPS number is measured without fsync — i.e., at the device's raw NAND throughput, not at its commit-latency-bound throughput. A consumer Samsung 990 Pro hits 1.2M random write IOPS in fio --fsync=0, but only ~2,500/s in fio --fsync=1. The 480× collapse between the two numbers is invisible on the spec sheet because the spec sheet does not test the fsync case.

Persistent memory after Optane — the CXL story

Intel discontinued Optane DC in 2022 after struggling to find a market between "expensive RAM" and "fast storage" buyers. The technology survives in two forms in 2026: existing Optane fleets (Aadhaar, Aerospike clusters, some HFT shops) running on stockpiled hardware until they age out, and CXL-attached memory expanders that approximate the persistent-memory programming model.

CXL (Compute Express Link) is a PCIe-based protocol that lets a CPU treat an attached device as either a memory expander (load/store, like DDR but slower) or a coherent device (with cache coherence across the link). CXL-attached memory expanders give you 200–400 ns latency, 30 GB/s per link, and some vendors offer power-loss-backed versions that approximate Optane's persistence. The programming model is identical — mmap + load/store + clflushopt — so software written for Optane runs on CXL PMEM with no source change.

The performance gap is real but not as dramatic as media articles suggest: Optane was 100–350 ns; CXL PMEM is 200–400 ns. For database fsync this is the difference between 6.5M commits/s/core and 4M commits/s/core — both still 100× faster than NVMe. The bigger change is economics: CXL memory expanders are sold by the GB at DRAM-adjacent prices (~₹4,000/GB in 2026), while Optane sold at ~₹500/GB. The market for "fast enough but cheap" persistent memory remains uncertain, which is why most production systems in 2026 still use NVMe + careful fsync batching as the pragmatic alternative.

The TLC vs QLC vs PLC NAND ladder — and why the gap matters more than the marketing

Inside the SSD, NAND flash cells store bits using charge levels in floating-gate transistors. SLC (Single-Level Cell) stores 1 bit per cell with two charge levels — fastest, longest-endurance, most expensive per GB. MLC (2 bits, 4 levels), TLC (3 bits, 8 levels), QLC (4 bits, 16 levels), and the experimental PLC (5 bits, 32 levels) trade endurance and write speed for density. The progression is roughly:

Type Bits/cell Write latency Endurance (P/E cycles) ₹/GB (2026)
SLC 1 30 µs 100,000 ₹450
MLC 2 60 µs 10,000 ₹130
TLC 3 200 µs 3,000 ₹40
QLC 4 600 µs 1,000 ₹22
PLC 5 1,500 µs 300 ₹14

Most consumer NVMe in 2026 is TLC; QLC drives (Samsung 870 QVO, Crucial X10 Pro) target capacity-first buyers; PLC is research. The crucial point: the IOPS / fsync numbers in the marketing materials assume the SLC cache region of the drive is hot. Modern TLC and QLC drives reserve some cells (typically 10–25% of the user-visible capacity) operated as SLC for write caching. While the SLC cache is hot, the drive looks fast; once the SLC cache fills (typically after 50–200 GB of sustained writes on a consumer drive), writes drop to the underlying TLC/QLC speed — for QLC that is a 5–10× collapse. Why this matters for the Karan procurement story: the SATA SSD he bought was a TLC drive with 12% SLC cache (~460 GB on the 3.84 TB device). The Postgres workload writes ~80 GB/hour during market hours — well within the SLC cache. The benchmark numbers held; the problem was not cache exhaustion but raw fsync latency. If the workload had been an analytics ingest at 500 GB/hour, the SLC cache would have exhausted in 55 minutes and the IOPS would have collapsed independently of the fsync issue, producing a different but equally bad outage.

Why network-attached storage (EBS, GCP PD) is its own category

EBS gp3, GCP pd-balanced, Azure Premium SSD — none of these are physical SSDs. They are network-attached virtual block devices implemented by the cloud provider as a distributed storage system (EBS uses something like a custom Cassandra-derived layer over local SSDs in the same AZ). The latency profile is:

The implication for Indian deployments: if your workload is fsync-sensitive, EBS gp3 puts you at the SATA-no-PLP rung of the ladder regardless of the fact that it is "SSD-backed". The cost-per-fsync analysis flips: io2 Block Express at 4× the gp3 price is ~6× the fsync rate. For Razorpay-class workloads, io2 is effectively mandatory; for Hotstar-class throughput workloads, gp3 is plenty. This is the cloud version of the SATA-vs-NVMe choice.

Wear leveling, garbage collection, and the "fresh device" benchmark trap

SSDs do not write in place — every "write" creates a new physical page and marks the old one stale. The garbage collector (GC, the SSD firmware's, not your application's) periodically reclaims stale pages by relocating valid data and erasing the underlying erase blocks. While GC runs, write IOPS drops by 20–60% and write latency p99 grows 3–10×.

The benchmark trap: a fresh-out-of-box SSD has empty erase blocks, no GC running, and posts vendor-spec numbers. After 30–60 minutes of sustained writes the device enters "steady state" — GC continuously running, IOPS settling at the long-term sustainable rate. The gap between the fresh and steady-state numbers is often 2–3× for consumer drives, 1.2–1.5× for enterprise drives with generous spare-area allocations.

The honest benchmark protocol: precondition the device with at least 2× capacity of sequential writes, then run 30 minutes of the target workload at full load, then start measuring. The first 5 minutes of any SSD benchmark are misleading. Why this hits cloud benchmarks especially hard: a freshly launched cloud instance has full burst credits and a fresh underlying device; benchmarks run in the first 5 minutes show numbers 3–5× higher than the same workload will achieve in steady state. Almost every cloud-storage benchmark blog post you read uses freshly-launched instances and reports the burst peak, not the sustained baseline. Trust nothing under 30 minutes of warm-up.

Where this leads next

The next chapter (/wiki/filesystem-overhead) covers the layer between your read()/write() syscall and the device — how ext4, XFS, and Btrfs add their own latency through journaling, metadata operations, and allocation. The numbers in this chapter are device numbers; the numbers your application sees are device numbers plus the filesystem tax, which can be 1.5–3× on small-file workloads.

The chapter after that (/wiki/o-direct-async-i-o-io-uring) covers how to bypass the page cache (O_DIRECT) and how to extract the device's full parallelism from a single application thread (io_uring). The "NVMe IOPS rating" assumes the kernel I/O path can sustain it — which on read()/write() syscalls and libaio it usually cannot. Modern OLTP databases use O_DIRECT + io_uring precisely to recover the per-I/O overhead the kernel's traditional path imposes.

Two operational habits this chapter adds. First, always identify the bottleneck operation before picking a medium — fsync rate, sequential throughput, random IOPS at low latency, or random IOPS at any latency. Each maps to a different medium; "fast disk" is not a specification. Second, always run the four-cell fio table on the actual hardware before procurement — vendor numbers are the peak achievable under contrived conditions, your numbers are the steady state under your workload's shape. The table takes 20 minutes to produce and prevents the Karan-class procurement error.

The Razorpay engineering blog has documented their storage tier choice at the payments path: io2 Block Express on AWS for ap-south-1 (Mumbai) production, with the database WAL on a dedicated volume. The cost premium over gp3 is ~₹3.5L/month per database node; the alternative is a 5× collapse in commit throughput during the 23:00 IST salary-credit batch processing peak when the salary-credit volume hits 600,000 commits/s briefly. The cost-benefit does not require a spreadsheet — one minute of downtime during salary-credit processing has regulatory consequences that dwarf a year of io2 premium.

The contrast with Hotstar is instructive. Their VOD-serving tier uses gp3 (or sometimes plain HDD-backed S3 with EBS gp3 caching) because the workload is throughput-dominated and large-sequential-read shaped. For their content-metadata service — small random reads, must respond in 50 ms p99 to user navigation — they use NVMe instance store (local SSD on the EC2 host) because the network round-trip to EBS would alone consume their latency budget. Two adjacent services in the same architecture, two different storage choices, both correct for their workload shapes. The medium follows the workload, not the brand of the cluster.

A third habit, harder but worth it: measure before you tune. When a storage incident hits production, the first 5 minutes should be iostat -xm 1 to identify the workload shape and fio against the production volume to identify whether the device's steady-state numbers have changed. The temptation is to start tuning kernel parameters, filesystem mount options, or database checkpoint cadence — all of which can help, none of which matters if the device is rate-limited by AWS or in the middle of a 30-minute GC cycle. Diagnose the medium first, tune second. This is the discipline that separates 5-minute storage incidents from 4-hour ones.

Reproducibility footer

# Reproduce this on your laptop, ~20 minutes per device
sudo apt install fio sysstat
python3 -m venv .venv && source .venv/bin/activate
# fio is the workhorse; no Python deps beyond stdlib for this script
# Run against each device path you have available:
python3 media_compare.py /mnt/sata_ssd/test /mnt/nvme/test
# Watch the device live in another pane:
iostat -xm 1
# To compare against PMEM (if you have access), use --ioengine=pmem in the
# fio invocation and point at a path on a /dev/pmem* mount.
# Precondition each device with at least 30 minutes of sustained writes
# before the measurement run — fresh-device numbers are misleading.
fio --name=precondition --filename=/mnt/nvme/precond --size=10G \
    --bs=128k --rw=write --iodepth=8 --runtime=1800 --time_based=1

References