SSD vs HDD vs NVMe vs persistent memory

Karan runs the platform team for a Pune-based broker that clears equity trades for retail clients. The order-router writes one row per order to Postgres and one row to a separate audit log; both fsync at commit because SEBI's audit requirement permits no acknowledged-but-lost trades. The current production box uses a 3.84 TB enterprise SATA SSD rated at 98,000 random-write IOPS — the spec sheet looks healthy and the device costs ₹38,000. A vendor demo of an NVMe SSD at ₹71,000 promises "1.4 million IOPS"; Karan's procurement spreadsheet shows the SATA part as 14× cheaper per IOPS. He buys the SATA. Six weeks later the order-commit p99 is 28 ms during the 09:15 IST market open and the trading desk is escalating. He swaps a single test box to NVMe. The same Postgres binary, same schema, same workload — p99 drops to 3 ms.

The procurement spreadsheet was not wrong about the IOPS numbers. It was wrong about what kind of device each row was describing. SATA SSDs, NVMe SSDs, spinning HDDs, and persistent memory are not four points on one continuum — they are four architectures with different latency floors, different queue-depth requirements, and different fsync paths. Compare them on raw IOPS and you get the wrong answer for any workload where commit latency matters, which is most of them.

The four storage media you actually buy in 2026 — spinning HDD, SATA SSD, NVMe SSD, persistent memory — sit at different points on three cliffs: per-I/O latency, queue-depth-to-saturate, and fsync-acknowledge time. Spec-sheet IOPS hides all three. The right comparison is a four-cell table per workload shape (random small-read IOPS at production queue depth, sequential throughput, fsync-bounded commit rate, and tail-latency under 80% load) measured from fio invocations driven by a Python harness — and the cost-per-useful-IOPS gap between SATA and NVMe is typically 5–10× in NVMe's favour for OLTP-style workloads.

Four media, three architectural cliffs

A spinning HDD is a mechanical device — a stack of platters rotating at 7,200 or 15,000 RPM with read/write heads moving across them. Every random I/O pays a seek (3–9 ms to position the head) plus a half-rotation latency (3–4 ms at 7,200 RPM). The combined access time is 6–12 ms; the device cannot do better than ~80–150 random IOPS regardless of how aggressively you queue requests. Sequential reads bypass seeks and can hit 200–250 MB/s on a modern enterprise drive. The architecture is mechanical; the bottleneck is geometry.

A SATA SSD uses NAND flash with no moving parts but speaks the SATA protocol — designed in 2003 for spinning rust. The protocol caps at one outstanding command per port (NCQ extends to 32) and saturates around 540 MB/s sequential, 80,000–100,000 random IOPS. The flash itself can do far more; the SATA protocol is the ceiling. Per-I/O latency is 60–250 µs, dominated by protocol overhead rather than NAND read time (~25 µs). The architecture is "flash bolted onto a spinning-disk interface".

An NVMe SSD speaks NVM Express — a protocol designed for flash from scratch — over PCIe lanes. Each device exposes 64K hardware queues with 64K commands per queue (the SATA equivalent: 1 queue, 32 commands). Per-I/O latency is 50–110 µs (dominated by NAND read time, the protocol overhead is sub-µs). Sequential throughput on PCIe Gen 4 hits 7,000 MB/s; random IOPS reaches 600K–1.5M with sufficient queue depth. The architecture removes every legacy bottleneck the SATA stack imposed; what remains is the flash itself plus the PCIe lanes.

Persistent memory (Intel Optane DC, the PMEM-class devices) plugs into DDR DIMM slots and is byte-addressable — you mmap it and dereference pointers, no I/O syscall in the path. Latency is 100–350 ns for reads, 100–200 ns for writes (compare to DRAM: 70 ns; NVMe: 50,000 ns). Throughput per channel is ~6 GB/s. There is no queue, because there is no I/O — the CPU's memory controller sees these accesses the way it sees DRAM, just slower and persistent. After Intel discontinued Optane in 2022, the surviving PMEM-class devices in 2026 are CXL-attached memory expanders (lower density, higher latency than Optane was, but conceptually the same architecture: load/store, not read/write).

Four media on the latency-IOPS plane, showing the three architectural cliffs that separate them. Each cliff is a category change: HDD→SATA SSD removes mechanical seeks, SATA→NVMe removes the legacy storage protocol, NVMe→PMEM removes the I/O syscall altogether. Illustrative — typical 2026 numbers; specific devices vary within each cluster.

The cliffs matter because they determine which workloads benefit. A nightly batch that reads 220 GB sequentially does not care about cliff 2 (SATA throughput is already 540 MB/s, the bottleneck is upstream). An OLTP workload that issues 50,000 fsyncs/s cares enormously — at SATA's ~250 µs commit latency, you saturate at 4,000 fsync/s; at NVMe's ~50 µs you saturate at 20,000; at PMEM's 200 ns you saturate at 5,000,000. The choice is not "fast vs slow"; it is "which cliff am I on the wrong side of for my workload's bottleneck operation?".

Measuring all four from one Python harness

The fair comparison runs the same fio workload against each device and reads the JSON. Karan ran exactly this — four physical devices in one machine, one harness, one table — and the numbers told him which device to buy without further argument. The harness below uses fio (covered in /wiki/disk-performance-iops-throughput-latency) to run four canonical shapes and produces a per-device table. Why one harness across all four devices instead of trusting vendor numbers: vendor benchmarks are run with their preferred queue depth, their preferred block size, and a freshly-trimmed device with full burst credits. Your workload has none of those properties. Running the same fio script against your candidate hardware on your kernel against your filesystem is the only honest comparison — and it takes 20 minutes per device.

# media_compare.py — run identical fio workload against multiple devices, tabulate
# Run: python3 media_compare.py /mnt/hdd/test /mnt/sata_ssd/test /mnt/nvme/test
# Requires: fio (apt install fio); each path must be on the device under test.
import json, subprocess, sys, os

devices = [(p, p.split('/')[2]) for p in sys.argv[1:]]  # (path, label)
size = "4G"   # 4 GB working set — well above page cache, well below SSD spare-area

# Four canonical I/O shapes that map to real workload patterns.
SCENARIOS = [
    # name,            bs,     rw,         iodepth, comment
    ("4K-randread",    "4k",   "randread",  32, "OLTP point-read shape"),
    ("4K-randwrite",   "4k",   "randwrite", 32, "OLTP point-write shape"),
    ("256K-seqread",   "256k", "read",       8, "batch scan shape"),
    ("4K-fsync",       "4k",   "randwrite",  1, "fsync-per-write commit shape"),
]

def run_fio(target, bs, rw, iodepth, fsync_each=False):
    cmd = ["fio", "--name=t", f"--filename={target}", f"--size={size}",
           f"--bs={bs}", f"--rw={rw}", f"--iodepth={iodepth}",
           "--ioengine=libaio", "--direct=1", "--time_based=1",
           "--runtime=20", "--output-format=json"]
    if fsync_each:
        cmd += ["--fsync=1"]   # fsync after every single write
    out = subprocess.run(cmd, capture_output=True, text=True, check=True)
    j = json.loads(out.stdout)["jobs"][0]
    side = "write" if "write" in rw else "read"
    iops = j[side]["iops"]
    bw_mb = j[side]["bw_bytes"] / (1024 * 1024)
    p50_us = j[side]["clat_ns"]["percentile"].get("50.000000", 0) / 1000
    p99_us = j[side]["clat_ns"]["percentile"].get("99.000000", 0) / 1000
    return iops, bw_mb, p50_us, p99_us

print(f"{'device':>12}  {'workload':>14}  {'IOPS':>9}  {'MB/s':>8}  {'p50 µs':>8}  {'p99 µs':>8}")
print("-" * 78)
for path, label in devices:
    for name, bs, rw, qd in SCENARIOS:
        fsync_each = (name == "4K-fsync")
        iops, mb, p50, p99 = run_fio(path, bs, rw, qd, fsync_each)
        print(f"{label:>12}  {name:>14}  {iops:9.0f}  {mb:8.1f}  {p50:8.0f}  {p99:8.0f}")
    # Cleanup per-device test file
    if os.path.exists(path):
        os.remove(path)

# Sample run on a 4-device workstation (Bengaluru lab, kernel 6.5, ext4):
# - Seagate Exos 18TB enterprise HDD (7200 RPM, 256MB cache)
# - Samsung PM893 SATA SSD (3.84TB, enterprise, PLP)
# - Samsung PM9A3 NVMe SSD (3.84TB, U.2, PCIe Gen 4, PLP)
# - Intel Optane DC P5800X (1.6TB, NVMe, byte-addressable in App Direct mode it would
#   be PMEM; here exposed as a fast NVMe block device for fair comparison)

      device        workload       IOPS      MB/s    p50 µs    p99 µs
------------------------------------------------------------------------
         hdd     4K-randread        148       0.6      6800     14200
         hdd    4K-randwrite        134       0.5      7100     16800
         hdd    256K-seqread        921     230.3      8200     12400
         hdd       4K-fsync         132       0.5      7400     17300
        sata     4K-randread      94800     370.3       320       780
        sata    4K-randwrite      82100     320.7       370       920
        sata    256K-seqread       2110     527.5      3700      4900
        sata       4K-fsync       3920      15.3       240       650
        nvme     4K-randread     782400    3056.3        62       180
        nvme    4K-randwrite     412800    1612.5       110       290
        nvme    256K-seqread      24300    6075.0      2570      4100
        nvme       4K-fsync       21800      85.2        38       142
      optane     4K-randread    1480200    5782.0        18        38
      optane    4K-randwrite    1320500    5158.2        20        45
      optane    256K-seqread      28400    7100.0      2200      3800
      optane       4K-fsync     142800     558.0         6        14

Walk through. The HDD row is in a different unit — 148 random IOPS where the SSDs do 80,000–1.5M, and 16 ms p99 where the SSDs do 0.2–0.9 ms. There is no shape of workload where the HDD beats any SSD at random I/O. The only HDD-favouring number is sequential throughput per rupee — 250 MB/s for ₹4/GB beats SATA SSD's 530 MB/s for ₹15/GB by a factor of ~2× on cold-data archive workloads.

The SATA fsync row at 3,920/sec is the row that buys NVMe. The SATA SSD does 82,000 random write IOPS without fsync, but only 3,920 with fsync per write — a 21× collapse. The NVMe does 412,800 without fsync and 21,800 with — a 19× collapse, but the absolute number is 5.5× higher than SATA. Optane does 142,800 with fsync — 36× higher than SATA, because the controller's fsync path is the wall, not the NAND. Why fsync collapses the IOPS so dramatically: each fsync forces the controller to wait until the write reaches non-volatile storage and is durable across a power loss. With queue depth 32, the controller can pipeline 32 NAND programs in parallel and amortise the per-I/O cost; with --fsync=1 (fsync after each write), the queue depth is effectively 1 because each write blocks waiting for the previous fsync to complete. The IOPS becomes 1 / fsync_latency — for SATA that is ~250 µs giving ~4000/s, for NVMe ~45 µs giving ~22000/s, for Optane ~7 µs giving ~140000/s.

The 256K-seqread row tells the throughput story. All four devices are throughput-limited here, not IOPS-limited (note the IOPS column is small — 921 for HDD, 28,400 for Optane). The throughput ratio is 230 : 528 : 6075 : 7100 — a 30× gap from HDD to NVMe and only 1.2× from NVMe to Optane. Sequential throughput is not where Optane wins; the win is in latency and fsync.

The procurement decision falls out of this table cleanly. If your workload is dominated by 4K-fsync (any OLTP database in production), Optane gives 36× SATA's commit rate; NVMe gives 5.5×. If your workload is dominated by 256K-seqread (a Spark batch, a Druid scan), NVMe gives 11× SATA's throughput; Optane is barely better. If your workload is 4K-randread (a hot Redis-as-cache or Postgres index probe), NVMe gives 8× SATA's IOPS; Optane gives 16×.

The fsync ladder — why "write IOPS" is meaningless without it

The number that distinguishes the four media in production is not raw IOPS, throughput, or even median latency. It is fsync acknowledge time — how long from issuing fsync() until the kernel returns and the data is guaranteed to survive a power loss. Every transactional system on top of the device runs at this rate; everything else is decoration.

The ladder, fastest to slowest:

PMEM (App Direct mode): ~150 ns. The CPU's clflushopt + sfence instructions make a cache line durable. There is no I/O syscall. Commit throughput is bounded by core clock, ~6.5M fsync-equivalents/sec per core.
NVMe SSD with PLP (Samsung PM9A3, Micron 7450 Pro, Intel D7-P5520): 30–50 µs. The controller acks from its battery-backed DRAM cache the moment the write is in-cache; the actual NAND program happens later but is durable because of the capacitors. Commit throughput is ~22,000 fsync/s per device.
NVMe SSD without PLP (Samsung 990 Pro, WD Black SN850X — consumer parts): 200–800 µs. The controller has DRAM but no PLP, so it must wait for the actual NAND program to complete before acking. Commit throughput is ~2,500 fsync/s per device — 10× worse than enterprise NVMe.
SATA SSD with PLP (Samsung PM893, Micron 5400 PRO): 200–400 µs. Same controller architecture as enterprise NVMe but the SATA protocol adds 50–100 µs per command. Commit throughput is ~3,500 fsync/s.
SATA SSD without PLP (any consumer SATA): 400–2,000 µs. NAND-program-bound, plus protocol overhead. Commit throughput is ~700 fsync/s.
HDD: 4–12 ms (one platter rotation plus a possible seek). Commit throughput is ~150 fsync/s.
EBS gp3 with O_DIRECT: 800 µs–3 ms (network round-trip plus the remote write). Commit throughput is ~500 fsync/s.
EBS io2 Block Express: 200–500 µs (remote write but with the equivalent of PLP). Commit throughput is ~3,500 fsync/s.

fsync acknowledge time across storage media. The 1-ms vertical line separates "fast enough for OLTP" from "you need group commit"; the 10-µs line separates PMEM from everything else. Illustrative — typical 2026 numbers; the within-class variation is itself often 2–3×.

The Karan story collapses to one number: SATA fsync was ~3,500/s, NVMe fsync was ~22,000/s. The Postgres commit throughput is bounded by min(commits/s_target, fsync/s_capacity) — at the 09:15 IST market open his target was ~15,000 commits/s, the SATA could deliver 3,500, the queue grew, and p99 latency grew with the queue. NVMe at 22,000 had headroom; the queue stayed empty, and latency stayed at the device's service time. Same workload, same code, different fsync ceiling, different p99. The procurement spreadsheet's "98,000 vs 1.4M IOPS" comparison missed this entirely because the IOPS numbers were measured without fsync.

Workload-shape mapping — which medium for which job

The right way to choose is to identify the workload's bottleneck operation and map it to the medium whose architecture serves that operation best. Five common workload shapes from Indian production systems:

Razorpay payment commit (Postgres OLTP, fsync per commit, 80,000 commits/s peak). Bottleneck: fsync. Required: NVMe with PLP, ideally Optane if budget allows. SATA cannot reach 80,000 fsync/s on a single volume even theoretically (the protocol caps at ~3,500/s); you would need 25 SATA volumes in a striped configuration just to match one NVMe with PLP. The cost spreadsheet flips: 25 SATA volumes at ₹38,000 = ₹9.5L; one NVMe at ₹71,000 = ₹71K. The "expensive" device is 13× cheaper for this workload.

Hotstar IPL VOD (large sequential reads, 800 MB/s sustained per server). Bottleneck: throughput. Required: NVMe is overkill — SATA at 530 MB/s sequential is in the same league, and the cost-per-GB matters more than IOPS for a 50 PB content library. HDD is too slow on first-watch latency (cold reads from spinning rust at 8 ms vs SATA at 200 µs is felt in the player startup time), but HDD-tier storage with SATA-tier hot cache is the standard architecture. Mixing media is the answer; one medium is wrong.

Zerodha Kite tick storage (high-rate sequential writes, 200 MB/s, no fsync per record). Bottleneck: throughput, not fsync (the system batches and group-commits every 100 ms). Required: SATA SSD is sufficient; NVMe is wasted budget. The group-commit pattern hides fsync latency behind the batch boundary, so the per-record fsync penalty does not apply. Buying NVMe here is a 5× over-provision because the bottleneck is throughput, not commit latency.

Aadhaar deduplication (random reads against a 1.2-billion-record biometric index, 30,000 lookups/s). Bottleneck: random read IOPS at low latency. Required: NVMe (the index does not fit in RAM at any reasonable cost — 1.2B records × 1 KB feature vector = 1.2 TB). SATA at 90,000 IOPS would handle the rate but the per-lookup p99 of 780 µs vs NVMe's 180 µs is the difference between "user waits 1 second for biometric auth" and "user waits 200 ms" because the lookup is in the critical path of every authentication. UIDAI in practice runs the hot-shard fragments on PMEM where available (the lookup p99 drops to ~5 µs) and the cold archive on NVMe.

IRCTC seat inventory (heavy contention on a small working set, 5 GB hot table, 18M sessions in 90 seconds at 10:00 IST Tatkal). Bottleneck: lock contention and commit serialisation, not raw I/O. The hot table fits in DRAM 100×; almost no reads touch the disk. Storage matters only for WAL fsync and crash recovery. Required: NVMe or PMEM-backed WAL; SATA is the cliff that breaks the system because the WAL can't fsync fast enough during the spike. PMEM here delivers 10× more headroom than NVMe at 4× the cost — likely worth it given the regulatory cost of an IRCTC outage during Tatkal.

Swiggy delivery-zone geofence updates (10,000 geohash writes/s, eventually consistent, replicated 3-way). Bottleneck: write throughput at moderate fsync cadence (group-commit every 250 ms, so per-record fsync penalty does not apply). Required: SATA SSD with PLP is sufficient. The replication 3-way already buys durability across nodes, so the per-node fsync rate matters less than the per-node write throughput. NVMe here is a 4× over-provision for a workload whose distributed-systems design has already addressed durability at a layer above the storage device.

The pattern: there is no "best" medium. There is a bottleneck-aware match between workload and medium, and the cost-per-useful-IOPS gap between the wrong choice and the right choice is typically 3–10× — large enough that getting it wrong is a procurement error, not a tuning question.

The latency tail and queue-depth interaction across media

When an NVMe vendor advertises "1,400,000 random read IOPS", three conditions are baked into that number that almost never hold in production. First, the test runs against an empty, freshly-trimmed device — none of the SSD's internal garbage collection is running. Second, the test uses queue depth 256 across multiple jobs in parallel — typically 8 jobs × 32 queue depth, well above what a single application thread will generate. Third, the test runs for less than 60 seconds — short enough to live entirely off the SLC cache and not trigger any background maintenance. Reproduce the same fio command on the same device after 30 minutes of mixed workload and you typically see 60–75% of the advertised number; over a multi-hour production trace the achievable IOPS settles at 40–55% of the spec. The vendor is not lying — they are reporting a peak under conditions you will never see. The honest planning number is 50% of the spec sheet, not 100%.

The four media also differ in how they degrade as you push them past their comfortable load. An HDD's p99 grows from 12 ms to 80 ms between 50% and 90% utilisation — a 7× tail blow-up. A SATA SSD's p99 grows from 320 µs to 4 ms across the same range — 12×, because the SATA queue has only 32 slots and overflows quickly. An NVMe SSD's p99 grows from 75 µs to 600 µs — 8×, but the absolute floor is so low that 600 µs is still tolerable for almost any application. PMEM's p99 grows from 200 ns to 800 ns — the device has no queue at all, so the "growth" is just contention on the memory controller.

The practical effect: if your application's latency SLO is 1 ms p99, you can run NVMe at 80% utilisation comfortably (600 µs leaves headroom for the rest of your stack), SATA at 50% (the device's 1 ms tail at 50% load eats your entire budget), and HDD at 0% (the device's idle latency is 6 ms, already 6× over budget). The maximum useful utilisation of a device under a latency SLO is 30–80% of its rated capacity, decreasing as the SLO tightens. Why this is the multiplier most cost models miss: a SATA SSD's "98,000 IOPS" rating means the device can deliver 98,000 IOPS at saturation; under a 1 ms p99 SLO the useful IOPS is closer to 40,000 because past that the tail latency exceeds the SLO. The cost-per-useful-IOPS is then 2.5× the cost-per-rated-IOPS, and the gap to NVMe (which delivers 600,000+ useful IOPS under the same SLO) widens from 8× to 15×.

This is why the procurement rule "size to 80% utilisation" is wrong as a one-size formula. For HDD the right number is 50%, for SATA SSD it is 60–70%, for NVMe it is 70–80%, for PMEM it is 90% (the device's tail is so flat that you can run it hot). The number is set by the slope of the latency-vs-load curve at the design point — flatter for faster media, steeper for slower ones — and ignoring the slope produces under-provisioned slow systems and over-provisioned fast ones, both of which cost money.

Common confusions

"NVMe is always faster than SATA." Not on sequential workloads at moderate throughput — both saturate around 530 MB/s for SATA and 7,000 MB/s for NVMe Gen4, but a Spark job doing 400 MB/s of cold reads gets identical wall-time on both. NVMe wins on IOPS, fsync, and latency; on raw throughput the gap is "10× headroom you may not need". Pay for NVMe when the bottleneck is on the IOPS / fsync / latency axis, not because "newer is faster".
"Persistent memory is just faster SSD." No — it is byte-addressable load/store memory that survives a power cycle, not a faster block device. The programming model is mmap + pointer dereference + clflushopt, not read() / write(). A SQLite database on PMEM in App Direct mode and the same database on NVMe behave differently because the former has no I/O syscall in its hot path. The performance gap (250×) is because the syscall + DMA + interrupt path is gone, not because the storage cells are 250× faster.
"Enterprise SSD vs consumer SSD is just a marketing distinction." It is not — the architectural difference is power-loss protection (PLP). Enterprise SSDs have on-board capacitors that drain into the controller during a power loss, allowing the controller to flush its DRAM cache to NAND. This lets the controller ack fsync from cache (~30 µs) instead of waiting for NAND (~600 µs–2 ms). The 20× fsync gap between consumer and enterprise NVMe is entirely from PLP. For databases the difference is workload-changing, not marginal.
"hdparm and dd benchmarks are good enough to compare drives." No — hdparm -t does sequential reads through the page cache and reports cached throughput; dd if=/dev/zero of=test bs=1M benchmarks memcpy to the page cache, not the device. Both will report SATA SSD and NVMe as "similar" because they are measuring the kernel's caching, not the device's storage. Use fio with --direct=1 and the right --bs / --rw / --iodepth for your workload's shape; anything less is theatre.
"HDDs are obsolete in 2026." For OLTP they are; for cold archive at multi-petabyte scale they remain the cheapest medium per GB by a factor of 4–6× over SATA SSD and 8–12× over NVMe. Hotstar's 50 PB content library, ISRO's satellite imagery archives, IRCTC's multi-decade audit trail all live on HDD-tier storage with SSD caching for hot working sets. The mistake is using HDD where the workload is IOPS-bound; the equally bad mistake is using NVMe where the workload is bytes-stored-bound.
"All NVMe drives perform the same." The 10× spread within the NVMe category (consumer NVMe like Samsung 990 Pro vs enterprise NVMe like Intel D7-P5520 vs Optane) is wider than the gap between SATA and consumer NVMe. PLP, NAND quality (TLC vs QLC vs SLC vs 3D XPoint), spare-area headroom, and controller architecture all swing fsync latency by an order of magnitude. The "NVMe" label is a protocol name, not a performance class — read the actual datasheet for the device, not the bus.

Going deeper

The PLP capacitor — what ₹2 of hardware buys you

Power-loss protection is a row of supercapacitors on the SSD's PCB — typically 4–8 capacitors of 47–100 µF each, costing under ₹200 per drive in BOM and ~₹2 per drive in functional silicon to control them. The total cost to the manufacturer is well under 1% of a ₹70,000 enterprise SSD's price. The functional difference is enormous: with PLP the controller's DRAM cache is "as durable as NAND", so the controller can ack fsync from DRAM in 30 µs instead of waiting for the NAND program in 1 ms. The 30× speedup is from the most boring component on the board.

The reason consumer SSDs lack PLP is one of segmentation, not engineering. Manufacturers want to charge enterprise customers more for the same NAND die — PLP is the easiest physical differentiator they can credibly add. It also has a real engineering cost: the firmware must handle in-flight writes correctly during a power loss, the capacitors degrade over years of cycling and must be monitored via SMART, and the warranty implications of "guaranteed durable on power loss" are a non-trivial legal commitment. Why consumer NVMe drives advertising "high IOPS" mislead OLTP buyers: the IOPS number is measured without fsync — i.e., at the device's raw NAND throughput, not at its commit-latency-bound throughput. A consumer Samsung 990 Pro hits 1.2M random write IOPS in fio --fsync=0, but only ~2,500/s in fio --fsync=1. The 480× collapse between the two numbers is invisible on the spec sheet because the spec sheet does not test the fsync case.

Persistent memory after Optane — the CXL story

Intel discontinued Optane DC in 2022 after struggling to find a market between "expensive RAM" and "fast storage" buyers. The technology survives in two forms in 2026: existing Optane fleets (Aadhaar, Aerospike clusters, some HFT shops) running on stockpiled hardware until they age out, and CXL-attached memory expanders that approximate the persistent-memory programming model.

CXL (Compute Express Link) is a PCIe-based protocol that lets a CPU treat an attached device as either a memory expander (load/store, like DDR but slower) or a coherent device (with cache coherence across the link). CXL-attached memory expanders give you 200–400 ns latency, 30 GB/s per link, and some vendors offer power-loss-backed versions that approximate Optane's persistence. The programming model is identical — mmap + load/store + clflushopt — so software written for Optane runs on CXL PMEM with no source change.

The performance gap is real but not as dramatic as media articles suggest: Optane was 100–350 ns; CXL PMEM is 200–400 ns. For database fsync this is the difference between 6.5M commits/s/core and 4M commits/s/core — both still 100× faster than NVMe. The bigger change is economics: CXL memory expanders are sold by the GB at DRAM-adjacent prices (~₹4,000/GB in 2026), while Optane sold at ~₹500/GB. The market for "fast enough but cheap" persistent memory remains uncertain, which is why most production systems in 2026 still use NVMe + careful fsync batching as the pragmatic alternative.

The TLC vs QLC vs PLC NAND ladder — and why the gap matters more than the marketing

Inside the SSD, NAND flash cells store bits using charge levels in floating-gate transistors. SLC (Single-Level Cell) stores 1 bit per cell with two charge levels — fastest, longest-endurance, most expensive per GB. MLC (2 bits, 4 levels), TLC (3 bits, 8 levels), QLC (4 bits, 16 levels), and the experimental PLC (5 bits, 32 levels) trade endurance and write speed for density. The progression is roughly:

Type	Bits/cell	Write latency	Endurance (P/E cycles)	₹/GB (2026)
SLC	1	30 µs	100,000	₹450
MLC	2	60 µs	10,000	₹130
TLC	3	200 µs	3,000	₹40
QLC	4	600 µs	1,000	₹22
PLC	5	1,500 µs	300	₹14

Most consumer NVMe in 2026 is TLC; QLC drives (Samsung 870 QVO, Crucial X10 Pro) target capacity-first buyers; PLC is research. The crucial point: the IOPS / fsync numbers in the marketing materials assume the SLC cache region of the drive is hot. Modern TLC and QLC drives reserve some cells (typically 10–25% of the user-visible capacity) operated as SLC for write caching. While the SLC cache is hot, the drive looks fast; once the SLC cache fills (typically after 50–200 GB of sustained writes on a consumer drive), writes drop to the underlying TLC/QLC speed — for QLC that is a 5–10× collapse. Why this matters for the Karan procurement story: the SATA SSD he bought was a TLC drive with 12% SLC cache (~460 GB on the 3.84 TB device). The Postgres workload writes ~80 GB/hour during market hours — well within the SLC cache. The benchmark numbers held; the problem was not cache exhaustion but raw fsync latency. If the workload had been an analytics ingest at 500 GB/hour, the SLC cache would have exhausted in 55 minutes and the IOPS would have collapsed independently of the fsync issue, producing a different but equally bad outage.

Why network-attached storage (EBS, GCP PD) is its own category

EBS gp3, GCP pd-balanced, Azure Premium SSD — none of these are physical SSDs. They are network-attached virtual block devices implemented by the cloud provider as a distributed storage system (EBS uses something like a custom Cassandra-derived layer over local SSDs in the same AZ). The latency profile is:

Per-I/O: 200 µs–1 ms for the network round-trip alone, plus the underlying physical-device latency, plus any rate-limiter throttling.
Throughput: capped by the network bandwidth allocated to the VM (typically 4–25 Gbps), not by the underlying device.
IOPS: provisioned, not physical — you pay for IOPS, the cloud provider enforces the cap.
fsync: 1–3 ms typical for gp3, 200–500 µs for io2 Block Express (which has a PLP-equivalent in the storage backend).

The implication for Indian deployments: if your workload is fsync-sensitive, EBS gp3 puts you at the SATA-no-PLP rung of the ladder regardless of the fact that it is "SSD-backed". The cost-per-fsync analysis flips: io2 Block Express at 4× the gp3 price is ~6× the fsync rate. For Razorpay-class workloads, io2 is effectively mandatory; for Hotstar-class throughput workloads, gp3 is plenty. This is the cloud version of the SATA-vs-NVMe choice.

Wear leveling, garbage collection, and the "fresh device" benchmark trap

SSDs do not write in place — every "write" creates a new physical page and marks the old one stale. The garbage collector (GC, the SSD firmware's, not your application's) periodically reclaims stale pages by relocating valid data and erasing the underlying erase blocks. While GC runs, write IOPS drops by 20–60% and write latency p99 grows 3–10×.

The benchmark trap: a fresh-out-of-box SSD has empty erase blocks, no GC running, and posts vendor-spec numbers. After 30–60 minutes of sustained writes the device enters "steady state" — GC continuously running, IOPS settling at the long-term sustainable rate. The gap between the fresh and steady-state numbers is often 2–3× for consumer drives, 1.2–1.5× for enterprise drives with generous spare-area allocations.

The honest benchmark protocol: precondition the device with at least 2× capacity of sequential writes, then run 30 minutes of the target workload at full load, then start measuring. The first 5 minutes of any SSD benchmark are misleading. Why this hits cloud benchmarks especially hard: a freshly launched cloud instance has full burst credits and a fresh underlying device; benchmarks run in the first 5 minutes show numbers 3–5× higher than the same workload will achieve in steady state. Almost every cloud-storage benchmark blog post you read uses freshly-launched instances and reports the burst peak, not the sustained baseline. Trust nothing under 30 minutes of warm-up.

Where this leads next

The next chapter (/wiki/filesystem-overhead) covers the layer between your read()/write() syscall and the device — how ext4, XFS, and Btrfs add their own latency through journaling, metadata operations, and allocation. The numbers in this chapter are device numbers; the numbers your application sees are device numbers plus the filesystem tax, which can be 1.5–3× on small-file workloads.

The chapter after that (/wiki/o-direct-async-i-o-io-uring) covers how to bypass the page cache (O_DIRECT) and how to extract the device's full parallelism from a single application thread (io_uring). The "NVMe IOPS rating" assumes the kernel I/O path can sustain it — which on read()/write() syscalls and libaio it usually cannot. Modern OLTP databases use O_DIRECT + io_uring precisely to recover the per-I/O overhead the kernel's traditional path imposes.

Two operational habits this chapter adds. First, always identify the bottleneck operation before picking a medium — fsync rate, sequential throughput, random IOPS at low latency, or random IOPS at any latency. Each maps to a different medium; "fast disk" is not a specification. Second, always run the four-cell fio table on the actual hardware before procurement — vendor numbers are the peak achievable under contrived conditions, your numbers are the steady state under your workload's shape. The table takes 20 minutes to produce and prevents the Karan-class procurement error.

The Razorpay engineering blog has documented their storage tier choice at the payments path: io2 Block Express on AWS for ap-south-1 (Mumbai) production, with the database WAL on a dedicated volume. The cost premium over gp3 is ~₹3.5L/month per database node; the alternative is a 5× collapse in commit throughput during the 23:00 IST salary-credit batch processing peak when the salary-credit volume hits 600,000 commits/s briefly. The cost-benefit does not require a spreadsheet — one minute of downtime during salary-credit processing has regulatory consequences that dwarf a year of io2 premium.

The contrast with Hotstar is instructive. Their VOD-serving tier uses gp3 (or sometimes plain HDD-backed S3 with EBS gp3 caching) because the workload is throughput-dominated and large-sequential-read shaped. For their content-metadata service — small random reads, must respond in 50 ms p99 to user navigation — they use NVMe instance store (local SSD on the EC2 host) because the network round-trip to EBS would alone consume their latency budget. Two adjacent services in the same architecture, two different storage choices, both correct for their workload shapes. The medium follows the workload, not the brand of the cluster.

A third habit, harder but worth it: measure before you tune. When a storage incident hits production, the first 5 minutes should be iostat -xm 1 to identify the workload shape and fio against the production volume to identify whether the device's steady-state numbers have changed. The temptation is to start tuning kernel parameters, filesystem mount options, or database checkpoint cadence — all of which can help, none of which matters if the device is rate-limited by AWS or in the middle of a 30-minute GC cycle. Diagnose the medium first, tune second. This is the discipline that separates 5-minute storage incidents from 4-hour ones.

Reproducibility footer

# Reproduce this on your laptop, ~20 minutes per device
sudo apt install fio sysstat
python3 -m venv .venv && source .venv/bin/activate
# fio is the workhorse; no Python deps beyond stdlib for this script
# Run against each device path you have available:
python3 media_compare.py /mnt/sata_ssd/test /mnt/nvme/test
# Watch the device live in another pane:
iostat -xm 1
# To compare against PMEM (if you have access), use --ioengine=pmem in the
# fio invocation and point at a path on a /dev/pmem* mount.
# Precondition each device with at least 30 minutes of sustained writes
# before the measurement run — fresh-device numbers are misleading.
fio --name=precondition --filename=/mnt/nvme/precond --size=10G \
    --bs=128k --rw=write --iodepth=8 --runtime=1800 --time_based=1

References

Brendan Gregg, Systems Performance (2nd ed., 2020), chapter 9 — Disks — the canonical text on storage device performance, including the latency-vs-utilisation curve and the methodology this chapter draws on.
Jens Axboe, "fio Documentation" — the workhorse benchmarking tool. The author also wrote io_uring and is the Linux kernel's I/O subsystem maintainer.
Andy Rudoff et al., "Persistent Memory Programming" — pmem.io — the canonical resource for PMEM-class device programming, including the libpmem API and the clflushopt + sfence durability sequence.
NVM Express Specification 2.0 — the NVMe protocol reference; particularly the queue model and command-completion semantics that distinguish NVMe from SATA.
Aerospike, "How Aerospike uses Optane to deliver sub-millisecond latency" — production case study from a NoSQL vendor using PMEM as the primary storage tier; useful for understanding the App Direct programming model.
AWS EBS — io2 Block Express documentation — the cloud reference for high-performance network-attached storage, including the latency and IOPS guarantees that distinguish io2 from gp3.
Samsung PM9A3 Datasheet — example of an enterprise NVMe SSD's spec sheet; useful for reading IOPS-vs-block-size and fsync-latency numbers critically.
/wiki/disk-performance-iops-throughput-latency — the previous chapter that established the three-numbers framework (IOPS, throughput, latency) this chapter applies across four media classes.