Storage tiering and hybrid storage

Karan runs the database tier for PhonePe's transaction history service. The cluster's hot Postgres has grown to 4.2 TB on NVMe across six replicas. NVMe at AWS io2 prices is roughly ₹18 per GB per month, so the cluster is burning ₹4.5 lakh a month on storage alone. He runs a pg_buffercache snapshot for a week and finds that 92% of block_reads hit just 240 GB of pages — the last 30 days of transactions. The other 3.96 TB exists to satisfy a regulatory retention requirement and a once-a-month "show me my old statements" feature that fires roughly 14,000 times a day across 40 crore users. He moves the cold pages to S3-backed tablespaces with a Postgres FDW, drops the NVMe footprint to 320 GB per replica, and the storage bill falls from ₹4.5 lakh to ₹68,000. The p99 of the hot path stays at 9 ms; the p99 of the cold-path "old statement" query goes from 12 ms to 380 ms. Nobody complains.

Storage tiering is the architecture that says: not every byte deserves the same medium. NVMe at 10 µs and ₹18/GB-month, SATA SSD at 100 µs and ₹3/GB-month, and S3-class object storage at 80 ms and ₹0.20/GB-month differ by 4–5 orders of magnitude on both axes — and the access distribution of almost every real dataset is so skewed that putting everything on NVMe is paying 90× the cost to move 8% of the data 8000× faster.

Tiering exists because the cost-per-byte and latency curves of storage media are not linearly related; every decade you move down the tier ladder you save roughly 10× on cost and pay roughly 10× on latency, and that ratio is exactly what makes tiering profitable when the hit rate of the upper tier is above ~85%. The rest of this chapter is about the math that decides where the boundaries between tiers should sit, the production patterns that work, and the failure modes that look like outages when the boundaries are wrong.

Storage tiering keeps a small hot working set on fast expensive media (NVMe), a larger warm set on cheaper SSD, and a cold tail on object storage — saving cost without losing the latency the hot path needs. The math works because real access distributions are heavy-tailed: a few percent of blocks absorb most of the reads, and the tier-promotion logic just has to be right about that few percent. The trap is that promotion lag is real — when a previously-cold block goes hot, you eat one slow access before the tier learns.

Why one tier is wrong — the cost-latency staircase

There is no single storage medium that is both fast and cheap. The price you pay per gigabyte and the latency you get per access are not on a straight line; they are on a staircase, with each step about an order of magnitude wide on both axes. Tiering exists because the steps are wide enough to be worth choosing between.

The staircase is not optional, it is the shape of physics meeting fab economics. NAND is cheaper than DRAM because it doesn't need refresh; HDD is cheaper than NAND because spinning rust is denser than transistors; tape and object storage win on cost because the medium is shared across customers and accessed with seconds of warm-up. Illustrative — exact prices vary by region and SKU.

The staircase is what makes tiering profitable. A single-tier system on NVMe pays the top-tier price for every byte; a single-tier system on HDD pays the slow-tier latency for every read. Tiering says: pay the NVMe price only for the bytes that get hit, and the HDD price for the ones that don't. Why this works: read-access distributions in production are almost always Zipf-like. The 80/20 rule is a mild understatement — for transaction-history, log archives, and most user-content stores, 95% of reads hit 5% of bytes. If your hot tier is sized at 10% of the dataset and your hit rate is 90%, the average read latency is 0.9 × NVMe + 0.1 × S3 ≈ 0.9 × 15 µs + 0.1 × 80 ms = 8 ms — and your storage bill is 0.1 × NVMe + 0.9 × S3 ≈ 0.1 × 18 + 0.9 × 0.20 = ₹1.98 per GB-month versus the all-NVMe ₹18. You are at 11% of the cost and your average latency is dominated by the cold accesses, but the p90 latency is still NVMe-fast.

The trick — and it is the entire engineering problem — is making the tier placement match the access distribution. Get it right and you save 80–90% on storage cost while losing nothing on the hot path. Get it wrong and you eat S3 latency on every page load.

The other thing the staircase tells you, which is easy to miss: the cost-per-IOPS curve goes the other way. NVMe gives you 100,000+ IOPS per drive at ₹18/GB-month; HDD gives you 200 IOPS per drive at ₹0.85/GB-month. Per IOPS, NVMe is cheaper than HDD even though per byte it is much more expensive. So the right way to think about the tier choice is two-dimensional: how often is this byte read (IOPS density) and how much of it is there (capacity). Hot data is high-IOPS-low-capacity; cold data is low-IOPS-high-capacity. The tier with the matching ratio is the right one, and the staircase exists because no single medium is good at both.

The hot-warm-cold model and how data flows between tiers

A typical three-tier system looks like this. Recently written or recently read data lives on hot storage — NVMe, in-memory caches (Redis), or a memory-resident page cache. Data that is accessed but not in the last few minutes lives on warm storage — SATA SSDs, EBS gp3, local SSD attached to the database. Data that is rarely accessed but cannot be deleted lives on cold storage — S3, GCS, Azure Blob, or even Glacier for the deepest tail.

The arrows are the policy. Demotion is an age-based or LRU-based background sweep that runs nightly or hourly; promotion is read-driven and synchronous with the access. The cost numbers assume one replica; multiply by the replication factor for the actual cluster bill.

Three movement rules govern the system, and getting them right is the entire design.

Writes always land on hot. New data is by definition hot — somebody just produced it. Even if it will be cold tomorrow, today it is in active use (consistency reads, audit, downstream pipelines).

Writing directly to S3 to "save money" is the trap that adds 80 ms to every write and triggers a thundering herd of consistency bugs. Hot is the write tier. The exception is bulk-historic writes (a one-time backfill loading 5 years of archived statements into the system) — these can land directly on the cold tier because there is no immediate read path through them, but the policy must be explicit and isolated, not the default.

Demotion is age-based or access-frequency-based, and runs in the background. A nightly job at 02:00 IST scans pages older than 30 days that haven't been accessed in 7 days, packs them into 4 MB chunks, writes the chunks to warm/cold, and rewrites the metadata. Demotion is asynchronous because it is not on the critical path of any user request.

Aggressive demotion (every hour, every 5 days) means smaller hot tier (cheaper) but more demotion churn (more I/O on the warm tier and more chance to demote something that was about to go hot again). Conservative demotion (weekly, 60 days) means a larger hot tier and a lower demotion churn. The right cadence is a function of the workload — for transaction history with a strong recency bias, daily-and-30-days is a good default.

Promotion is read-driven and synchronous, and the first read pays the cold-tier cost. When a user asks for "my December 2024 statement" and that data lives on S3, the request fetches from S3 (80 ms penalty), serves the user, and also writes the page back to the hot tier so the next access — likely soon, because the user is probably scrolling — is fast. This is read-through caching at the storage layer, and it is the only place in the design where the user pays the cost of tier separation.

The promotion latency is the pain point readers underestimate. You cannot make the first cold-read fast — you can only make sure the cold-read frequency is below the threshold where users notice. Why this matters in practice: a 2% cold-read rate sounds tiny, but at PhonePe's 4 crore daily transaction-history queries that is 8 lakh slow requests a day. If "slow" means 380 ms instead of 9 ms and your SLO is "p99 of 200 ms", you are violating the SLO 0.8 percentage points of the time — which on a 99.9% SLO is the entire error budget for the year, blown in a day. The fix is either to make the cold tier faster (use S3 Express One Zone at 8 ms, or use SSD-class warm storage as the cold tier), or to make sure the promotion logic is aggressive enough that frequently-accessed cold data gets pulled up before it costs anyone an SLO violation.

Measuring what's hot — a Python harness over a real workload

The cheap way to design a tiering policy is to look at the actual access distribution of your dataset and pick the hot-tier size where the cumulative-hit-rate curve flattens. Here is a Python harness that takes a Postgres query log (or any workload trace), computes the unique-page access histogram, and plots cumulative hit rate vs hot-tier size — the curve that tells you "if I put X% of pages on NVMe, I capture Y% of reads".

# tier_sizing.py — measure cumulative-hit-rate vs hot-tier size from a workload trace.
# Input: a CSV of (timestamp, page_id) tuples — each row is one read.
# Output: a table showing, for each hot-tier size as a fraction of total pages,
#         what fraction of reads would hit the hot tier under perfect oracle placement
#         and under a simple LRU policy.
#
# Setup:
#   pip install pandas
#   # Generate a sample trace (Zipf access pattern, 1M reads, 100K pages):
#   python3 -c "import random,csv,sys; random.seed(42); \
#     ps=sorted(range(100000), key=lambda _: random.random()); \
#     w=csv.writer(sys.stdout); \
#     [w.writerow([i, ps[min(int(random.paretovariate(1.16))-1, 99999)]]) for i in range(1000000)]" \
#     > trace.csv
#   python3 tier_sizing.py trace.csv

import csv, collections, sys
from itertools import accumulate

def cumulative_hit_curve(path):
    counts = collections.Counter()
    total = 0
    with open(path) as f:
        for ts, pg in csv.reader(f):
            counts[pg] += 1
            total += 1
    sorted_pages = counts.most_common()                # hottest first
    cum_hits = list(accumulate(c for _, c in sorted_pages))
    n = len(sorted_pages)
    # Sample at 5% steps of hot-tier size.
    print(f"trace: {total:,} reads over {n:,} unique pages")
    print(f"{'hot-size %':>11}  {'pages':>10}  {'oracle hit %':>14}")
    for pct in [1, 2, 5, 10, 15, 20, 30, 50]:
        k = max(1, int(n * pct / 100))
        hits = cum_hits[k - 1]
        print(f"{pct:>10}%  {k:>10,}  {100*hits/total:>13.2f}%")

def lru_simulate(path, hot_pages):
    """Simulate an LRU policy with hot_pages slots; return hit rate."""
    cache, hits, total = collections.OrderedDict(), 0, 0
    for _, pg in csv.reader(open(path)):
        total += 1
        if pg in cache:
            hits += 1
            cache.move_to_end(pg)
        else:
            cache[pg] = True
            if len(cache) > hot_pages:
                cache.popitem(last=False)
    return 100 * hits / total

if __name__ == "__main__":
    path = sys.argv[1]
    cumulative_hit_curve(path)
    # LRU vs oracle at a couple of representative sizes:
    n_pages = sum(1 for _ in csv.reader(open(path)))   # rough; reuses earlier scan in practice
    for pct in [5, 10, 20]:
        k = max(1, int(n_pages * pct / 100 / 10))      # /10 because we have ~10x reads per page
        print(f"LRU {pct:>2}% hit rate: {lru_simulate(path, k):.2f}%")

Sample output on a Zipf-1.16 synthetic trace (which is roughly the shape of the PhonePe transaction-history workload):

trace: 1,000,000 reads over 99,943 unique pages
 hot-size %       pages   oracle hit %
         1%       999          61.40%
         2%     1,998          70.83%
         5%     4,997          82.91%
        10%     9,994          90.06%
        15%    14,991          93.45%
        20%    19,988          95.42%
        30%    29,982          97.43%
        50%    49,971          99.05%
LRU  5% hit rate: 78.41%
LRU 10% hit rate: 87.82%
LRU 20% hit rate: 93.94%

A few things to notice in the walkthrough:

counts.most_common() gives you the oracle placement — perfect knowledge of which pages will be hit, hottest first. Real systems do not have an oracle; they use LRU, LFU, or 2Q approximations. The oracle curve is the upper bound on how good any tiering policy can be.

The curve flattens hard at ~10% of pages capturing 90% of reads. The knee point in the cumulative-hit curve is exactly where you should size the hot tier. Why the knee tells you the right size: below the knee, every additional GB of hot tier captures a lot of additional reads — it is paying for itself. Above the knee, every additional GB of hot tier captures very few additional reads, so you are paying NVMe prices for cold data. The knee is the price-performance optimum, and it is workload-dependent: more skewed workloads (Zipf with a smaller alpha) have an earlier, sharper knee; flatter workloads (uniform-ish access) have no knee at all and tiering doesn't help.

lru_simulate is the realistic estimate. LRU achieves about 87–94% of oracle hit rate at the same hot-tier size. The gap between oracle and LRU is the cost of not having perfect future knowledge, and it is the floor on how much your tier-promotion algorithm can ever improve over a simple LRU.

The script is what you run before deciding on any tiering policy. Don't size the hot tier by guessing or by "what we had last quarter". Pull a week of access logs, run this against them, look at the curve, and pick the percentage where the curve flattens.

If the curve is straight (no knee), tiering will not help — you have a uniform-access workload, and the only way to scale is to buy more of one tier. If the curve has multiple knees (a sharp jump at 2%, a second jump at 25%), that is a signal that you have two distinct workloads sharing the same dataset, and the right answer is to split them — give each workload its own tier policy and let them not interfere. The shape of the curve is the diagnostic; everything else is just acting on what the curve told you.

Production patterns: where tiering shows up and where it hides

Three production systems use tiering in three different ways, and recognising the pattern is the difference between debugging a slow query and re-engineering the architecture.

Database internals (MySQL InnoDB, Postgres): the page cache is the hot tier, the local SSD is the warm tier, and tablespaces on different mount points (or the EBS gp3 underneath the data directory) form the implicit cold tier. The "tiering" is invisible — the database treats it as one address space, but the underlying I/O times differ by 1000×.

When iostat -x 1 shows your data disk at await = 14 ms while the page cache hit rate is 99%, the 1% of misses are the hidden cold tier and the entire reason the p99 of the query latency is 14 ms instead of 100 µs. This is also why "just add more RAM" is the standard prescription for a slow Postgres — you are not making the disk faster, you are pushing more of the dataset into the implicit hot tier so fewer queries touch the slow tier at all.

Object storage with prefix caching (Hotstar's video edge): the live IPL match's HLS segments live on local NVMe at the edge node, the last 24 hours of segments live on the regional cache cluster's SSD, and the multi-year archive lives on S3. The promotion logic is geography-aware: a Mumbai user requesting a Wankhede 2018 IPL clip pulls from S3 to the Mumbai regional cache to the Mumbai edge node, taking 80 ms once and 8 ms thereafter.

The architecture is fundamentally a CDN with object-storage backing, and the tiering is what makes "every game ever played" feasible at ₹1.50 per GB-month aggregate instead of ₹18. During the IPL final the edge tier hit rate is 99.7% (everyone is watching the same live segments); during a quiet Tuesday afternoon the hit rate drops to 78% (long-tail catalogue browsing dominates). The cost-per-byte stays roughly constant across both regimes because the tier sizes were chosen against the long-tail-afternoon distribution, not the live-final distribution.

Time-series databases (InfluxDB, TimescaleDB at Razorpay): the most-recent hour's metrics live in memory, the last 24 hours live on local SSD, the last 30 days live on attached EBS gp3, and the multi-year retention lives on S3 in compressed Parquet. The tier transitions are explicit in the data model — pg_cron jobs continuously aggregate and archive — and the query planner knows which tier each time range lives on.

A SELECT * WHERE time > now() - '30s' hits memory, while SELECT count(*) WHERE time > now() - '5y' hits S3 and pays the latency. The cost of metric retention drops by 60× without giving up any of the freshness on the hot path. The compression ratio on the cold tier matters as much as the storage price — Parquet with Zstandard typically compresses metric data 8–12×, so the effective cold-tier cost is ₹0.02 per logical GB-month, which is what makes 5-year retention financially possible.

In every one of these cases the unit of tiering matters. Postgres tiers at the page level (8 KB). Hotstar tiers at the HLS segment level (~6 MB). Time-series tiers at the chunk level (often 1-day chunks of compressed columnar data).

The unit is chosen so that the metadata to track which tier a unit lives on is much smaller than the unit itself — track 8 KB pages with 64-byte page-table entries, track 6 MB segments with 256-byte CDN-table entries, track 1-day chunks with 128-byte metadata rows. Why this matters: if your metadata is too large relative to your tiering unit (say, 1 KB metadata for 4 KB pages), the metadata itself blows up the hot tier and you've made things worse. The rule is that metadata-to-data ratio should be under 1% — at which point the metadata is "free" relative to the data being tiered.

Failure modes — when tiering bites back

Tiering has three failure modes that show up in production once the system has been live for a few months. Each one is the kind of incident that produces a postmortem ending with "we are reverting to a single-tier architecture for now". Each one is also avoidable if you know to look for it.

The cold-storm. A scheduled job — maybe a billing run, maybe a regulatory data export, maybe a customer-support tool that pulls "all transactions for this user since 2020" — issues a sweeping read that hits 100% of the cold tier in 30 seconds. Every read is an 80 ms S3 fetch; the job runs at concurrency 200 and saturates the S3 connection pool.

While the storm is in flight, every legitimate cold-read by a user takes 1.2 s instead of 80 ms because they are queued behind the storm. The fix is to give batch jobs a separate cold-tier connection pool with a strict QPS cap, and never let them share the user-facing connection pool. Razorpay's billing-export tool used to take down the cold-tier latency every Tuesday at 03:00 IST until the team enforced this separation.

The promotion thrash. A workload with weak recency — every read is to a different cold page, no read is repeated — promotes every cold access to the hot tier, fills the hot tier with single-use data, and evicts genuinely-hot pages to make room. The hot tier hit rate collapses from 92% to 40% in an hour.

The diagnostic signal is a hot-tier hit-rate dashboard that suddenly drops while the cold-tier read rate spikes. The fix is admission control on promotion: don't promote on the first cold access, only on the second access within a window (a "two-touch" admission policy). Postgres's effective_cache_size and InnoDB's innodb_old_blocks_time are the production-grade implementations of exactly this idea.

The tier inversion. A misconfigured monitoring rule, a scheduled backup, or a developer running SELECT count(*) FROM huge_table on a Friday evening triggers a sequential scan that touches every page. The hot tier is now full of cold sequential pages, and the actual hot data — recently-written transactions — is on the warm tier.

Until the workload runs long enough to re-warm the hot tier (which can take hours), the user-facing latency is in the warm-tier numbers. The fix is sequential-scan detection in the buffer manager: if a query is reading pages in numerically-adjacent order at high speed, mark those pages as "do not promote" or put them in a separate scan-resistant buffer. Why this is a hardware-aware decision: the underlying storage device is doing its own readahead for sequential I/O, so the warm tier serves sequential scans at near-NVMe speeds anyway — there is no benefit to promoting them. The cost of promotion is real, the benefit is zero, so the right answer is to not promote.

The thread joining these three failure modes is that they are all about who is allowed to influence the tier-placement policy. User-facing reads should drive promotion. Batch jobs, sequential scans, and monitoring should not. Any tiering system that lets every read have equal weight in the placement decision will eventually be bullied by a misbehaving workload. Discrimination — by source, by access pattern, by frequency — is the architecture.

Common confusions

"Tiering is just caching." Caching keeps a copy on the fast tier and the original on the slow tier; tiering moves the data and there is one canonical location. The difference matters for invalidation: a cache must invalidate on writes; a tier merely re-locates writes. Tiering also amortises the metadata across the lifetime of the data; caching pays the metadata cost on every hit-miss decision.
"All workloads benefit from tiering." Uniform-access workloads (e.g. a hash-partitioned KV store with random reads across 1 TB) have no hot subset to pin — every page is cold. For these workloads tiering is pure overhead. Run the access-distribution measurement before designing the tier policy.
"S3 is always the cold tier." S3 standard is 80 ms p50; S3 Express One Zone is 8 ms; S3 Glacier is hours. Treating "S3" as a single tier conflates three different price-performance points. The right cold tier depends on the SLO of the cold path — Express for "rare but interactive", standard for "background analytics", Glacier for "compliance archive only".
"Tier transitions are free." Demotion does I/O, promotion does I/O, and both consume bandwidth on the hot tier. A poorly-tuned tiering system burns 20–30% of its hot-tier IOPS on tier movement, which means the user throughput drops by that amount. Measure tier-movement IOPS as a first-class metric, not an afterthought.
"More tiers is always better." Each tier adds policy complexity, monitoring, and one more thing to misconfigure. Three tiers (hot/warm/cold) is the production sweet spot; five-tier systems exist (memory/NVMe/SSD/HDD/S3) but they only pay off when the dataset spans 100+ TB and the team has dedicated storage engineers.
"Tiering policies should always be LRU." LRU is optimal for recency-skewed workloads but terrible for sequential-scan workloads (a single full-table scan poisons the hot tier with cold data). Production systems use 2Q, ARC, LRU-K, or workload-aware policies; the database community settled on segmented LRU for InnoDB and clock-pro for Postgres precisely because pure LRU has too many corner cases.

Going deeper

Tier movement amplification — the hidden write cost

Demotion is not free, and the I/O it costs is the cost engineers most often forget when they cost a tiering system. Each demoted page is read from the hot tier (1× read) and written to the warm tier (1× write). If you also keep a hot-tier copy for a grace period, you are doing 1× read + 1× write + 1× delete.

On NVMe SSDs, every write contributes to write amplification — the firmware moves data internally during garbage collection — so the effective wear on the device per logical demotion is 2–4×. A system that demotes 2 TB per day from a 4 TB hot tier is doing 8 TB of effective writes daily, which on a 3-DWPD-rated NVMe drive is consuming half its write endurance just on tier churn. Plan the demotion rate against the drive's endurance budget, not just its IOPS budget.

Metadata as the bottleneck — why tier maps live in memory

The tier map (which page lives on which tier) must be consulted on every read. If the map is on disk, every read does a metadata lookup before the actual read — which doubles the latency. Production systems keep the entire tier map in DRAM, which works as long as the map is small enough.

For a 4 TB dataset tiered at 8 KB pages, the map has 5 × 10⁸ entries; at 16 bytes each that's 8 GB of DRAM just for the tier map. Tiering at larger units (256 KB chunks instead of 8 KB pages) shrinks the map by 32× to 256 MB — comfortable. The unit-of-tiering decision is therefore a metadata-cost decision as much as it is an I/O-efficiency decision.

Promotion lag and the SLO budget arithmetic

A read against cold data takes the cold-tier latency; the next read against the same data takes the hot-tier latency, because promotion happened in between. The window where a user gets cold-tier latency is the promotion lag, and during a viral moment (a celebrity tweet linking to a 5-year-old Hotstar clip, say) the promotion lag is what every user in the first 60 seconds eats.

If your SLO is p99 of 200 ms and the cold-tier latency is 80 ms, you have headroom; if the cold-tier latency is 500 ms (Glacier-style), you are violating the SLO for every cold-promoted access until the first read finishes. The fix is predictive promotion — pre-warming the hot tier when an external signal suggests data is about to go viral — but predictive promotion has its own false-positive cost (hot-tier capacity wasted on data that did not in fact go viral). The Razorpay-style answer is to track the cold-read rate as a leading indicator and alert when it crosses 1% of total reads, then have an on-call response runbook.

Brendan Gregg's tools for spotting tier-movement pathology

iostat -xz 1 shows per-device throughput, IOPS, and await. A healthy tiered system shows the hot tier running near its IOPS budget with low await (under 1 ms), and the warm/cold tier running at 5–20% of its budget with await matching the device class (5–10 ms for HDD, 80+ ms for object storage gateways).

iotop -aoP shows per-process I/O, which catches the case where the demotion job is monopolising the hot tier. The bpftrace one-liner bpftrace -e 'tracepoint:block:block_rq_complete { @[args->dev] = hist(args->bytes); }' shows the I/O-size distribution per device, which immediately reveals whether the warm-tier traffic is large sequential demotions (good) or small random promotions (bad). A bimodal distribution with peaks at 4 KB (random) and 1 MB (sequential) is the signature of a healthy tier-movement workload; a unimodal 4 KB distribution on a tier that should be doing batch demotions is the signature of a tiering policy that has degenerated into a slow random-I/O pump.

When the access distribution is not skewed — and what to do instead

If the cumulative-hit-rate curve from tier_sizing.py is roughly straight (no knee), the workload is uniform-access — every page is roughly equally likely to be read. Tiering does not help; you cannot fit the working set into a small fast tier because there is no small working set. The right answers in this regime are different.

One is to scale the upper tier horizontally — buy more NVMe, shard the data, and let parallelism do what tiering cannot. A second is to compress the data so that the effective dataset fits in less storage at the cost of CPU on each access; column-store databases (ClickHouse, DuckDB) compress 10–20× and then a single fast tier can hold what would otherwise need a tier hierarchy. A third is to redesign the workload so that access is not uniform — add a query-result cache, materialize aggregates, or shed reads to a follower. The diagnostic from the access-distribution measurement is what tells you which of these to try; without the diagnostic, you are guessing.

A useful aside on the cloud's effect on this whole calculus: before object storage, the cold tier was a tape robot or a SAN — capital-intensive, slow to provision, and operationally awkward, so the break-even where a cold tier paid back the engineering cost was roughly 50 TB. Object storage moved the break-even down to about 1 TB because the cold tier costs nothing until you put data in it, scales without procurement, and has a stable per-GB price. The downstream effect is that tiering is now the default architecture for any system above 1 TB, and the operational complexity — promotion lag, cold-storm protection, metadata consistency — has moved from "rare expert knowledge" to "table-stakes for a backend engineer at any company doing more than ₹100 crore of annual transactions". A senior SRE who cannot explain how their hot-tier hit rate is computed is missing a foundational skill, the way missing knowledge of TCP slow-start would have been a foundational gap in 2010.

Reproduce this on your laptop

# Reproduce this on your laptop (no root required for the simulator):
python3 -m venv .venv && source .venv/bin/activate
pip install pandas
# Generate a synthetic Zipf-1.16 trace (matches typical recency-skewed workloads):
python3 -c "import random,csv,sys; random.seed(42); \
  ps=sorted(range(100000), key=lambda _: random.random()); \
  w=csv.writer(sys.stdout); \
  [w.writerow([i, ps[min(int(random.paretovariate(1.16))-1, 99999)]]) for i in range(1000000)]" \
  > trace.csv
python3 tier_sizing.py trace.csv

Where this leads next

Tiering is one half of the I/O performance story; the other half is how the data gets in and out of each tier efficiently. The next chapters in Part 10 cover the syscall layer that actually moves the bytes — how to choose between blocking I/O, async I/O, io_uring, and zero-copy paths once you've decided which tier a byte lives on.

A useful exercise once you have read those chapters is to revisit your tiering policy and ask whether the syscall path you are using is making the tier transitions cheaper or more expensive than they need to be. A demotion job that uses read()/write() against the hot tier consumes 4× the memory bandwidth of one that uses sendfile() or splice() between file descriptors; an io_uring-driven demotion can saturate an NVMe drive with 4 KB random I/O at 12% of the CPU cost of the same workload through synchronous reads. Tiering policy and I/O syscall choice are two sides of the same I/O budget, and a well-tuned system optimises them together.

The three adjacent threads to follow are:

/wiki/disk-performance-iops-throughput-latency — the underlying mechanics of why each tier has the IOPS and bandwidth ceiling it does.
/wiki/o-direct-async-i-o-io-uring — once you've picked a tier, this is how you pump bytes through it without burning CPU on syscall overhead.
/wiki/the-tail-at-scale-dean-barroso — the cold-tier-hit problem (a tiny fraction of slow accesses dominate the p99) is a special case of the general "tail at scale" phenomenon.
/wiki/zero-copy-sendfile-splice-mmap — the syscall toolkit you should reach for when implementing the tier-movement code paths described above.

References

Brendan Gregg, Systems Performance (2nd ed., 2020), ch.9 — the canonical chapter on disks and the I/O subsystem; discusses tiering at the device-mapper layer.
Patterson & Hennessy, Computer Organization and Design (6th ed.), ch.5 — the original memory-hierarchy framing this chapter generalises from CPU caches to storage tiers.
AWS Storage Pricing 2026 — the rupee/GB-month numbers in this article are derived from ap-south-1 prices as of 2026-04.
Megiddo & Modha, "ARC: A Self-Tuning, Low Overhead Replacement Cache" (FAST '03) — the algorithm used by ZFS and others when a single LRU misbehaves on tiered storage.
Cao et al., "Implementation and Performance of Application-Controlled File Caching" (OSDI '94) — early work on letting the application drive tier-placement decisions instead of the OS.
TimescaleDB Continuous Aggregates documentation — a production example of explicit time-based tiering in a time-series database.
/wiki/disk-performance-iops-throughput-latency — internal cross-reference for the device-level performance ceilings that bound each tier.