SSTable Per Column Family — The On-Disk Layout

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

A column family in Cassandra — what CQL calls a table — is physically stored on each node as a set of SSTables: Sorted String Tables, each one an immutable file written whenever a memtable fills and flushes. Inside a single SSTable, rows are laid out sorted first by the partition key's token (its consistent-hash value) and then, within one partition, by the clustering keys declared in the primary key. That sort order is the reason range scans inside a partition cost one sequential disk read instead of thousands of random ones.

Around each Data.db file sits a small family of siblings. Index.db maps partition keys to byte offsets inside Data.db; Summary.db is a sparser summary of that index that fits in RAM; Filter.db holds a bloom filter that answers "might this partition key exist in this SSTable?" in constant time with a tiny false-positive rate; Statistics.db records per-SSTable metadata the query planner uses; CompressionInfo.db holds LZ4 chunk offsets; TOC.txt and Digest.crc32 provide the manifest and checksum.

Reads are a merge scan across the memtable and every SSTable that might hold the key, newest-first, cell-by-cell, with latest-timestamp-wins semantics and tombstone suppression. Bloom filters and summaries let most SSTables answer "no" without any disk I/O at all. In the background, compaction merges older SSTables into fewer larger ones, dropping overwritten cells and expired tombstones. This chapter walks through the on-disk layout file-by-file, shows what ls on a live Cassandra data directory prints, and builds a minimal SSTable writer and reader in Python so the abstractions become concrete.

You have now seen the wide-column data model and the partition-plus-clustering primary key it depends on. Those chapters described the logical layout: a table is a map from partition key to a sorted list of rows, each row a sparse column map, shard-distributed across the ring.

This chapter opens the box and shows you the physical layout. What does a Cassandra node actually store on its local disk? When you issue SELECT * FROM tweets WHERE user_id = 'priya' AND tweet_time = '...', which bytes does the storage engine read, in what order, and why is the answer "almost none"?

Opening a Cassandra data directory

SSH into a Cassandra node and run ls inside a single column family's directory. You will see something close to this:

$ ls /var/lib/cassandra/data/twitter/tweets-a9f3c1/
na-1-big-Data.db
na-1-big-Index.db
na-1-big-Filter.db
na-1-big-Statistics.db
na-1-big-CompressionInfo.db
na-1-big-Summary.db
na-1-big-TOC.txt
na-1-big-Digest.crc32
na-2-big-Data.db
na-2-big-Index.db
... (six to seven files per generation)

The directory is named tweets-a9f3c1 — the column family name followed by a UUID suffix stable across the cluster, so the same table has the same folder on every node. Inside, files come in groups identified by a generation number (1, 2, 3, ...). Each group is one SSTable — one immutable file produced by one memtable flush or one compaction. Six to seven files per group is normal; Data.db is the bulk and the rest are support structures. The na- prefix is the format version (Cassandra 4.x uses "na"); -big- is the SSTable family type ("big" is default; "trie" is the 5.0 alternative).

Each SSTable is immutable. Once written, its bytes never change. Updates produce new rows in newer SSTables; the old row still sits in the older SSTable until compaction drops it. This is the core invariant of the log-structured merge tree family that Cassandra, RocksDB, LevelDB, ScyllaDB, and HBase all descend from.

Why immutability matters: append-only writes are sequential disk I/O — the fastest pattern on both SSDs and spinning disks. In-place updates would need random writes plus write-ahead logging to be crash-safe, and in-place deletes would fragment the file. By accepting the cost of read-time merging across multiple files, LSM engines turn every write into a sequential append and every flush into a sequential file creation. Compaction pays the merge cost in the background, off the critical path.

The anatomy of an SSTable — file by file

Take one generation, say na-1-big-. That one SSTable is actually seven or eight files working together. Each one exists for a specific lookup stage.

Data.db — the rows themselves. This is where the actual cell data lives: for every partition in this SSTable, a sorted run of rows, each row a list of cells in clustering-key order. It is the biggest file by far — typically 95% or more of the SSTable's on-disk size. Data.db is LZ4-compressed in fixed-size chunks (default 64 KB), so the engine can decompress just the chunk it needs on a seek.

Index.db — a sorted map from partition key to byte offset inside Data.db. For every partition in this SSTable, Index.db records the key and a pointer saying "this partition starts at offset X in Data.db". It is sparse relative to rows — one entry per partition, not per row — because the row-level sort within a partition is handled by a second scan inside the partition block.

Summary.db — a sparser summary of Index.db. Every 128th entry in Index.db (by default) is copied into Summary.db, which is small enough to hold entirely in RAM. Given a partition key, the engine binary-searches Summary.db in memory to find the narrow Index.db page to read, then binary-searches that page to find the exact Data.db offset. Two-level index = two-level lookup.

Filter.db — the bloom filter. For every partition key present in this SSTable, one entry is set in a bit vector. At read time, the engine hashes the query's partition key, checks a handful of bits, and learns either "definitely not in this SSTable" (skip entirely — no disk I/O) or "might be, go look". With the default 1% false-positive rate and 10 bits per key, a 100M-row SSTable has a 125 MB filter that eliminates ~99% of negative lookups without touching the disk.

Statistics.db — per-SSTable metadata: min and max clustering keys, partition count estimates, cell-count distribution, min and max timestamps, TTL metadata, compaction strategy hints. The query planner uses it to skip SSTables whose time ranges do not overlap the query (essential for TimeWindow compaction), and the compactor uses it to decide which SSTables to merge next.

CompressionInfo.db — if Data.db is compressed (the default), this file records the byte offset of every compression chunk in the uncompressed stream and in the compressed stream. Given "I want uncompressed byte X", the engine looks up the containing chunk, decompresses just that chunk, and extracts the slice.

TOC.txt — plain-text table of contents listing which component files exist for this SSTable. Mainly for tools and repair: a missing file is an invalid SSTable.

Digest.crc32 — a CRC32 checksum of Data.db for integrity checking on startup and during repair.

The eight sibling files of a single SSTable generation and how they relate. Filter.db is the first gate: it rejects most partition keys without any disk I/O. Summary.db lives in RAM and narrows the search to a page of Index.db. Index.db gives a byte offset into Data.db, where the actual rows live. The remaining files hold metadata that the planner, compactor, and repair tools consult.

How an SSTable is built

Writes do not go to SSTables directly. They land in two places at once: the commit log (an append-only file on disk, for durability) and the memtable (an in-RAM sorted map, keyed first by partition token and then by clustering key).

The memtable lives in RAM and absorbs all writes for that column family. It is a sorted structure — typically a skip list or concurrent B-tree — so range scans and point gets are cheap even before flush. When the memtable's size crosses its threshold (default ~128 MB, tunable per-table via memtable_flush_writers and related settings), a background flush thread takes it over.

The flush sequence is deterministic:

The in-RAM sorted map is iterated in order. Because it is already sorted by token and clustering key, no external sort is needed.
For each partition, the flusher writes the rows into Data.db sequentially. As it reaches a partition boundary, it records a new entry in Index.db: (partition_key, byte_offset_in_Data_db).
Every Nth (default 128th) Index.db entry is echoed into Summary.db so the narrower index stays RAM-resident.
For every partition key written, a hash is inserted into the bloom filter. At the end, the filter bit vector is serialised to Filter.db.
During the single pass, min/max clustering keys, min/max timestamps, cell-count estimates, and partition counts are accumulated and written to Statistics.db.
Data.db is compressed in fixed-size chunks; offsets are recorded in CompressionInfo.db.
TOC.txt is written listing every component file; Digest.crc32 is written as the final act.
The commit-log segment that held these writes is marked safely flushable — it can now be discarded because the data is durable on disk.
The memtable is discarded; a new empty one starts.

Why the sort-then-stream flow matters: the entire flush is sequential I/O. No back-and-forth seeking, no in-place updates, no rebalancing. On an SSD that means near-peak write bandwidth; on spinning disks it is the only pattern that does not thrash. The memtable's sorted structure is what makes this possible — without a pre-sorted in-RAM buffer you would have to sort at flush time, which would either use disk (slow) or need much more RAM (expensive).

The commit log is the durability story. A write is acknowledged only after it is in the commit log and the memtable. If the node crashes before flush, on restart the commit log is replayed into a fresh memtable and the state is reconstructed. The commit log is forward-only — new entries append; old entries are dropped as memtables flush. You will see this pattern in the dedicated commit log and memtable chapter.

How a read finds a row

A read — point get or range scan within a partition — performs a merge scan across the memtable (in RAM, always checked) and every SSTable on disk, newest-first. The merge is needed because a single row may have been written across many SSTables over time: one write went to SSTable 1, a later partial update went to SSTable 2 (only some cells), the most recent update landed in the memtable. Cassandra merges these cell-by-cell, keeping the latest timestamp — the read path we cover separately.

For a point get on (partition_key, clustering_key):

Check the memtable first. If the full row is there with a recent timestamp, great — but do not stop. Cells may have been written in older SSTables that the memtable does not yet shadow, and tombstones (deletes) in older SSTables must be honoured.
For each SSTable, newest-first:
- Check the bloom filter. Hash the partition key, probe the bits. If any bit is zero, the key is definitely not in this SSTable — skip the SSTable entirely, no disk I/O. The 1% false-positive rate means ~99% of absent keys are eliminated at this gate.
- If the filter says "maybe", binary-search the in-RAM Summary.db to find the Index.db page that would contain the key.
- Read that one Index.db page (a few KB), binary-search it for the exact partition key, and retrieve the byte offset into Data.db.
- Seek to the Data.db offset, decompress the LZ4 chunk at that offset, and scan the partition block for the clustering key. The partition block is itself sorted by clustering key, so this is a binary search or a simple linear scan of a small block.
- Collect the row's cells with their timestamps.
Merge all discovered versions — memtable plus any SSTables that contributed — cell by cell, keeping the latest timestamp per cell. Apply tombstones to suppress deleted cells. Return the surviving row.

For a range scan inside a partition (WHERE user_id = 'priya' AND tweet_time > '2026-04-24T00:00:00Z' LIMIT 100), the per-SSTable flow is the same — bloom filter, summary, index, data — but the Data.db read yields a clustering-key-ordered slice of the partition, and the merge step produces a single sorted stream of rows.

Typical cost in production: 1-3 SSTables contribute actual disk reads (the rest are filtered out by the bloom filter), 1-2 KB of Index.db is read per contributing SSTable, and 4-8 KB of Data.db is read per contributing SSTable. Total: well under 1 ms from a warm OS page cache on an SSD, 5-10 ms on a cold read from an NVMe.

Python — a minimal SSTable writer

Strip the real format down to essentials and the pieces click into place:

import struct, os

class SSTableWriter:
    """Minimal sorted on-disk writer. Keys must be fed in sorted order."""

    def __init__(self, path):
        self.data = open(path + ".Data.db", "wb")
        self.index = open(path + ".Index.db", "wb")
        self.last_key = None

    def write(self, key: str, value: str):
        if self.last_key is not None and key <= self.last_key:
            raise ValueError("keys must arrive in ascending order")
        offset = self.data.tell()
        kb = key.encode("utf-8")
        vb = value.encode("utf-8")
        self.data.write(struct.pack(">I", len(kb)) + kb)
        self.data.write(struct.pack(">I", len(vb)) + vb)
        self.index.write(struct.pack(">I", len(kb)) + kb +
                         struct.pack(">Q", offset))
        self.last_key = key

    def close(self):
        self.data.close()
        self.index.close()

What this captures: a Data.db with length-prefixed (key, value) pairs in sorted order, and an Index.db with (key, offset) pairs for every record. What it skips: LZ4 compression, partition-versus-cell structure, bloom filter, summary, timestamps, TTLs, tombstones. The real Cassandra format is more elaborate, but the skeleton is here.

Why keys must arrive sorted: the whole point of an SSTable is that its readers can binary-search or sequential-scan without a rebuild. If the writer accepted arbitrary order, it would need to sort at close time (extra memory or temp files). In practice the memtable's sorted structure guarantees the flush iteration order is already correct, so enforcing sorted input is a cheap invariant.

Python — a minimal reader

The reader loads Index.db once into a dict, then serves point gets by seeking to the recorded offset and reading the length-prefixed row:

import struct

class SSTableReader:
    def __init__(self, path):
        self.data = open(path + ".Data.db", "rb")
        self.offsets = {}
        with open(path + ".Index.db", "rb") as idx:
            while True:
                header = idx.read(4)
                if not header:
                    break
                klen = struct.unpack(">I", header)[0]
                key = idx.read(klen).decode("utf-8")
                offset = struct.unpack(">Q", idx.read(8))[0]
                self.offsets[key] = offset

    def get(self, key: str):
        if key not in self.offsets:
            return None
        self.data.seek(self.offsets[key])
        klen = struct.unpack(">I", self.data.read(4))[0]
        self.data.read(klen)  # skip stored key
        vlen = struct.unpack(">I", self.data.read(4))[0]
        return self.data.read(vlen).decode("utf-8")

    def close(self):
        self.data.close()

Two methods, thirty lines, and you have the essence of SSTable lookup: Index.db maps key to offset, Data.db holds length-prefixed values, get seeks once and reads once.

Add realism with a bloom filter. A few lines using a standard library or pybloom_live:

from pybloom_live import BloomFilter

class SSTableReaderWithBloom(SSTableReader):
    def __init__(self, path, capacity, error_rate=0.01):
        super().__init__(path)
        self.bloom = BloomFilter(capacity=capacity, error_rate=error_rate)
        for key in self.offsets:
            self.bloom.add(key)

    def get(self, key: str):
        if key not in self.bloom:
            return None  # definitely not here, zero disk I/O
        return super().get(key)

In a real LSM engine, the bloom filter is built during the write pass and persisted to Filter.db, not reconstructed at open time — but the lookup behaviour is identical.

Every column family — every CQL table — has its own set of SSTables in its own directory under the keyspace folder. There is no shared file between tweets and users; they are physically independent. Consequences:

Per-column-family tuning. Compression codec, chunk size, compaction strategy, bloom-filter rate, caching policy, TTL, memtable thresholds — all set per table via CREATE TABLE ... WITH .... Time-series tables pick TimeWindow compaction; reference tables pick Leveled for read-heavy workloads; profile tables pick SizeTiered for write-heavy.
Per-column-family repair and backup. Snapshot, restore, or anti-entropy repair one table without touching others.
Per-column-family I/O accounting. Metrics break down hit rate, flush rate, compaction activity, and disk usage by table — how operators detect hot tables.

In Bigtable and HBase the grouping is finer: a single table has multiple column families, each storing its columns in separate SSTables. The classic example is a web-crawl row where metadata (small, hot) sits in one family and contents (huge HTML, cold) sits in another, so metadata scans do not drag HTML through I/O. Cassandra collapses this — each CQL table is effectively one Bigtable column family — but allows separately-tuned tables instead. DynamoDB hides the file layout behind the service boundary, but internally each table is split into 10 GB partitions each owning its own log-structured storage files.

Compaction — the background maintenance

SSTables accumulate. Every memtable flush produces another; steady-state writes create 10 to 100+ per column family per node over hours. Three problems follow if nothing manages them: reads get slower as each new SSTable adds another bloom filter probe and (rarely) another seek; disk space grows because overwritten cells are not reclaimed and deletes write tombstones rather than erasing cells; and expired TTL data lingers until the containing SSTable is rewritten.

Compaction is the background thread that fixes all three. It reads multiple input SSTables, merges them in sorted order, drops obsolete cells (overwritten, deleted, expired), and writes one (or fewer) output SSTables; old inputs are deleted. SSTable count stays bounded, disk space is reclaimed, tombstones eventually purged.

Strategies vary. SizeTiered (the old default) groups similarly-sized SSTables and merges when enough accumulate — good for write-heavy workloads. Leveled compaction organises SSTables into levels with non-overlapping key ranges — good for read-heavy workloads, higher write amplification. TimeWindow groups SSTables by time window and never merges across windows — perfect for append-only timelines. Each has a dedicated chapter: tiered, leveled, and reclaiming deleted and overwritten keys.

Bloom filter math

A bloom filter for N keys at target false-positive rate p uses roughly m = -N * ln(p) / (ln 2)^2 bits and k = (m/N) * ln 2 hashes. At the default p = 1%: ~9.6 bits per key and 7 hashes — 120 MB for a 100M-row SSTable, 1.2 GB for a billion-row one. At p = 0.1%: 14.4 bits per key. At p = 10%: 4.8 bits per key.

The payoff: for a partition key absent from an SSTable, ~99% of the time the filter returns "no" and the engine skips all disk I/O for that SSTable. Across 50 SSTables, the expected seeks for a key present in only one of them is about 1.5 — one real hit plus 0.5 from false positives. This is the main reason LSM reads are competitive with B-trees despite spreading data across many files. Tune bloom_filter_fp_chance down to 0.001 for read-heavy tables; up for write-heavy append-only ones.

Trace a read through the layers

A read request arrives at the cluster:

SELECT * FROM tweets
WHERE user_id = 42 AND tweet_time = '2026-04-24 10:00:00';

Step through what happens on the replica node that owns the partition.

Coordinator routing. The coordinator node hashes user_id = 42 to a token, looks up the preference list on the ring, and forwards the read to one of the N replicas for that token (or fan-out to R of them for quorum reads). The chosen replica's storage engine now owns the query.
Memtable. The engine first checks the tweets memtable — an in-RAM sorted map — for (partition = 42, clustering = 2026-04-24 10:00:00). Maybe the cell is there (recent write); maybe it is absent. Either way, the engine keeps going because older SSTables may contain additional cells or tombstones to merge.
SSTables, newest-first. Say the tweets directory has 7 SSTable generations: na-1 through na-7. The engine visits them from 7 backwards.
- na-7: hash partition = 42, probe bloom filter. Filter returns "no" — skip entirely, zero disk I/O.
- na-6: bloom filter says "maybe". Binary-search Summary.db in RAM, find the relevant Index.db page offset. Read one ~4 KB Index.db page, binary-search it, get the Data.db offset for partition 42. Seek to Data.db offset; decompress one 64 KB LZ4 chunk. The partition block for user_id = 42 starts here. Binary-search the block for clustering key 2026-04-24 10:00:00. Found — collect the cells with their timestamps.
- na-5 through na-3: bloom filter says "no" on all three — skip.
- na-2: bloom filter says "maybe". Same flow as na-6. The partition block yields a tombstone for one of the cells (it was deleted after being written). Record the tombstone.
- na-1: bloom filter says "no" — skip.
Merge. Engine now has: memtable cells (if any) + na-6 cells + na-2 tombstone. It picks the latest timestamp for each cell, applies the na-2 tombstone to suppress the deleted cell, and assembles the final row.
Return. The row is serialised and returned to the coordinator, which merges replica responses (for quorum reads) and replies to the client.

Cost in hot-cache conditions: two bloom-filter probes (constant time), two Summary.db binary searches (constant time, in RAM), two ~4 KB Index.db reads, two ~64 KB Data.db chunk decompressions, and one small merge. Under 1 ms end-to-end for a warm read on modern SSDs. Five of the seven SSTables were rejected at the bloom filter gate and never touched the disk.

A point get on seven SSTables. Bloom filters eliminate five without disk I/O. The two survivors walk Summary → Index → Data to return their cells. The merge step picks the latest timestamp per cell and applies any tombstones before returning the row.

Common confusions

"Memtable is RAM, SSTable is disk." True — memtable absorbs writes and flushes to an SSTable at threshold. Before flush, the commit log makes writes durable.
"SSTables are mutable." Never. Once written, an SSTable is byte-for-byte immutable until compaction deletes it. Updates produce new rows in new SSTables.
"Bloom filters give exact answers." No — probabilistic. False positives are allowed, false negatives are never. The read path relies on the false-negative guarantee.
"Each table has one SSTable." Far from it. A busy production table has 10 to 100+ SSTables per node at any moment. Compaction bounds the number.
"More SSTables always means slower reads." Usually, but bloom filters flatten the curve — reading through 50 SSTables is often only ~1.5× slower than reading through 5.
"Index.db is an index over rows." It is an index over partitions. Within a partition, the clustering-key sort of Data.db is what finds the specific row.

Everything up to this point is the working model. If you want to go under the hood, the next four topics fill in the physics.

LSM-trees as a family

Cassandra's storage is one instance of the general log-structured merge tree pattern introduced by O'Neil, Cheng, Gawlick, and O'Neil in 1996. The pattern: buffer writes in memory, flush in sorted runs to disk, merge-compact runs in the background. LevelDB, RocksDB, HBase, Cassandra, ScyllaDB, and InfluxDB are all LSM engines; the storage ideas in this chapter transfer directly across them.

Bloom filter construction and tuning

Bloom filters trade space, false-positive rate, and hash count. The optimum hash count for fixed space is k = (m/n) * ln 2; under-hashing leaves gaps, over-hashing collides bits. Cassandra uses Murmur3 for two independent hashes and derives the rest via the Kirsch-Mitzenmacher trick. Production starting points: 0.01 for write-heavy tables, 0.001 for read-heavy tables.

Compaction strategies

Each strategy makes a different trade in the read-write-space triangle. SizeTiered minimises write amplification but has bursty read amplification. Leveled minimises read amplification at 3-5× higher write amplification. TimeWindow optimises for time-ordered append-only data. Pick wrong and your disk fills up or your p99 collapses — covered in the dedicated compaction chapters.

The trie-SSTable format (Cassandra 5.0)

Cassandra 5.0 introduced trie-big, which replaces Index.db + Summary.db with a single trie-based index giving sub-microsecond partition lookups without any RAM-resident summary. Radix-compressed, memory-mapped, and enables index-only queries in some cases. Production adoption is early; big remains the default. Both formats read the same kind of sorted data file — only the index structure differs.

Where this leads next

Chapter 88 — query-first schema design — takes this physical layout and explains why you design Cassandra tables around access patterns, not entities. If every read is a bloom-filter-plus-summary-plus-index-plus-data walk across a handful of SSTables, you want your queries to hit one column family and one partition. That constraint drives denormalisation. Chapter 89 covers joins and cross-partition aggregations — short answer: not supported, move the work into application code or offline pipelines. Chapter 90 quantifies compaction strategies and the operational knobs that decide whether your table stays healthy under production load.

References

Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, and Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI 2006 — the founding paper defining the SSTable format, column-family organisation, and compaction model that Cassandra, HBase, and most subsequent wide-column systems inherited.
O'Neil, Cheng, Gawlick, and O'Neil, The log-structured merge-tree (LSM-tree), Acta DataFlow 1996 — the original paper introducing the buffer-and-merge storage pattern that SSTables implement. Covers the read/write/space trade-off analysis that still drives compaction strategy choice today.
Apache Software Foundation, Cassandra SSTable Format Documentation — the canonical reference for the on-disk file layout, component file semantics, and the evolution from the ma through na format versions.
Sociogram RocksDB team, RocksDB Architecture Guide — a detailed architecture overview of a production LSM storage engine closely related to Cassandra's. Useful as a cross-reference for memtable, SSTable, and compaction mechanics, with more explicit treatment of write-amplification measurement than the Cassandra documentation.
George, HBase: The Definitive Guide, Chapter 8 — Architecture, O'Reilly 2011 — a thorough treatment of the Bigtable-derived HBase storage engine, including HFile (the HBase analogue of SSTable), region servers, and column-family separation motivated by access pattern heterogeneity.
ScyllaDB Engineering, ScyllaDB Storage Internals — ScyllaDB's documentation of its Cassandra-compatible SSTable format, including the shard-per-core extensions and the MC/MD/ME format versions. Useful as a performance-oriented second view on the same file format.

Opening a Cassandra data directory

The anatomy of an SSTable — file by file

How an SSTable is built

How a read finds a row

Python — a minimal SSTable writer

Python — a minimal reader

Column families share configurations, not files

Compaction — the background maintenance

Bloom filter math

Common confusions

LSM-trees as a family

Bloom filter construction and tuning

Compaction strategies

The trie-SSTable format (Cassandra 5.0)

Where this leads next

References