In short

Wide-column stores — Cassandra, HBase, Bigtable, ScyllaDB, and DynamoDB's table model — organise data on disk very differently from a relational database. A row is identified by a partition key, which is consistent-hashed to pick the physical node that stores the row. Within a partition, rows are stored sorted by a clustering key — typically a timestamp or a monotonic id — so range scans within one partition are sequential disk reads. Each row holds a sparse, potentially very wide, map of column-name to value: thousands of columns is fine, and two rows in the same table can have totally different columns. Schema is per-row-flexible; you do not ALTER TABLE to add a column, you just start writing it.

The design trade is blunt. Access within a partition is very fast: point gets on (partition_key, clustering_key) are O(1) logical, range scans on clustering keys are sequential I/O, everything benefits from data locality. Access across partitions is slow or impossible: no joins, no cross-partition transactions, no ad-hoc secondary queries without pre-built indexes. You model your schema around the queries you plan to run — not around the entities and relationships in your domain.

This model exists because of Google's web-crawl problem. In 2004, relational databases could not store 100 billion rows (one per URL), each with hundreds of sparse columns (crawl timestamps, inlinks, cached content, language annotations), and serve both batch scans and point lookups on the same hardware. The 2006 Bigtable paper (Chang et al., OSDI) showed an alternative: partition by row-key, cluster by column-family, forget about joins, scale horizontally to petabytes. Cassandra (Facebook 2008) merged Bigtable's data model with Dynamo's leaderless replication; HBase is an open-source Bigtable clone; DynamoDB ships the same core ideas as a managed service. When your data has a natural primary entity — a user, a device, a sensor, a conversation — with attached time-series or activity history, wide-column is often the right choice, and often the only one that scales.

Build 10 dismantled single-leader replication and showed how Dynamo runs leaderless clusters, consistent-hashes keys onto a ring, and tunes N/R/W. That was the distribution layer. What actually sits on each node — what data structure holds the rows, what queries are cheap, what queries are impossible — was deferred. Build 11 opens that box.

The answer is the wide-column model, and it looks nothing like the relational tables you have used since freshman year.

The Google web-crawl problem

In 2004, Google's crawler had indexed a substantial chunk of the public web — roughly 8 billion pages, on its way to hundreds of billions. For each URL, the crawler kept the last crawl timestamp, the full HTML content, the outbound links, the inbound links ("anchors"), extracted metadata, language detection, near-duplicate cluster IDs, per-crawl-version revision history. Some URLs had tens of anchors; others had tens of thousands. Some were crawled every hour; others once a year. The data was enormous, extremely sparse, and wildly uneven across rows.

No relational database in 2004 could hold it. A pages table with columns (url, content, last_crawl, language, anchor_1, ..., anchor_N) was impossible because N was unbounded and mostly empty. Normalising to a separate anchors(url, anchor_url) table forced a seven-way join for every page fetch — a disaster at crawler scale. Denormalising into JSON blobs defeated range scans like "all URLs on wikipedia.org crawled this week".

The engineers who wrote the Bigtable paper (Chang et al., OSDI 2006) described this exactly. They needed three properties that relational databases delivered only in the small: horizontal scale to petabytes, per-row schema flexibility, and locality of access for rows that share a key prefix. Bigtable was the design that came out, and it introduced the wide-column model the rest of this chapter describes.

Why relational falls down here: the relational model trades flexibility for guarantees — fixed schema, joins to recombine normalised entities, transactions across rows. Those guarantees cost money on every write and become expensive to distribute. Petabyte-scale OLTP gives up most of the guarantees anyway — you are not running multi-row ACID across 100 billion web pages. Wide-column accepts the loss of guarantees up front and spends the savings on scale and flexibility.

Relational vs wide-column — a side-by-side

Relational (PostgreSQL, MySQL, Oracle). Tables have a fixed schema declared up front. Every row has the same columns. Data is normalised — each fact stored in exactly one place — and queries use joins to recombine related rows. You model by thinking about entities and their relationships. Queries are ad-hoc: the SQL planner figures out how to combine tables, indexes, and join algorithms.

Wide-column (Cassandra, HBase, Bigtable, DynamoDB). Rows are identified by a compound primary key and hold a sparse map of column-name to value. Two rows in the same table can have completely different sets of columns. Data is denormalised — each query gets its own table, and the same fact may be written to many tables. There are no joins. You model by thinking about access patterns and build one table per query. Ad-hoc queries that were not planned for are slow or impossible.

The Cassandra community motto, due to Patrick McFadin: "in relational, you model your entities; in wide-column, you model your queries." Internalise this sentence. It is the single thing that most trips up engineers coming from Postgres — they try to normalise, and then the joins they expect are not there.

A wide-column row layoutA horizontal strip labelled "partition user:priya" containing three row segments stacked top to bottom. Each row is identified by a clustering-key (a timestamp) and shows a different set of columns attached to it; the columns differ from row to row, illustrating the sparse per-row schema.partition key: user:priya (hashed to node C on the ring)rows within this partition are sorted by clustering key (event_time DESC)event_time = 2026-04-24T11:05:00Zcolumns: event_type="page_view", url="/wiki/cassandra", device="mobile", session_id="s42", referer="google"event_time = 2026-04-24T11:04:32Zcolumns: event_type="click", target="#download", session_id="s42"event_time = 2026-04-24T10:58:11Zcolumns: event_type="purchase", item_id="i-9941", amount_inr=1299, payment="upi", session_id="s37", gift_note="Diwali gift"
One partition of a wide-column table. The partition key user:priya consistent-hashes to one physical node. Inside the partition, three rows are stored sorted by event_time (descending). Notice that each row has a different set of columns — a page-view has a URL and a referer; a purchase has an amount and a payment method and a free-text gift note. Two of the rows share a session_id; one has none. Schema is per-row-flexible. A read for "last 100 events for user:priya" is a single sequential scan of this partition on one node.

The primary key — partition plus clustering

A wide-column table has a compound primary key with two named parts.

The partition key decides which physical node in the cluster stores the row. Cassandra and DynamoDB pass it through the consistent-hash function (see Consistent Hashing and Virtual Nodes) to find N preference-list nodes and replicate to all N. Every row with the same partition key lives together.

The clustering key decides the order of rows within a partition. Rows are stored physically sorted by clustering key on disk — this is the storage layout, not a query-time sort. Scanning a partition in clustering-key order is a sequential SSTable read, the fastest I/O pattern the system offers.

Put in CQL (Cassandra Query Language) syntax:

CREATE TABLE user_events (
    user_id    text,
    event_time timestamp,
    event_type text,
    payload    text,
    PRIMARY KEY ((user_id), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

The inner parentheses around user_id declare it as the partition key. Everything after — here, event_time — is clustering keys. CLUSTERING ORDER BY (event_time DESC) stores rows newest-first within each partition, so "last 100 events for a user" reads the first 100 rows sequentially, no sort.

Why this two-level key matters: the partition key is a sharding and availability decision — one partition lives on one set of N nodes and is the unit of read-your-writes, failure isolation, and single-partition transactions. The clustering key is a layout decision — it determines whether the scans you care about are sequential or random on disk. Get the partition key wrong and queries hit every node; get the clustering key wrong and scans become random I/O. Both choices happen at table-creation time and are nearly impossible to change without rewriting the table.

The sparse-map row

Inside a row, the data is a dictionary: {column_name: value}. This is the single design choice that most distinguishes wide-column from relational.

Columns are not declared in a global schema. Cassandra lets you declare them — CREATE TABLE user_events (...) lists the known ones — but dynamic column families accept arbitrary names at runtime via map<text, text> types. DynamoDB is more relaxed still: declare the partition and sort keys, and every other attribute is write-time dynamic. HBase treats the schema as advisory.

Two rows can have different columns. Row A might have 5 columns: {event_type: "click", target: "#btn", session: "s1", device: "mobile", ts: "..."}. Row B might have 50 columns including gift_note and 47 A/B-test flags that A does not. The database does not care. Missing columns consume no storage — sparse on disk, sparse in RAM.

Columns can be added on the fly. Need a referer_source column? Start writing it on new rows. Old rows do not have it; if you want them to, you migrate explicitly. No ALTER TABLE lock, no rewrite of 100 million rows.

This is why wide-column is often mis-described as "schemaless". It is not — partition and clustering keys are schema, and most production tables declare their main columns. What is flexible is the long tail of sparse columns: A/B flags, experiment buckets, per-event metadata, profile attributes that differ by region or tier. Relational requires ALTER TABLE for each; wide-column lets you add them at write time.

Access patterns — wide-column is not a relational substitute

Wide-column scales because it imposes a narrow set of fast access patterns and makes everything outside that set slow or impossible.

Point get by full primary key. SELECT * FROM user_events WHERE user_id = 'priya' AND event_time = '2026-04-24T11:05:00Z'. Coordinator hashes the partition key, routes to N replicas, reads R, merges. O(1) logical cost.

Range scan within a partition. SELECT * FROM user_events WHERE user_id = 'priya' AND event_time > '2026-04-24T00:00:00Z' LIMIT 100. Same routing — one partition, one set of replicas. The scan is a sequential SSTable read. Bread-and-butter for timelines, feeds, per-entity history.

Cross-partition query without a secondary index. SELECT * FROM user_events WHERE event_type = 'purchase' hits every node and scans every partition. Production Cassandra and DynamoDB reject or heavily warn on such queries. Either build a separate table with event_type as the partition key, or run the query through an external batch tool (Spark, Hive, EMR).

Joins. Not supported. No JOIN operator in CQL. Either look up related rows on the application side (round-trips per row), or denormalise by storing the join result inside the row.

Aggregates. COUNT(*), SUM(amount) are supported only within a single partition. Cluster-wide aggregates go through external tools (Spark on Cassandra, Athena on DynamoDB).

Secondary indexes exist but fan out to every node and are catastrophic for high-cardinality columns. Production advice is almost always: build a second table instead.

Why access patterns are king: the wide-column storage layout — sorted SSTables per column-family, compaction, bloom filters, memtables — is optimised for exactly two operations: point get by (partition, clustering) and range scan by clustering inside a partition. Queries that match are fast at any cluster size; queries that do not are slow, period. You design the schema to keep queries inside the sweet spot.

Denormalisation is the default

In relational, you normalise to avoid redundancy; each fact lives in one table and joins recombine. In wide-column, you denormalise to match queries; each query gets its own table and the same fact may live in five tables. Storage is multiplied; query-time cost is flat.

Messaging example. A relational schema: conversations, participants, messages — three tables, normalised; "show my conversations with last messages and participants" is a three-way join.

The wide-column translation splits by access pattern:

-- Query: "show messages in conversation X, ordered by time"
messages_by_conversation (
    conversation_id text,
    message_time    timestamp,
    sender_id       text,
    text            text,
    PRIMARY KEY ((conversation_id), message_time)
);

-- Query: "show all conversations this user participates in"
conversations_by_user (
    user_id         text,
    conversation_id text,
    last_activity   timestamp,
    title           text,
    other_members   set<text>,
    PRIMARY KEY ((user_id), last_activity, conversation_id)
) WITH CLUSTERING ORDER BY (last_activity DESC);

-- Query: "show messages this user sent across all conversations"
messages_by_user (
    user_id         text,
    message_time    timestamp,
    conversation_id text,
    text            text,
    PRIMARY KEY ((user_id), message_time)
) WITH CLUSTERING ORDER BY (message_time DESC);

Three tables, each for one query, all denormalised. Sending a message writes to messages_by_conversation, updates conversations_by_user for every participant, and appends to messages_by_user. One logical send is three or four writes. The reward: each of the three queries runs in one partition scan on one node, regardless of system scale.

Cassandra's batch API groups related writes together — not as an ACID transaction, but as a logged group the coordinator retries until every target succeeds. Write amplification is the tax you pay for flat read latency.

Schema flexibility in practice

Adding a column. In Cassandra, ALTER TABLE user_events ADD column_name text is metadata-only — schema version bump, gossip, done in seconds. Existing rows do not acquire the column; new writes may include it. In DynamoDB, no ALTER at all — just start writing the attribute. Compare to PostgreSQL, where adding a non-null column with a default on a billion-row table can take hours and lock out writes.

Removing a column. Stop writing it; old rows still have it; a background scan with DELETE column_name FROM ... or a TTL cleans up. No blocking operation.

Changing a column type. Not supported directly. Dual-write the new column, backfill, drop the old. The application manages the migration — more work than ALTER TABLE ... TYPE ..., but never blocks writes.

Per-row variation. Premium users can have a premium_expiry column; free users do not. Row size grows only for users who have the column.

Column families — Bigtable and HBase

Bigtable introduced one grouping above columns: the column family. Columns are grouped into families declared at table-creation time ("metadata", "contents", "anchors"), and each family is stored in a separate SSTable on disk.

Why separate storage per family? Access patterns for different columns of the same row are often wildly different. The contents family in a web-crawl table — full HTML of each page — is huge and rarely read alongside metadata (timestamp, HTTP status). Storing them together forces every metadata read to drag multi-megabyte HTML through I/O. Storing them separately means metadata reads touch only the metadata SSTable.

Families also enable per-family configuration: different compression, different TTL, in-memory versus on-disk. HBase inherited this directly. Cassandra's "column family" terminology is related but works differently — what Cassandra calls a column family is closer to a table. DynamoDB simplified further: no families, just attributes, all stored together.

Python sketch — a toy wide-column storage

Strip the model down to a data structure and the access primitives become unambiguous:

from sortedcontainers import SortedDict

class WideColumnStore:
    """A single-node wide-column store. No replication, no SSTables,
    no compaction — just the logical model."""

    def __init__(self):
        # partition_key -> SortedDict of (clustering_key -> {column: value})
        self.partitions = {}

    def put(self, pk, ck, columns):
        if pk not in self.partitions:
            self.partitions[pk] = SortedDict()
        # Merge semantics: new columns overwrite, absent columns persist.
        row = self.partitions[pk].setdefault(ck, {})
        row.update(columns)

    def get(self, pk, ck):
        return self.partitions[pk][ck]

    def range(self, pk, ck_start, ck_end, limit=None):
        partition = self.partitions.get(pk, {})
        out = []
        for ck in partition.irange(ck_start, ck_end):
            out.append((ck, partition[ck]))
            if limit is not None and len(out) >= limit:
                break
        return out

    def delete_column(self, pk, ck, column):
        self.partitions[pk][ck].pop(column, None)

Every operation does what the model says: put upserts and merges columns, get returns the column map for one (partition, clustering) pair, range scans a clustering-key interval within one partition, delete_column drops a single cell.

Why SortedDict and not dict: the clustering-key order is the whole point. Range scans must be O(log n + results). In real Cassandra this is the memtable — an in-RAM sorted map that buffers writes until flush to on-disk SSTable. The SSTable is also sorted by clustering key, so the range-scan primitive survives the RAM-to-disk transition.

What the sketch skips: replication across nodes, SSTable compaction, bloom filters, per-column-family separation, cell timestamps for conflict resolution. Those are engineering on top of the logical model.

When wide-column is the right choice

Rule of thumb: if your dominant query is "give me the latest N items for a specific entity", wide-column is the right tool. If it is "items matching ad-hoc filters", it is not.

When wide-column is the wrong choice

Design a Twitter-like timeline in wide-column

Build a social feed for 100M users, each writing a few tweets per day and reading a timeline on every app open. The relational approach — tweets(tweet_id, author_id, time, text) joined against follows(follower, followee) — does not scale; the follow-graph is enormous and the sort touches unbounded data.

The wide-column design splits into two tables.

Table 1: a user's own tweets.

CREATE TABLE tweets_by_user (
    author_id  text,
    tweet_time timestamp,
    tweet_id   text,
    text       text,
    media_url  text,
    PRIMARY KEY ((author_id), tweet_time, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_time DESC);

Partition by author_id; all of a user's tweets live on one set of N nodes, sorted newest-first. "Priya's last 100 tweets" is a single-partition range scan.

Table 2: the timeline of each follower.

CREATE TABLE home_timeline (
    user_id      text,
    tweet_time   timestamp,
    author_id    text,
    tweet_id     text,
    text_snippet text,
    PRIMARY KEY ((user_id), tweet_time, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_time DESC);

Partition by user_id — the recipient, not the author. When Priya tweets, the service looks up her follower list and writes one row to the home_timeline of every follower. 5000 followers means 5001 writes per tweet. This is write fan-out — denormalising at write time so reads are trivial.

"Show Arjun's home timeline" is a single-partition range scan of home_timeline WHERE user_id = 'arjun' LIMIT 100. No joins, no follow-graph walk, no sort. Stays fast even if Arjun follows 10,000 accounts.

Dual-table design for a timeline feedLeft box: tweets_by_user partition for author "priya" containing three tweet rows keyed by tweet_time. An arrow labelled "fan-out on write" points to a right box: home_timeline partition for follower "arjun" containing the same tweet row duplicated alongside rows from other authors.tweets_by_userpartition: author_id = priyat=11:05 "hello Delhi"tweet_id=t99, media=nullt=10:30 "samosas at CP"tweet_id=t98, media=img.jpgt=09:12 "morning run"tweet_id=t97home_timelinepartition: user_id = arjunt=11:05 by priya "hello Delhi"t=10:58 by ravi "new release"t=10:30 by priya "samosas at CP"t=09:41 by meena "road trip"fan-out on write
Two wide-column tables for a timeline. Left: Priya's tweets, partitioned by her author_id. Right: Arjun's home timeline, partitioned by his user_id. When Priya tweets, the service writes one row on the left and one row into every follower's home_timeline on the right — Arjun's included. Reading Arjun's timeline is a single-partition scan of the right-hand table. Priya's original row also lives on the left for direct "visit Priya's profile" queries.

A celebrity tweet to 50M followers produces 50M writes — the fan-out-on-write problem. Twitter's real system is a hybrid: fan-out on write for normal users, fan-out on read for celebrities (pull their tweets at read time). Both paths are wide-column scans; the choice is which table gets duplicated writes.

Common confusions

Going deeper

The Bigtable paper

Chang, Dean, Ghemawat et al. published Bigtable: A Distributed Storage System for Structured Data at OSDI 2006 after years of production use at Google. The paper describes the row-key-plus-column-family model, the SSTable on-disk format, the Chubby lock service for metadata, the three-level tablet location hierarchy, and performance measurements at petabyte scale. HBase, Cassandra, ScyllaDB, Hypertable, Accumulo, and DynamoDB all owe design debts to it. Bigtable itself runs on GFS (SOSP 2003); MapReduce (OSDI 2004) consumes it for offline analysis.

HBase, Cassandra, ScyllaDB

HBase (2007) was an open-source Bigtable reimplementation for the Hadoop ecosystem, running on HDFS instead of GFS. It preserves the data model closely — column families, regions, coprocessors.

Cassandra (Facebook 2008) merged Bigtable's data model with Dynamo's distribution. Architecturally it is closer to Dynamo — gossip membership, leaderless quorum, no master — but the SSTable format and column-family organisation are Bigtable-derived.

ScyllaDB (2015) is a C++ reimplementation of Cassandra, wire-compatible at the CQL protocol level but using a thread-per-core shared-nothing architecture (Seastar). It delivers 5-10× higher per-node throughput on the same hardware.

DynamoDB's management

Amazon's DynamoDB service is a managed database whose name contains "Dynamo" but whose architecture has drifted. Vogels et al. published Amazon DynamoDB at USENIX ATC 2022, describing auto-sharding to partitions (10 GB units), auto-adaptive capacity, per-partition Paxos-based strong consistency, and CRDT-based global tables. Recognisably wide-column on the data model, but structurally closer to sharded range-partitioning than to Dynamo-the-paper.

Where this leads next

This chapter introduced the model. Build 11 drills into the pieces:

By the end of Build 11 you will design a production wide-column schema for a realistic workload — a ride-hail event stream, a messaging service, an IoT pipeline — and reason about scaling bottlenecks before deployment.

References

  1. Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, and Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI 2006 — the founding paper of the wide-column model. Describes the row-key-plus-column-family design, the SSTable format, Chubby-based metadata, and production deployments at Google including the web-crawl use case that motivated the original design.
  2. DeCandia et al., Dynamo: Amazon's Highly Available Key-value Store, SOSP 2007 — the leaderless-replication paper that Cassandra combined with Bigtable's data model. Relevant here for the distribution half of Cassandra's design.
  3. Apache Software Foundation, Apache Cassandra Architecture Documentation — the canonical reference for Cassandra's architecture, data model, partition and clustering keys, and CQL. Especially useful as a cross-check on the data-modelling vocabulary this chapter uses.
  4. Amazon Web Services, Amazon DynamoDB Developer Guide — the managed DynamoDB reference, with coverage of the table/item/attribute model, partition and sort keys, global secondary indexes, and transaction support added in 2018.
  5. Kleppmann, Designing Data-Intensive Applications, Chapter 3 — Storage and Retrieval, O'Reilly 2017 — the clearest pedagogical treatment of LSM-trees, SSTables, and the Bigtable data model, with worked comparisons to B-tree-based relational storage engines.
  6. Allen, Bouchier, Boyd, et al., Scaling Cassandra beyond a Thousand-Node Cluster, Netflix Technology Blog 2019 — a production account of running Cassandra at hundreds-of-petabytes scale at Netflix, with concrete discussion of partition design, hot-partition pathology, and denormalisation strategies learned the hard way.