In short
In Cassandra, ScyllaDB, and DynamoDB, a table's primary key is always a compound of two parts: a partition key and zero or more clustering keys. The partition key is consistent-hashed to pick which physical node (and which N replicas) store the row. Every row that shares a partition key lives on the same set of nodes — that is the entire purpose of the concept. Within a partition, rows are physically sorted on disk by the clustering keys in declared order; range queries on those keys are sequential SSTable reads, which is as fast as the storage layer goes.
Choosing the partition key is a distribution decision. A good one has high cardinality (billions of distinct values), aligns with your dominant read pattern ("give me everything for user X"), and bounds each partition to at most around 100 MB. A bad one — a boolean, a country code, a day-of-week — crushes the whole cluster into two or three hot partitions that cannot be rebalanced by adding nodes. Choosing clustering keys is a sort-order decision. The column you range-query by goes first: (event_time, event_id) sorts newest events first and breaks time ties by id; range filters on event_time are free, range filters on event_id alone are not.
Cassandra's CQL syntax encodes both choices in one place and one typo changes the answer entirely: PRIMARY KEY ((user_id), event_time, event_id) declares user_id as the partition key and the other two as clustering keys. Drop the inner parentheses and user_id, event_time become a composite partition key — every distinct (user_id, event_time) pair hashes to a different node, timelines shatter, and "last 100 events for user X" becomes a cluster-wide scan. Two characters; three orders of magnitude of latency difference. This chapter walks through the theory and the traps.
Chapter 85 introduced the wide-column model: rows live on the node their partition key hashes to, sorted by clustering key within the partition, each row a sparse map of columns. That chapter treated the compound primary key as a single concept. This one separates the two halves — partition key and clustering key — and shows how each one independently governs which queries run in milliseconds and which ones are impossible to scale.
Most Cassandra production incidents trace back to a primary key that looked sensible during schema review and turned pathological in week four, once the access pattern met the real data distribution. The CQL syntax makes the mistake easy to commit. Understanding what each bracket does, and what each choice costs, is the whole game.
The primary key is a tuple
Every wide-column table has a primary key declared as a tuple. In Cassandra's CQL the declaration looks like this:
CREATE TABLE user_events (
user_id text,
event_time timestamp,
event_id uuid,
payload text,
PRIMARY KEY ((user_id), event_time, event_id)
);
Read the PRIMARY KEY clause left to right. The inner parentheses wrap the partition key — here just user_id. Everything after the inner parentheses is a clustering key, ordered — here event_time first, then event_id. The tuple (user_id, event_time, event_id) uniquely identifies a row; Cassandra rejects duplicates and upserts over them.
The two halves do completely different jobs.
The partition key hashes to a node. Cassandra takes the partition key's value, runs it through the Murmur3 hash function, and looks the resulting token up in the consistent-hash ring (see Consistent Hashing and Virtual Nodes). The walk clockwise collects N distinct physical nodes into the preference list; those N nodes hold all the replicas. The partition key is the unit of distribution. Every row with the same user_id value lands on the same N replicas, regardless of its event_time.
The clustering keys sort within the partition. Inside the storage for one user_id, rows are laid out on disk in the order (event_time ASC, event_id ASC). That ordering is physical — it is not a query-time sort, it is the on-disk byte sequence. A range scan from event_time = T1 to event_time = T2 reads a contiguous run of rows from one SSTable.
Why this tuple-with-inner-parens is worth a whole chapter: the ordering and grouping of the key columns are not just metadata — they are the storage layout. Adding a column to the partition key moves the data. Reordering clustering keys rewrites every SSTable. These are decisions you can change only by creating a new table and rewriting the data in bulk.
Two declarations look similar and do totally different things. Compare:
-- Variant 1: partition key is user_id; clustering = event_time.
PRIMARY KEY ((user_id), event_time)
-- Variant 2: composite partition key (user_id, event_time); no clustering.
PRIMARY KEY ((user_id, event_time))
Variant 1 stores all of Priya's events on one set of replicas, sorted by time. "Last 100 events for Priya" is a one-partition range scan. Variant 2 hashes each (user_id, event_time) pair separately — every event of Priya's lands on a different node. "Last 100 events for Priya" is not even expressible without ALLOW FILTERING and a cluster-wide scan. The two variants differ by four characters in the schema file.
What the partition key determines
The partition key is load-bearing in three distinct ways, and you have to think about all three before declaring it.
It determines distribution across the cluster. Cassandra and DynamoDB pass the partition key through a hash function and place the row at the resulting ring position. With a good hash, N distinct partition-key values distribute themselves roughly uniformly across N physical nodes. With 1 distinct partition-key value, one node does all the work while the others sit idle. Distribution is only as good as the cardinality of the partition key.
It determines the boundary of efficient queries. A single CQL query can only reach one partition efficiently. The coordinator hashes the WHERE-clause partition key, routes the request to the preference list, and collects a response. If the WHERE clause does not restrict the partition key, the coordinator has no route — it would have to contact every node in the cluster and scatter-gather the results. Production Cassandra rejects such queries unless you add ALLOW FILTERING, and ALLOW FILTERING is almost always a mistake.
It determines maximum partition size. All rows with the same partition key live together on disk in one logical unit. Cassandra stores the compacted form of each partition as a run of bytes in an SSTable; if that run exceeds a few hundred megabytes, reads slow down because the per-partition index overhead grows and bloom filters become less selective. The community rule of thumb is keep each partition under 100 MB for healthy OLTP performance. There is a hard ceiling of 2 GB per partition in Cassandra before read behaviour degrades noticeably. DynamoDB enforces a stricter 10 GB hard limit per partition and automatically splits partitions that grow beyond it.
user_id = 'priya' routes the query to a single node via the consistent-hash ring. Once there, the partition is a run of rows on disk sorted by the clustering key event_time DESC, so the range scan event_time > '2026-04-24T00:00' is a sequential read of a contiguous byte run. Partition key picks the machine; clustering key picks the slice.What the clustering keys determine
Clustering keys do one job: they decide the on-disk sort order within a partition, and therefore which range queries are cheap.
Sort order. Cassandra stores rows physically sorted by the clustering keys in the declared order. (event_time, event_id) sorts primarily by event_time; ties within the same event_time are broken by event_id. WITH CLUSTERING ORDER BY (event_time DESC, event_id ASC) flips the primary sort to descending, which is almost always what you want for timeline-style tables where the newest entries come first.
Range-query capability. The clustering keys are the only columns you can range-filter on in a WHERE clause without ALLOW FILTERING. And even then, the filter must be a prefix of the clustering-key tuple. Given PRIMARY KEY ((user_id), event_time, event_id) you can filter:
WHERE user_id = ? AND event_time > ?— valid, uses the first clustering column.WHERE user_id = ? AND event_time = ? AND event_id > ?— valid, uses both in prefix order.WHERE user_id = ? AND event_id > ?— rejected. Skipsevent_time, not a prefix.
This rule falls out of the physical layout directly. The rows are sorted by event_time first, then event_id. A range on event_time alone reads a contiguous byte run. A range on event_id alone would need to touch every event_time bucket to find matching rows — it is not contiguous and the engine refuses to pretend otherwise.
Nested tie-breakers. Clustering keys after the first are pure tie-breakers. (event_time, event_id) means events at the same microsecond get a deterministic order — event_id ASC — so duplicate timestamps do not cause duplicate primary keys. A single event_time clustering key is sometimes enough if your timestamps are truly unique, but at scale two events can arrive at the same millisecond and one will silently overwrite the other. The tie-breaker column is cheap insurance.
Why prefix-only and not arbitrary-column range: the SSTable format stores rows in one sorted sequence per partition. A range on the first clustering column is an interval in that sequence. A range on a later column — skipping one — is not an interval, it is an arbitrary subset requiring a full partition scan to find. Cassandra declines to disguise that cost behind convenient syntax.
Good partition-key choices
A good partition key has four properties at once. Miss any one and the table will eventually break.
High cardinality. There should be at least tens of thousands and ideally billions of distinct values. Every distinct value becomes its own partition and can live on its own preference list. Low cardinality collapses the ring into a few points regardless of how many nodes the cluster has.
Alignment with the primary access pattern. The partition key should be whatever your dominant query already filters by. If you ask "last N events for user X" all day, user_id is the partition key. If you ask "last N readings for sensor Y", sensor_id is. The query-first mindset (see chapter 88) reads the query and names the partition key from it.
Bounded partition size. Each distinct partition-key value's data should fit comfortably under 100 MB of compacted bytes. A user with 200 events a day storing 1 KB per event accumulates ~73 MB per year — fine for three years, growing past the threshold in year five. A sensor emitting once a second and storing 200 bytes per reading produces ~6 GB per year by itself — not fine, and needs bucketing into a composite partition key.
Even distribution across values. Cardinality is necessary but not sufficient. If 80% of your traffic lands on 0.1% of your partition-key values (a power-law distribution common in user-facing systems — see the celebrity problem below), you get hot partitions even with billions of distinct keys. The best partition keys are ones where both the count of rows per partition and the request rate per partition are reasonably even.
Typical good choices in real systems:
user_idfor per-user feeds, notifications, preferences.device_idorsensor_idfor IoT telemetry.session_idfor ephemeral conversation state.order_idfor order-centric views (line items attached to an order).tenant_idcombined with a sub-key for multi-tenant SaaS isolation.
Bad partition-key choices
The failure modes are stereotyped. Once you see them named, you can spot them at schema review.
Low cardinality. A boolean column has two values, so the whole table splits across exactly two partitions no matter how large your cluster. A country_code has about 200 distinct values, so you get at most 200 partitions — even on a thousand-node cluster, 800 nodes are idle. A day_of_week has seven values; a category enum has perhaps a dozen. These are catastrophic partition keys for anything past a few GB of total data.
Monotonically increasing values. A current_timestamp_truncated_to_day looks like a reasonable partition key for time-series data, but it is pathological. Every write between midnight and 11:59 PM lands on today's partition. All of your writes land on the replica set that owns that day's hash. Tomorrow the hot spot moves, today's partition cools, yesterday's is cold, and the cluster's write bandwidth is gated by one preference list at a time. You paid for N machines and are using three.
Unbalanced distributions (the celebrity problem). PRIMARY KEY ((celebrity_id), post_time) looks uniform — millions of celebrities, each a distinct partition. But the most-followed celebrity has 100M followers fetching her posts, while the average celebrity has 500 followers. The top-1% partitions receive 90% of the traffic. No amount of cardinality helps; the skew is in the request rate, not the key space. The production fix is usually to split the hot partition at the application layer — (celebrity_id, shard_bucket) where shard_bucket is a round-robin modulo — and aggregate at read time.
Monotonic IDs from a single sequence. An auto-incrementing order_id as the partition key is borderline. Cardinality is fine; distribution is even. But if your query is "recent orders", the recent partition-key values cluster together in hash space only if the hash is local-sensitive (it is not — Murmur3 is uniform). So the distribution is fine, but the "give me recent orders" query becomes a cross-partition scan. The real issue here is access-pattern alignment, not distribution.
Why hot partitions cannot be rebalanced away: the ring distributes distinct partition keys evenly. Within one partition key, all replicas live on the N nodes determined by its hash; you cannot spread one partition across more than N nodes without changing the schema. When 90% of requests go to one partition, no number of additional machines help — the bottleneck is the fixed preference list of that single partition. The fix has to be schema-level: give the hot entity more distinct partition keys by splitting it.
Composite partition keys
Sometimes no single column has the right combination of high cardinality, query alignment, and bounded size. A sensor that emits once a second produces too much data per year to fit in one sensor_id-keyed partition. A tenant in a multi-tenant SaaS might have billions of rows total — too much for one tenant_id partition — but the typical query still filters by tenant.
The fix is a composite partition key — two or more columns inside the inner parentheses, hashed together:
CREATE TABLE sensor_readings (
sensor_id text,
day date,
ts timestamp,
temp float,
PRIMARY KEY ((sensor_id, day), ts)
);
The inner parens contain two columns. Cassandra hashes the pair (sensor_id, day) as a single value; each distinct pair becomes its own partition on its own preference list. Clustering key ts sorts within.
Now each partition is one sensor's data for one day — bounded at 86400 readings (one per second, worst case), which is comfortable. Different days for the same sensor live on different nodes, so "sensor 42 on 2026-04-24" is one partition but "sensor 42 across all time" is a cross-partition query.
This pattern — called bucketing or time-bucketing — shows up in every production time-series schema. The bucket column trades range-query breadth (you can no longer grab a month of data in one partition scan) for bounded partition size (you can now store a decade without any one partition exploding). The right bucket granularity depends on event rate and retention: day for 1 Hz sensors, hour for kHz sensors, minute for high-frequency trading tickers.
Multi-tenant SaaS uses a similar idea for isolation:
CREATE TABLE tenant_resources (
tenant_id text,
resource_id uuid,
created_at timestamp,
payload text,
PRIMARY KEY ((tenant_id), resource_id, created_at)
);
Here tenant_id alone is the partition key. Every tenant gets its own partition — good for isolation (a noisy tenant's hot partition cannot drown out others) and for compliance (a tenant's data lives on an identifiable preference list). Clustering sorts resources within the tenant by id, then by creation time for versioning.
Good clustering-key choices
The first clustering key should be the column you range-query by. If you ask for data newer than a timestamp, that timestamp is the first clustering key. If you ask for sequence numbers in a range, that sequence number is first. Every clustering column after the first is a tie-breaker.
CLUSTERING ORDER BY (event_time DESC) is free and worth setting. Descending storage means "the latest row" is the first byte of the partition — the hottest-cached byte. A LIMIT 100 on a newest-first query reads the first 100 rows and stops. If you ascend-order the table, the same query has to scan to the end of the partition and take the last 100, which in a large partition means scanning everything. The order flag is metadata-only at table-create time; it has zero runtime cost once chosen, so choose it deliberately.
A worked comparison. For a table tracking user events, PRIMARY KEY ((user_id), event_time, event_type) with ASC ordering means:
- "last 100 events for user X" — scans from the end of the partition, expensive if the partition has 10M rows.
- Use
WITH CLUSTERING ORDER BY (event_time DESC)and the same query reads the first 100 rows of the partition.
Cassandra's query planner does not reorder rows on the fly; ORDER BY in CQL works only for the clustering order, and only in the direction you declared (or the exact reverse). You cannot say "sort by event_type instead" at query time unless event_type is a clustering key and the query has already constrained everything before it.
The ALLOW FILTERING trap
Cassandra's CQL has a keyword that looks innocent and is not: ALLOW FILTERING. If you write a query whose WHERE clause does not match the primary-key structure — no partition key, or a partition key plus some non-prefix clustering filter — Cassandra rejects it with the message "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance". Append ALLOW FILTERING to the query and it runs.
What does it do? It scans every partition in the range it can reach — which, without a partition-key restriction, means every partition in the cluster. A million-row table becomes a million reads. A billion-row table becomes a billion reads. The query that returned in 100 ms during testing with ten thousand rows becomes a cluster-wide thirty-minute sweep in production.
Production rule: never use ALLOW FILTERING in OLTP. If the query is important, the schema is wrong — create a second table whose primary key matches the query's filter, and write to both tables on every insert. If the query is rare enough that correctness matters more than latency, run it from Spark or an equivalent analytics tool against a snapshot, not against the live cluster. If you find ALLOW FILTERING in a production codebase, treat it as a latent outage.
Why the keyword exists at all: during development and one-off data inspection, filtering is genuinely useful. Cassandra does not want to prevent you from exploring a small table by hand. But the keyword is opt-in precisely to prevent anyone shipping it to production by accident. Seeing it in a pull request is the signal that the data model needs a new table, not a bigger hint.
Secondary indexes — the partial escape
Cassandra ships a secondary-index feature that looks like the relational index you know from Postgres. CREATE INDEX ON user_events (event_type) lets you WHERE event_type = 'purchase' without ALLOW FILTERING. For real workloads, secondary indexes are almost always the wrong answer.
Under the hood, a Cassandra secondary index is a per-node inverted table — each node maintains, for its local partitions, a lookup from indexed-column value to partition key. A query that uses the index still has to contact every node in the cluster, because any partition might contain a row matching the filter. The coordinator fans the query out to every node, each node consults its local index, and results are merged. For low-cardinality values (event_type = 'click' matching 90% of rows), that is roughly a full table scan. For very high-cardinality values (user_email = 'specific@example.com' matching one row), the fan-out is wasted on 99% of the nodes that have no match.
Cassandra 3.4 added SASI indexes (SSTable Attached Secondary Index), an improved implementation with better performance for prefix and range queries. Still, SASI indexes are scoped per SSTable and still fan out per node; they are an improvement, not a solution.
The production pattern is unambiguous: if you query by a non-primary-key column often, create a second table with that column as the partition key and write to both. The two tables are kept consistent by the application — the write path does two inserts instead of one, and the read path picks whichever table matches the query. Storage doubles; query latency stays flat. This is the "query a table per access pattern" dictum from chapter 85, applied to what relational databases would solve with an index.
Worked example — Twitter timeline
Concrete example. Each Twitter user has a timeline of their own tweets; the most common read is "give me the latest N tweets by user X". The table:
CREATE TABLE tweets (
user_id bigint,
tweet_time timestamp,
tweet_id bigint,
text text,
media_url text,
PRIMARY KEY ((user_id), tweet_time, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_time DESC, tweet_id DESC);
Partition key user_id. Clustering (tweet_time DESC, tweet_id DESC) — newest first, ties broken by id. Every tweet of a user's lands on the same preference list and sits sorted newest-first on disk.
Query 1 — latest 100 tweets for user 5. SELECT * FROM tweets WHERE user_id = 5 LIMIT 100; Coordinator hashes 5, routes to the three replicas, one of them reads the first 100 rows of the partition, returns. Total latency: a few milliseconds. Scales to billion-user systems trivially.
Query 2 — user 5's tweets in the last hour. SELECT * FROM tweets WHERE user_id = 5 AND tweet_time > '...' LIMIT 100; Same partition, range-scan on tweet_time, also milliseconds. The range filter is a prefix of the clustering key, so no ALLOW FILTERING.
Query 3 — all tweets across all users in the last hour. No partition-key restriction, so Cassandra refuses without ALLOW FILTERING. With ALLOW FILTERING, the coordinator scans every partition on every node — billions of partitions, most of them empty for this filter. Unusable. The real answer is a second table:
CREATE TABLE tweets_by_hour (
hour timestamp, -- truncated to the hour
tweet_time timestamp,
user_id bigint,
tweet_id bigint,
text text,
PRIMARY KEY ((hour), tweet_time, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_time DESC, tweet_id DESC);
Now "all tweets in the last hour" is one partition scan of WHERE hour = '2026-04-24 11:00:00'. The partition for the current hour is hot but bounded — one hour of site-wide writes is large but finite. The cost: every tweet is written to both tables on the write path. That is the trade. Two writes to bound read latency; live with it.
Python sketch — partition + clustering storage
A minimal wide-column store with composite partition keys and multi-column clustering:
from sortedcontainers import SortedDict
class WideColumnStore:
def __init__(self, num_nodes=10):
# partition_id -> node_index (simulated ring)
self.num_nodes = num_nodes
# (node_index, partition_key_tuple) -> SortedDict
# keys are clustering_key tuples; values are the row's column map
self.partitions = {}
def _route(self, pk_tuple):
# Simplified: real Cassandra uses Murmur3 + vnodes.
return hash(pk_tuple) % self.num_nodes
def put(self, pk_tuple, ck_tuple, columns):
node = self._route(pk_tuple)
part = self.partitions.setdefault((node, pk_tuple), SortedDict())
row = part.setdefault(ck_tuple, {})
row.update(columns) # sparse-merge semantics
def get(self, pk_tuple, ck_tuple):
node = self._route(pk_tuple)
return self.partitions[(node, pk_tuple)][ck_tuple]
def range(self, pk_tuple, ck_start, ck_end, limit=None):
node = self._route(pk_tuple)
part = self.partitions.get((node, pk_tuple), SortedDict())
out = []
for ck in part.irange(ck_start, ck_end):
out.append((ck, part[ck]))
if limit and len(out) >= limit:
break
return out
Every method mirrors the model. put routes by partition key, finds-or-creates the partition on that node, and inserts the row at the clustering-key position sorted. range does a single-partition clustering-key sweep — the engine never touches a second partition. get is the trivial point-lookup case.
Why the partition key hashes but the clustering key does not: the partition key is the only thing that identifies which node a row lives on, so it must hash to a ring position. Clustering keys identify which slot within a partition — they are compared, not hashed, because the storage must preserve their natural order for range scans to be sequential.
Sensor telemetry with a composite partition key
An industrial plant operates 10000 temperature sensors, each emitting a reading every ten seconds. That is 8640 readings per sensor per day, or about 3.15M per year per sensor. Retention is five years. Per-sensor partitions would grow to ~15M rows — past the comfortable threshold.
The schema uses a composite partition key of (device_id, day):
CREATE TABLE sensor_readings (
device_id uuid,
day date,
ts timestamp,
temp float,
humidity float,
PRIMARY KEY ((device_id, day), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
Each partition is now one device on one day: 8640 rows, ~500 KB compacted, comfortably small. Different days for the same device live on different nodes — (device_42, 2026-04-24) and (device_42, 2026-04-23) hash independently.
Query A — "temperature curve for device 42 on 2026-04-24":
SELECT ts, temp FROM sensor_readings
WHERE device_id = 42 AND day = '2026-04-24';
One partition, one preference list, one sequential scan of 8640 rows. Returns in ~5 ms.
Query B — "last 10 readings for device 42":
SELECT * FROM sensor_readings
WHERE device_id = 42 AND day = '2026-04-24' LIMIT 10;
The client has to know today's date. That is the trade-off of bucketed partition keys: one more column in the WHERE clause for a dramatically smaller partition.
Query C — "average temperature across all devices on 2026-04-24":
Not expressible efficiently on this schema. You would need a different table keyed by day alone (rollup) or run aggregation through Spark over all 10000 device partitions for that day. In production, the rollup table is kept up-to-date by a separate stream consumer that ingests the raw readings and updates a per-day summary.
day inside the partition key. Each day's readings for one device form a bounded partition (8640 rows, about 500 KB). Different days hash to different nodes, so write load for a single device spreads across the cluster over time rather than piling onto one preference list. The clustering key ts DESC puts the newest row at the start of each partition.Common confusions
- "Primary key equals partition key." The primary key is the whole tuple including clustering keys. The partition key is just the part inside the inner parens.
PRIMARY KEY ((user_id), event_time)has a primary key of(user_id, event_time)but a partition key of onlyuser_id. - "Clustering keys distribute data across nodes." They do not. Clustering keys only sort rows within an already-chosen partition. Only the partition key decides placement.
- "Compound partition key and compound primary key are the same." They are not. A compound primary key has multiple columns total in the tuple. A compound (or composite) partition key is specifically multiple columns inside the inner parens, hashed together.
((a, b), c)has a composite partition key;((a), b, c)has a simple partition key and a compound primary key. - "I can add a secondary index later if queries change." You technically can, but at scale the index either fans out cluster-wide or degrades write throughput. Plan for new queries by creating new tables, not new indexes.
- "ALLOW FILTERING is fine if the data is small." It is fine if the data will never grow. Tables in production always grow. Seeing
ALLOW FILTERINGin a pull request is seeing a future outage. - "Clustering order can be changed later." It cannot without rewriting every row. Pick it at table-create time.
Going deeper
Token ranges and token-aware routing
Cassandra's ring is 2^63-wide (signed Murmur3) and each vnode occupies a contiguous token range. The preference list for a partition is the N physical nodes whose token ranges the partition key lands in, walking clockwise and skipping duplicates. Modern Cassandra drivers are token-aware: the client computes the partition-key hash locally and sends the request directly to a replica, saving one hop compared to routing through a random coordinator. Token-aware routing cuts tail latency by the round-trip time of one extra hop — a significant fraction of the total for fast queries.
ScyllaDB's shard-per-core model
ScyllaDB partitions the token range twice: once across nodes (Cassandra-style) and once across CPU cores within each node. Each core owns a slice of the ring independently — no shared memory, no locks between cores. The driver can route a query not just to the right node but to the right core on that node. Per-node throughput rises 5-10× on equivalent hardware because of reduced contention and cache-line ping-pong. The schema-level partition-key decision is unchanged; the mechanical execution is different.
DynamoDB's partition-plus-sort-key vocabulary
DynamoDB uses slightly different terminology. The partition key is the same concept (called the "hash key" in older docs). The sort key is what Cassandra calls the first clustering key — a single column that sorts rows within a partition. DynamoDB does not support multi-column clustering; if you need a second sort dimension, you synthesise it into the sort-key column (e.g., "2026-04-24#purchase" as a sortable composite string). DynamoDB's Global Secondary Indexes are essentially the "second table for the second access pattern" pattern from Cassandra, managed automatically by the service.
Bigtable's single-key row
Google's Bigtable uses a single-level row key with no clustering concept. Flexibility comes from column families (grouped per-row columns stored in separate SSTables) rather than from a multi-part key. To get clustering-like behaviour, Bigtable users encode sort information into the row key itself — e.g., "device_42#2026-04-24#11:05:30" — and range-scan over the encoded prefix. Same idea, different syntax.
Where this leads next
This chapter was about the logical model — the shape of the key and what it lets you query. The next chapters drill into implementation:
- Chapter 87 — the SSTable on-disk format: how one partition's rows are physically arranged, how multiple SSTables per partition are merged at read time, how bloom filters make point gets cheap.
- Chapter 88 — query-first schema design: taking an application's access patterns and mechanically deriving the table set, using the Chebotko diagram methodology.
- Chapter 89 — why joins do not exist in wide-column and what to do instead, including denormalisation strategies and the write-fan-out pattern.
- Chapter 90 — materialised views (Cassandra's built-in second-table pattern) and why the community generally still recommends manual duplication.
By the end of Build 11 you will be able to read a CQL schema and know immediately which queries it answers cheaply and which ones it cannot answer at all — before the first row is written.
References
- Apache Software Foundation, CQL Data Definition: Primary Key — the canonical reference on the partition-key / clustering-key syntax, including composite partition keys and clustering-order declarations. The source of truth for what each bracket means.
- DataStax, Basic Rules of Cassandra Data Modeling — the most-cited short guide to choosing partition keys and clustering keys, with the query-first methodology and concrete rules of thumb on partition size and cardinality.
- Amazon Web Services, Best Practices for Designing and Using Partition Keys — the DynamoDB reference on partition-key choice, hot partitions, and write sharding. The vocabulary is DynamoDB-specific but the trade-offs are identical to Cassandra's.
- Google Cloud, Bigtable Schema Design: Row Key Design Patterns — Bigtable's advice on encoding sort information into single-level row keys, including reverse-timestamp, salted prefix, and field-promotion techniques. Useful as a contrast to Cassandra's two-level key.
- Kleppmann, Designing Data-Intensive Applications, Chapter 6 — Partitioning, O'Reilly 2017 — the clearest pedagogical treatment of partitioning strategies, covering hash partitioning, range partitioning, skewed workloads, and the celebrity problem, with comparisons across Cassandra, DynamoDB, HBase, and MongoDB.
- ScyllaDB, ScyllaDB Architecture: Shard-per-Core — the architectural description of how ScyllaDB extends Cassandra's node-level partitioning down to per-core partitioning using the Seastar framework, with benchmark data showing the resulting per-node throughput improvements.