Query-First Schema Design

In short

Relational modelling begins with entities — User, Tweet, Follow, Conversation, Message — which become tables, which get normalised to avoid redundancy, and then you write queries against whatever schema fell out. The schema exists first; queries are a read layer on top.

Wide-column inverts this. You start with the list of queries the application needs, and for each query you design one dedicated table whose primary key matches that query's access pattern exactly. The partition key is whatever the query filters on; the clustering keys are whatever the query sorts by. Data is duplicated across tables freely — denormalisation is the default, not a last resort. A single logical entity like a tweet may appear in tweets_by_user, tweets_by_hashtag, and tweets_by_location, each a separate table with its own primary key and its own copy of the row. Writes fan out across every table that contains the data; one user action produces 3–5 physical inserts. The trade is blunt: more storage, more write work, but every read is a single-partition lookup on a single node with known latency.

This chapter walks through the query-first methodology step by step, contrasts it with relational modelling, builds a complete schema for a messaging application from its query list, and catalogues the anti-patterns that trip up engineers bringing Postgres habits to Cassandra: trying to use one table for multiple access patterns, writing queries that need ALLOW FILTERING, letting partitions grow unbounded, treating denormalisation as a code smell instead of a design pattern. By the end you should be able to read an application's query list and produce a working Cassandra schema on a whiteboard in under ten minutes — and spot a broken schema in someone else's pull request in under one.

A team at a Bangalore fintech startup moves their backend from PostgreSQL to Cassandra. The sales pitch was straightforward: ten times the throughput at a quarter of the latency. They keep the same schema — users, accounts, transactions, line items, joined through foreign keys — port it to CQL, swap the driver, run the smoke tests, push to staging.

Everything is slow. Not a little slow — dramatically slower than Postgres had been on one-tenth the hardware. Point lookups take 40 ms instead of 2 ms. A transaction-history page times out. The nightly reconciliation job that took twenty minutes in Postgres now takes four hours and then crashes.

The instinct is to blame Cassandra. More replicas, bigger heaps, faster disks. None of it helps, because nothing about the hardware is the problem. The problem is that every query the application runs fans out across every node in the cluster, merges results from dozens of partitions, and re-sorts them at the coordinator — because the schema was designed for a relational database and not for the one they deployed. The fix is not to tune Cassandra. The fix is to throw away the schema and design a new one starting from the list of queries the application actually runs.

That methodology has a name: query-first schema design. It is the single skill that distinguishes engineers who can build working wide-column systems from engineers who believe they can and cannot. This chapter teaches it.

The query-first methodology

Query-first is a five-step procedure. Each step is mechanical once the previous one is done. The ingenuity lives in step 1 — enumerating queries completely — and in step 3 — picking the right partition key for each. The rest is typing.

Step 1. List every query the application needs. Every one — reads, writes, rare admin queries, exports, everything. Write them as user-facing sentences, not as SQL: "show user X's conversation list"; "show all messages in conversation Y sorted by time newest-first"; "show user X's unread count per conversation". The quality of the schema is gated on the completeness of this list. A query you forgot to list will either force a schema migration later or get served by ALLOW FILTERING in production, which is a future outage.

Step 2. For each query, identify three things. (a) Which rows are returned — the result shape. (b) What value identifies the access point — the filter key. (c) What the sort order is — which column drives the ordering of the result set. "Show user X's conversations, most recent first" has: result shape = conversation records for that user; filter key = user_id; sort = last_activity DESC.

Step 3. Design one table whose primary key matches that query's access pattern exactly. The partition key is the filter key from step 2(b). The clustering keys are the sort column from step 2(c), followed by whatever tie-breaker columns make the primary key unique. Every non-key attribute the query returns becomes a regular column in that same table. One query, one table, one primary key.

Step 4. Accept that the same data appears in multiple tables. If user.email appears in the output of three queries, it lives in three tables — not because of a bug, because denormalisation is the design. If you find yourself considering a "canonical" table that the others reference by foreign key, stop. There are no foreign keys in Cassandra, and there are no joins.

Step 5. Write the application to update every relevant table on every change. When a user changes their email, the update writes to every table that stores user.email. The write-path is the expensive part of wide-column; reads are the cheap part. That asymmetry — pay at write, save at read — is the entire bargain of denormalisation.

Why this order and not any other: the storage layout of a wide-column system is fixed at table-create time. Partition keys determine which physical nodes hold the data; clustering keys determine the byte layout on disk. Queries that match the layout are milliseconds; queries that do not are impossible in production. Designing tables from the query list is the only way to guarantee alignment. Starting from entities and hoping queries will fit is the failure mode.

Contrast with relational modelling

A relational schema is built entity-first. You identify the things in your domain — users, tweets, messages, conversations, line items — make one table per thing, give each a primary key, add foreign keys between them to represent relationships. You normalise to Third Normal Form or thereabouts: every fact lives in exactly one place, recombined at query time via JOIN. The schema is a description of the domain; the queries are a separate layer that uses it.

This pays off beautifully when queries are unpredictable. A BI analyst at the same fintech wants to know "which users signed up last Tuesday, made a first deposit within seven days, and whose current balance is over ₹50,000" — a query nobody planned for. The relational database composes the answer on the fly from the existing entity tables: users joined to deposits joined to balances, with WHERE filters and a planner that picks indexes and join orders. Ad-hoc is the default.

Wide-column gives up this flexibility and gets scale in return. You do not model entities first; you do not have joins; you do not have ad-hoc queries. You model queries first. One table per query, primary key matched to that query's access pattern, every non-key column denormalised into the row. A tweet is not a row in a tweets table — a tweet is five rows in five different tables (tweets_by_user, tweets_by_hashtag, tweets_by_location, home_timeline, tweets_by_id), each with its own partition key chosen to serve a specific read path. The logical entity exists only in the application's mind; the physical storage is organised by access path.

Patrick McFadin's line from chapter 85 was: "in relational, you model your entities; in wide-column, you model your queries." Now you see what that means operationally. The diagrams are inverted. Instead of an entity-relationship diagram where boxes are tables and arrows are foreign keys, a wide-column design has a Chebotko diagram: boxes are queries on the left, boxes are tables on the right, arrows map each query to the one table that serves it. The arrows are one-way and many-to-one — many queries can share a table only if they have identical access patterns, which is rare.

Worked example — a messaging app

Concretely. You are building the back-end for a messaging application — direct messages, conversation threads, unread counts. The product team gives you three queries, which you treat as load-bearing.

Query 1. "Show user X's conversation list — all threads they are part of, most recently active first."

Query 2. "Show all messages in conversation Y, ordered by time newest-first, paginated."

Query 3. "Show user X's unread message count per conversation."

Each gets its own table.

For Query 1, the filter key is user_id (the viewer) and the sort is last_activity DESC (most recently active first). The partition key is user_id; the first clustering key is last_activity; conversation_id tie-breaks within the same timestamp:

CREATE TABLE conversations_by_user (
    user_id         text,
    last_activity   timestamp,
    conversation_id text,
    other_members   set<text>,
    title           text,
    last_message    text,
    PRIMARY KEY ((user_id), last_activity, conversation_id)
) WITH CLUSTERING ORDER BY (last_activity DESC, conversation_id ASC);

One partition per user holds every conversation they belong to, sorted newest-activity-first. "Show Priya's conversation list" is a single-partition scan on user_id = 'priya', reading rows in order until the page limit. No merge, no network fan-out, no sort at read time.

For Query 2, the filter is conversation_id and the sort is sent_at DESC:

CREATE TABLE messages_by_conversation (
    conversation_id text,
    sent_at         timestamp,
    message_id      uuid,
    sender_id       text,
    body            text,
    media_url       text,
    PRIMARY KEY ((conversation_id), sent_at, message_id)
) WITH CLUSTERING ORDER BY (sent_at DESC, message_id DESC);

One partition per conversation. "Show messages in conversation c-42" is a partition scan. Tie-broken by message_id so two messages at the same millisecond cannot overwrite one another.

For Query 3, the filter is the pair (user_id, conversation_id) and the result is a single scalar — the unread count. No sort dimension, so no clustering keys:

CREATE TABLE unread_by_user_conversation (
    user_id         text,
    conversation_id text,
    unread_count    counter,
    PRIMARY KEY ((user_id, conversation_id))
);

counter is a Cassandra-specific type that supports atomic += and -= without a read-modify-write cycle. Every (user, conversation) pair is its own partition — exactly one row per partition. Reading a user's full unread list across conversations requires a separate query per conversation, which is fine because the messaging UI already knows the conversation IDs from Query 1.

Three queries, three tables, three non-overlapping partition keys. Each table is a purpose-built index into the data. The row representing "the message Priya sent to Arjun at 11:05 on 24 April" lives in messages_by_conversation, has its sent_at mirrored into conversations_by_user rows for both Priya and Arjun, and increments unread_by_user_conversation for Arjun alone. One logical message; three tables touched.

Update fan-out

The write path for "Priya sends a message in conversation with Arjun" looks like this:

Insert into messages_by_conversation — the message itself.
Update conversations_by_user for Priya — set last_activity = now for that conversation.
Update conversations_by_user for Arjun — same.
Increment unread_by_user_conversation for Arjun (not Priya — she sent it, she has read it).

Four writes for one user action. If the conversation is a group chat with twelve members, it becomes thirteen writes: one message, twelve conversations_by_user updates, eleven unread increments (everyone except the sender). Fan-out scales with group size.

This is acceptable because writes in wide-column are cheap and reads are the bottleneck. An LSM-tree write is an append to a memtable plus an append to the commit log — in the tens of microseconds. A read may have to consult multiple SSTables, check bloom filters, merge tombstones, hit the cache. The physics of the storage engine make writes much cheaper than reads, so designs that trade one more write for one guaranteed-fast read are net wins.

Compare with the relational alternative. In Postgres the send-message path is one INSERT into messages, one update of conversations.last_activity, one update per participant's unread_count. On paper that is simpler. But the read path — "show Priya's conversation list with last activity and unread counts" — becomes a three-way join across conversations, participants, messages, with a GROUP BY and LEFT JOIN for the counts. At a hundred conversations per user and a few thousand queries per second, the join cost dominates. Cassandra trades that join cost for write amplification and gets back a flat, predictable read latency that does not degrade with user count.

Why four writes instead of one: in the relational schema, the four pieces of information you updated — the message row, Priya's last-activity, Arjun's last-activity, Arjun's unread count — live in different tables that get recombined via joins at read time. In Cassandra there are no joins, so you have to pre-combine the information at write time. The writes are the join, moved earlier in the pipeline.

Duplicating data is the default — not a bug

A single message in the schema above appears in two tables, and its metadata influences rows in a third. Scale this up — a tweet in a social app, a product view in an e-commerce app, a sensor reading in a telemetry system — and one logical row may appear in five or more tables. Storage cost grows proportionally.

The reaction of engineers coming from relational is almost universal: this feels wasteful. They count the duplicated bytes and flinch. They propose "just one source-of-truth table and derive the others with views". They argue for foreign keys.

None of that works in wide-column, and the instinct itself is the first thing to unlearn. The reasoning is economic, not aesthetic.

Disk is cheap. At 2026 cloud prices, one terabyte of SSD storage costs about 75 per month on AWS, or ~₹6,200. A social network with 100 million users, each with an average of 1 KB of profile data, needs 100 GB for one copy. Duplicating that to five tables is 500 GB —37 per month for the whole system. The finance argument against denormalisation is essentially zero.

Latency is expensive. The real resource you are managing is p99 read latency — the tail of the distribution that determines how your app feels and how many users churn. A 50-ms tail on a page load is acceptable; a 500-ms tail is a revenue problem. Denormalised reads at single-digit milliseconds are the only way to hold a tight p99 as the dataset grows. Normalised reads with multi-table lookups amplify both the median latency and the variance.

Consistency is application-managed, and that is fine. Yes, the "copies" of a piece of data can briefly disagree during a write — a tweet might appear in tweets_by_user a microsecond before it appears in home_timeline. For the vast majority of applications, this is invisible to the user and irrelevant to the business. Where it matters — financial ledgers, inventory — you use lightweight transactions or a different storage engine entirely.

Denormalisation is not a last resort to be justified in a design review; it is the defining design pattern of the model. A wide-column schema that does not duplicate data heavily is almost certainly wrong.

Bucket partitions to bound size

One pattern falls out of the methodology and bites teams every time if they skip it: partitions must be bounded in size. The community rule of thumb from chapter 86 is 100 MB per partition, soft; a few hundred MB as the practical warning zone; 2 GB as the hard ceiling in Cassandra before read behaviour degrades noticeably.

In the messaging schema above, conversations_by_user has user_id as the partition key. That means every conversation the user has ever been part of lives in one partition. A heavy user of a messaging app in production might accumulate 50,000 conversations over five years. At 500 bytes per row (metadata only — last message snippet, member list, title) that is 25 MB. Fine today, nearing the soft ceiling in a decade, and a clear migration risk.

The fix is bucketing: add a time component to the partition key so the unbounded dimension gets sliced. For conversations_by_user:

CREATE TABLE conversations_by_user (
    user_id         text,
    year_month      text,    -- e.g. "2026-04"
    last_activity   timestamp,
    conversation_id text,
    other_members   set<text>,
    title           text,
    last_message    text,
    PRIMARY KEY ((user_id, year_month), last_activity, conversation_id)
) WITH CLUSTERING ORDER BY (last_activity DESC, conversation_id ASC);

Now the partition key is (user_id, year_month). Each partition holds one user's conversations that had activity in one specific month — April 2026, March 2026, and so on. Partitions are bounded by the number of conversations active in a month, which is much smaller than the cumulative total.

The cost is query breadth. "Show Priya's conversation list" now needs to know the month. For the common case — the UI shows recent conversations — this is fine; the client queries year_month = '2026-04', gets the current month's activity, shows it. For "all conversations ever", the client walks multiple month-buckets: April, March, February, ... This is slower but bounded; the client knows how far back to walk and stops when it hits a page break or an empty month.

The bucket granularity is a trade-off. Finer buckets (hour, day) bound partitions more tightly at the cost of needing more buckets scanned for a "give me everything" query. Coarser buckets (year) make the everything-query one partition but risk oversized partitions for heavy users. The rule: size each bucket so the busiest user produces a 10–50 MB partition in one bucket, and use as coarse a bucket as satisfies that rule. A 1 Hz sensor uses daily buckets; a 1 kHz sensor uses hourly buckets; a conversation list with ten messages per day uses monthly or yearly buckets.

The "good partition key" checklist

Before shipping a table, check the partition key against four independent properties. Failing any one is a future outage.

High cardinality. At least 10× the number of nodes in the cluster, ideally millions-to-billions of distinct values. The partition key is the unit of distribution; if there are only 200 distinct values (a country_code) the whole table squeezes onto at most 200 replica sets no matter how many nodes you own.

Evenly distributed. High cardinality is necessary but not sufficient. If 80% of your writes target 0.1% of the distinct partition-key values — the celebrity problem, the hot tenant problem, the viral post problem — you get hot partitions even with billions of distinct keys. Sample your production traffic; plot a histogram; look for power-law tails; split the hot keys with an extra bucketing column if the tail is fat.

Bounded size. Each distinct partition-key value's data should fit comfortably under 100 MB compacted. Compute the row size, multiply by the maximum number of rows per value, check the product. If it is unbounded in time (a user accumulates history forever), bucket the partition key with a time component so the growth stops.

Query access pattern aligned. The partition key must match what the dominant query filters on. If the query is "give me X for user 42" the partition key is user_id. If the query is "give me X for device 42 on day D" the partition key is (device_id, day). If the schema's partition key does not appear in the query's WHERE clause, the query does not hit a single partition and falls into the ALLOW FILTERING abyss.

Miss any one of these and you have made the first and most expensive schema mistake. Partition-key changes in production are schema rewrites — you create a new table and migrate every row. For a billion-row table, that is days of dual-writing, backfilling, and cutover, with complex rollback. It is worth twice as much design review as you think.

The Chebotko-style mapping for the messaging schema. Each of the three application queries (left) is served by exactly one dedicated table (right). Query 1's filter column user_id becomes the partition key of conversations_by_user; its sort field last_activity becomes the first clustering key. Queries 2 and 3 are treated the same way. A single user action — "send message" — writes to all three tables in a coordinated fan-out, paying the write cost to keep all three reads at single-partition latency.

Anti-patterns

Five patterns come up in code review again and again. Each is a variant of the same underlying error: trying to treat Cassandra like a relational database.

One table for many queries. The engineer defines a messages table with message_id as the partition key, then writes queries like WHERE sender_id = ? and WHERE conversation_id = ? and WHERE sent_at > ?. None of them match the partition key. All of them either fail outright or need ALLOW FILTERING. The fix is always the same: create one table per query pattern, pay the write-fan-out cost, keep reads single-partition.

Unbounded partitions. The engineer picks user_id as the partition key for an append-only log, forgets to bucket it, and the table works beautifully for six months. At month seven, the heaviest user's partition crosses 500 MB; reads on that user slow down, then time out. Bucketing after the fact is a schema migration. The fix is to bound partition size at design time, always, with a time component in the partition key whenever the dimension is unbounded in time.

Boolean partition keys. A table keyed by ((is_active)) has exactly two partitions — true and false — no matter how many rows it holds. Every write and read goes to one of two preference lists. The fix is to compose the boolean with a high-cardinality column: ((is_active, user_id)).

Counter tables for high-cardinality dimensions. Cassandra's counter type is atomic and safe for fan-out, but it has schema restrictions — a counter table cannot mix counter columns with non-counter columns, cannot be included in batched multi-table writes atomically, and has its own replication semantics. For a handful of well-scoped counters (per-conversation unread, per-user notification count) it works fine. For "count by arbitrary dimension" rollups, a stream-processing pipeline that materialises into a normal table is usually better.

Non-idempotent updates. Cassandra writes can be retried on timeout — the coordinator or client reissues the request and the write lands at the replica. If the write is UPDATE balance SET value = value + 100, a retry double-increments. The fix is to make writes idempotent: assign a client-side write ID, use UPDATE ... IF NOT EXISTS with that ID as part of the primary key, or express the write as an absolute set rather than a delta (SET value = 500). For counters specifically, this is hard — Cassandra's internal counter implementation uses per-shard accumulation with its own idempotency handling, but retries on counter writes still carry a small risk of double-count. Design around it when correctness matters.

When Cassandra says no

Even with a perfectly query-first schema, some queries are genuinely unanswerable at OLTP latencies in Cassandra. Recognising these queries before proposing a table for them saves time and heartache.

"Find the user with the most messages." A global aggregation across every partition. No table layout makes this fast — you would need an index that merges across partitions at query time, which is exactly the thing wide-column does not provide. The production answer is a separate batch or streaming job: run Spark over the Cassandra data, emit the top-K list, store it in a third table that the application reads. Update the top-K list every hour or every day, depending on how fresh the answer needs to be.

"Show messages mentioning the keyword 'Diwali'." Full-text search. Cassandra's SASI indexes can do prefix matching on strings, but genuine inverted-index search with ranking, stemming, fuzzy matching, is outside the model. Production systems attach a separate search engine — Elasticsearch, OpenSearch, Solr — that consumes a change stream from Cassandra and builds its own indexes. The search engine answers keyword queries; Cassandra answers partition-key queries; both are fed by the same write path.

"All conversations between users A and B across history." If conversations_by_user is partitioned by user_id, then A's conversations live on one preference list and B's on another. Finding the intersection requires reading both partitions and intersecting on the application side — expensive if either user has many conversations. The better pattern is a dedicated conversation_between table partitioned by a canonicalised pair (min(user_a, user_b), max(user_a, user_b)), so the lookup is a single-partition read. This is the query-first methodology applied: a new query type gets a new table.

Python sketch — a write fan-out helper

Concretely, the write path for "send message" using the DataStax Python driver:

import uuid
from datetime import datetime
from cassandra.query import BatchStatement, BatchType

def send_message(session, sender_id, recipient_id, conv_id, body):
    """Send one message; fan out to three tables in a logged batch."""
    msg_id = uuid.uuid4()
    now = datetime.utcnow()

    batch = BatchStatement(batch_type=BatchType.LOGGED)

    batch.add(session.prepare("""
        INSERT INTO messages_by_conversation
        (conversation_id, sent_at, message_id, sender_id, body)
        VALUES (?, ?, ?, ?, ?)
    """), (conv_id, now, msg_id, sender_id, body))

    for user_id in (sender_id, recipient_id):
        batch.add(session.prepare("""
            INSERT INTO conversations_by_user
            (user_id, last_activity, conversation_id, last_message)
            VALUES (?, ?, ?, ?)
        """), (user_id, now, conv_id, body[:80]))

    session.execute(batch)

    # Counter updates cannot share a batch with non-counter statements.
    session.execute(session.prepare("""
        UPDATE unread_by_user_conversation
        SET unread_count = unread_count + 1
        WHERE user_id = ? AND conversation_id = ?
    """), (recipient_id, conv_id))

    return msg_id

Four statements total, three inside a LOGGED batch (which gives Cassandra's "eventually all-or-nothing" semantics — the coordinator persists the batch log and retries each statement until each succeeds), and one counter update outside the batch because Cassandra forbids mixing counter and non-counter writes in a single batch.

Why LOGGED and not UNLOGGED: an unlogged batch is just a convenience wrapper around independent writes — if the coordinator dies, some may land and others may not. A logged batch persists the batch itself to Cassandra's batch-log table first, then replays it until every target succeeds; this guarantees atomicity across partitions at the cost of extra round-trips. For the messaging use case, partial failure (the message lands but one conversation-list row does not) produces a visible inconsistency, so the stronger guarantee is worth the latency.

Mini-schema for a small social app

A complete six-table schema

Five queries drive the feed of a small social app. The schema is six tables.

Q1. User's own timeline — tweets they have posted. Q2. A given tweet's full content, by tweet ID. Q3. User profile lookup, by user ID. Q4. Who does user X follow? Q5. Who follows user X?

-- Q1: one user's own tweets, newest first
CREATE TABLE tweets_by_author (
    author_id text, tweet_time timestamp, tweet_id uuid,
    text text, media_url text,
    PRIMARY KEY ((author_id), tweet_time, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_time DESC);

-- Q2: one tweet by its ID (permalinks, reply loading)
CREATE TABLE tweets_by_id (
    tweet_id uuid, author_id text, tweet_time timestamp,
    text text, media_url text, reply_to uuid,
    PRIMARY KEY ((tweet_id))
);

-- Q3: user profile
CREATE TABLE users_by_id (
    user_id text, display_name text, bio text,
    avatar_url text, joined_at timestamp,
    PRIMARY KEY ((user_id))
);

-- Q4: who does this user follow?
CREATE TABLE following_by_user (
    user_id text, followed_at timestamp, followee_id text,
    followee_name text,
    PRIMARY KEY ((user_id), followed_at, followee_id)
) WITH CLUSTERING ORDER BY (followed_at DESC);

-- Q5: who follows this user?
CREATE TABLE followers_by_user (
    user_id text, followed_at timestamp, follower_id text,
    follower_name text,
    PRIMARY KEY ((user_id), followed_at, follower_id)
) WITH CLUSTERING ORDER BY (followed_at DESC);

Five tables for the five queries. Note that followee_name in following_by_user and follower_name in followers_by_user are denormalised copies of users_by_id.display_name — stored inline so "show who I follow with their names" is a single partition scan with no cross-table lookup. When a user changes their display name, the app fans out an update across every row in following_by_user and followers_by_user that mentions them. For a celebrity with millions of followers, this is a non-trivial background job; for typical users it is a handful of writes.

A follow action ("Arjun follows Priya") produces two writes: one row in following_by_user for Arjun (adding Priya to his follow list) and one row in followers_by_user for Priya (adding Arjun to her follower list). A profile edit (Priya changes her display name) produces one update to users_by_id plus a background job that walks followers_by_user WHERE user_id = 'priya' and following_by_user WHERE followee_id = 'priya' and rewrites the denormalised follower_name/followee_name in every row. For a typical user with a few hundred followers, this completes in seconds. For a celebrity with millions, it takes minutes to hours — and the UI shows a stale display name for a while in the interim. This staleness is acceptable in most applications; where it is not, you design the user-experience to avoid relying on the denormalised field.

Common confusions

"Query-first is cheating." It is the design pattern for this storage model. "Cheating" would be reusing relational idioms that have no implementation in Cassandra and pretending they work.
"Relational modelling doesn't apply." It applies perfectly well — to relational databases. Postgres and MySQL are excellent tools, and their entity-first methodology is correct for them. Wide-column is a different storage model with different tradeoffs; using its methodology for it is not a repudiation of relational, just the right tool for the right job.
"Denormalisation is a last resort." In OLTP relational systems it often is — a pragmatic concession to performance after profiling. In wide-column it is the first resort. The schema is denormalised from line one.
"I can refactor later." You cannot. Migrating a denormalised wide-column schema is punishing — every table has to be rewritten, every write path has to dual-write to old and new tables during the migration, every reader has to tolerate inconsistency between the halves for the duration. A simple schema change in Postgres — ALTER TABLE ADD COLUMN — is seconds of metadata work. The equivalent in Cassandra, when it involves changing partition keys, is weeks of migration work. Design correctly up front.
"A view or materialised view saves the duplication." Cassandra has materialised views, but the community consensus (and the Cassandra 4.x documentation) is to avoid them in production for anything nontrivial — they introduce consistency edge cases that the manual dual-write pattern does not. Manual fan-out is verbose but predictable.

Going deeper

KDM and the Chebotko methodology

The query-first methodology was formalised for Cassandra by Artem Chebotko and colleagues (then at DataStax) in a 2015 paper, "A Big Data Modeling Methodology for Apache Cassandra" (Chebotko, Kashlev, Lu). The paper introduces a two-stage workflow: first build a conceptual model as a standard entity-relationship diagram, then transform it into a logical model driven by the application's query set, using a specific notation — the Chebotko diagram — that maps each query to exactly one table. The paper's contribution is the formal rules for this transformation: how to derive partition keys from query predicates, how to derive clustering keys from query orderings, how to detect when one table can satisfy multiple queries. DataStax Academy's training materials are built on it.

Kashlev Data Modeling

Sebastian Kashlev, a co-author of the paper, developed a visual tool — the Kashlev Data Modeler — that implements the methodology step by step: you input your entity-relationship diagram and your query list, and the tool generates the Cassandra table set. The tool is now retired, but the methodology has propagated into DataStax's newer tooling and the official Cassandra data-modelling documentation. The paper remains the clearest short introduction to the approach.

Jonathan Ellis's early Cassandra patterns

Jonathan Ellis, co-founder of DataStax and one of the original Cassandra developers, gave a series of talks between 2012 and 2016 that codified the patterns this chapter describes — the "query-driven modelling", the "denormalisation is the default", the specific anti-patterns with names. His 2014 talk "Cassandra Data Modeling — Basics and Advanced" is still the single best hour of video on the subject and is worth watching before any wide-column project.

Where this leads next

Query-first design is the mindset; the next few chapters fill in the specific techniques it relies on.

Chapter 89 — why joins do not exist and what you do instead, in depth: the write-fan-out pattern, denormalisation strategies, consistency windows during partial-update scenarios.
Chapter 90 — CQL in detail: the full grammar of PRIMARY KEY, the difference between INSERT and UPDATE semantics in Cassandra (they do the same thing — both upsert), collection types (set, list, map), USING TTL, prepared statements, batches.
Chapter 91 — sharding and secondary indexes: when Cassandra's built-in index features do and do not work, how partition tokens are distributed across nodes, how token-aware drivers route queries.

By the end of Build 11 you will be able to take a product spec for a realistic application — a ride-hail dispatcher, a payment ledger, a user-generated content platform — and produce a complete wide-column schema and write-path sketch from the query list, before writing any code.

References

Apache Software Foundation, Apache Cassandra Data Modeling Documentation — the official reference on query-first schema design, partition-size rules, and the standard anti-patterns. The canonical starting point for anyone designing a production Cassandra schema.
DataStax Academy, Data Modeling Course (DS220) — the most-cited formal training on query-first Cassandra modelling, with worked examples across social, IoT, messaging, and e-commerce workloads, and concrete rules of thumb on partition sizing and bucketing.
Chebotko, Kashlev, and Lu, A Big Data Modeling Methodology for Apache Cassandra, IEEE Big Data Congress 2015 — the academic formalisation of query-first design for Cassandra, including the Chebotko diagram notation and rules for deriving partition and clustering keys from query predicates.
Ellis, Cassandra Data Modeling Patterns, DataStax 2014 — Jonathan Ellis's original talk and white-paper codifying the write-fan-out pattern, denormalisation-as-default, and the named anti-patterns this chapter draws on.
Kleppmann, Designing Data-Intensive Applications, Chapter 3 — Storage and Retrieval, O'Reilly 2017 — the clearest pedagogical treatment of the storage models that motivate query-first design, with worked comparisons across LSM-tree and B-tree engines.
Carpenter and Hewitt, Cassandra: The Definitive Guide, 3rd Edition, O'Reilly 2020 — the long-form reference on Cassandra internals and data modelling, including a detailed chapter on Chebotko-methodology schema design with runnable CQL examples.