CQL — SQL-Shaped, But Not Relational

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

CQL is the query language for Cassandra and its compatible relatives — ScyllaDB, DataStax Enterprise, Azure Cosmos DB's Cassandra API. It was designed to look like SQL because SQL is what working engineers already know: the same SELECT, INSERT, UPDATE, DELETE, WHERE, CREATE TABLE keywords, the same punctuation. A developer sitting down to Cassandra for the first time can write valid CQL in the first minute.

Under the skin, it is not a relational query language. It cannot express JOIN, subqueries, OR across columns, arbitrary WHERE predicates, GROUP BY across partitions, or COUNT(*) over an unbounded range. What it can do efficiently is the narrow set of operations the wide-column storage model supports: single-partition reads with equality on the partition key, range scans on the leading clustering columns, batch updates within one partition, compare-and-set via lightweight transactions, counter increments, and per-row TTL expiry.

The SQL-shape is a familiarity aid; the semantic model is still the query-first wide-column discipline from Chapters 86–89. Every CQL query you run must have been anticipated when the table was created — the partition key must match the filter, the clustering keys must match the sort. When the query shape doesn't match the table shape, CQL either rejects the query or demands ALLOW FILTERING, which is the wire-level confession that you are about to scan every partition in the cluster. This chapter walks through the CQL surface one keyword at a time, points at each divergence from SQL, and gives you the vocabulary to read a schema and know exactly which queries it will and will not serve.

A developer joins a team running Cassandra. They read the schema — users, tweets, followers — and conclude that it looks like a normal database. They open a CQL shell and type what they have typed a thousand times before:

SELECT * FROM users WHERE email = 'priya@example.com';

Cassandra responds:

InvalidRequest: Error from server: code=2200 [Invalid query]
message="Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW FILTERING"

The query is syntactically valid SQL and syntactically valid CQL. It is semantically rejected because email is not part of the primary key of users, and no physical layout can answer the question without scanning every node. The surprise is structural: a query that looks like SQL is not necessarily a query Cassandra can run. CQL's grammar overlaps with SQL's in surface, but the set of queries it accepts is a smaller set — determined entirely by the partition-key and clustering-key structure of the tables you created. This chapter unpicks the relationship.

What CQL has

CQL exposes enough SQL-like surface to feel immediately familiar. The surface you actually use day to day:

Data manipulation — SELECT, INSERT, UPDATE, DELETE. One quirk that matters later: INSERT and UPDATE are semantically identical. Both upsert; there is no "row exists" check by default, and both create the row if it is missing. The distinction is syntactic sugar.
Data definition — CREATE KEYSPACE, CREATE TABLE, CREATE INDEX, CREATE MATERIALIZED VIEW, CREATE TYPE (user-defined types), CREATE FUNCTION, CREATE AGGREGATE, plus the usual ALTER and DROP. Schema changes propagate over gossip and take seconds to a minute to reach every node.
Data types — primitives (int, bigint, varint, decimal, float, double, boolean, text, blob, timestamp, date, uuid, timeuuid, inet); collections (list<T>, set<T>, map<K, V>); the frozen<T> modifier that stores a collection as a single opaque blob; user-defined types; and counter, which has its own semantics.
WHERE operators — =, IN, range operators (<, <=, >, >=) on clustering columns, CONTAINS and CONTAINS KEY for collections, LIKE on SASI-indexed text. No OR, no !=, no subquery predicates.
Lightweight transactions — IF clauses on INSERT, UPDATE, DELETE that give atomic compare-and-set within a single partition, implemented via Paxos.
Per-row lifetime — USING TTL <seconds> expires the row automatically; USING TIMESTAMP <microseconds> overrides the write timestamp used for last-write-wins conflict resolution.
Batches — BEGIN BATCH ... APPLY BATCH. LOGGED batches provide atomicity across partitions; UNLOGGED batches only group statements over the wire for network efficiency.
Prepared statements — compile once with ? placeholders, execute many times. Cached on the coordinator. The standard production pattern.
Paging — result sets larger than a few megabytes are transparently paged; the driver fetches pages on demand.

That is substantial. You can build most of an OLTP application with only this surface. What you cannot do is the part of SQL that assumes a single-node engine with global visibility — and that is most of the rest.

What CQL doesn't have

The omissions are deliberate. Each missing feature corresponds to an operation that the distributed storage layer cannot cheaply support.

No JOIN. Not in any form. As Chapter 89 explained at length, joining two partitioned tables requires a broadcast or a shuffle, and Cassandra refuses to pretend otherwise. If you need the join, you denormalise or you do an application-side join.
No subqueries. A CQL SELECT cannot appear inside another CQL query. No WHERE id IN (SELECT ...), no EXISTS, no scalar subqueries in the projection.
No OR in WHERE. WHERE user_id = 42 OR user_id = 43 is not valid. The nearest equivalent is IN (42, 43) on a single column. Disjunctions across different columns require two queries and an application-side union.
No arbitrary WHERE predicates. You can filter by = on partition-key columns, by =, IN, or range operators on clustering columns (subject to the prefix rule from Chapter 86), and by = on indexed columns. Nothing else without ALLOW FILTERING. No length(text) > 140, no compound predicates comparing columns to each other, no NOT.
No GROUP BY across partitions. CQL 3.10+ added GROUP BY, but only within a single partition. Cross-table aggregations are Spark's job.
No aggregation across partitions. COUNT(*) on a table is technically valid but runs a full cluster-wide scan and takes minutes on any sizeable table. Maintain a counter or materialise counts via a stream job instead.
No cross-table constraints. No foreign keys, no REFERENCES, no CASCADE. Referential integrity is the application's job.
No stored procedures. User-defined functions and aggregates exist but operate over a single query's result set, not over control flow. No PL/pgSQL equivalent.
No transactions across partitions. Lightweight transactions are single-partition only. Atomically transferring money between two accounts on different nodes requires application-level two-phase commit or a different engine.

Why this list is not a feature request: each missing capability would require the coordinator to see rows from multiple partitions at once, breaking the shared-nothing invariant the cluster is built on. A database that "just" added joins would re-introduce the coordinator bottleneck it was designed to avoid. The omissions are what make linear scaling work; including them would change the scaling story entirely.

The WHERE clause — strict rules

The single most important thing to understand about CQL is exactly what the WHERE clause will accept. The rules follow directly from the on-disk layout described in Chapter 86.

Partition-key columns must use = or IN. No range operators. WHERE user_id > 100 is rejected — the partition key is hashed, and ranges over a hash are meaningless. The only exception is the internal TOKEN() function used by drivers for token-range scans, not a pattern for application code.

IN on the partition key is supported but expensive at scale: IN (1, 2, ..., 1000) translates into 1000 separate partition reads fanned out from the coordinator. A few dozen values is fine; a few hundred is borderline; thousands cause coordinator pressure. Most drivers recommend issuing N parallel queries instead.

Clustering-key columns can use =, IN, or range operators, but only in prefix order. Given PRIMARY KEY ((a), b, c, d), the legal shapes:

WHERE a = ? AND b = ? — prefix of length 1 on the clustering columns.
WHERE a = ? AND b = ? AND c = ? AND d > ? — prefix of length 2 plus range on the next.
WHERE a = ? AND b > ? — range on the first clustering column, nothing after.

Illegal: WHERE a = ? AND c = ? (skips b), WHERE a = ? AND b > ? AND c > ? (two ranges). The prefix rule falls out of the layout directly — rows are sorted by (b, c, d) within each partition, and only prefix-plus-leading-range filters correspond to contiguous byte runs on disk.

Non-PK, non-indexed columns cannot be filtered without ALLOW FILTERING, which reads every partition the query can reach and applies the filter post-hoc. Without a partition-key restriction, that is every partition on every node — a full table scan.

ALLOW FILTERING is the single most dangerous word in CQL. Every production incident review at Cassandra-using companies features at least one case of a filter slipping into OLTP code and taking down the cluster when data volumes grew past the threshold. Treat it like rm -rf in a shell script: fine in tools you control, never in committed application code.

Why indexed columns are a partial workaround and not a solution: a secondary index is a per-node inverted table from indexed-value to local partition keys. A query using the index fans out to every node; each node consults its local index; the coordinator merges. For high-cardinality matches the fan-out is wasted on 99% of nodes with no match. For low-cardinality matches it is roughly a full scan. Indexes are useful for admin paths, not substitutes for well-designed tables.

Python/CQL examples — what is cheap and what isn't

Concretely, given the schema from Chapter 86:

CREATE TABLE tweets (
    user_id    bigint,
    tweet_time timestamp,
    tweet_id   bigint,
    text       text,
    PRIMARY KEY ((user_id), tweet_time, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_time DESC, tweet_id DESC);

OK — single-partition read with clustering-key range.

SELECT * FROM tweets
 WHERE user_id = 42
   AND tweet_time > '2026-01-01'
 LIMIT 100;

Partition key equality, clustering-key range on the leading clustering column. One partition, one node, a contiguous byte-run read. Typical latency under 5 ms.

OK — multi-partition read via IN.

SELECT * FROM tweets
 WHERE user_id IN (1, 2, 3);

Three separate partition reads in parallel, merged at the coordinator. Each one independently fast; total latency roughly the slowest of the three. Fine for small N.

OK — clustering-key prefix with tie-breaker equality.

SELECT * FROM tweets
 WHERE user_id = 42
   AND tweet_time = '2026-04-24 11:05:30'
   AND tweet_id > 999;

Equality on the leading clustering column plus range on the next. Legal prefix shape.

NOT OK — non-indexed filter.

SELECT * FROM tweets WHERE text LIKE '%cricket%';

text is neither the partition key nor a clustering key nor an indexed column. Rejected. Adding ALLOW FILTERING would run a full cluster scan; the production answer is a separate search engine (Elasticsearch, OpenSearch) fed by a change stream.

NOT OK — OR on the partition key.

SELECT * FROM tweets WHERE user_id = 42 OR user_id = 43;

No OR. Use WHERE user_id IN (42, 43).

NOT OK — range on partition key.

SELECT * FROM tweets WHERE user_id > 100;

Partition keys are hashed, so ranges are meaningless. Rejected even syntactically.

NOT OK — skipping a clustering column.

SELECT * FROM tweets WHERE user_id = 42 AND tweet_id > 999;

Skips tweet_time, which is the first clustering column. Not a prefix; rejected.

Counter columns

Cassandra supports a special column type — counter — that can be atomically incremented or decremented without a read-modify-write cycle.

CREATE TABLE post_likes (
    post_id bigint PRIMARY KEY,
    likes   counter
);

UPDATE post_likes SET likes = likes + 1 WHERE post_id = 42;

The UPDATE ... SET counter = counter + delta form is the only way to modify a counter. You cannot INSERT a counter, cannot set an absolute value, cannot decrement-then-check atomically.

Counter tables have a hard schema restriction: a counter column cannot share a table with a non-counter non-primary-key column. So you either have a table of counters or a table of regular columns — no mixing, no retrofitting.

Under the hood, counters use a commutative replicated data type (CRDT) — each replica accumulates its local deltas, and reads merge per-replica deltas into a total. This lets increments succeed under network partitions without coordination. The trade-off: counters are not strictly consistent under all failure modes — retries can double-count in rare scenarios because the client cannot tell whether the first attempt reached some replicas before the timeout. For correctness-critical counts (financial balances, inventory), counters are the wrong tool — use a regular column with lightweight transactions, or a different engine entirely.

Lightweight transactions (IF clauses)

CQL offers conditional updates via IF — what Cassandra calls lightweight transactions (LWT). They give atomic compare-and-set within a single partition.

INSERT INTO users (user_id, email)
    VALUES (42, 'priya@example.com') IF NOT EXISTS;

UPDATE accounts
    SET balance = 500
    WHERE account_id = 42 IF balance = 1000;

The first inserts only if the row is absent; the second changes the balance only if it matches. Both are useful — the first for uniqueness checks (username registration), the second for optimistic concurrency control.

Under the hood, LWT runs Paxos: four phases (prepare, read, propose, commit), each a round trip. An LWT costs roughly 4× the latency of a normal write. A 2 ms write becomes an 8 ms LWT. LWTs also serialise on the partition — only one Paxos round in flight per partition — so contention amplifies the effective latency under concurrent attempts.

Production rule: use LWT only where the compare-and-set is genuinely needed, on paths that are not in the hot loop. Username uniqueness at registration is a good fit. A per-request counter is not. LWT is also strictly single-partition; cross-partition atomicity requires logged batches or a different engine.

TTL and timestamp control

USING TTL expires rows automatically. The TTL is specified in seconds at write time, and the row becomes invisible to reads after that many seconds.

INSERT INTO sessions (session_id, user_id, created_at)
    VALUES (uuid(), 42, toTimestamp(now()))
    USING TTL 3600;

The row is tombstoned at TTL expiry; physical deletion happens during the next compaction. Reads silently skip expired rows. TTL is how Cassandra handles session stores, cache entries, event logs with retention, and rate-limit windows — any time-bounded state. The TTL is per-row and can be changed on rewrite.

USING TIMESTAMP overrides the write timestamp used for last-write-wins conflict resolution. Useful when importing historical data (so subsequent real writes with current timestamps still win), and for idempotent retries (a deterministic timestamp per logical write prevents stale retries from clobbering later legitimate writes).

Why per-row TTL is a first-class feature and not application cleanup: keeping expiry in the storage engine means the read path skips expired rows without a coordinator-level scan, and compaction reclaims space. The alternative — periodically scanning to delete expired rows — is exactly the cluster-wide scan Cassandra is built to avoid. TTL pushes expiry logic down to where it costs O(1) per read.

BATCH statements

BATCH groups multiple statements. Two flavours with very different semantics:

BEGIN BATCH ... APPLY BATCH (logged) — atomic across partitions. The coordinator writes the batch to a replicated batch log first, then executes each statement. If a statement fails, the coordinator retries. If the coordinator dies, another node replays from the log. "Eventually all-or-nothing" — every statement eventually succeeds, though a reader between replays might see a partial state briefly.
BEGIN UNLOGGED BATCH ... APPLY BATCH — just a network-efficient wrapper. Not atomic. Makes sense only when all statements target the same partition, in which case the replica executes them atomically at a bandwidth saving.

Rule: logged for cross-partition atomicity, unlogged only within one partition for performance. The common mistake — using logged batches for random bags of writes on the assumption that "atomic is good" — imposes heavy coordinator overhead for no benefit when the writes span partitions anyway.

Typical pattern from Chapter 88's messaging schema — "send message" touches three tables on different partitions and uses a logged batch:

BEGIN BATCH
    INSERT INTO messages_by_conversation (conversation_id, sent_at, message_id, sender_id, body)
        VALUES ('c-42', '2026-04-24 11:05:30', uuid(), 'priya', 'hi');
    INSERT INTO conversations_by_user (user_id, last_activity, conversation_id)
        VALUES ('priya', '2026-04-24 11:05:30', 'c-42');
    INSERT INTO conversations_by_user (user_id, last_activity, conversation_id)
        VALUES ('arjun', '2026-04-24 11:05:30', 'c-42');
APPLY BATCH;

The counter update for Arjun's unread count goes in a separate statement — counters cannot be mixed with non-counter writes in one batch.

Python usage with the DataStax driver

Real CQL is almost always called from a driver, not from a shell. The DataStax Python driver is the standard on the Python side.

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement, ConsistencyLevel

cluster = Cluster(['cassandra-1.example.com', 'cassandra-2.example.com'])
session = cluster.connect('social_app')

# Prepared statement — compiled once, reused many times.
insert_tweet = session.prepare("""
    INSERT INTO tweets_by_user (user_id, tweet_time, tweet_id, text)
    VALUES (?, ?, ?, ?)
""")
insert_tweet.consistency_level = ConsistencyLevel.QUORUM

session.execute(insert_tweet, (42, '2026-04-24 11:05:30', 999, 'hi'))

# Query with paging — result sets larger than fetch_size are paged transparently.
stmt = SimpleStatement(
    "SELECT * FROM tweets_by_user WHERE user_id = %s",
    fetch_size=100,
)
for row in session.execute(stmt, (42,)):
    print(row.tweet_id, row.text)

cluster.shutdown()

Three patterns are worth noting. Prepared statements (session.prepare(...) returning a reusable object) are compiled and cached on the coordinator — subsequent executions skip the parse step and are faster and safer than string-interpolated queries. Consistency levels (QUORUM, ONE, ALL, LOCAL_QUORUM for multi-DC setups) control how many replicas must acknowledge before the driver considers the operation successful — a per-query knob on the CAP-theorem trade-off. Paging (fetch_size=100) is automatic for large result sets; the driver fetches the first page, the application iterates, the driver requests the next page in the background.

The driver also supports async futures (session.execute_async(...) returns a future you can await) for high-concurrency workloads, and token-aware routing (the driver hashes the partition key locally and sends the request directly to a replica, skipping the coordinator hop). Both are on by default in modern versions of the driver and are part of the reason Cassandra's end-to-end latency is as low as it is.

Schema-first or schema-last?

Cassandra sits between strict-schema Postgres and schemaless MongoDB, and the exact position is worth naming.

Cassandra requires CREATE TABLE before any insert. Writing to a non-existent table, or inserting a column not declared in the schema, is a hard error. In that sense Cassandra is schema-first.

But each row is a sparse map of columns. Two rows in the same table can have completely different sets of non-PK columns populated. Unwritten columns are genuinely absent from disk — not stored as NULL. Storage is proportional to the columns you write, not to the schema's total column count.

Adding columns is cheap. ALTER TABLE tweets ADD country_code text is metadata-only — no existing rows are rewritten. Reading an old row returns NULL for the new column; writing it on new rows is seamless.

The philosophy: the schema defines the structure — primary key and column types — not which columns every row must have. Pragmatic middle ground: enough schema to catch typos and enforce primary-key discipline, sparse enough to make evolution and per-row variability easy.

SQL vs CQL for the same question

Suppose you want "the top 10 users by total purchases in 2026". In SQL against a normalised schema:

SELECT u.name, SUM(o.amount) AS total
  FROM users u JOIN orders o ON u.id = o.user_id
 WHERE o.year = 2026
 GROUP BY u.name ORDER BY total DESC LIMIT 10;

Postgres joins users and orders, groups, sums, sorts, limits. On a hundred-million-row orders table this takes tens of seconds on a single node — but it runs.

CQL cannot express any of it. No JOIN, no GROUP BY across partitions, no ORDER BY on an aggregate. The query, as written, is unavailable.

The production answer — following Chapter 88's query-first methodology — is a stream processor (Flink or Kafka Streams) that consumes order events, maintains a rolling top-K in its own state, and periodically writes the answer to a simple table:

CREATE TABLE top_users_by_year (
    year       int,
    rank       int,
    user_id    bigint,
    user_name  text,
    total      bigint,
    PRIMARY KEY ((year), rank)
);

SELECT rank, user_name, total FROM top_users_by_year
 WHERE year = 2026 LIMIT 10;

A single-partition lookup, served in milliseconds regardless of data volume. The cost moved: instead of a cluster-wide join-aggregate-sort at query time, you run a continuous streaming aggregation that materialises the answer in advance.

The same question, two cost models. SQL runs the join, group, and sort at query time — flexible, but latency scales with data volume. CQL materialises the answer in advance via a streaming pipeline, so the read is always a single-partition lookup. The total compute cost is similar; the when is different. Pre-aggregation shifts work from the hot path (read) to the cold path (write/stream), which is the fundamental wide-column bargain.

Common confusions

"CQL is SQL." It is shaped like SQL but supports a much smaller subset. Every idiom that assumes global visibility — joins, subqueries, cross-partition aggregation, arbitrary WHERE — is absent. Starting a Cassandra project with "let me port my Postgres queries" is the fastest way to a broken system.
"I can use OR freely." CQL has no OR in the WHERE clause. IN approximates OR on a single column; cross-column disjunctions require two queries.
"ALLOW FILTERING is fine for occasional use." Every use is a full cluster scan. Fine in admin queries against a small test table, never in OLTP production — the cluster either absorbs the scan (so the table is small enough you did not need Cassandra) or it does not (so the scan takes down the cluster on the day it runs). The keyword is the signal that a new table is needed, not that the query is acceptable.
"Counters work like any other column." They do not. Dedicated table, no mixing with non-counter columns, not strictly consistent under retry edge cases, not composable with lightweight transactions. Use them for display metrics where rare double-count is acceptable; not for financial balances.
"INSERT and UPDATE are different." They are not. Both upsert by default. The only difference is the IF variants: INSERT ... IF NOT EXISTS for create-only, UPDATE ... IF ... for compare-and-set. Without IF, the keywords are interchangeable.
"Lightweight transactions are lightweight." Only relative to full two-phase commit. They still cost 4× a regular write (four Paxos phases, each a round trip) and serialise on the partition. Use them where the semantics justify the cost.

Going deeper

CQL2 vs CQL3 — the 2013 redesign

The original query language for Cassandra was Thrift RPC over a column-family abstraction — conceptually, a get/set API over nested maps of maps. CQL1 (2011) was a minimal SQL-like veneer. CQL2 (2012) added declarative schema. CQL3, shipped in Cassandra 1.2 (2013), was a ground-up redesign — it introduced the partition-key-plus-clustering-key primary key model, made tables strongly typed, and aligned the query surface with SQL much more closely. "CQL" in modern documentation always means CQL3. The Thrift era is dead — every modern driver speaks CQL3 over the native binary protocol.

DataStax Enterprise extensions

DataStax Enterprise (DSE) is a commercial distribution of Cassandra that ships with DSE Search (Solr integration), DSE Graph (Gremlin traversals), and DSE Analytics (embedded Spark). The CQL syntax in DSE is a strict superset, with additional keywords for these features. Queries like WHERE solr_query = '...' are not portable back to open-source Cassandra, but the core grammar is unchanged.

ScyllaDB CQL compatibility

ScyllaDB is a Cassandra-compatible rewrite in C++ using the Seastar framework. Its CQL is source-compatible — a well-designed schema migrates between the two without changes. Differences live mostly in operational behaviour (per-core sharding, different metrics, tuning knobs) and a few edge cases around materialised-view consistency, where Scylla's documentation is more explicit. Scylla also implements some optional features Cassandra is slower to adopt (alternator, Lua UDFs), but the core PRIMARY KEY/SELECT/INSERT surface is identical.

The native binary protocol

The wire protocol between drivers and Cassandra is the native binary protocol (versions 3, 4, 5). Stateful request/response over TCP — the driver opens a connection per node, negotiates protocol version, sends framed queries. It supports prepared statements, paging, batches, async responses, and server push notifications for schema or topology changes. Most developers never touch it directly. But knowing that prepared statements are identified by a server-generated ID invalidated on schema changes — and that a DDL on one node causes every driver connection in the cluster to re-prepare — explains a class of "my queries stopped working after ALTER TABLE" symptoms.

Where this leads next

Chapter 91 is sharding in wide-column systems — how Cassandra's consistent hash ring distributes partitions, how virtual nodes relate to real ones, how token-aware drivers route queries, and how topology changes move data. The WHERE rules you learned here fall directly out of how the ring maps partition hashes to nodes. Chapter 92 covers replication factor and consistency levels, completing the picture of how a single CQL query becomes a coordinated operation across N replicas.

References

Apache Software Foundation, Apache Cassandra CQL Reference — the canonical CQL specification, covering every keyword, data type, and clause. The source of truth for what the grammar allows.
DataStax, CQL for Cassandra Developers — the most approachable tutorial on CQL, including worked examples of IF clauses, counter updates, and batches, aimed at developers coming from SQL.
Amazon Web Services, Comparing DynamoDB and Cassandra — AWS's side-by-side comparison of Cassandra CQL and DynamoDB's API, useful for engineers evaluating managed alternatives or migrating between the two.
Carpenter and Hewitt, Cassandra: The Definitive Guide, 3rd Edition, O'Reilly 2020 — the long-form reference on CQL, Paxos-based lightweight transactions, and the native protocol. The most complete single book on the language.
Kleppmann, Designing Data-Intensive Applications, Chapter 3 — Storage and Retrieval, O'Reilly 2017 — the clearest treatment of why distributed storage engines omit joins and cross-partition aggregations, with comparisons across Cassandra, DynamoDB, HBase, and Bigtable.
ScyllaDB, ScyllaDB CQL Differences from Cassandra — ScyllaDB's catalogue of where its CQL implementation differs from open-source Cassandra, including the handling of lightweight transactions, materialised views, and some optional features.