Global Secondary Indexes Across Shards

Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.

In short

Your users table is sharded by user_id. Every read whose WHERE clause names a user_id is a single-shard lookup — exactly what sharding was designed for. Then product asks for a feature that is two lines of SQL on a single-machine system and a screaming problem on yours: log in by email. The query is SELECT * FROM users WHERE email = ?. The email is not the shard key. There is no information in the email that tells the router which shard owns the row. The natural place to look is everywhere — a 1000-shard scatter-gather for every login, ten times a second.

A global secondary index (GSI) replaces the scatter-gather with a second single-shard lookup. The GSI is itself a sharded table — user_by_email(email PRIMARY KEY, user_id) — sharded by email. On every write that creates or updates a user, the engine (or your application) writes to both: the base on its shard, and the GSI on its shard. On every read by email: hash the email to find the GSI shard, fetch the row to learn the user_id, hash that to find the base shard, fetch the user. Two single-shard lookups instead of one cluster-wide fan-out. Latency goes from "scan all 1000 shards" (seconds, if it works) to "two round trips" (low milliseconds).

DynamoDB, Spanner, and CockroachDB ship native GSIs the engine maintains. Cassandra's SASI and MATERIALIZED VIEW are limited approximations. Vitess exposes the same idea as a lookup vindex. The trade-off is the one denormalisation always pays: a second sharded table is more storage, every base write fans out to two shards, and across them the system is eventually consistent unless the engine runs a distributed transaction. A GSI is what you buy when an access pattern is hot enough that scatter-gather would dominate your bill.

You have shipped a sharded users table. Reads keyed by user_id are 4 ms. Then login arrives. The auth service wants to run one line:

SELECT user_id, password_hash FROM users WHERE email = ?;

The email is not the shard key. The router has nothing to hash. The query has no choice but to ask every shard "do you have a user with this email?" and stitch the result. With 1000 shards, that is 1000 round trips for one login. Multiply by a few thousand logins per second and the cluster is doing scatter-gather all day.

This is the wall every sharded system hits. Chapter 89 taught you to denormalise wide-column tables for joins; this chapter teaches the analogous pattern for sharded lookups. You build a second sharded table keyed by the column you want to query, keep it consistent with the first, and turn the scatter-gather into a single-shard index lookup.

What a GSI is — architecturally

A global secondary index is another sharded table. Not a B-tree inside a single node the way a single-machine secondary index is. Not metadata. A full-blown distributed table, with its own shards, replication, write path, and failure modes — that happens to store nothing but a mapping from the indexed column to the base table's primary key.

The schema is mechanical. If your base table is

users(user_id PRIMARY KEY, email, name, ...)   sharded by user_id

and you want to query by email, the GSI is

user_by_email(email PRIMARY KEY, user_id)      sharded by email

Two columns, sharded on the column you filter on. Reads by email touch one GSI shard, get back a user_id, then touch one base shard.

Why this works: sharding turns "find this row" into "hash this column, ask one shard", and you only get fast access on the column you sharded by. A GSI is a second sharded table whose shard column is the one the original table did not let you query fast. You have paid for a second shard map so that one more column gets the fast-path treatment.

The maintenance contract is the part that surprises people. Every insert into the base requires an insert into the GSI. Every delete requires a delete in both. Every update that changes the indexed column is a delete-then-insert in the GSI plus the update in the base. The engine — or your application — keeps the two tables in lockstep across shards. The cost of cluster-wide consistency between them is the central trade-off of the pattern.

The flow. The router cannot hash an email against the base table because the base table is sharded by user_id. It hashes the email against the GSI, fetches the user_id from one GSI shard, then hashes the user_id against the base table and fetches the row from one base-table shard. Two round trips, both single-shard. The path the GSI replaces — broadcasting the email predicate to every base-table shard — is what every login would otherwise cost.

Two kinds — GSI and LSI in DynamoDB terminology

DynamoDB names a distinction the rest of the industry leaves implicit, so the vocabulary is worth borrowing.

Global Secondary Index (GSI). A separate table with its own partition key and sort key, sharded independently of the base. Can be added or dropped at any time. Maintained asynchronously and eventually consistent with the base — there is a real, observable lag between a base write and the GSI reflecting it. In exchange you get a different partition key, which is the entire point.

Local Secondary Index (LSI). Same partition key as the base, different sort key. Lives on the same partition, so writes to both are atomic on the partition leader. Strongly consistent with the base on request. Cannot be added after table creation, and shares the partition's size limit (10 GB in DynamoDB) with the base. Use an LSI for a different ordering inside a partition; use a GSI for a different partition key altogether.

Most other systems implement only one. Cassandra's MATERIALIZED VIEW is GSI-shaped. Vitess vindexes are GSI-shaped. CockroachDB and Spanner offer both — Spanner unusually offering strong consistency on the global variety because TrueTime makes the commit protocol cheap. When a system says "secondary index" without qualification, it almost always means GSI in our sense.

The pattern is so common it is almost the only thing you build a GSI for in your first year.

Base table — sharded by user_id:

CREATE TABLE users (
  user_id        UUID PRIMARY KEY,
  email          TEXT UNIQUE,
  password_hash  TEXT,
  display_name   TEXT,
  created_at     TIMESTAMP
) DISTRIBUTE BY HASH(user_id);

GSI — a sharded table whose primary key is email:

CREATE TABLE user_by_email (
  email    TEXT PRIMARY KEY,
  user_id  UUID
) DISTRIBUTE BY HASH(email);

The login path:

SELECT user_id FROM user_by_email WHERE email = ? — single GSI shard, ~3 ms.
SELECT password_hash FROM users WHERE user_id = ? — single base shard, ~3 ms.
Compare hash, return a session token.

Total: 6 ms across two round trips. Without the GSI, the same query on a 1000-shard cluster costs 1000 round trips and saturates the cluster the moment login traffic gets above modest. The factor of 1000 justifies the cost of a second sharded table.

Why two round trips and not one: the GSI stores only the indexed column and the foreign key, because copying every column would double your storage. The GSI is two-step by design — find the user_id, then fetch the row. A "covered index" is the variant that copies extra columns, trading storage for skipping the second lookup; treated under "common confusions" below.

The same pattern handles every "log in by some unique identifier" feature: phone, username, referral code. Each is another sharded table.

Write amplification

GSI cost is paid on the write path. Consider an email change.

On a single-machine relational database, the operation is one row update; Postgres handles the secondary index update inside the same transaction. Atomic, invisible.

On a sharded system with a GSI, it expands to three writes on three different shards:

UPDATE users SET email = ? — base shard.
DELETE FROM user_by_email WHERE email = old_email — old GSI shard.
INSERT INTO user_by_email VALUES (new_email, user_id) — new GSI shard.

User creation: two writes. User deletion: two writes. Email rename: three writes including the cleanup. Multiply by every GSI you maintain: three GSIs (email, phone, username) and an email rename costs four writes on four shards.

Write amplification is acceptable when the read:write ratio justifies it. For login, reads outnumber email renames by something like 10000:1 in any consumer product, so each rename's three writes amortise across 30000 read savings. For a workload where the indexed column changes constantly — a last_active_at updated every page view — a GSI multiplies the write rate by an order of magnitude with no read benefit. That is why you index identity-stable columns (email, phone, username), not flux-prone ones.

Consistency challenges

The base and the GSI are two sharded tables on two shard maps. Updating both atomically requires a distributed transaction — two-phase commit, or a TrueTime-coordinated commit. Most systems do not offer that, or offer it at a latency cost that defeats the purpose of a fast index.

So the dominant choice is eventual consistency. The base write commits independently; the GSI write follows asynchronously. There is a window — typically milliseconds, sometimes hundreds of milliseconds under load — during which the base says one thing and the GSI says another.

A user changes their email at 12:00:00.000. A login attempt with the new email arrives at 12:00:00.050. If the GSI has not caught up, the lookup returns "no such email" and login fails. The user retries 200 ms later, login succeeds. Applications handle this either by accepting the brief failure or by falling back to a scatter-gather verify on a GSI miss.

Stronger guarantees cost. DynamoDB GSIs are eventually consistent and the docs are explicit. Spanner's global indexes are strongly consistent at higher write latency. CockroachDB indexes are transactionally consistent with the base. Cassandra's MATERIALIZED VIEW is eventually consistent and historically buggy under replica failures.

The right mental model: a GSI is a denormalised second copy, like the wide-column denormalisation of Chapter 89. You either pay at write time for synchronous fan-out, or you defer the work and accept eventual consistency.

Python implementation sketch

A minimal client-managed GSI, in case your engine does not provide one. The pattern generalises to any sharded store with a put(key, value) and get(key) interface.

class ShardedDB:
    """Two logical tables, each transparently sharded under the hood."""
    def __init__(self):
        self.users = ShardedTable(num_shards=1000)
        self.users_by_email = ShardedTable(num_shards=1000)

def create_user(db, user_id, email, name):
    db.users.put(user_id, {"user_id": user_id, "email": email, "name": name})
    db.users_by_email.put(email, {"email": email, "user_id": user_id})

def update_email(db, user_id, new_email):
    user = db.users.get(user_id)
    old_email = user["email"]
    user["email"] = new_email
    db.users.put(user_id, user)
    db.users_by_email.delete(old_email)
    db.users_by_email.put(new_email, {"email": new_email, "user_id": user_id})

def delete_user(db, user_id):
    user = db.users.get(user_id)
    if user is None:
        return
    db.users.delete(user_id)
    db.users_by_email.delete(user["email"])

def find_by_email(db, email):
    row = db.users_by_email.get(email)
    if row is None:
        return None
    return db.users.get(row["user_id"])

Why this is the structure even when the engine does GSI maintenance for you: the engine is running the same steps internally, with extra machinery for replica replay and crash recovery. Reading the code makes the cost model concrete — you can count the writes and round trips and see there is no hidden cheap path. Engines that "do GSIs for you" do this fan-out under the covers, and their write latency reflects it.

A production version adds idempotency tokens (a retry of create_user after partial failure does not double-insert), a write-ahead log of pending GSI ops (a crash between the base write and the GSI write replays on recovery), and lag metrics. For a learning implementation, the four functions above are the whole story.

Engine-specific implementations

DynamoDB. Native GSIs declared at table creation or added via UpdateTable. The engine maintains them; you do not write fan-out code. Eventually consistent by default; strongly-consistent reads available on LSIs but not GSIs. GSIs have separate provisioned throughput, and underprovisioning a GSI throttles base writes — a subtle operational gotcha. Adding a GSI triggers a backfill that can take hours.

Cassandra. Three options, none ideal. CREATE INDEX makes a per-node inverted index that fans out on read — sub-linear scaling. SASI adds range and prefix support but the same fan-out. CREATE MATERIALIZED VIEW is closest to a true GSI but its consistency story under replica failures is shaky enough that operators have recommended hand-built second tables since Cassandra 3.x.

Vitess. A lookup vindex is a GSI declared in VSchema, backed by a separate MySQL table. Maintenance is on the application; Vitess's contribution is the routing layer that turns WHERE email = ? into the two-step lookup transparently.

CockroachDB. SQL CREATE INDEX on a non-primary-key column produces a sharded, transactionally-consistent GSI via the distributed transaction layer.

Spanner. CREATE INDEX produces a globally strongly-consistent index because TrueTime makes the commit protocol cheap enough to do at every write.

Pick by consistency story. Strong: Spanner, CockroachDB. Eventual is fine: DynamoDB, Vitess. On Cassandra, build the second table yourself.

When NOT to use a GSI

The cost — extra storage, write amplification, consistency complexity — is worth paying only when the read benefit is large.

The query is rare. A back-office search you run twice a week can scatter-gather. A GSI is operational overhead with no user-facing payoff.

Low cardinality on the indexed column. A GSI on country_code (200 values) shards into 200 partitions, and India's row dominates one of them — the GSI inherits the hot-partition problem of low-cardinality shard keys.

Writes outnumber reads on the indexed column. A last_login_at updated every login fans out to the GSI on every login, doubling the write load with no read benefit.

The query already filters on the shard key. WHERE user_id = ? AND email = ? already routes to one shard; the email filter is local.

Low selectivity. An index on subscription_tier (4 values, 25% each) returns 250 million rows on a billion-user table — a scan, not a lookup.

When YES

High-frequency query on a non-shard-key column. Login by email, lookup by phone, @mention resolution. Thousands per second, millisecond latency required.

Read-dominant with high selectivity. Read:write ratio at least 100:1 on the indexed column; each indexed value points to one or a few rows.

Latency budget rules out scatter-gather. A 50 ms login SLA cannot tolerate a 1000-shard fan-out where the slowest shard sets the response time.

Operational maturity for the second consistency surface. Monitoring on GSI lag, alerts on divergence, a tested backfill tool that rebuilds the GSI from the base.

A social app's GSI portfolio

You are building a social app, users sharded by user_id. Over the first year, access patterns accumulate.

By user_id — home page, friends list, profile. Single-shard, no GSI.

By email, for login. 5000 reads per second. user_by_email ships in week one.

By username, for @mentions and profile URLs. 20000 reads per second — every comment-render parses the text and resolves usernames. user_by_username ships in month two.

By phone_number, for SMS login and contact import. 1000 reads per second. user_by_phone ships in month three.

By referral_code on signup. A few per second — scatter-gather would be tolerable, but referral attribution is on the growth-metrics critical path. GSI ships anyway: 50 ms instead of 1500 ms.

By account_status (active / suspended / deleted). Three values, low cardinality, mostly used by operational scans. No GSI — the partitions would be hot. Answer offline in the data warehouse.

The portfolio: four GSIs, each its own sharded table, each its own write amplification. Every signup is one base write plus four GSI writes. Every email change is three writes (base + GSI delete + GSI insert) on three shards. The bookkeeping is non-trivial; you write a BackfillGSI job that rebuilds any GSI from the base if it diverges.

The decision was per-pattern: which queries are hot, latency-critical, and stably-keyed enough to justify a GSI. There is no rule that every non-shard-key access pattern needs one — only a calculation per query.

The portfolio. The base users table is sharded by user_id. Four GSIs — one per non-shard-key access pattern that justified the cost — each its own sharded table on its own shard column. Reads on any indexed column become single-shard lookups against the corresponding GSI followed by single-shard lookups against the base. Writes on the base fan out to as many GSIs as touch the changed column.

Common confusions

"A GSI is a free feature you turn on." It is a second sharded table with its own storage, throughput, and maintenance cost. DynamoDB charges separately for GSI storage and writes; CockroachDB and Spanner spend write IOPS on every GSI update. The query is faster; the underlying work has not vanished, it has been moved to write time.
"Indexes always make queries faster." Only when the query filters with high selectivity. A GSI on a low-cardinality column buys you almost nothing — the lookup returns millions of rows on the index, then issues millions of base reads.
"A covered index eliminates the second lookup." A covered GSI stores extra non-key columns inside the GSI, so a read returns everything without going back to the base. DynamoDB calls this KEYS_ONLY vs INCLUDE vs ALL projection. Saves the second round trip; costs storage duplication and extra write fan-out.
"LSI and GSI are the same thing." LSI shares the partition key, lives on the same shard, is strongly consistent, and shares the partition's size limit. GSI uses a different partition key, lives on different shards, is eventually consistent, and has its own limits.
"The GSI is the source of truth." It is not. The base is. If the GSI diverges, scan the base and rebuild the GSI, never the other way. Treating the GSI as authoritative ("no entry means no user") is how you delete legitimate users when the index is briefly stale.
"Eventually consistent means broken." It means converges within a bounded time, typically milliseconds. For login, profile lookup, mentions, that is invisible. The edge cases (rename then immediately log in) are handled with read-side retries, not strong consistency.

Going deeper

DynamoDB Streams + Lambda for custom GSI fan-out

When the built-in GSIs are not flexible enough — say, a GSI whose key is a computed value — build it yourself with DynamoDB Streams. Every base write emits an event; a Lambda consumes the stream and writes into a separate table you treat as a GSI. This is what the engine does internally for native GSIs. You own Lambda concurrency, retry semantics, and poison-event dead-letter queues; you get full control over index shape, projection, and consistency.

The write-side flush pattern for strongly-consistent GSIs

If your engine's GSIs are eventually consistent and your access pattern cannot tolerate the lag, layer strong consistency on top with a write-side flush: writer issues base + GSI writes in sequence, publishes a "flushed" event only after both succeed; readers that need strong consistency wait for the flush event. The latency cost is real, but lets you mix strong and eventual reads on the same GSI per query.

MongoDB compound indexes vs GSIs

MongoDB has primary shard keys and per-shard local indexes. A "compound index" — {email: 1, created_at: -1} — sorts within a shard but does not change the shard key. A query filtering on email against a collection sharded on user_id is still a scatter-gather. MongoDB has no GSIs in the DynamoDB sense; the equivalent is a second collection sharded by the indexed column, maintained by application code or Atlas Search. Confusing the two is a common design error.

Where this leads next

Chapter 96 covers scatter-gather queries — the path you walk when no GSI exists and the latency budget allows it. Chapter 97 covers cross-shard transactions — updating a base and a GSI atomically via two-phase or Paxos commit, and the latency cost. Chapter 98 returns to resharding when the GSI's distribution itself becomes hot.

The thread across Build 12 is the same as Build 11: every access pattern that does not match the storage layout costs something, and the engineering work is choosing where to pay it. The GSI is one of the half-dozen ways. Knowing which patterns it fits — and which it does not — separates a sharded system that scales from one that scatter-gathers itself to death.

References

Amazon Web Services, Using Global Secondary Indexes in DynamoDB — the canonical reference for GSI semantics, consistency model, capacity provisioning, and the difference from LSIs. The "Best Practices" subsection on projection types and write-cost analysis is essential reading.
Apache Software Foundation, Materialized Views in Cassandra — Cassandra's GSI-shaped feature, with the official caveats about replica-failure consistency and the recommendation to build a second table manually for production correctness needs.
PlanetScale / CNCF Vitess, Lookup Vindexes — Vitess's mechanism for indexing a sharded MySQL table by a non-shard-key column, with the lookup-table-driven design that closely mirrors a hand-built GSI.
Bacon et al., Spanner: Becoming a SQL System, SIGMOD 2017 — the Spanner paper, including the discussion of how globally distributed indexes stay strongly consistent via TrueTime-coordinated commits, and the cost model for that consistency.
Cockroach Labs, Indexes — CockroachDB SQL Reference — the SQL-syntax view of distributed secondary indexes, including coverage of partial indexes, hash-sharded indexes, and how Cockroach maintains transactional consistency between base tables and indexes.
Kleppmann, Designing Data-Intensive Applications, Chapter 6 — Partitioning, O'Reilly 2017 — the textbook treatment of secondary indexes in partitioned systems, including the document-partitioned vs term-partitioned distinction (the same axis as LSI vs GSI) and the consistency consequences of each.