Nested Indexes and Dot-Path Queries

In short

A BSON document is a tree. An order has a customer, the customer has an address, the address has a city; the order also has an array of line items, each of which has a product, which itself has a brand. None of that flattens into a tuple. To say "find every order shipped to Mumbai" you need to point at a node inside the tree, and the language MongoDB picked for pointing is dot-path notation: address.city, customer.address.pincode, items.0.name, items.product.brand.

Dot-paths show up in three places. They are how you query (db.orders.find({"address.city": "Mumbai"}) walks into the address sub-document and matches its city field). They are how you index (db.orders.createIndex({"address.city": 1}) builds a B-tree over the same path). And when the path crosses an array, they trigger MongoDB's most powerful and most dangerous index variant — the multikey index — which stores one entry per array element, so a collection of 10 million orders averaging 4 line items produces 40 million index keys.

This chapter walks the dot-path mechanics end to end, using a Bharat Bazaar orders collection: dot-paths into nested sub-documents, three flavours of array indexing (items.0 for first element, items for any element, positional items.$ for projection), the multikey index and its size trade-off, the array operator zoo ($elemMatch, $all, $in), and a real PyMongo example with explain() output proving the planner picked the index. By the end you will know exactly what dot-paths buy you, when a deeper path becomes a query-planner liability, and why the seasoned MongoDB modeller prefers three nesting levels and a denormalised city_indexed field over six pretty levels and a slow query.

Where the tree comes from

In a relational schema you would model the Bharat Bazaar order like this: an orders row points to a customers row by customer_id, the customers row points to an addresses row by address_id, and a separate order_items table holds one row per line item with its own product_id joining to products. Four tables, three foreign keys, every list-an-order query a multi-way join.

In MongoDB you embed. One document carries the order, the customer at the time of the order (denormalised on purpose, so a later customer rename does not retro-edit history), the shipping address, and the line items as an inline array:

{
  "_id": ObjectId("..."),
  "order_no": "BB-2026-00471829",
  "placed_at": datetime(2026, 4, 22, 15, 12),
  "customer": {
    "name": "Aarav Sharma",
    "email": "aarav.s@example.in",
    "phone": "+91-98xxxxxx12"
  },
  "address": {
    "line1": "Flat 4B, Marigold Apartments",
    "line2": "Koramangala 5th Block",
    "city": "Bengaluru",
    "state": "KA",
    "pincode": "560095"
  },
  "items": [
    {"product": {"sku": "SHO-CAM-9-BLK", "name": "Stride Pro", "brand": "Bata"},
     "qty": 1, "price_inr": 2499},
    {"product": {"sku": "STA-NB-A4-100", "name": "Notebook A4 100p", "brand": "Classmate"},
     "qty": 3, "price_inr": 60}
  ],
  "total_inr": 2679,
  "status": "shipped"
}

Two structural facts about this shape determine everything in this chapter. First, the tree has depth — customer is one level down from the document root; address.city is also one level down but in a different subtree; items.0.product.brand is three levels down (array index, then product field, then brand field). Second, items is an array, which means a single document can match a query like "an item with brand Bata" through any of its array elements — the database has to consider all of them.

Both of these break the assumption a relational index is built on. A B-tree on customers.email indexes one value per row. A B-tree on orders.items.brand would have to index N values per document, where N is the array length. That is the multikey index, and it is the heart of the chapter.

Dot-paths into sub-documents

The simplest case is dot-path into a non-array sub-document. The order has a customer sub-document; you want orders for a specific email:

db.orders.find({"customer.email": "aarav.s@example.in"})

MongoDB parses "customer.email" as "descend into the value at customer, then match its email field against the given value." The traversal is purely structural — there is no join, no second collection lookup, just a walk of the BSON tree inside one document. Why this is cheap at the read path: the value at customer is a length-prefixed embedded document (BSON type 0x03), so the parser can locate the customer subtree, then within it locate the email field, without scanning the rest of the order. The same dot-path expression can be matched without a full document parse on every hit.

The dot-path generalises to any depth. Bharat Bazaar tracks supplier brands per line item, so the path "the brand of the product of the first line item" is items.0.product.brand. The path "any line item's product brand" — without specifying which one — is items.product.brand. MongoDB treats these differently, and the difference matters for indexes:

# Find orders where the FIRST item is a Bata product
db.orders.find({"items.0.product.brand": "Bata"})

# Find orders where ANY item is a Bata product
db.orders.find({"items.product.brand": "Bata"})

The first form is a positional path: items.0 is "the element at index 0 of the array." The query matches only when that specific position has the matching brand. The second form is an unindexed-path traversal: when MongoDB sees a dot-path crossing an array without a numeric index, it implicitly fans out — the query matches if any element of items has a product.brand of "Bata". That implicit fan-out is the multikey behaviour, and it is what makes array indexing both powerful and expensive.

(positional)</text><text x="560" y="84" text-anchor="middle" font-size="10" fill="#374151">projection placeholder</text><text x="470" y="110" font-family="monospace" font-size="10">tags = ["sale", "new",</text><text x="470" y="124" font-family="monospace" font-size="10"> "festive"]</text><line x1="460" y1="140" x2="660" y2="140" stroke="#9ca3af"/><text x="560" y="158" text-anchor="middle" font-family="monospace" font-size="9">find({"tags": "new"},</text><text x="560" y="170" text-anchor="middle" font-family="monospace" font-size="9"> {"tags.": 1})returns: tags = ["new"](only the matched element)$ is replaced by the indexof the matched element.Used in projection and inupdate operators.A query operator, not an index.

Indexing a dot-path

A nested-field index in MongoDB is built the same way as a top-level index — same B-tree, same createIndex syntax — but the key the index extracts from each document is the value at the dot-path:

db.orders.create_index([("address.city", 1)])
db.orders.create_index([("customer.email", 1)])
db.orders.create_index([("address.city", 1), ("placed_at", -1)])  # compound

The first index lets find({"address.city": "Mumbai"}) go from a collection scan (read every document, check the field) to a B-tree seek (jump straight to the orders shipped to Mumbai, in O(log N) instead of O(N)). The third is a compound index that supports both find({"address.city": "Mumbai"}) and find({"address.city": "Mumbai"}).sort({"placed_at": -1}) — sorted index entries by city then by placed_at descending.

What MongoDB will not do is auto-create indexes for paths it sees in queries. You must declare every index. A query against an unindexed path is a COLLSCAN — full collection scan — which on a 10-million-document orders collection will read every document into memory. The first sign of this in production is "the dashboard query that used to take 50 ms now takes 12 seconds and is pegging the CPU." See MongoDB indexes overview for the full list of supported index types.

Two index variations matter for nested paths specifically. Sparse indexes ({"sparse": True}) only index documents that have the field — useful for optional sub-documents like promo.discount_pct, where most orders have no promo and you do not want the index carrying NULL entries for them. Wildcard indexes (createIndex({"address.$**": 1})) index every field at every depth under address — useful for ad-hoc queries when you do not know which field will be queried, but expensive in space and slower per-write than a targeted index.

Multikey indexes — one entry per array element

Now the dragon. When you create an index on a field that is itself an array, or a dot-path that crosses an array, MongoDB builds a multikey index: one B-tree entry per array element, all pointing back to the same document. This is what makes find({"items.product.brand": "Bata"}) fast — but it also means the index can be much larger than the document count.

In the diagram, three documents with arrays of length 2, 1, and 3 produce six index entries — twice the document count. For Bharat Bazaar's real load, an order averages four line items and a tags array with three tags; an index on items.product.brand carries roughly 4 × N entries for N orders, and an index on tags carries 3 × N. A 10-million-order collection: 40 million items.product.brand index keys plus 30 million tags index keys, on top of all the single-key indexes. That is real RAM pressure on the working set.

The constraint MongoDB enforces to keep multikeys tractable: a compound index can be multikey on at most one field. You cannot have a compound index on (items.brand, items.qty) because that would be the Cartesian product of the two arrays, which can blow up exponentially. You can have (customer.email, items.brand) because only items.brand is array-valued. Why this limit exists: if you indexed two array fields together, a document with 5 items and 4 tags would generate 20 index entries — and a document with 50 items and 30 tags would generate 1,500 entries. The B-tree would grow faster than the data, and the planner would have no good way to estimate selectivity. The single-array constraint keeps the multiplier linear.

The query operator zoo for arrays

Dot-paths into arrays are powerful but blunt — find({"items.brand": "Bata"}) matches if any item is Bata. Real queries often want more precision:

$elemMatch — "an array element exists that satisfies these conditions together." Different from a plain dot-path conjunction. find({"items": {"$elemMatch": {"brand": "Bata", "qty": {"$gte": 2}}}}) matches orders with at least one Bata item bought in quantity 2 or more. Without $elemMatch, find({"items.brand": "Bata", "items.qty": {"$gte": 2}}) matches orders that have some Bata item AND some item with qty ≥ 2 — possibly different items.
$all — "the array contains all of these values." find({"tags": {"$all": ["sale", "festive"]}}) matches if both tags are present.
$in — "the field equals any of these values." find({"address.city": {"$in": ["Mumbai", "Bengaluru", "Pune"]}}) matches the multi-city case efficiently using the index.
Positional $ — used in projection and update to refer to the matched element. update_one({"items.product.sku": "SHO-CAM-9-BLK"}, {"$set": {"items.$.qty": 2}}) updates the qty of the first matching item without rewriting the whole array.
Array filters (arrayFilters) — for updating multiple matching elements with conditions: update_one({...}, {"$set": {"items.$[i].discount": 0.1}}, array_filters=[{"i.brand": "Bata"}]).

Each of these (documented in the MongoDB query operator reference) plays differently with multikey indexes. $elemMatch and $all can use a multikey index efficiently; $elemMatch in particular is the only way to push multiple per-element conditions into the index lookup. $in is index-friendly and is the right way to express "any of these N values." Positional $ and array filters are write-side operators — they do not select documents, they pinpoint elements within an already-matched document.

Worked example — Bharat Bazaar orders, dot-paths, indexes, explain plans

Indexing and querying the Bharat Bazaar orders collection

End to end: insert sample orders, build the right indexes, run the three target queries, prove via explain() that each query uses its index.

from pymongo import MongoClient, ASCENDING
from bson import ObjectId, Decimal128
from datetime import datetime
from pprint import pprint

client = MongoClient("mongodb://localhost:27017")
db = client.bharat_bazaar
orders = db.orders
orders.drop()  # start fresh

# A handful of orders — three cities, mixed brands, varying item counts
orders.insert_many([
    {
      "order_no": "BB-2026-00471829", "placed_at": datetime(2026, 4, 22, 15, 12),
      "customer": {"name": "Aarav Sharma", "email": "aarav.s@example.in"},
      "address": {"city": "Bengaluru", "state": "KA", "pincode": "560095"},
      "items": [
        {"product": {"sku": "SHO-BAT-9", "name": "Stride Pro", "brand": "Bata"},
         "qty": 1, "price_inr": 2499},
        {"product": {"sku": "STA-CLM-A4", "name": "Notebook A4", "brand": "Classmate"},
         "qty": 3, "price_inr": 60},
      ],
      "total_inr": 2679, "status": "shipped",
    },
    {
      "order_no": "BB-2026-00471830", "placed_at": datetime(2026, 4, 22, 15, 30),
      "customer": {"name": "Priya Iyer", "email": "p.iyer@example.in"},
      "address": {"city": "Bengaluru", "state": "KA", "pincode": "560034"},
      "items": [
        {"product": {"sku": "SHO-NIK-8", "name": "Air Stride", "brand": "Nike"},
         "qty": 1, "price_inr": 5499},
      ],
      "total_inr": 5499, "status": "shipped",
    },
    {
      "order_no": "BB-2026-00471831", "placed_at": datetime(2026, 4, 22, 16, 02),
      "customer": {"name": "Rohan Mehta", "email": "rohan.m@example.in"},
      "address": {"city": "Mumbai", "state": "MH", "pincode": "400001"},
      "items": [
        {"product": {"sku": "ELC-SAM-65", "name": "Smart TV 65", "brand": "Samsung"},
         "qty": 1, "price_inr": 89999},
        {"product": {"sku": "SHO-BAT-10", "name": "School Shoe", "brand": "Bata"},
         "qty": 2, "price_inr": 1299},
        {"product": {"sku": "STA-CLM-A4", "name": "Notebook A4", "brand": "Classmate"},
         "qty": 5, "price_inr": 60},
      ],
      "total_inr": 92897, "status": "packed",
    },
])

# Indexes — one per query pattern we want to support
orders.create_index([("address.city", ASCENDING)],
                    name="ix_address_city")  # dot-path index
orders.create_index([("items.product.brand", ASCENDING)],
                    name="ix_items_brand")  # multikey index (items is an array)
orders.create_index([("items.0.product.name", ASCENDING)],
                    name="ix_first_item_name")  # positional, NOT multikey

print("indexes:", [ix["name"] for ix in orders.list_indexes()])

Now the three target queries, each with explain() to confirm the index is hit.

# Query 1: orders shipped to Bengaluru — uses ix_address_city
q1 = {"address.city": "Bengaluru"}
print("\nQ1 results:", [o["order_no"] for o in orders.find(q1)])
plan = orders.find(q1).explain()["queryPlanner"]["winningPlan"]
print("Q1 stage:", plan["inputStage"]["stage"], "index:",
      plan["inputStage"].get("indexName"))
# Q1 results: ['BB-2026-00471829', 'BB-2026-00471830']
# Q1 stage: IXSCAN index: ix_address_city

# Query 2: orders containing at least one Bata item — multikey on items.product.brand
q2 = {"items.product.brand": "Bata"}
print("\nQ2 results:", [o["order_no"] for o in orders.find(q2)])
plan = orders.find(q2).explain()["queryPlanner"]["winningPlan"]
print("Q2 stage:", plan["inputStage"]["stage"], "index:",
      plan["inputStage"].get("indexName"),
      "isMultiKey:", plan["inputStage"].get("isMultiKey"))
# Q2 results: ['BB-2026-00471829', 'BB-2026-00471831']
# Q2 stage: IXSCAN index: ix_items_brand isMultiKey: True

# Query 3: orders where the FIRST item is named "Stride Pro" — positional, not multikey
q3 = {"items.0.product.name": "Stride Pro"}
print("\nQ3 results:", [o["order_no"] for o in orders.find(q3)])
plan = orders.find(q3).explain()["queryPlanner"]["winningPlan"]
print("Q3 stage:", plan["inputStage"]["stage"], "index:",
      plan["inputStage"].get("indexName"),
      "isMultiKey:", plan["inputStage"].get("isMultiKey"))
# Q3 results: ['BB-2026-00471829']
# Q3 stage: IXSCAN index: ix_first_item_name isMultiKey: False

Three things worth noting in the output. Q1 uses an ordinary B-tree index on a dot-path — same shape as a regular column index, no special handling. Q2 uses a multikey index (isMultiKey: True); MongoDB seeks to "Bata" in the B-tree, finds two index entries pointing at orders 1 and 3, and returns those documents with no further filtering. Q3's index is on items.0.product.name — note isMultiKey: False, because items.0 pins a single position rather than fanning across the array. Why this distinction matters operationally: a multikey index has stricter rules (only one array field per compound index, larger storage footprint), while a positional index behaves like any other single-key index. If you find yourself querying mostly the first array element, a positional index is cheaper than a full multikey one.

For the more precise "Bata items in quantity ≥ 2", use $elemMatch:

q4 = {"items": {"$elemMatch": {"product.brand": "Bata", "qty": {"$gte": 2}}}}
print("Q4 results:", [o["order_no"] for o in orders.find(q4)])
# Q4 results: ['BB-2026-00471831']

Without $elemMatch, find({"items.product.brand": "Bata", "items.qty": {"$gte": 2}}) would match order 1 too — order 1 has a Bata item (qty 1) AND a Classmate item (qty 3), so the per-element conditions are satisfied by different elements. $elemMatch is the only way to require both conditions on the same element.

How deep is too deep

Dot-paths work to arbitrary depth — you could nest seven levels and write find({"a.b.c.d.e.f.g": 1}) — but every level is a dereference at parse time, and every level is a step the index-extraction code has to walk on every write. The unwritten rule among experienced MongoDB modellers is roughly:

One to two levels of nesting (address.city, customer.email): essentially free. This is the sweet spot and the default for most embedding decisions.
Three levels (items.product.brand): comfortable, but be deliberate. Indexes still work cleanly; queries are still readable.
Four levels (order.shipping.address.delivery.window): smell. You are probably over-embedding. Consider de-normalising — promote frequently-queried fields to the top level (e.g. add delivery_city alongside the nested address) so common queries do not need the deep path.
Five or more: a hard signal something is wrong with the model. Either the data is genuinely a different shape (consider a separate collection plus a $lookup) or the embedding is hiding a relationship that should be its own document.

The reason has nothing to do with MongoDB performance per se — the parse cost of an extra dot is negligible. The reason is query authorability and migration risk. Deep paths are hard to write correctly, hard to grep for in application code, and dangerous to refactor. Adding a level (renaming address to shipping_address) means rewriting every query that traverses through it. Shallow shapes are forgiving; deep shapes are brittle.

A common pattern in production: keep the original nested shape for fidelity (the customer's address as the customer entered it), but add denormalised top-level fields for hot-path queries (indexed_city, indexed_pincode) that get populated on insert and stay flat. Querying and indexing happen against the flat fields; the nested original stays for display and audit. The cost is a tiny bit of write-time duplication for a meaningful win in query authorability and index size.

Couchbase, Cosmos, and the wider dot-path world

MongoDB is not the only document store, and dot-paths are not the only addressing scheme. Couchbase indexes use N1QL (essentially SQL with a path syntax — WHERE address.city = "Mumbai") backed by GSI (Global Secondary Index) which is a separate index service rather than co-located B-trees. Azure Cosmos DB uses the same dot-path idea but indexes everything by default (configurable via an indexing policy) — convenient until your write throughput halves because every field is being indexed twice. The IETF-standardised path syntax for JSON is JSONPath (RFC 9535), with $.address.city syntax (dollar-sign root, then dot or bracket); it is the syntax PostgreSQL's jsonb_path_query and AWS DocumentDB use.

Two themes recur across all of these. First, the path syntax is always shallow-to-deep, root-to-leaf, with dots or slashes as separators — none of them invented something cleverer because none of them needed to. Second, every system that supports indexing on a path that crosses an array faces the same multikey trade-off: index size scales with array element count, not document count. This is not a MongoDB-specific quirk; it is a fundamental cost of indexing tree-shaped data with array nodes.

Going deeper — when the index is not enough

For most queries, a well-chosen single-key, compound, or multikey index is the entire optimisation story. But document databases live in a corner of the design space where a few specific patterns push beyond what a B-tree alone can do. The papers and blog posts that explore these patterns are worth reading once you have the basics down.

Wildcard indexes — when you do not know the field

A wildcard index (createIndex({"$**": 1}) or createIndex({"address.$**": 1})) indexes every field at every depth in the document or under a subtree. It is the answer to "the user can filter on any of 47 attributes; I cannot index them all individually." But the cost is real: write throughput drops noticeably (every field of every insert is being indexed), the index can become enormous, and the planner cannot use a wildcard index for sort operations or range queries the way it can use a targeted one. Use wildcards when you genuinely have ad-hoc filterability requirements; do not use them as a "just in case" default.

Hashed indexes — for sharding, not for range

A hashed index (createIndex({"customer.email": "hashed"})) indexes the hash of the field, not the value itself. It is meaningless for range queries ({"customer.email": {"$gt": "x"}} cannot use it) but indispensable as a shard key — the hash distribution gives even chunk sizes across shards regardless of the value distribution. Multikey hashed indexes are not allowed; the shard key cannot be an array field.

Text indexes and the language tag

A text index (createIndex({"product.description": "text"})) builds a token-level inverted index — the right tool for "find products whose description mentions cotton kurta with embroidery." It accepts a default_language parameter ("english", "hindi", "none") which determines stemming; for mixed-language Indian e-commerce content, "none" (no stemming) is often the safest default. Note that a collection can have at most one text index, but it can cover multiple fields with weights.

Index intersection vs compound

When you query find({"address.city": "Mumbai", "status": "shipped"}), MongoDB's planner can either use a compound index (address.city, status) if one exists, or use index intersection — seek both single-key indexes and intersect the results. Intersection is a real plan but rarely optimal compared to a well-chosen compound. The right rule of thumb: build compound indexes for your most common multi-field query patterns; do not rely on intersection for hot paths.

Covered queries

If every field your query needs is in the index (both the filter fields and the projection fields), MongoDB can answer the query from the index alone without fetching the document. This is a "covered query" and is dramatically faster — no document-fetch round-trip per result. The recipe: include in your compound index every field you project. For a query like find({"address.city": "Mumbai"}, {"order_no": 1, "_id": 0}), the compound index (address.city, order_no) covers it.

The deeper trade-offs of indexing strategy in document stores — when wildcard beats targeted, when text beats regex, when you should denormalise versus building a complex compound — are the subject of operational experience more than theory. The MongoDB indexing strategies guide is the canonical practical reference; for a more academic look at indexing trade-offs in NoSQL systems, the Cattell survey of scalable SQL and NoSQL data stores puts dot-path indexing in its broader context.

What to take away

A document is a tree. To query and index it, you need a way to point at nodes in the tree, and dot-path notation is that pointer language — universal across MongoDB, Couchbase, Cosmos DB, AWS DocumentDB, and Postgres jsonb. Dot-paths into sub-documents are cheap and behave like ordinary indexes. Dot-paths that cross arrays trigger multikey indexes, which are powerful but produce one entry per array element — index size grows with total elements, not document count, and a compound index can only contain one multikey field. The query operator zoo ($elemMatch, $all, $in, positional $, array filters) gives precision that plain dot-paths cannot, especially for "this element satisfies multiple conditions together."

Modelling discipline matters more than syntax. Three levels of nesting is the comfortable ceiling; deeper paths become brittle to refactor and fragile to query. Denormalise for hot-path queries — keep a flat indexed_city next to the nested address.city if address.city is your most-asked filter — and your indexes stay small and your queries stay readable. Always run explain() on a representative query before shipping; an index that is not actually used is just storage and write overhead.

The next chapter takes the schema-flexibility promise that makes all of this possible and asks the harder question — what does it cost when nobody enforces the shape but the application code, and how do production teams keep their document collections from becoming archaeology?

References

MongoDB indexes — overview — the canonical reference for index types, including single-key, compound, multikey, wildcard, hashed, and text indexes.
MongoDB query operators — array — $all, $elemMatch, positional $, and array filters with examples.
Couchbase index documentation — N1QL path syntax and the GSI architecture; useful contrast to MongoDB's co-located B-trees.
JSONPath — RFC 9535 — the IETF-standardised path syntax for JSON, the syntax PostgreSQL jsonb_path_query and several other systems use.
MongoDB indexing strategies guide — practical patterns for compound indexes, covered queries, and index intersection.
Cattell — Scalable SQL and NoSQL Data Stores (SIGMOD Record 2010) — a foundational survey placing document-database indexing in the broader scalable-storage landscape.