Embeddings, Similarity Metrics, and the Curse of Dimensionality

In short

An embedding is a function item → vector ∈ ℝ^d chosen so that items with similar meaning land at similar coordinates. The vector for "comfortable jogging shoes" sits next to "running sneakers" and far from "hardcover novel"; the embedding for a photo of a Royal Enfield sits next to the embedding for the string "motorcycle" if you used a multimodal model like CLIP. Once your data lives in this space, every "find similar" query becomes the same primitive: return the k vectors closest to the query vector. Three metrics define "closest": cosine similarity (a·b / (||a|| ||b||), measures angle, scale-invariant, the default for text), L2 distance (√Σ(aᵢ−bᵢ)², measures straight-line gap, used when magnitudes carry meaning), and inner product (a·b, the cheapest, equivalent to cosine when both vectors are unit-normalised). Modern embeddings are 384 to 3072 dimensions; brute-force scanning all N of them on every query is O(N·d) and dies past about a hundred thousand vectors on a CPU. Worse, the curse of dimensionality says that in high d the ratio between the farthest and nearest vector to your query collapses toward 1 — neighbours stop being meaningfully different from non-neighbours. Both problems — speed and statistical concentration — are why the next chapters introduce vector indexes (flat, IVF, HNSW, PQ) that trade exactness for orders-of-magnitude speedups. This chapter sets up the vocabulary; the rest of Build 19 builds the indexes.

You have spent eighteen builds learning to retrieve data by exact match. B-trees find rows where customer_id = 7421. Inverted indexes find documents where the term samsung appears literally. Bloom filters answer "definitely-not / maybe" for an exact key. Every one of these structures rests on a binary question: are these two things the same string of bytes, yes or no?

Now consider the queries that never quite fit that shape. A user types comfortable jogging shoes into your e-commerce search box, and you have a product titled Reebok Floatride Energy 4 — lightweight running sneakers for daily training. There is not a single overlapping word. BM25, the king of keyword ranking, scores this match at zero. The user, looking at the result, would call it a perfect match. Why: humans search by meaning; the inverted index searches by spelling. The gap between the two is the gap that vector search exists to close.

This chapter opens Build 19, the build where you stop matching strings and start matching meanings. By the end of it, three pieces of vocabulary will be solid: what an embedding is, what a similarity metric computes, and what the curse of dimensionality takes away from you. The next five chapters build the indexes — flat search, IVF, HNSW, product quantization, pre/post filtering — that make this idea fast enough to ship.

What an embedding actually is

An embedding is the output of a learned function f: item → ℝ^d, where d is some fixed integer (today, usually between 384 and 3072) and the items might be words, sentences, paragraphs, images, audio clips, product SKUs, or user profiles. The function is learned in the machine-learning sense — somebody trained a neural network on a giant corpus until the network started producing vectors with one specific property: semantically similar items get geometrically close vectors.

That is the whole pitch. Read it twice. There is no rule of physics that says the word motorcycle must end up near the word bike in some 768-dimensional Euclidean space — it ends up there because Word2Vec, BERT, or whichever model trained itself on enough sentences to notice that motorcycle and bike show up in similar sentences (I rode my ___ to college), and the training procedure punishes it whenever it places these two words far apart. After enough billions of training examples, the network's internal representation of language ends up with this geometric property: distance in vector space ≈ semantic distance.

Each item — a word, a phrase, a product description, even an image — is mapped by an embedding model to a point in a high-dimensional space. Items that mean similar things land in the same neighbourhood; items that mean different things land far apart. Multimodal models like CLIP put text and images in the *same* space, so a photo of a motorcycle ends up near the *string* "motorcycle". Once your data lives here, "find me similar things" becomes "find me nearby points".

Where embeddings come from

There are four broad families of embedding model you will meet in practice, in roughly chronological order.

Word embeddings (legacy). Mikolov's Word2Vec (2013) was the first popular embedding model. It produces a vector per word — typically 300 dimensions — using a shallow neural net trained to predict surrounding words. Stanford's GloVe (2014) used global word co-occurrence statistics to similar effect. These are still useful for keyword-level tasks but cannot embed a sentence sensibly: averaging the word vectors of not bad gives you something near not and bad, which is the opposite of the sentence's meaning. Why: word vectors are positionally and compositionally blind — they have no notion of negation, scope, or word order.

Sentence and document embeddings (the modern default). Sentence-BERT (Reimers and Gurevych, 2019) showed how to fine-tune BERT into producing one fixed vector per sentence such that cosine similarity between vectors corresponds to semantic similarity between sentences. It runs locally, is open-source, and produces 384-dimensional vectors fast enough for production. The hosted versions — OpenAI text-embedding-3, Cohere embed-english-v3, Voyage voyage-3 — go further: 1536 to 3072 dimensions, slightly better quality, but you pay per call and your data leaves your server.

Multimodal embeddings. OpenAI's CLIP (2021) and successors train two networks — one for images, one for text — and force them to produce the same vector for a (caption, image) pair. Result: you can search images using text queries, or find text that matches an image, all using cosine similarity in the same shared space. This is what powers reverse-image search, "find products that look like this photo", and the image search inside Pinterest and Google Lens.

Recommendation embeddings. Trained on user interaction data — clicks, purchases, watch time. The vector for a Bollywood film like RRR lands near Bahubali not because their plots are similar (they kind of are) but because users who liked one tended to watch the other. Two-tower networks (one tower for users, one for items) are the canonical architecture; YouTube, Spotify, Flipkart, and Netflix all run variations on this.

Typical dimensions in 2026

Model	`d`	Notes
Word2Vec, GloVe	50–300	Legacy, word-level
Sentence-BERT (mini)	384	Local, free, the workhorse
BERT-base	768	The classic baseline
OpenAI `text-embedding-ada-002`	1536	The 2022–2023 default
OpenAI `text-embedding-3-small`	1536	2024 release, cheaper, better
OpenAI `text-embedding-3-large`	3072	Best-in-class, expensive
CLIP ViT-L/14	768	Multimodal text + image

For a hundred million items, a 1536-dim float32 embedding occupies 1536 × 4 = 6 KB per vector — 600 GB total. A 384-dim small model brings that down to 1.5 KB per vector or 150 GB. Why dimension matters for storage: every coordinate is typically a 32-bit float, so memory grows linearly with d. Compression schemes in chapter 155 cut this 8–32× by quantising floats to bytes.

The three similarity metrics

Once your items are points in ℝ^d, "similar" needs a definition. There are three that matter in practice, and the difference between them is the difference between a system that retrieves the right things and one that retrieves embarrassing things.

The three metrics are not interchangeable. Cosine measures the angle between two vectors and ignores their magnitudes — perfect for text embeddings, where a longer document does not mean a "bigger" topic. L2 measures absolute distance; magnitudes count, which is what you want when the vectors encode positions or physical quantities. Inner product is just the numerator of cosine — fastest to compute, and equivalent to cosine if you have already unit-normalised your vectors at index time, which is the cheap trick most production systems use.

Cosine similarity

\cos\theta = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|} = \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2} \, \sqrt{\sum_i b_i^2}}

Range: [-1, 1]. 1 means identical direction, 0 means orthogonal (unrelated), -1 means opposite. The denominator divides out the magnitudes, leaving only the angle. Why this matters for text: a one-sentence email and a thousand-word essay on the same topic should be considered "the same topic". The thousand-word vector has bigger raw magnitude (more total signal), but the direction — what the document is about — is what semantic search cares about. Cosine throws away the magnitude and keeps the direction.

This is the default metric for almost every text embedding model in 2026. If a model card does not specify, assume cosine.

L2 (Euclidean) distance

d_2(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_i (a_i - b_i)^2}

The straight-line distance between two points. Smaller = more similar. Range: [0, ∞). Use it when magnitude carries information — for instance, image embeddings where the model was trained with L2 loss, or when your "vectors" are actually physical positions (lat/long, GPS traces). FAISS defaults to L2; pgvector lets you pick. Why pick L2 over cosine in image work: convolutional networks trained with contrastive L2 loss produce embeddings whose magnitudes encode confidence — a brighter, sharper image gets a longer vector. Cosine would throw that signal away.

Inner product (dot product)

\mathbf{a} \cdot \mathbf{b} = \sum_i a_i b_i

The cheapest of the three — no square roots, no division. Larger = more similar (note the flip; it is not a distance, it is a similarity score). Range: (-∞, ∞) in general. Used when:

Vectors are already unit-normalised (i.e., ||a|| = ||b|| = 1). Then a·b = cos θ exactly, so inner product is cosine, and you save the normalisation cost on every query.
Recommendation systems with magnitude-as-popularity, where a longer item vector means "popular item" and should win ties. Maximum inner product search (MIPS) is its own active research area.

The unification

Here is the trick most production systems use: normalise every vector to unit length at index time, store them that way, and use inner product at query time. You get cosine semantics with dot-product speed.

If ||a|| = ||b|| = 1:

cos θ = a · b (cosine equals inner product)
||a − b||² = 2 − 2 (a·b) = 2 − 2 cos θ (L2² is monotonic in cosine)

All three metrics become rank-equivalent — they will return the same top-k in the same order. Why this matters: it lets a single index data structure (HNSW, IVF) serve any of the three metrics interchangeably. The choice of metric becomes a one-line config flag, not a re-indexing job.

Computing them: a concrete example

An Indian product catalogue with Sentence-BERT. You run a Flipkart-scale e-commerce site. You have 100,000 product descriptions and you want a "find similar products" feature plus a semantic search box that handles comfortable jogging shoes even when the product title says Reebok Floatride lightweight running sneakers.

Step 1: Embed everything once, offline.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim, runs on CPU
descriptions = [
    "Reebok Floatride Energy 4 — lightweight running sneakers for daily training",
    "Bata formal black leather shoes for office wear",
    "Royal Enfield Classic 350 motorcycle, single-cylinder 349cc",
    "Penguin paperback edition of Midnight's Children by Salman Rushdie",
    # ... 99,996 more
]
# Encode all 100K and L2-normalise so we can use dot product later
embeddings = model.encode(descriptions, normalize_embeddings=True)
print(embeddings.shape)        # (100000, 384)
print(np.linalg.norm(embeddings[0]))  # 1.0 (unit-normalised)

The encode step takes a few minutes for 100K descriptions on a CPU, an order of magnitude less on a GPU. You do this once and persist the resulting (100000, 384) matrix to disk (NumPy .npy, Parquet, or directly into pgvector / Pinecone). Why one-time: embeddings only change when the model changes or the description changes; product catalogue churn is slow.

Step 2: Embed the query and compute all three similarities.

query = "comfortable jogging shoes"
q = model.encode([query], normalize_embeddings=True)[0]   # shape (384,)

# Cosine = inner product when both sides are unit-normalised
cosine_scores = embeddings @ q                # (100000,) — one matrix-vector mul

# L2 distance, computed without normalisation assumption
l2_distances = np.linalg.norm(embeddings - q, axis=1)   # (100000,)

# Inner product (identical to cosine here because we normalised)
ip_scores = embeddings.dot(q)

The single line embeddings @ q is doing 100,000 × 384 = 38.4 million multiplications and additions. On a modern x86 CPU with NumPy backed by BLAS, this runs in about 30–80 ms — fast enough for a prototype, painful for production.

Step 3: Pick the top-5.

top5_idx = np.argpartition(-cosine_scores, 5)[:5]
top5_idx = top5_idx[np.argsort(-cosine_scores[top5_idx])]

for i in top5_idx:
    print(f"{cosine_scores[i]:.3f}  {descriptions[i]}")

Sample output:

0.812  Reebok Floatride Energy 4 — lightweight running sneakers for daily training
0.794  Asics Gel-Nimbus 25 cushioned running shoes, men's road running
0.771  Nike Pegasus 40 daily trainer with React foam, breathable mesh
0.733  Adidas Ultraboost 22 men's running shoes, energy return midsole
0.689  Puma Velocity Nitro 2 men's neutral running shoes

Notice the relevance: not one result contains the literal word comfortable or the literal word jogging. BM25 would have ranked all of these at zero. The embedding model knows that "jogging" and "running" live in the same neighbourhood of ℝ³⁸⁴, and that "comfortable" is encoded in adjectives like "cushioned", "energy return", "breathable mesh". This is the entire reason vector search exists.

Step 4: Verify the metric equivalence.

# Top-5 by L2 should match top-5 by cosine when vectors are unit-normalised
top5_l2 = np.argpartition(l2_distances, 5)[:5]
top5_l2 = top5_l2[np.argsort(l2_distances[top5_l2])]
assert set(top5_l2.tolist()) == set(top5_idx.tolist())   # passes

# And the algebraic identity ||a−b||² = 2 − 2(a·b) for unit vectors
for i in top5_idx[:3]:
    lhs = l2_distances[i] ** 2
    rhs = 2 - 2 * cosine_scores[i]
    print(f"{lhs:.6f}  vs  {rhs:.6f}")   # match to floating-point precision

The brutal arithmetic. That embeddings @ q step takes about 800 ms on a single-core CPU at 100M vectors instead of 100K — every query, every user, every time. Even with all eight cores busy, you are still in the 100 ms range and burning the entire CPU on one query. The vector indexes in the next four chapters drop this to 5 ms with 95 %+ recall. That is the entire reason Build 19 exists.

The curse of dimensionality

Brute-force search is too slow at scale, but that is the easy problem — engineering solves slow problems. The harder problem is statistical, and it has a name: the curse of dimensionality, characterised in the database community by Beyer, Goldstein, Ramakrishnan, and Shaft (1999), in a paper with the wonderful title "When is 'Nearest Neighbour' Meaningful?".

The result, in one sentence: as d → ∞, under mild assumptions on the data distribution, the ratio of the farthest point's distance from your query to the nearest point's distance approaches 1.

\lim_{d \to \infty} \frac{\max_i \|q - x_i\| - \min_i \|q - x_i\|}{\min_i \|q - x_i\|} \to 0

In English: in a million-dimensional space, every point is roughly the same distance from your query as every other point. The concept of "nearest" stops carrying useful information.

In two dimensions, your nearest neighbour stands out clearly from random points and the distance histogram is broad. In a thousand dimensions, the histogram collapses to a sharp peak — every point is approximately the same distance from your query, and the difference between "1st nearest" and "100th nearest" is razor-thin in raw distance. The query for a vector index is no longer "find the obvious nearest neighbour" but "find one of the slightly-closer points among a sea of indistinguishable ones".

You can verify this in three lines of NumPy:

import numpy as np
def concentration(d, N=10000):
    X = np.random.randn(N, d); q = np.random.randn(d)
    dists = np.linalg.norm(X - q, axis=1)
    return dists.max() / dists.min()

for d in [2, 10, 100, 1000, 10000]:
    print(f"d={d:>5}  max/min ratio = {concentration(d):.3f}")

Run it. You will see something like:

d=    2  max/min ratio = 47.213
d=   10  max/min ratio = 5.184
d=  100  max/min ratio = 1.527
d= 1000  max/min ratio = 1.149
d=10000  max/min ratio = 1.046

In 2D, the farthest random point is forty-seven times farther than the nearest. In 10,000D, it is barely 5 % farther. Why this happens geometrically: in d dimensions the volume of a unit ball is concentrated in a thin shell near its surface, and most of the probability mass of any reasonable distribution sits at distance roughly √d from the centre. The variance of distances grows much slower than the mean, so the relative spread shrinks.

What this means for vector search

Two practical implications, and they are the load-bearing reasons Build 19 exists:

Brute-force exact k-NN gives weakly meaningful answers. If the top-1 vector is at distance 0.413 and the top-100 vector is at distance 0.421, the top-1 is barely a winner. Tiny noise — re-tokenisation, model fine-tune, a different random seed — swaps them. Insisting on exact k-NN is buying false precision.
You cannot prune efficiently using one-dimensional bounds. B-trees, hash indexes, and range scans all work because in 1D you can split data on a single coordinate and discard half of it. In 1000D, splitting on coordinate 47 hardly tells you anything about distance — a point that is "small" in coordinate 47 might be "large" in coordinate 312, and the total distance is dominated by the sum across all coordinates. Classic database indexes do not help here.

The solution to both problems is the same: stop computing exact nearest neighbours and start computing approximate ones, with carefully tuned data structures that accept (say) 95 % recall in exchange for 100× speedup. This trade is called approximate nearest neighbour search (ANN), and the next four chapters are nothing but ANN structures: flat search (no index, the baseline), IVF (k-means clusters), HNSW (hierarchical small-world graphs), and product quantization (compression). Each one trades a slice of recall for a slice of speed; the engineer's job is to pick the right point on that curve for your latency, memory, and accuracy budget.

What Build 19 builds

Five chapters, each one stacking on the previous:

Chapter 152 — Flat search and why it doesn't scale. The baseline: scan all N vectors on every query. Exact, simple, and O(N·d). Useful up to a few hundred thousand vectors and as the ground truth against which approximate methods are measured.
Chapter 153 — IVF (Inverted File Indexes for vectors). Cluster your vectors into k groups using k-means at index time. At query time, search only the nprobe clusters closest to the query. 10–50× faster than flat with 90–95 % recall.
Chapter 154 — HNSW (Hierarchical Navigable Small World). A multi-layer graph where each node is connected to its near neighbours; queries navigate down the layers like a skip list. The state of the art for in-memory vector search; what Pinecone, Weaviate, Qdrant, and pgvector all use.
Chapter 155 — Product Quantization (PQ) and DiskANN. Compress each 1536-dim float32 vector down to 64 bytes by splitting it into chunks and quantising each chunk against a learned codebook. Essential for billion-scale.
Chapter 156 — Filtered search. "Find similar products that ship to Bangalore and cost under ₹3000." Combining a metadata filter with a vector search is harder than it looks — pre-filter and post-filter each fail in different ways.
Chapter 157 — Real systems compared. pgvector vs Pinecone vs Weaviate vs Qdrant: who picks which index, what they charge, and where each one is the right answer.

Going-deeper

The story above treated embeddings as fixed objects and metrics as the choice of measurement. Both views simplify reality. The deeper questions are: where do good embeddings actually come from, and is there a better metric than the three above?

How embedding models are actually trained

Modern sentence and document embedding models are trained with contrastive learning: take a pair (query, positive_passage) and a batch of (query, negative_passage) and adjust the network weights so that cos(query, positive) > cos(query, negative). The training data comes from things like search query logs (where a clicked result is a positive and a non-clicked one is a negative), question-answer pairs from Stack Overflow, and human-annotated triples. The famous triplet loss and InfoNCE loss are different shapes of this same idea.

The choice of negatives matters enormously. Random negatives (passages from a random other query) teach the model little — they are too easy to distinguish. Hard negatives (passages that look textually similar to the positive but are semantically wrong) are where the real signal comes from. Constructing hard negatives is half the engineering of a good embedding model.

Calibration is not what you think

You will be tempted to use cosine similarity scores as confidences — "0.83 is highly similar, 0.42 is barely related". This is wrong in two ways:

Score distributions are model-specific. OpenAI text-embedding-3-small typically scores everything in [0.2, 0.6]; Sentence-BERT spreads more widely; CLIP image-to-text scores rarely exceed 0.35 even for perfect matches. There is no universal threshold.
Scores are not probabilities. A cosine of 0.7 does not mean "70 % related". To get calibrated confidences, train a small re-ranker on top of the retrieval candidates — typically a cross-encoder model that takes both texts as joint input and outputs a learned probability. Sentence-BERT's authors recommend exactly this two-stage architecture: bi-encoder retrieval (fast, approximate) followed by cross-encoder re-ranking (slow, accurate, on the top-100 only).

Beyond the three: learned metrics and Mahalanobis distance

Cosine, L2, and inner product all treat every dimension equally. But the embedding model probably did not — some dimensions encode load-bearing semantic axes, others encode noise. The Mahalanobis distance weights each dimension by the inverse of its variance, which can give a better-calibrated similarity for distributions with very anisotropic geometry. In practice this is rarely worth the trouble for modern embeddings (the models are already trained to make cosine work) but it shows up in classical pattern recognition and in some recommender systems.

A more modern direction is learned similarity: train a small neural net s(a, b) that takes two embeddings and outputs a similarity score directly. This is exactly what cross-encoders are. They are too slow for first-stage retrieval over millions of candidates but are the standard for the re-ranking stage.

Why dimensions keep growing

Why has the industry moved from 300-dim Word2Vec to 3072-dim text-embedding-3-large? The naive answer — "more dimensions = more capacity = better quality" — is partially right but understates the cost. Each doubling of d doubles your storage, doubles your CPU cost per similarity, and (mildly) worsens the curse of dimensionality.

The real reason is that modern embedding tasks are plural: a single 3072-dim vector is now expected to encode topic, sentiment, formality, language, factual entities, code structure, and more, all in shared space. With 384 dimensions there is not enough room to keep these signals from clobbering each other. With 3072, the model can dedicate sub-spaces to each. The 2024 trick of Matryoshka representation learning (used by OpenAI text-embedding-3) lets you truncate the vector to any prefix length you want — use the full 3072 dims when latency-tolerant, the first 256 dims when latency-bound, and the geometry remains usable at every prefix.

What you should walk away with

Three things, in this order:

An embedding is f: item → ℝ^d chosen so that semantically close items have geometrically close vectors. Modern models (Sentence-BERT, OpenAI text-embedding-3, CLIP) produce these in 384–3072 dimensions, and the best are multimodal.
Three similarity metrics matter — cosine, L2, inner product. They become rank-equivalent when vectors are unit-normalised, which is the trick that lets one index serve all three.
The curse of dimensionality says brute-force k-NN is both too slow and statistically thin in high d. The next four chapters introduce the indexes that fix the speed problem and accept the approximation.

You now know what an embedding is. The next chapter (flat search) shows you what not to do, and why doing it anyway is sometimes still the right answer.

References

Mikolov et al., Efficient Estimation of Word Representations in Vector Space (Word2Vec, 2013) — the paper that started modern word embeddings.
Reimers and Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019) — how to turn BERT into a sentence embedder.
OpenAI, New embedding models and API updates (text-embedding-3 docs) — production embedding API and Matryoshka representations.
Beyer, Goldstein, Ramakrishnan, Shaft, When Is "Nearest Neighbor" Meaningful? (1999) — the canonical curse-of-dimensionality paper for databases.
Pinecone, Embeddings and Vector Search tutorial — a hands-on walkthrough with code.
Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021) — the multimodal text+image embedding paper.