Note: Company names, engineers, incidents, numbers, and scaling scenarios in this article are hypothetical — even when they resemble real ones. See the full disclaimer.
In short
An embedding is a function item → vector ∈ ℝ^d chosen so that items with similar meaning land at similar coordinates, turning "find similar" into "find nearby points". Three metrics define nearby — cosine, L2, inner product — and they become rank-equivalent once vectors are unit-normalised. The catch is the curse of dimensionality: in high d, the farthest and nearest vectors to your query collapse toward the same distance, which is exactly why brute-force k-NN is both slow and statistically thin, and why the rest of Build 19 builds approximate indexes.
A user types comfortable jogging shoes into your search box, and your best-selling product is titled Reebok Floatride Energy 4 — lightweight running sneakers. Not one overlapping word. BM25 scores this match at zero; the user calls it perfect. That gap — between matching spelling and matching meaning — is what vector search exists to close.
You have spent eighteen builds learning to retrieve data by exact match. B-trees find rows where customer_id = 7421. Inverted indexes find documents where the term samsung appears literally. Bloom filters answer "definitely-not / maybe" for an exact key. Every one of these structures rests on a binary question: are these two things the same string of bytes, yes or no? Why this chapter matters: humans search by meaning; the inverted index searches by spelling. The next five chapters build the indexes — flat search, IVF, HNSW, product quantization, pre/post filtering — that make meaning-search fast enough to ship.
What an embedding actually is
An embedding is the output of a learned function f: item → ℝ^d, where d is some fixed integer (today, usually between 384 and 3072) and the items might be words, sentences, paragraphs, images, audio clips, product SKUs, or user profiles. The function is learned in the machine-learning sense — somebody trained a neural network on a giant corpus until the network started producing vectors with one specific property: semantically similar items get geometrically close vectors.
That is the whole pitch. Read it twice. There is no rule of physics that says the word motorcycle must end up near the word bike in some 768-dimensional Euclidean space — it ends up there because Word2Vec, BERT, or whichever model trained itself on enough sentences to notice that motorcycle and bike show up in similar sentences (I rode my ___ to college), and the training procedure punishes it whenever it places these two words far apart. After enough billions of training examples, the network's internal representation of language ends up with this geometric property: distance in vector space ≈ semantic distance.
Where embeddings come from
There are four broad families of embedding model you will meet in practice, in roughly chronological order.
Word embeddings (legacy). Mikolov's Word2Vec (2013) was the first popular embedding model. It produces a vector per word — typically 300 dimensions — using a shallow neural net trained to predict surrounding words. Stanford's GloVe (2014) used global word co-occurrence statistics to similar effect. These are still useful for keyword-level tasks but cannot embed a sentence sensibly: averaging the word vectors of not bad gives you something near not and bad, which is the opposite of the sentence's meaning. Why: word vectors are positionally and compositionally blind — they have no notion of negation, scope, or word order.
Sentence and document embeddings (the modern default). Sentence-BERT (Reimers and Gurevych, 2019) showed how to fine-tune BERT into producing one fixed vector per sentence such that cosine similarity between vectors corresponds to semantic similarity between sentences. It runs locally, is open-source, and produces 384-dimensional vectors fast enough for production. The hosted versions — OpenAI text-embedding-3, Cohere embed-english-v3, Voyage voyage-3 — go further: 1536 to 3072 dimensions, slightly better quality, but you pay per call and your data leaves your server.
Multimodal embeddings. OpenAI's CLIP (2021) and successors train two networks — one for images, one for text — and force them to produce the same vector for a (caption, image) pair. Result: you can search images using text queries, or find text that matches an image, all using cosine similarity in the same shared space. This is what powers reverse-image search, "find products that look like this photo", and the image search inside Pinscope and Querion Lens.
Recommendation embeddings. Trained on user interaction data — clicks, purchases, watch time. The vector for a Bollywood film like RRR lands near Bahubali not because their plots are similar (they kind of are) but because users who liked one tended to watch the other. Two-tower networks (one tower for users, one for items) are the canonical architecture; YouTube, Tunestack, BharatBazaar, and Streamora all run variations on this.
Typical dimensions in 2026
| Model | d |
Notes |
|---|---|---|
| Word2Vec, GloVe | 50–300 | Legacy, word-level |
| Sentence-BERT (mini) | 384 | Local, free, the workhorse |
| BERT-base | 768 | The classic baseline |
OpenAI text-embedding-ada-002 |
1536 | The 2022–2023 default |
OpenAI text-embedding-3-small |
1536 | 2024 release, cheaper, better |
OpenAI text-embedding-3-large |
3072 | Best-in-class, expensive |
| CLIP ViT-L/14 | 768 | Multimodal text + image |
For a hundred million items, a 1536-dim float32 embedding occupies 1536 × 4 = 6 KB per vector — 600 GB total. A 384-dim small model brings that down to 1.5 KB per vector or 150 GB. Why dimension matters for storage: every coordinate is typically a 32-bit float, so memory grows linearly with d. Compression schemes in chapter 155 cut this 8–32× by quantising floats to bytes.
The three similarity metrics
Once your items are points in ℝ^d, "similar" needs a definition. There are three that matter in practice, and the difference between them is the difference between a system that retrieves the right things and one that retrieves embarrassing things.
Cosine similarity
Range: [-1, 1]. 1 means identical direction, 0 means orthogonal (unrelated), -1 means opposite. The denominator divides out the magnitudes, leaving only the angle. Why this matters for text: a one-sentence email and a thousand-word essay on the same topic should be considered "the same topic". The thousand-word vector has bigger raw magnitude (more total signal), but the direction — what the document is about — is what semantic search cares about. Cosine throws away the magnitude and keeps the direction.
This is the default metric for almost every text embedding model in 2026. If a model card does not specify, assume cosine.
L2 (Euclidean) distance
The straight-line distance between two points. Smaller = more similar. Range: [0, ∞). Use it when magnitude carries information — for instance, image embeddings where the model was trained with L2 loss, or when your "vectors" are actually physical positions (lat/long, GPS traces). FAISS defaults to L2; pgvector lets you pick. Why pick L2 over cosine in image work: convolutional networks trained with contrastive L2 loss produce embeddings whose magnitudes encode confidence — a brighter, sharper image gets a longer vector. Cosine would throw that signal away.
Inner product (dot product)
The cheapest of the three — no square roots, no division. Larger = more similar (note the flip; it is not a distance, it is a similarity score). Range: (-∞, ∞) in general. Used when:
- Vectors are already unit-normalised (i.e.,
||a|| = ||b|| = 1). Thena·b = cos θexactly, so inner product is cosine, and you save the normalisation cost on every query. - Recommendation systems with magnitude-as-popularity, where a longer item vector means "popular item" and should win ties. Maximum inner product search (MIPS) is its own active research area.
The unification
Here is the trick most production systems use: normalise every vector to unit length at index time, store them that way, and use inner product at query time. You get cosine semantics with dot-product speed.
If ||a|| = ||b|| = 1:
cos θ = a · b(cosine equals inner product)||a − b||² = 2 − 2 (a·b) = 2 − 2 cos θ(L2² is monotonic in cosine)
All three metrics become rank-equivalent — they will return the same top-k in the same order. Why this matters: it lets a single index data structure (HNSW, IVF) serve any of the three metrics interchangeably. The choice of metric becomes a one-line config flag, not a re-indexing job.
Computing them: a concrete example
An Indian product catalogue with Sentence-BERT. You run a BharatBazaar-scale e-commerce site. You have 100,000 product descriptions and you want a "find similar products" feature plus a semantic search box that handles comfortable jogging shoes even when the product title says Reebok Floatride lightweight running sneakers.
Step 1: Embed everything once, offline.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2') # 384-dim, runs on CPU
descriptions = [
"Reebok Floatride Energy 4 — lightweight running sneakers for daily training",
"Bata formal black leather shoes for office wear",
"BharatBike Classic 350 motorcycle, single-cylinder 349cc",
"Penguin paperback edition of Midnight's Children by Salman Rushdie",
# ... 99,996 more
]
# Encode all 100K and L2-normalise so we can use dot product later
embeddings = model.encode(descriptions, normalize_embeddings=True)
print(embeddings.shape) # (100000, 384)
print(np.linalg.norm(embeddings[0])) # 1.0 (unit-normalised)
The encode step takes a few minutes for 100K descriptions on a CPU, an order of magnitude less on a GPU. You do this once and persist the resulting (100000, 384) matrix to disk (NumPy .npy, Parquet, or directly into pgvector / Pinecone). Why one-time: embeddings only change when the model changes or the description changes; product catalogue churn is slow.
Step 2: Embed the query and compute all three similarities.
query = "comfortable jogging shoes"
q = model.encode([query], normalize_embeddings=True)[0] # shape (384,)
# Cosine = inner product when both sides are unit-normalised
cosine_scores = embeddings @ q # (100000,) — one matrix-vector mul
# L2 distance, computed without normalisation assumption
l2_distances = np.linalg.norm(embeddings - q, axis=1) # (100000,)
# Inner product (identical to cosine here because we normalised)
ip_scores = embeddings.dot(q)
The single line embeddings @ q is doing 100,000 × 384 = 38.4 million multiplications and additions. On a modern x86 CPU with NumPy backed by BLAS, this runs in about 30–80 ms — fast enough for a prototype, painful for production.
Step 3: Pick the top-5.
top5_idx = np.argpartition(-cosine_scores, 5)[:5]
top5_idx = top5_idx[np.argsort(-cosine_scores[top5_idx])]
for i in top5_idx:
print(f"{cosine_scores[i]:.3f} {descriptions[i]}")
Sample output:
0.812 Reebok Floatride Energy 4 — lightweight running sneakers for daily training
0.794 Asics Gel-Nimbus 25 cushioned running shoes, men's road running
0.771 Nike Pegasus 40 daily trainer with React foam, breathable mesh
0.733 Adidas Ultraboost 22 men's running shoes, energy return midsole
0.689 Puma Velocity Nitro 2 men's neutral running shoes
Notice the relevance: not one result contains the literal word comfortable or the literal word jogging. BM25 would have ranked all of these at zero. The embedding model knows that "jogging" and "running" live in the same neighbourhood of ℝ³⁸⁴, and that "comfortable" is encoded in adjectives like "cushioned", "energy return", "breathable mesh". This is the entire reason vector search exists.
Step 4: Verify the metric equivalence.
# Top-5 by L2 should match top-5 by cosine when vectors are unit-normalised
top5_l2 = np.argpartition(l2_distances, 5)[:5]
top5_l2 = top5_l2[np.argsort(l2_distances[top5_l2])]
assert set(top5_l2.tolist()) == set(top5_idx.tolist()) # passes
# And the algebraic identity ||a−b||² = 2 − 2(a·b) for unit vectors
for i in top5_idx[:3]:
lhs = l2_distances[i] ** 2
rhs = 2 - 2 * cosine_scores[i]
print(f"{lhs:.6f} vs {rhs:.6f}") # match to floating-point precision
The brutal arithmetic. That embeddings @ q step takes about 800 ms on a single-core CPU at 100M vectors instead of 100K — every query, every user, every time. Even with all eight cores busy, you are still in the 100 ms range and burning the entire CPU on one query. The vector indexes in the next four chapters drop this to 5 ms with 95 %+ recall. That is the entire reason Build 19 exists.
The curse of dimensionality
Brute-force search is too slow at scale, but that is the easy problem — engineering solves slow problems. The harder problem is statistical, and it has a name: the curse of dimensionality, characterised in the database community by Beyer, Goldstein, Ramakrishnan, and Shaft (1999), in a paper with the wonderful title "When is 'Nearest Neighbour' Meaningful?".
The result, in one sentence: as d → ∞, under mild assumptions on the data distribution, the ratio of the farthest point's distance from your query to the nearest point's distance approaches 1.
In English: in a million-dimensional space, every point is roughly the same distance from your query as every other point. The concept of "nearest" stops carrying useful information.
You can verify this in three lines of NumPy:
import numpy as np
def concentration(d, N=10000):
X = np.random.randn(N, d); q = np.random.randn(d)
dists = np.linalg.norm(X - q, axis=1)
return dists.max() / dists.min()
for d in [2, 10, 100, 1000, 10000]:
print(f"d={d:>5} max/min ratio = {concentration(d):.3f}")
Run it. You will see something like:
d= 2 max/min ratio = 47.213
d= 10 max/min ratio = 5.184
d= 100 max/min ratio = 1.527
d= 1000 max/min ratio = 1.149
d=10000 max/min ratio = 1.046
In 2D, the farthest random point is forty-seven times farther than the nearest. In 10,000D, it is barely 5 % farther. Why this happens geometrically: in d dimensions the volume of a unit ball is concentrated in a thin shell near its surface, and most of the probability mass of any reasonable distribution sits at distance roughly √d from the centre. The variance of distances grows much slower than the mean, so the relative spread shrinks.
What this means for vector search
Two practical implications, and they are the load-bearing reasons Build 19 exists:
-
Brute-force exact k-NN gives weakly meaningful answers. If the top-1 vector is at distance
0.413and the top-100 vector is at distance0.421, the top-1 is barely a winner. Tiny noise — re-tokenisation, model fine-tune, a different random seed — swaps them. Insisting on exact k-NN is buying false precision. -
You cannot prune efficiently using one-dimensional bounds. B-trees, hash indexes, and range scans all work because in 1D you can split data on a single coordinate and discard half of it. In 1000D, splitting on coordinate 47 hardly tells you anything about distance — a point that is "small" in coordinate 47 might be "large" in coordinate 312, and the total distance is dominated by the sum across all coordinates. Classic database indexes do not help here.
The solution to both problems is the same: stop computing exact nearest neighbours and start computing approximate ones, with carefully tuned data structures that accept (say) 95 % recall in exchange for 100× speedup. This trade is called approximate nearest neighbour search (ANN), and the next four chapters are nothing but ANN structures: flat search (no index, the baseline), IVF (k-means clusters), HNSW (hierarchical small-world graphs), and product quantization (compression). Each one trades a slice of recall for a slice of speed; the engineer's job is to pick the right point on that curve for your latency, memory, and accuracy budget.
What Build 19 builds
Five chapters, each one stacking on the previous:
- Chapter 152 — Flat search and why it doesn't scale. The baseline: scan all
Nvectors on every query. Exact, simple, and O(N·d). Useful up to a few hundred thousand vectors and as the ground truth against which approximate methods are measured. - Chapter 153 — IVF (Inverted File Indexes for vectors). Cluster your vectors into
kgroups using k-means at index time. At query time, search only thenprobeclusters closest to the query. 10–50× faster than flat with 90–95 % recall. - Chapter 154 — HNSW (Hierarchical Navigable Small World). A multi-layer graph where each node is connected to its near neighbours; queries navigate down the layers like a skip list. The state of the art for in-memory vector search; what Pinecone, Weaviate, Qdrant, and pgvector all use.
- Chapter 155 — Product Quantization (PQ) and DiskANN. Compress each 1536-dim float32 vector down to 64 bytes by splitting it into chunks and quantising each chunk against a learned codebook. Essential for billion-scale.
- Chapter 156 — Filtered search. "Find similar products that ship to Bangalore and cost under ₹3000." Combining a metadata filter with a vector search is harder than it looks — pre-filter and post-filter each fail in different ways.
- Chapter 157 — Real systems compared. pgvector vs Pinecone vs Weaviate vs Qdrant: who picks which index, what they charge, and where each one is the right answer.
Common confusions
-
"Cosine and dot product are different metrics, so I should benchmark both." Only true when vectors are not unit-normalised. Once you
normalize_embeddings=Trueat index time — which is what every modern Sentence-BERT, OpenAI, and Cohere pipeline recommends —a · bandcos θare literally the same number. Benchmarking both is benchmarking the same thing twice. Pick inner product for speed; the rank order is identical. -
"A cosine score of 0.85 means '85 % similar'." No. Cosine returns a real number in
[-1, 1], and the distribution of scores is model-specific. OpenAItext-embedding-3-smallrarely produces scores below 0.2 even for unrelated text; CLIP image-text scores rarely exceed 0.35 even for perfect matches. There is no universal threshold. To get a calibrated probability, run a cross-encoder re-ranker on the top candidates — the cosine score is a ranking signal, not a confidence. -
"More dimensions = better embeddings." Half-true. More dimensions buys capacity for plural semantic signals (topic + sentiment + entity + style) but also doubles your storage, doubles your similarity-compute cost, and pushes you deeper into the curse of dimensionality. The 2024 Matryoshka trick lets you truncate
text-embedding-3-largeto its first 256 dims with surprisingly little quality loss — proof that most of the action is in the early dimensions. -
"The curse of dimensionality means vector search doesn't work in high
d." This is the most common misreading of Beyer et al. Their result is about random data; real embeddings are emphatically not random — they live on a low-dimensional manifold inside ℝ^d, with semantic clusters and meaningful structure. The curse predicts that brute-force exact k-NN gives ratios like 1.05 between top-1 and top-100, which is true and is exactly why approximate indexes (HNSW, IVF) work — they exploit the manifold structure that uniform-distribution analysis ignores. -
"L2 distance and cosine similarity will return different top-k results." Not when both vectors are unit-normalised. The identity
‖a − b‖² = 2 − 2(a·b)makes L2 a strictly monotonic function of cosine on the unit sphere, so the rank order is identical. FAISS will return the same top-k whether you index withIndexFlatL2orIndexFlatIPas long as youfaiss.normalize_L2()before adding. Forget that step and they diverge. -
"Embeddings are deterministic, so the same input always gives the same vector." True for one fixed model checkpoint, false the moment you upgrade.
text-embedding-ada-002andtext-embedding-3-smallproduce vectors in different geometries — you cannot mix them in the same index. Whenever you change the embedding model, every existing vector in your store becomes stale and must be re-embedded. This is one of the load-bearing operational pains of running a vector database in production.
Going-deeper
The story above treated embeddings as fixed objects and metrics as the choice of measurement. Both views simplify reality. The deeper questions are: where do good embeddings actually come from, and is there a better metric than the three above?
How embedding models are actually trained
Modern sentence and document embedding models are trained with contrastive learning: take a pair (query, positive_passage) and a batch of (query, negative_passage) and adjust the network weights so that cos(query, positive) > cos(query, negative). The training data comes from things like search query logs (where a clicked result is a positive and a non-clicked one is a negative), question-answer pairs from Stack Overflow, and human-annotated triples. The famous triplet loss and InfoNCE loss are different shapes of this same idea.
The choice of negatives matters enormously. Random negatives (passages from a random other query) teach the model little — they are too easy to distinguish. Hard negatives (passages that look textually similar to the positive but are semantically wrong) are where the real signal comes from. Constructing hard negatives is half the engineering of a good embedding model.
Calibration is not what you think
You will be tempted to use cosine similarity scores as confidences — "0.83 is highly similar, 0.42 is barely related". This is wrong in two ways:
-
Score distributions are model-specific. OpenAI
text-embedding-3-smalltypically scores everything in[0.2, 0.6]; Sentence-BERT spreads more widely; CLIP image-to-text scores rarely exceed0.35even for perfect matches. There is no universal threshold. -
Scores are not probabilities. A cosine of
0.7does not mean "70 % related". To get calibrated confidences, train a small re-ranker on top of the retrieval candidates — typically a cross-encoder model that takes both texts as joint input and outputs a learned probability. Sentence-BERT's authors recommend exactly this two-stage architecture: bi-encoder retrieval (fast, approximate) followed by cross-encoder re-ranking (slow, accurate, on the top-100 only).
Beyond the three: learned metrics and Mahalanobis distance
Cosine, L2, and inner product all treat every dimension equally. But the embedding model probably did not — some dimensions encode load-bearing semantic axes, others encode noise. The Mahalanobis distance weights each dimension by the inverse of its variance, which can give a better-calibrated similarity for distributions with very anisotropic geometry. In practice this is rarely worth the trouble for modern embeddings (the models are already trained to make cosine work) but it shows up in classical pattern recognition and in some recommender systems.
A more modern direction is learned similarity: train a small neural net s(a, b) that takes two embeddings and outputs a similarity score directly. This is exactly what cross-encoders are. They are too slow for first-stage retrieval over millions of candidates but are the standard for the re-ranking stage.
Why dimensions keep growing
Why has the industry moved from 300-dim Word2Vec to 3072-dim text-embedding-3-large? The naive answer — "more dimensions = more capacity = better quality" — is partially right but understates the cost. Each doubling of d doubles your storage, doubles your CPU cost per similarity, and (mildly) worsens the curse of dimensionality.
The real reason is that modern embedding tasks are plural: a single 3072-dim vector is now expected to encode topic, sentiment, formality, language, factual entities, code structure, and more, all in shared space. With 384 dimensions there is not enough room to keep these signals from clobbering each other. With 3072, the model can dedicate sub-spaces to each. The 2024 trick of Matryoshka representation learning (used by OpenAI text-embedding-3) lets you truncate the vector to any prefix length you want — use the full 3072 dims when latency-tolerant, the first 256 dims when latency-bound, and the geometry remains usable at every prefix.
What you should walk away with
Three things, in this order:
- An embedding is
f: item → ℝ^dchosen so that semantically close items have geometrically close vectors. Modern models (Sentence-BERT, OpenAI text-embedding-3, CLIP) produce these in 384–3072 dimensions, and the best are multimodal. - Three similarity metrics matter — cosine, L2, inner product. They become rank-equivalent when vectors are unit-normalised, which is the trick that lets one index serve all three.
- The curse of dimensionality says brute-force k-NN is both too slow and statistically thin in high
d. The next four chapters introduce the indexes that fix the speed problem and accept the approximation.
You now know what an embedding is. The next chapter (flat search) shows you what not to do, and why doing it anyway is sometimes still the right answer.
References
- Mikolov et al., Efficient Estimation of Word Representations in Vector Space (Word2Vec, 2013) — the paper that started modern word embeddings.
- Reimers and Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019) — how to turn BERT into a sentence embedder.
- OpenAI, New embedding models and API updates (text-embedding-3 docs) — production embedding API and Matryoshka representations.
- Beyer, Goldstein, Ramakrishnan, Shaft, When Is "Nearest Neighbor" Meaningful? (1999) — the canonical curse-of-dimensionality paper for databases.
- Pinecone, Embeddings and Vector Search tutorial — a hands-on walkthrough with code.
- Radford et al., Learning Transferable Visual Models From Natural Language Supervision (CLIP, 2021) — the multimodal text+image embedding paper.