Embeddings Deep Dive
Distance metrics, normalization, chunking, and dimensionality tradeoffs
Overview
An embedding maps text (or images, code) into a fixed-length vector where distance approximates semantic similarity. Going deeper means understanding how the numbers behave: which distance metric to use, why normalization matters, how chunking changes retrieval quality, and the cost/quality tradeoff of vector dimensionality. These choices determine whether semantic search feels precise or noisy.
Syntax / Usage
Cosine similarity is the default metric because it ignores magnitude and compares direction. If vectors are L2-normalized, cosine similarity equals the dot product, which is cheaper to compute at scale.
import numpy as np
from openai import OpenAI
client = OpenAI()
def embed(texts):
resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
return np.array([d.embedding for d in resp.data])
def normalize(v):
return v / np.linalg.norm(v, axis=1, keepdims=True)
def cosine(a, b):
a, b = normalize(a), normalize(b)
return a @ b.T # dot product of unit vectors == cosine similarity
vecs = embed(["cancel my subscription", "how to unsubscribe", "reset password"])
sim = cosine(vecs[:1], vecs)
print(sim.round(3)) # first two are close, third is far
Examples
Some models (like text-embedding-3-*) support Matryoshka truncation—you can shorten dimensions to cut storage and speed up search with modest quality loss:
resp = client.embeddings.create(
model="text-embedding-3-large",
input="quarterly revenue report",
dimensions=256, # truncate from 3072 to save memory
)
short_vec = resp.data[0].embedding
Chunking strategy strongly affects recall. Overlapping windows preserve context that would otherwise be split across boundaries:
def chunk(text: str, size: int = 600, overlap: int = 100):
words = text.split()
step = size - overlap
return [" ".join(words[i:i + size]) for i in range(0, len(words), step)]
chunks = chunk(long_document) # ~600-word windows, 100-word overlap
vectors = embed(chunks)
Common Mistakes
- Comparing embeddings from different models or model versions
- Using dot product on unnormalized vectors, letting length dominate meaning
- Chunks too large (diluted meaning) or too small (missing context)
- Embedding boilerplate—nav, footers, repeated headers—into the vectors
- Assuming higher dimensions always help; they raise cost and memory linearly
See Also
embeddings ai-vector-databases rag-basics