stackademic

The leading education platform for anyone with an interest in software development.

Embeddings Deep Dive

Distance metrics, normalization, chunking, and dimensionality tradeoffs

Overview

An embedding maps text (or images, code) into a fixed-length vector where distance approximates semantic similarity. Going deeper means understanding how the numbers behave: which distance metric to use, why normalization matters, how chunking changes retrieval quality, and the cost/quality tradeoff of vector dimensionality. These choices determine whether semantic search feels precise or noisy.

Syntax / Usage

Cosine similarity is the default metric because it ignores magnitude and compares direction. If vectors are L2-normalized, cosine similarity equals the dot product, which is cheaper to compute at scale.

import numpy as np
from openai import OpenAI

client = OpenAI()

def embed(texts):
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in resp.data])

def normalize(v):
    return v / np.linalg.norm(v, axis=1, keepdims=True)

def cosine(a, b):
    a, b = normalize(a), normalize(b)
    return a @ b.T  # dot product of unit vectors == cosine similarity

vecs = embed(["cancel my subscription", "how to unsubscribe", "reset password"])
sim = cosine(vecs[:1], vecs)
print(sim.round(3))  # first two are close, third is far

Examples

Some models (like text-embedding-3-*) support Matryoshka truncation—you can shorten dimensions to cut storage and speed up search with modest quality loss:

resp = client.embeddings.create(
    model="text-embedding-3-large",
    input="quarterly revenue report",
    dimensions=256,  # truncate from 3072 to save memory
)
short_vec = resp.data[0].embedding

Chunking strategy strongly affects recall. Overlapping windows preserve context that would otherwise be split across boundaries:

def chunk(text: str, size: int = 600, overlap: int = 100):
    words = text.split()
    step = size - overlap
    return [" ".join(words[i:i + size]) for i in range(0, len(words), step)]

chunks = chunk(long_document)  # ~600-word windows, 100-word overlap
vectors = embed(chunks)

Common Mistakes

  • Comparing embeddings from different models or model versions
  • Using dot product on unnormalized vectors, letting length dominate meaning
  • Chunks too large (diluted meaning) or too small (missing context)
  • Embedding boilerplate—nav, footers, repeated headers—into the vectors
  • Assuming higher dimensions always help; they raise cost and memory linearly

See Also

embeddings ai-vector-databases rag-basics