Vector Databases
Storing and querying embeddings for similarity search at scale
Overview
A vector database stores high-dimensional embeddings and answers nearest-neighbor queries fast. Instead of exact keyword matches, it ranks records by geometric similarity (cosine, dot product, or L2). This powers semantic search, RAG retrieval, recommendations, and deduplication. Under the hood, an approximate nearest neighbor (ANN) index (HNSW, IVF) trades a little recall for large speedups over brute-force scans.
Syntax / Usage
Most systems expose the same lifecycle: create a collection with a fixed dimension and distance metric, upsert vectors with metadata, then query with a vector plus optional metadata filters. The example below uses Chroma, a lightweight local vector store.
import chromadb
from openai import OpenAI
client = OpenAI()
db = chromadb.PersistentClient(path="./vectordb")
collection = db.get_or_create_collection(
name="docs",
metadata={"hnsw:space": "cosine"}, # distance metric
)
def embed(texts: list[str]) -> list[list[float]]:
resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
return [d.embedding for d in resp.data]
docs = ["Refunds take 5 days", "Reset your password in settings"]
collection.upsert(
ids=["doc-1", "doc-2"],
embeddings=embed(docs),
documents=docs,
metadatas=[{"category": "billing"}, {"category": "account"}],
)
hits = collection.query(
query_embeddings=embed(["how long for a refund?"]),
n_results=3,
where={"category": "billing"}, # metadata pre-filter
)
print(hits["documents"])
Examples
Persisting the same dimension used at query time avoids silent errors—always embed queries with the model that produced the stored vectors:
QUERY_MODEL = "text-embedding-3-small" # must match ingestion model
def search(question: str, k: int = 5):
q = client.embeddings.create(model=QUERY_MODEL, input=question).data[0].embedding
return collection.query(query_embeddings=[q], n_results=k)
Postgres with the pgvector extension works well when you already run Supabase and want SQL joins alongside similarity:
import psycopg
from pgvector.psycopg import register_vector
conn = psycopg.connect("postgresql://...")
register_vector(conn)
vec = embed(["annual pricing plans"])[0]
rows = conn.execute(
"SELECT content FROM chunks ORDER BY embedding <=> %s LIMIT 5",
(vec,),
).fetchall() # <=> is cosine distance in pgvector
Common Mistakes
- Mixing embedding models between ingestion and query, producing meaningless distances
- Forgetting to normalize vectors when using dot-product similarity
- Skipping metadata filters, so retrieval crosses tenants or categories
- Setting
ktoo high, flooding the LLM context with weak matches - Treating ANN recall as exact—tune
ef_search/nprobefor accuracy-critical use
See Also
embeddings ai-embeddings-deep-dive rag-basics