stackademic

The leading education platform for anyone with an interest in software development.

Vector Databases

Storing and querying embeddings for similarity search at scale

Overview

A vector database stores high-dimensional embeddings and answers nearest-neighbor queries fast. Instead of exact keyword matches, it ranks records by geometric similarity (cosine, dot product, or L2). This powers semantic search, RAG retrieval, recommendations, and deduplication. Under the hood, an approximate nearest neighbor (ANN) index (HNSW, IVF) trades a little recall for large speedups over brute-force scans.

Syntax / Usage

Most systems expose the same lifecycle: create a collection with a fixed dimension and distance metric, upsert vectors with metadata, then query with a vector plus optional metadata filters. The example below uses Chroma, a lightweight local vector store.

import chromadb
from openai import OpenAI

client = OpenAI()
db = chromadb.PersistentClient(path="./vectordb")
collection = db.get_or_create_collection(
    name="docs",
    metadata={"hnsw:space": "cosine"},  # distance metric
)

def embed(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

docs = ["Refunds take 5 days", "Reset your password in settings"]
collection.upsert(
    ids=["doc-1", "doc-2"],
    embeddings=embed(docs),
    documents=docs,
    metadatas=[{"category": "billing"}, {"category": "account"}],
)

hits = collection.query(
    query_embeddings=embed(["how long for a refund?"]),
    n_results=3,
    where={"category": "billing"},  # metadata pre-filter
)
print(hits["documents"])

Examples

Persisting the same dimension used at query time avoids silent errors—always embed queries with the model that produced the stored vectors:

QUERY_MODEL = "text-embedding-3-small"  # must match ingestion model

def search(question: str, k: int = 5):
    q = client.embeddings.create(model=QUERY_MODEL, input=question).data[0].embedding
    return collection.query(query_embeddings=[q], n_results=k)

Postgres with the pgvector extension works well when you already run Supabase and want SQL joins alongside similarity:

import psycopg
from pgvector.psycopg import register_vector

conn = psycopg.connect("postgresql://...")
register_vector(conn)
vec = embed(["annual pricing plans"])[0]
rows = conn.execute(
    "SELECT content FROM chunks ORDER BY embedding <=> %s LIMIT 5",
    (vec,),
).fetchall()  # <=> is cosine distance in pgvector

Common Mistakes

  • Mixing embedding models between ingestion and query, producing meaningless distances
  • Forgetting to normalize vectors when using dot-product similarity
  • Skipping metadata filters, so retrieval crosses tenants or categories
  • Setting k too high, flooding the LLM context with weak matches
  • Treating ANN recall as exact—tune ef_search/nprobe for accuracy-critical use

See Also

embeddings ai-embeddings-deep-dive rag-basics