Vector Databases — AI Engineer Track

Overview

A vector database stores high-dimensional embeddings and answers nearest-neighbor queries fast. Instead of exact keyword matches, it ranks records by geometric similarity (cosine, dot product, or L2). This powers semantic search, RAG retrieval, recommendations, and deduplication. Under the hood, an approximate nearest neighbor (ANN) index (HNSW, IVF) trades a little recall for large speedups over brute-force scans.

Syntax / Usage

Most systems expose the same lifecycle: create a collection with a fixed dimension and distance metric, upsert vectors with metadata, then query with a vector plus optional metadata filters. The example below uses Chroma, a lightweight local vector store.

import chromadb
from openai import OpenAI

client = OpenAI()
db = chromadb.PersistentClient(path="./vectordb")
collection = db.get_or_create_collection(
    name="docs",
    metadata={"hnsw:space": "cosine"},  # distance metric
)

def embed(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

docs = ["Refunds take 5 days", "Reset your password in settings"]
collection.upsert(
    ids=["doc-1", "doc-2"],
    embeddings=embed(docs),
    documents=docs,
    metadatas=[{"category": "billing"}, {"category": "account"}],
)

hits = collection.query(
    query_embeddings=embed(["how long for a refund?"]),
    n_results=3,
    where={"category": "billing"},  # metadata pre-filter
)
print(hits["documents"])

Examples

Persisting the same dimension used at query time avoids silent errors—always embed queries with the model that produced the stored vectors:

QUERY_MODEL = "text-embedding-3-small"  # must match ingestion model

def search(question: str, k: int = 5):
    q = client.embeddings.create(model=QUERY_MODEL, input=question).data[0].embedding
    return collection.query(query_embeddings=[q], n_results=k)

Postgres with the pgvector extension works well when you already run Supabase and want SQL joins alongside similarity:

import psycopg
from pgvector.psycopg import register_vector

conn = psycopg.connect("postgresql://...")
register_vector(conn)
vec = embed(["annual pricing plans"])[0]
rows = conn.execute(
    "SELECT content FROM chunks ORDER BY embedding <=> %s LIMIT 5",
    (vec,),
).fetchall()  # <=> is cosine distance in pgvector

Common Mistakes

Mixing embedding models between ingestion and query, producing meaningless distances
Forgetting to normalize vectors when using dot-product similarity
Skipping metadata filters, so retrieval crosses tenants or categories
Setting k too high, flooding the LLM context with weak matches
Treating ANN recall as exact—tune ef_search/nprobe for accuracy-critical use

Overview

Syntax / Usage

Examples

Common Mistakes

See Also