RAG Basics

Overview

Retrieval-augmented generation (RAG) grounds LLM answers in your own data. Instead of relying on training memory, the system retrieves relevant documents at query time and injects them into the prompt. This reduces hallucinations, supports private knowledge bases, and lets you update answers without retraining the model.

Typical RAG stack: ingest documents → chunk → embed → store in a vector DB → on query, retrieve top chunks → send chunks + question to the LLM → return answer with citations.

Syntax / Usage

Pipeline stages:

Ingestion     Load PDFs, Markdown, DB rows, API docs
Chunking      Split into overlapping segments (~400–800 tokens)
Embedding     Vectorize each chunk
Indexing      Store vectors + metadata in vector store
Retrieval     Embed query → similarity search → top-k chunks
Generation    Prompt LLM with context + user question

Minimal prompt template:

Answer using ONLY the context below. If the answer is not in the context, say "I don't know."

Context:
---
{chunk_1}
---
{chunk_2}
---

Question: {user_question}

Architecture in a Next.js app:

User → /api/ask → embed query → Supabase pgvector search
              → build prompt with hits → LLM API → stream response

Tune top-k (3–8 chunks), similarity threshold, and chunk size for your content type.

Examples

On publish, chunk documents (~600 tokens, 100 overlap), embed batches, and upsert into doc_chunks with content, embedding, and source metadata.

Hybrid search (vector + keyword) improves recall for exact terms like SKUs and error codes.

Common Mistakes

Retrieving too many irrelevant chunks—pollutes context and wastes tokens
No citation/metadata—users cannot verify answers
Stale index after content updates—schedule re-ingestion on publish webhooks
Sending raw HTML/PDF noise without cleaning headers, nav, and boilerplate
Skipping evaluation—maintain a set of questions with expected source docs

Overview

Syntax / Usage

Examples

Common Mistakes

See Also