RAG Basics
Retrieval-augmented generation pipeline for developer applications
Overview
Retrieval-augmented generation (RAG) grounds LLM answers in your own data. Instead of relying on training memory, the system retrieves relevant documents at query time and injects them into the prompt. This reduces hallucinations, supports private knowledge bases, and lets you update answers without retraining the model.
Typical RAG stack: ingest documents → chunk → embed → store in a vector DB → on query, retrieve top chunks → send chunks + question to the LLM → return answer with citations.
Syntax / Usage
Pipeline stages:
Ingestion Load PDFs, Markdown, DB rows, API docs
Chunking Split into overlapping segments (~400–800 tokens)
Embedding Vectorize each chunk
Indexing Store vectors + metadata in vector store
Retrieval Embed query → similarity search → top-k chunks
Generation Prompt LLM with context + user question
Minimal prompt template:
Answer using ONLY the context below. If the answer is not in the context, say "I don't know."
Context:
---
{chunk_1}
---
{chunk_2}
---
Question: {user_question}
Architecture in a Next.js app:
User → /api/ask → embed query → Supabase pgvector search
→ build prompt with hits → LLM API → stream response
Tune top-k (3–8 chunks), similarity threshold, and chunk size for your content type.
Examples
On publish, chunk documents (~600 tokens, 100 overlap), embed batches, and upsert into doc_chunks with content, embedding, and source metadata.
Hybrid search (vector + keyword) improves recall for exact terms like SKUs and error codes.
Common Mistakes
- Retrieving too many irrelevant chunks—pollutes context and wastes tokens
- No citation/metadata—users cannot verify answers
- Stale index after content updates—schedule re-ingestion on publish webhooks
- Sending raw HTML/PDF noise without cleaning headers, nav, and boilerplate
- Skipping evaluation—maintain a set of questions with expected source docs
See Also
embeddings prompt-engineering large-language-models ai-apis