Advanced RAG
Hybrid search, reranking, query rewriting, and grounded generation
Overview
Advanced RAG addresses the failure modes of naive top-k retrieval: missed exact terms, irrelevant chunks, and ungrounded answers. The core techniques are hybrid search (dense vectors + sparse keyword), reranking with a cross-encoder, query transformation (rewriting or decomposition), and strict grounding so the model cites only retrieved context. Together they raise both recall and precision before the generation step ever runs.
Syntax / Usage
A robust pipeline retrieves a wide candidate set cheaply, then reranks precisely. Reciprocal rank fusion (RRF) merges dense and keyword result lists without tuning score scales.
from openai import OpenAI
client = OpenAI()
def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
"""Fuse multiple ranked ID lists via reciprocal rank fusion."""
scores: dict[str, float] = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
dense_hits = vector_search(query, k=30) # semantic candidates (ids)
keyword_hits = bm25_search(query, k=30) # lexical candidates (ids)
fused = rrf([dense_hits, keyword_hits])[:10]
Examples
A cross-encoder reranker scores each (query, chunk) pair jointly, which is far more accurate than comparing independent embeddings—apply it to the fused shortlist only, for cost reasons:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, chunks: list[str], top_n: int = 4) -> list[str]:
pairs = [(query, c) for c in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
return [c for c, _ in ranked[:top_n]]
Query rewriting expands vague or conversational questions into retrieval-friendly form, and grounded generation forces citations:
def rewrite(question: str) -> str:
r = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": "Rewrite as a standalone search query."},
{"role": "user", "content": question}],
)
return r.choices[0].message.content
def answer(question: str, chunks: list[str]) -> str:
context = "\n---\n".join(f"[{i}] {c}" for i, c in enumerate(chunks))
prompt = (f"Use ONLY the context. Cite sources like [0].\n"
f"If unsupported, say you don't know.\n\n{context}\n\nQ: {question}")
r = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return r.choices[0].message.content
Common Mistakes
- Reranking the entire corpus instead of a cheap candidate shortlist
- Dropping keyword search, so exact IDs, codes, and names get missed
- Passing conversation history verbatim without resolving pronouns/context
- No grounding instruction, letting the model blend memory with retrieval
- Skipping retrieval evaluation—measure hit rate and answer faithfulness
See Also
rag-basics ai-vector-databases ai-evaluation