stackademic

The leading education platform for anyone with an interest in software development.

Advanced RAG

Hybrid search, reranking, query rewriting, and grounded generation

Overview

Advanced RAG addresses the failure modes of naive top-k retrieval: missed exact terms, irrelevant chunks, and ungrounded answers. The core techniques are hybrid search (dense vectors + sparse keyword), reranking with a cross-encoder, query transformation (rewriting or decomposition), and strict grounding so the model cites only retrieved context. Together they raise both recall and precision before the generation step ever runs.

Syntax / Usage

A robust pipeline retrieves a wide candidate set cheaply, then reranks precisely. Reciprocal rank fusion (RRF) merges dense and keyword result lists without tuning score scales.

from openai import OpenAI

client = OpenAI()

def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
    """Fuse multiple ranked ID lists via reciprocal rank fusion."""
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

dense_hits = vector_search(query, k=30)     # semantic candidates (ids)
keyword_hits = bm25_search(query, k=30)      # lexical candidates (ids)
fused = rrf([dense_hits, keyword_hits])[:10]

Examples

A cross-encoder reranker scores each (query, chunk) pair jointly, which is far more accurate than comparing independent embeddings—apply it to the fused shortlist only, for cost reasons:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_n: int = 4) -> list[str]:
    pairs = [(query, c) for c in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:top_n]]

Query rewriting expands vague or conversational questions into retrieval-friendly form, and grounded generation forces citations:

def rewrite(question: str) -> str:
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "Rewrite as a standalone search query."},
                  {"role": "user", "content": question}],
    )
    return r.choices[0].message.content

def answer(question: str, chunks: list[str]) -> str:
    context = "\n---\n".join(f"[{i}] {c}" for i, c in enumerate(chunks))
    prompt = (f"Use ONLY the context. Cite sources like [0].\n"
              f"If unsupported, say you don't know.\n\n{context}\n\nQ: {question}")
    r = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content

Common Mistakes

  • Reranking the entire corpus instead of a cheap candidate shortlist
  • Dropping keyword search, so exact IDs, codes, and names get missed
  • Passing conversation history verbatim without resolving pronouns/context
  • No grounding instruction, letting the model blend memory with retrieval
  • Skipping retrieval evaluation—measure hit rate and answer faithfulness

See Also

rag-basics ai-vector-databases ai-evaluation