Advanced RAG — AI Engineer Track

Overview

Advanced RAG addresses the failure modes of naive top-k retrieval: missed exact terms, irrelevant chunks, and ungrounded answers. The core techniques are hybrid search (dense vectors + sparse keyword), reranking with a cross-encoder, query transformation (rewriting or decomposition), and strict grounding so the model cites only retrieved context. Together they raise both recall and precision before the generation step ever runs.

Syntax / Usage

A robust pipeline retrieves a wide candidate set cheaply, then reranks precisely. Reciprocal rank fusion (RRF) merges dense and keyword result lists without tuning score scales.

from openai import OpenAI

client = OpenAI()

def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
    """Fuse multiple ranked ID lists via reciprocal rank fusion."""
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

dense_hits = vector_search(query, k=30)     # semantic candidates (ids)
keyword_hits = bm25_search(query, k=30)      # lexical candidates (ids)
fused = rrf([dense_hits, keyword_hits])[:10]

Examples

A cross-encoder reranker scores each (query, chunk) pair jointly, which is far more accurate than comparing independent embeddings—apply it to the fused shortlist only, for cost reasons:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_n: int = 4) -> list[str]:
    pairs = [(query, c) for c in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:top_n]]

Query rewriting expands vague or conversational questions into retrieval-friendly form, and grounded generation forces citations:

def rewrite(question: str) -> str:
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "Rewrite as a standalone search query."},
                  {"role": "user", "content": question}],
    )
    return r.choices[0].message.content

def answer(question: str, chunks: list[str]) -> str:
    context = "\n---\n".join(f"[{i}] {c}" for i, c in enumerate(chunks))
    prompt = (f"Use ONLY the context. Cite sources like [0].\n"
              f"If unsupported, say you don't know.\n\n{context}\n\nQ: {question}")
    r = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content

Common Mistakes

Reranking the entire corpus instead of a cheap candidate shortlist
Dropping keyword search, so exact IDs, codes, and names get missed
Passing conversation history verbatim without resolving pronouns/context
No grounding instruction, letting the model blend memory with retrieval
Skipping retrieval evaluation—measure hit rate and answer faithfulness

Overview

Syntax / Usage

Examples

Common Mistakes

See Also