The RAG Interview Question I Couldn’t Answer

The RAG Interview Question I Couldn’t Answer

Prathmesh Vhatkar

And the 25% accuracy gap it taught me to take seriously.

An interviewer asked me one question about RAG last week. I froze.“Why do we even need rerankers? Isn’t semantic search enough?”

I gave the textbook answer. Something about improving precision, reducing noise, better results. She nodded and moved on.

I knew I had lost the signal.

It took me the rest of the evening to work out what I should have said. This post is that answer, the one I wish I had given in the room.

A quick refresher on what RAG actually does

If you have touched generative AI in the last two years, you have probably heard the term RAG thrown around. It stands for retrieval-augmented generation.

The idea is simple. Large language models are smart but have no idea what is in your company’s internal documents, your codebase, or any knowledge that was not in their training data. So when someone asks a question, you first go find the most relevant documents, then hand those documents to the language model along with the question. The model reads the documents and answers based on what it found.

The retrieval part is where most people stop thinking. They assume: “I just embed everything with OpenAI, store it in a vector database, do a cosine similarity search, and I am done.”

That is the mental model the interviewer was testing. It is also the mental model that breaks production RAG systems.

The two-tower bottleneck

Here is what is actually happening when you do a standard semantic search.

Your embedding model takes the query and turns it into a vector, a list of numbers that represents its meaning. It takes each document and does the same thing. Now you have two vectors, one for the query, one for the document. You compare them using a dot product, which is just a math operation that measures how similar two vectors are.

The key point: the query and the document never meet before the comparison. The model looked at each one separately and built its representation in isolation. This design is called a bi-encoder, because it encodes the two things independently.

It is fast. You can pre-compute and store the document vectors in a database, and at query time you only need to embed the new query and do a nearby-neighbor search. Millions of documents, milliseconds of latency.

It is also shallow. The model never got to compare the query and the document together. It is ranking without reading.

Here is the example that made this click for me.

Query: “How do I prevent heart attacks?”

Document: “Heart attacks kill millions every year.”

Both mention heart attacks. Both live in the same semantic neighborhood. Cosine similarity will be high. A vector search will rank this document near the top.

But one is a question and the other is a statistic. The document does not answer the query. It is topically similar but completely unhelpful.

This is the gap. Semantic similarity is not the same as relevance. And a bi-encoder cannot tell the difference, because it never actually looked at the two together.

The fix: cross-encoders

A cross-encoder fixes this by encoding the query and the document in the same pass.

Instead of two separate vectors, you feed the model a single sequence that looks roughly like: query, separator token, document. The separator token is a special marker that tells the model “the first part is one thing, the second part is another thing, reason about how they relate.”

Now every word in the query can attend to every word in the document, and vice versa. The model sees the interaction. It can notice that one is phrased as a question and the other as a fact. It can check whether the document actually contains an answer, not just related vocabulary. Instead of two vectors being compared, the model produces a single relevance score for the pair.

That is what you want. Deep, interaction-aware scoring.

So why don’t we just use cross-encoders for everything?

Because they are expensive. In a bi-encoder setup, you embed every document once, store the vectors, and reuse them forever. In a cross-encoder setup, you cannot pre-compute anything, because the score depends on both the query and the document together. Every new query means running the full model over every document pair.

If you have 10 million documents, that is 10 million forward passes through a transformer for every single user query. Completely unusable at production scale.

The production pattern: retrieve, then rerank

The answer most production RAG systems converge on is a two-stage pipeline.

Stage one is the retriever. You use a bi-encoder over your entire corpus. It is fast, it is cheap, and its job is not to be correct. Its job is to narrow 10 million documents down to maybe 100 candidates, making sure the right answer is somewhere in that pile. This is called recall, the ability to catch the correct answer even if you also catch a lot of junk alongside it.

Stage two is the reranker. You take those 100 candidates and run a cross-encoder on each one, scoring them against the query. Now you get 100 scores that actually reflect whether each document answers the question. You take the top 10 and feed them to the language model.

Fast recall at the start. Deep precision at the end. Each stage does the job the other cannot.

The numbers that make it matter

Public benchmarks on standard RAG evaluation suites consistently show a big gap between retrieval-only and retrieve-then-rerank setups. Retrieval-only precision lands somewhere around 60%. Add a cross-encoder reranker and it jumps into the mid-80s.

That 25% delta is the difference between a RAG system that confidently answers your question and one that confidently hallucinates because the top document it pulled was topically related but factually irrelevant.

This is why rerankers are not a nice-to-have. They are the thing that makes RAG actually work.

What I learned

The real lesson from that interview was not about rerankers. It was about the gap between knowing the name of a technique and understanding the problem the technique was invented to solve.

“Use a reranker” is a fact you can memorize from a tutorial. “Bi-encoders rank without reading, so you need a cross-encoder to actually compare query and document together, but cross-encoders are too expensive to run over everything, which is why you use a bi-encoder first to narrow the field” is a mental model. The interviewer was not testing whether I had read about rerankers. She was testing whether I understood what retrieval was actually doing and where it broke.

I did not. Now I do.

If you are building RAG and you have not added a reranker yet, you are leaving accuracy on the table. If you are interviewing for ML or infrastructure roles, you will get asked some version of this question. Either way, the two-tower bottleneck is a load-bearing idea. It is worth sitting with.