RAG Retrieval Optimization: Hybrid Search, Re-Ranking, Query Transformation


Introduction





Retrieval quality is the single biggest factor in RAG system performance. Even the best LLM cannot produce accurate answers from irrelevant context. This article covers three optimization layers: hybrid search that combines embedding similarity with keyword matching, re-ranking that refines initial results, and query transformation that bridges the gap between user questions and searchable terms.





Hybrid Search





Pure vector search excels at semantic similarity but misses exact keyword matches. Pure keyword search finds exact terms but misses conceptually related content. Hybrid search combines both:






from qdrant_client import QdrantClient


from qdrant_client.models import Filter, HybridFusion




client = QdrantClient(host="localhost", port=6333)




def hybrid_search(


query: str, collection: str = "documents", limit: int = 10


) -> list[dict]:


# Generate dense vector


dense_vector = embedding_model.encode(query).tolist()




# Sparse vector (BM25-style)


sparse_vector = sparse_encoder.encode(query)




results = client.search_batch(


collection_name=collection,


requests=[


# Dense search


{


"vector": dense_vector,


"limit": limit * 2,


"with_payload": True,


},


# Sparse search


{


"vector": sparse_vector,


"limit": limit * 2,


"with_payload": True,


},


],


)




# Fusion: Reciprocal Rank Fusion


return rrf_fusion(results[0], results[1], k=60)







Reciprocal Rank Fusion





RRF combines ranked lists from multiple retrieval methods:






def rrf_fusion(dense_results: list, sparse_results: list, k: int = 60) -> list[dict]:


scores = {}


for rank, result in enumerate(dense_results):


scores[result.id] = scores.get(result.id, 0) + 1 / (k + rank + 1)


for rank, result in enumerate(sparse_results):


scores[result.id] = scores.get(result.id, 0) + 1 / (k + rank + 1)




reranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)


return [{"id": id_, "score": score} for id_, score in reranked[:limit]]







RRF is simple, effective, and requires no training. The constant `k` (typically 60) prevents a single high rank from dominating.





Cross-Encoder Re-Ranking





After initial retrieval, a cross-encoder model re-scores candidates with higher accuracy:






from sentence_transformers import CrossEncoder




reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")




def retrieve_and_rerank(query: str, top_k: int = 50, rerank_top: int = 5) -> list[dict]:


# First stage: fast bi-encoder retrieval


candidates = hybrid_search(query, limit=top_k)




# Second stage: cross-encoder re-ranking


pairs = [(query, cand["text"]) for cand in candidates]


scores = reranker.predict(pairs)




for cand, score in zip(candidates, scores):


cand["rerank_score"] = float(score)




candidates.sort(key=lambda x: x["rerank_score"], reverse=True)


return candidates[:rerank_top]







Cross-encoders are 10-100x slower than bi-encoders but significantly more accurate. The two-stage pattern (wide bi-encoder retrieval, narrow cross-encoder re-ranking) balances speed and quality.





Query Transformation





User queries are rarely optimal for retrieval. Transform them before searching:






def transform_query(user_query: str, technique: str = "expansion") -> str:


if technique == "expansion":


return expand_query(user_query)


elif technique == "decomposition":


return decompose_query(user_query)


elif technique == "hypothetical":


return hyde_query(user_query)




def expand_query(query: str) -> str:


"""Generate search-friendly expansions of the original query."""


expansions = call_llm(f"""


Generate 3 alternative phrasings of this query for better search retrieval.


Keep the core meaning but vary terminology.




Original: {query}


""")


return f"{query}\n{expansions}"




def hyde_query(query: str) -> str:


"""Hypothetical Document Embeddings: generate a hypothetical ideal document,


then use its embedding for retrieval."""


hypothetical = call_llm(f"Write a short passage that perfectly answers: {query}")


return hypothetical







Query Decomposition





Complex questions should be split into sub-queries, each searched independently:






def decompose_and_retrieve(question: str) -> list[dict]:


sub_queries = call_llm(f"""


Break this question into 2-4 independent sub-questions:


{question}


""")


sub_queries = parse_list(sub_queries)




all_results = []


for sq in sub_queries:


results = hybrid_search(sq, limit=5)


all_results.extend(results)




# Deduplicate by ID


seen = set()


unique_results = []


for r in all_results:


if r["id"] not in seen:


seen.add(r["id"])


unique_results.append(r)


return unique_results[:10]







Measuring Retrieval Quality





Track these metrics to validate improvements:






def evaluate_retrieval(test_queries: list[dict], retriever_fn) -> dict:


"""test_queries: [{query, relevant_doc_ids}]"""


metrics = {"recall@k": {}, "mrr": 0}


for k in [1, 3, 5, 10]:


recalls = []


for tq in test_queries:


results = retriever_fn(tq["query"], limit=k)


retrieved_ids = {r["id"] for r in results}


relevant = set(tq["relevant_doc_ids"])


recall = len(retrieved_ids & relevant) / len(relevant) if relevant else 0


recalls.append(recall)


metrics["recall@k"] = np.mean(recalls)


return metrics







Conclusion





Optimize RAG retrieval in three layers. First, implement hybrid search combining dense and sparse vectors with RRF fusion. Second, add cross-encoder re-ranking for precision on early-stage results. Third, transform user queries through expansion, decomposition, or HyDE techniques. Measure recall@k after each change to validate improvements before deploying.