RAG Retrieval Optimization: Hybrid Search, Re-Ranking, Query Transformation
Introduction
Retrieval quality is the single biggest factor in RAG system performance. Even the best LLM cannot produce accurate answers from irrelevant context. This article covers three optimization layers: hybrid search that combines embedding similarity with keyword matching, re-ranking that refines initial results, and query transformation that bridges the gap between user questions and searchable terms.
Hybrid Search
Pure vector search excels at semantic similarity but misses exact keyword matches. Pure keyword search finds exact terms but misses conceptually related content. Hybrid search combines both:
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, HybridFusion
client = QdrantClient(host="localhost", port=6333)
def hybrid_search(
query: str, collection: str = "documents", limit: int = 10
) -> list[dict]:
# Generate dense vector
dense_vector = embedding_model.encode(query).tolist()
# Sparse vector (BM25-style)
sparse_vector = sparse_encoder.encode(query)
results = client.search_batch(
collection_name=collection,
requests=[
# Dense search
{
"vector": dense_vector,
"limit": limit * 2,
"with_payload": True,
},
# Sparse search
{
"vector": sparse_vector,
"limit": limit * 2,
"with_payload": True,
},
],
)
# Fusion: Reciprocal Rank Fusion
return rrf_fusion(results[0], results[1], k=60)
Reciprocal Rank Fusion
RRF combines ranked lists from multiple retrieval methods:
def rrf_fusion(dense_results: list, sparse_results: list, k: int = 60) -> list[dict]:
scores = {}
for rank, result in enumerate(dense_results):
scores[result.id] = scores.get(result.id, 0) + 1 / (k + rank + 1)
for rank, result in enumerate(sparse_results):
scores[result.id] = scores.get(result.id, 0) + 1 / (k + rank + 1)
reranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [{"id": id_, "score": score} for id_, score in reranked[:limit]]
RRF is simple, effective, and requires no training. The constant `k` (typically 60) prevents a single high rank from dominating.
Cross-Encoder Re-Ranking
After initial retrieval, a cross-encoder model re-scores candidates with higher accuracy:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_and_rerank(query: str, top_k: int = 50, rerank_top: int = 5) -> list[dict]:
# First stage: fast bi-encoder retrieval
candidates = hybrid_search(query, limit=top_k)
# Second stage: cross-encoder re-ranking
pairs = [(query, cand["text"]) for cand in candidates]
scores = reranker.predict(pairs)
for cand, score in zip(candidates, scores):
cand["rerank_score"] = float(score)
candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
return candidates[:rerank_top]
Cross-encoders are 10-100x slower than bi-encoders but significantly more accurate. The two-stage pattern (wide bi-encoder retrieval, narrow cross-encoder re-ranking) balances speed and quality.
Query Transformation
User queries are rarely optimal for retrieval. Transform them before searching:
def transform_query(user_query: str, technique: str = "expansion") -> str:
if technique == "expansion":
return expand_query(user_query)
elif technique == "decomposition":
return decompose_query(user_query)
elif technique == "hypothetical":
return hyde_query(user_query)
def expand_query(query: str) -> str:
"""Generate search-friendly expansions of the original query."""
expansions = call_llm(f"""
Generate 3 alternative phrasings of this query for better search retrieval.
Keep the core meaning but vary terminology.
Original: {query}
""")
return f"{query}\n{expansions}"
def hyde_query(query: str) -> str:
"""Hypothetical Document Embeddings: generate a hypothetical ideal document,
then use its embedding for retrieval."""
hypothetical = call_llm(f"Write a short passage that perfectly answers: {query}")
return hypothetical
Query Decomposition
Complex questions should be split into sub-queries, each searched independently:
def decompose_and_retrieve(question: str) -> list[dict]:
sub_queries = call_llm(f"""
Break this question into 2-4 independent sub-questions:
{question}
""")
sub_queries = parse_list(sub_queries)
all_results = []
for sq in sub_queries:
results = hybrid_search(sq, limit=5)
all_results.extend(results)
# Deduplicate by ID
seen = set()
unique_results = []
for r in all_results:
if r["id"] not in seen:
seen.add(r["id"])
unique_results.append(r)
return unique_results[:10]
Measuring Retrieval Quality
Track these metrics to validate improvements:
def evaluate_retrieval(test_queries: list[dict], retriever_fn) -> dict:
"""test_queries: [{query, relevant_doc_ids}]"""
metrics = {"recall@k": {}, "mrr": 0}
for k in [1, 3, 5, 10]:
recalls = []
for tq in test_queries:
results = retriever_fn(tq["query"], limit=k)
retrieved_ids = {r["id"] for r in results}
relevant = set(tq["relevant_doc_ids"])
recall = len(retrieved_ids & relevant) / len(relevant) if relevant else 0
recalls.append(recall)
metrics["recall@k"] = np.mean(recalls)
return metrics
Conclusion
Optimize RAG retrieval in three layers. First, implement hybrid search combining dense and sparse vectors with RRF fusion. Second, add cross-encoder re-ranking for precision on early-stage results. Third, transform user queries through expansion, decomposition, or HyDE techniques. Measure recall@k after each change to validate improvements before deploying.