RAG (Retrieval-Augmented Generation) is the most common production LLM pattern in 2026, powering everything from customer support chatbots to legal document analysis. But most RAG implementations stop at "chunk documents, embed, query" — leaving 50%+ of potential accuracy on the table. This guide covers the production practices that separate a 70% accurate RAG system from a 95% accurate one.

Chunking Strategies: The Foundation of RAG

StrategyBest ForProsCons
Fixed-Size (512 tokens)General documents, quick to implementSimple, predictableSplits sentences mid-thought; loses context
Sentence-Aware (split on sentence boundaries)Articles, documentationSemantically meaningful chunksCan produce very uneven chunk sizes
Recursive (split on paragraph, then sentence, then token)Mixed content typesBalanced approachSlightly more complex to implement
Semantic (split on topic boundaries via embedding similarity)Long, multi-topic documentsBest semantic coherenceSlower, requires LLM or embedding model
Agentic (LLM decides chunk boundaries)Complex, unstructured documentsBest qualityExpensive, slow, LLM cost per chunk

The optimal chunk size in 2026: 256-512 tokens with 10-20% overlap. Larger chunks (1,024+) improve context but dilute retrieval precision. The overlap ensures no answer straddles a chunk boundary.

Embedding Model Selection

ModelDimensionsMTEB ScoreCost (per 1M tokens)Best For
OpenAI text-embedding-3-large256-3,072 (Matryoshka)64.6$0.13General purpose, best quality
Cohere Embed v41,02465.2$0.10Multilingual, long documents
BGE-M3 (BAAI)1,02463.8Free (OSS)Self-hosted, multilingual, dense+sparse hybrid
Jina embeddings v31,02462.4Free (up to 1M tokens/day)Long context (8K token input), task-specific

Advanced Retrieval Techniques

  • Hybrid search (dense + sparse): Combine vector similarity (semantic meaning) with BM25 keyword matching (exact terms). Critical for code search, legal docs, and any domain with precise terminology.
  • Multi-hop retrieval: First retrieval finds context, second retrieval uses that context to find more specific information. Essential for complex questions: "What was the revenue impact of the pricing change announced in Q3?"
  • Re-ranking: Retrieve 20-50 chunks, then use a cross-encoder (Cohere Rerank, BGE-Reranker) to score relevance and keep the top 3-5. Adds ~100ms latency but dramatically improves precision.
  • Query transformation: Rewrite user queries before retrieval — decompose complex questions into sub-questions, or generate hypothetical answers to use as search queries.

RAG Architecture Pattern

User Query
  -> Query Rewriting (decompose, expand)
  -> Hybrid Retrieval (vector + keyword, 50 candidates)
  -> Re-ranking (cross-encoder, top 5)
  -> Context Assembly (deduplicate, order by relevance)
  -> LLM Generation (with citation grounding)
  -> Response with Citations

Bottom line: The default "chunk + embed + query" RAG pipeline gets you to ~70% accuracy. Upgrade to hybrid search + re-ranking for ~85%. Add query transformation and multi-hop retrieval for 90%+. The biggest ROI improvement for most teams is adding re-ranking — it costs ~$0.01 per query and meaningfully improves answer quality. See also: AI API Integration Guide and LangChain vs LlamaIndex vs Haystack.