RAG (Retrieval-Augmented Generation) is the most common production LLM pattern in 2026, powering everything from customer support chatbots to legal document analysis. But most RAG implementations stop at "chunk documents, embed, query" — leaving 50%+ of potential accuracy on the table. This guide covers the production practices that separate a 70% accurate RAG system from a 95% accurate one.
Chunking Strategies: The Foundation of RAG
| Strategy | Best For | Pros | Cons |
|---|---|---|---|
| Fixed-Size (512 tokens) | General documents, quick to implement | Simple, predictable | Splits sentences mid-thought; loses context |
| Sentence-Aware (split on sentence boundaries) | Articles, documentation | Semantically meaningful chunks | Can produce very uneven chunk sizes |
| Recursive (split on paragraph, then sentence, then token) | Mixed content types | Balanced approach | Slightly more complex to implement |
| Semantic (split on topic boundaries via embedding similarity) | Long, multi-topic documents | Best semantic coherence | Slower, requires LLM or embedding model |
| Agentic (LLM decides chunk boundaries) | Complex, unstructured documents | Best quality | Expensive, slow, LLM cost per chunk |
The optimal chunk size in 2026: 256-512 tokens with 10-20% overlap. Larger chunks (1,024+) improve context but dilute retrieval precision. The overlap ensures no answer straddles a chunk boundary.
Embedding Model Selection
| Model | Dimensions | MTEB Score | Cost (per 1M tokens) | Best For |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 256-3,072 (Matryoshka) | 64.6 | $0.13 | General purpose, best quality |
| Cohere Embed v4 | 1,024 | 65.2 | $0.10 | Multilingual, long documents |
| BGE-M3 (BAAI) | 1,024 | 63.8 | Free (OSS) | Self-hosted, multilingual, dense+sparse hybrid |
| Jina embeddings v3 | 1,024 | 62.4 | Free (up to 1M tokens/day) | Long context (8K token input), task-specific |
Advanced Retrieval Techniques
- Hybrid search (dense + sparse): Combine vector similarity (semantic meaning) with BM25 keyword matching (exact terms). Critical for code search, legal docs, and any domain with precise terminology.
- Multi-hop retrieval: First retrieval finds context, second retrieval uses that context to find more specific information. Essential for complex questions: "What was the revenue impact of the pricing change announced in Q3?"
- Re-ranking: Retrieve 20-50 chunks, then use a cross-encoder (Cohere Rerank, BGE-Reranker) to score relevance and keep the top 3-5. Adds ~100ms latency but dramatically improves precision.
- Query transformation: Rewrite user queries before retrieval — decompose complex questions into sub-questions, or generate hypothetical answers to use as search queries.
RAG Architecture Pattern
User Query
-> Query Rewriting (decompose, expand)
-> Hybrid Retrieval (vector + keyword, 50 candidates)
-> Re-ranking (cross-encoder, top 5)
-> Context Assembly (deduplicate, order by relevance)
-> LLM Generation (with citation grounding)
-> Response with Citations
Bottom line: The default "chunk + embed + query" RAG pipeline gets you to ~70% accuracy. Upgrade to hybrid search + re-ranking for ~85%. Add query transformation and multi-hop retrieval for 90%+. The biggest ROI improvement for most teams is adding re-ranking — it costs ~$0.01 per query and meaningfully improves answer quality. See also: AI API Integration Guide and LangChain vs LlamaIndex vs Haystack.