Embeddings: Techniques and Best Practices

Embeddings convert text into dense vector representations that capture semantic meaning. They are the foundation of semantic search, clustering, recommendation systems, and retrieval-augmented generation.


Embedding Models


Different embedding models excel at different tasks. OpenAI text-embedding-ada-002 (1536 dimensions) is a strong general-purpose model. text-embedding-3-small (512-1536) offers better performance at lower cost. Sentence-transformers (all-MiniLM-L6-v2, 384 dimensions) run locally.


Multilingual embeddings support cross-lingual retrieval. intfloat/multilingual-e5-large works across 100+ languages. Cohere embed-multilingual supports semantic search in multiple languages. Domain-specific embeddings fine-tuned on your data outperform general models.


Embedding Quality Factors


Embedding quality depends on training data, model architecture, and dimension. Higher dimensions capture more information but cost more to store and query. Matryoshka embeddings adjust dimensionality without retraining.


Text normalization matters. Remove irrelevant formatting, standardize whitespace, and handle special characters consistently. Longer texts average out—1024 tokens is a good default chunk size. Experiment with different prefix instructions ("search_query:" vs "search_document:") for asymmetric search.


Similarity Metrics


Cosine similarity is the most common metric. It measures the angle between vectors, ignoring magnitude. Dot product considers both angle and magnitude—use with normalized vectors for cosine equivalence. Euclidean distance captures magnitude differences—useful for clustering.


Choose similarity based on your embedding model. OpenAI embeddings use cosine similarity. Cohere embeddings use dot product. Sentence-transformers use cosine similarity. Check your model's documentation.


Vector Databases


Pinecone, Weaviate, Qdrant, and Milvus are purpose-built vector databases. PostgreSQL with pgvector extends existing databases with vector search. Chroma is lightweight for development. Each offers different trade-offs in scalability, consistency, and query features.


Index type determines search speed-accuracy trade-off. HNSW (Hierarchical Navigable Small World) offers fast approximate nearest neighbor search. IVF (Inverted File Index) is more memory-efficient. Brute force search is exact but slow for large collections.


Preprocessing


Clean text before embedding. Remove HTML tags, normalize unicode, standardize whitespace, and handle special characters. For retrieval, prepend task prefixes matching the embedding model's training format. Test different chunk sizes and overlap strategies for your specific use case.