Decision: For most RAG pipelines: recursive 512-token chunks [4],
text-embedding-3-smallorQwen3-Embedding-0.6Bas your starting model [3], HNSW index up to ~100M vectors, IVF or IVF-PQ above that [7], and BM25+dense hybrid with RRF fusion when exact-keyword recall matters [9]. Chunking quality constrains retrieval accuracy more than model choice [4]. Every embedder swap forces a full re-index — re-embed only on model switches or chunking restructures [16].
Embedding Models
The MTEB quality curve flattens past 768 dimensions for most tasks — going from 1536 to 3072 dims adds marginal recall at roughly 6× the storage cost [3]. Open-weight models now match or exceed closed APIs on MTEB v2; the self-hosting crossover is ~500M–1B tokens/month [2]. Models marked † support Matryoshka Representation Learning (MRL): truncating to 256 dims retains 93–95% of retrieval quality at 6× lower storage [3]. Note that MTEB scores are not directly comparable across leaderboard versions; benchmark on your own corpus before committing [1].
| Model | Provider | MTEB | Dims | $/1M tokens | Best for |
|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | ~55 (ret) | 1536† | $0.02 | Safe default; widest integration |
| text-embedding-3-large | OpenAI | ~59 (ret) | 3072† | $0.13 | Text-only high quality |
| voyage-3-large | Voyage AI | ~62 (ret) | 1024 | $0.18 | Top retrieval API benchmark |
| jina-embeddings-v4 | Jina AI | ~65 (ret) | 1024† | $0.02 | Best-value API; MRL 256→<1% quality loss |
| Gemini Embedding 2 | SOTA (ret) | — | API | Multilingual & cross-modal; all-rounder | |
| Qwen3-Embedding-8B | Alibaba | ~75.1 (avg) | — | Self-hosted | Open-weight SOTA; 119 languages |
| Qwen3-Embedding-0.6B | Alibaba | ~67.8 (avg) | — | Self-hosted | Compact; matches API leaders |
| BGE-M3 | BAAI | ~68.2 (avg) | 1024 | Self-hosted | 100+ langs; dense+sparse+ColBERT in one |
| NV-Embed-v2 | NVIDIA | ~64 (ret) | 4096 | Self-hosted | High overall MTEB; large dimension |
Sources: [1], [2], [3]. MTEB (ret) = retrieval sub-task; (avg) = overall MTEB average.
At comparable scale, self-hosting BGE-M3 costs ~$500/month vs ~$13,000/month for OpenAI’s API [2]. For multilingual and cross-modal retrieval, Gemini Embedding 2 is the strongest all-rounder [1].
Chunking Strategies
Best-vs-worst strategy on the same corpus produces a ~9% recall gap [4]. A January 2026 systematic analysis (arXiv:2601.14123) [20] found chunk overlap adds no measurable benefit and raises indexing cost — treat as tunable rather than a mandatory default. Optimal chunk size by query type: factoid/lookup → 64–256 tokens; analytical/narrative → 512–1024 tokens; quality degrades sharply beyond ~2,500 tokens [4].
| Strategy | Best for | Size target | Rel. speed | Key trade-off |
|---|---|---|---|---|
| Recursive | General default | 512 tok | Fast | 69% end-to-end accuracy in head-to-head [4] |
| Fixed-size | Simple, uniform docs | 512–1024 tok | Fastest | Splits mid-sentence; ignore boundaries |
| Sentence | Clean narrative prose | ~100 tok avg | Fast | Matches semantic quality for docs <5k tok [14] |
| Semantic | Multi-topic unstructured prose | ~43 tok avg | 14× slower | Chunks too short for LLM generation [4] |
| Hierarchical | Long analytical docs | Leaf + parent | Medium | Embeds small, retrieves big; best factoid+narrative |
| Late chunking | Long coherent docs | Full doc first | Medium | +1.9% nDCG@10 over naive; needs long-ctx model [5] |
| Contextual | High-stakes, lower volume | 512 + LLM prefix | Slow (LLM) | −35% retrieval failures; ~$1.02/M tokens [15] |
| Agentic | Irregular / complex structure | Variable | Slowest | Highest quality ceiling; high latency + cost |
Late chunking [5] [6]: embed all document tokens first, then split the token-embedding sequence before mean pooling. The transformer sees full context before chunking, so coreferences (“it”, “they”, “the city”) resolve correctly. Gains of 1.8–1.9% nDCG@10 over naive chunking on BEIR; gains grow with document length [5]. Requires a long-context embedding model (e.g., jina-embeddings-v3).
Contextual Retrieval (Anthropic): an LLM prepends a 50–100 token context summary to each chunk before embedding. Alone cuts retrieval failures by 35%; combined with cross-encoder reranking, −67% [15]. Cost is ~$1.02/M document tokens with prompt caching [4].
ANN Index Algorithms
| Algorithm | Recall | Query latency | Memory | Build speed | Filtered search |
|---|---|---|---|---|---|
| HNSW | ~98%+ | Sub-ms; log complexity | High | Slow | Degrades >90% filter ratio [7] |
| IVF | ~95% | Fast; tunable via nprobe | Low | Fast | Stable; two-level centroid+fine filter [7] |
| IVF-PQ | ~90–95% | Fast | 4–8× lower | Medium | Good; 1B vecs ~500 GB vs 4 TB HNSW [18] |
| Flat | 100% | O(n) | Medium | Instant | Perfect; only viable for <1M vectors |
Use HNSW by default for pure similarity search without heavy filtering [7]. Switch to IVF when filtered searches are common or memory is constrained. Use IVF-PQ at billion-scale: 4–8× memory reduction at ~5% recall cost [18]. For IVF, set nlist ≈ √n and tune nprobe (1–16) at query time without rebuilding the index [7].
Vector Databases
| Database | Stars | Scale | p50 latency | Deployment | Standout |
|---|---|---|---|---|---|
| FAISS ⭐ 40k | library | 1B+ (in-mem) | <1ms | Self-hosted | No persistence; fastest raw search |
| Milvus ⭐ 44.7k | OSS | 1B+ | ~6ms | Both | Distributed; enterprise-grade |
| Qdrant ⭐ 32k | OSS | 100M+ | ~4ms | Both | Rust; best filtered-search latency |
| Weaviate ⭐ 16.3k | OSS | 100M+ | ~5ms | Both | Native BM25+dense hybrid |
| Chroma ⭐ 28.3k | OSS | <1M | ~10ms | Both | Dev-friendly; degrades above 1M vecs |
| pgvector ⭐ 21.7k | extension | 1–10M | 5–50ms | Self-hosted | Postgres extension; zero new infra |
Sources: [8], [17]. Stars fetched June 2026.
Scale selection [8]: under 10M vectors, any option works; 10M–1B narrows to Qdrant, Weaviate, Milvus, or managed Pinecone; above 1B → Milvus distributed or Vespa. Qdrant and Weaviate maintain low latency with complex filters; others can see 2–3× slowdowns. By 2026, hybrid search (BM25+dense) is table stakes — Weaviate, Qdrant, and Vespa ship it natively; Pinecone added it; pgvector requires manual composition [8].
Advanced Retrieval Pipeline
The production pattern for maximum recall [9] [22]:
BM25 (sparse) + dense ANN → top-1000 candidates
→ Reciprocal Rank Fusion (RRF)
→ cross-encoder rerank → top-100
→ LLM generate with citations
Why hybrid. BM25 handles exact-match and rare-term queries that break dense retrieval; dense vectors handle semantic paraphrase and synonyms that break BM25 [19]. Reciprocal Rank Fusion (RRF) sidesteps score-incompatibility between the two signals by working purely on rank position, not raw scores [9]. One team reported +48% RAG accuracy using cascading BM25+FAISS+cross-encoder reranking in production [19].
Cross-encoder reranking. Unlike bi-encoders (which embed query and document independently), a cross-encoder processes the query+document pair jointly, letting attention directly model their interaction. The “retrieve top-1000 → rerank top-100” pattern stays within p99 latency budgets for interactive search [22].
Production Optimization
Batching. GPUs thrive on large batches (256–512 sequences); CPUs are optimal at 16–64 [10]. A key finding: in a naive embedding function, GPU inference accounts for only ~10% of total compute — 90% is CPU-side tokenization [10]. Disaggregating tokenization from inference (pipeline parallelism) and running multiple model replicas per GPU produced a 16× throughput improvement, sustaining 230k tokens/sec on A10g GPUs [10]. Self-hosting via Hugging Face Text Embeddings Inference (TEI) on a GPU instance is straightforward for teams crossing the ~500M-token/month threshold [21].
Vector quantization. Three tiers [11] [12]:
| Method | Compression | Recall impact | When to use |
|---|---|---|---|
| INT8 scalar | ~4× | <0.5% loss | Default first step; nearly free [13] |
| Binary (BBQ) | ~32× | Moderate loss | High-throughput; add a rescoring pass [11] |
| MRL truncation | 4–12× | 1–7% loss | Models with Matryoshka training [3] |
Binary quantization achieves 32× compression and 80% faster queries but requires a full-vector rescoring pass to recover precision [11]. Combined INT8 + dimensionality reduction can cut storage 4–8× with under 1% MTEB score drop [12].