Corpus Embedding & Indexing: 2026 Production Playbook

Decision: For most RAG pipelines: recursive 512-token chunks [4], text-embedding-3-small or Qwen3-Embedding-0.6B as your starting model [3], HNSW index up to ~100M vectors, IVF or IVF-PQ above that [7], and BM25+dense hybrid with RRF fusion when exact-keyword recall matters [9]. Chunking quality constrains retrieval accuracy more than model choice [4]. Every embedder swap forces a full re-index — re-embed only on model switches or chunking restructures [16].

Embedding Models

The MTEB quality curve flattens past 768 dimensions for most tasks — going from 1536 to 3072 dims adds marginal recall at roughly 6× the storage cost [3]. Open-weight models now match or exceed closed APIs on MTEB v2; the self-hosting crossover is ~500M–1B tokens/month [2]. Models marked † support Matryoshka Representation Learning (MRL): truncating to 256 dims retains 93–95% of retrieval quality at 6× lower storage [3]. Note that MTEB scores are not directly comparable across leaderboard versions; benchmark on your own corpus before committing [1].

Model	Provider	MTEB	Dims	$/1M tokens	Best for
text-embedding-3-small	OpenAI	~55 (ret)	1536†	$0.02	Safe default; widest integration
text-embedding-3-large	OpenAI	~59 (ret)	3072†	$0.13	Text-only high quality
voyage-3-large	Voyage AI	~62 (ret)	1024	$0.18	Top retrieval API benchmark
jina-embeddings-v4	Jina AI	~65 (ret)	1024†	$0.02	Best-value API; MRL 256→<1% quality loss
Gemini Embedding 2	Google	SOTA (ret)	—	API	Multilingual & cross-modal; all-rounder
Qwen3-Embedding-8B	Alibaba	~75.1 (avg)	—	Self-hosted	Open-weight SOTA; 119 languages
Qwen3-Embedding-0.6B	Alibaba	~67.8 (avg)	—	Self-hosted	Compact; matches API leaders
BGE-M3	BAAI	~68.2 (avg)	1024	Self-hosted	100+ langs; dense+sparse+ColBERT in one
NV-Embed-v2	NVIDIA	~64 (ret)	4096	Self-hosted	High overall MTEB; large dimension

Sources: [1], [2], [3]. MTEB (ret) = retrieval sub-task; (avg) = overall MTEB average.

At comparable scale, self-hosting BGE-M3 costs ~$500/month vs ~$13,000/month for OpenAI’s API [2]. For multilingual and cross-modal retrieval, Gemini Embedding 2 is the strongest all-rounder [1].

Chunking Strategies

Best-vs-worst strategy on the same corpus produces a ~9% recall gap [4]. A January 2026 systematic analysis (arXiv:2601.14123) [20] found chunk overlap adds no measurable benefit and raises indexing cost — treat as tunable rather than a mandatory default. Optimal chunk size by query type: factoid/lookup → 64–256 tokens; analytical/narrative → 512–1024 tokens; quality degrades sharply beyond ~2,500 tokens [4].

Strategy	Best for	Size target	Rel. speed	Key trade-off
Recursive	General default	512 tok	Fast	69% end-to-end accuracy in head-to-head [4]
Fixed-size	Simple, uniform docs	512–1024 tok	Fastest	Splits mid-sentence; ignore boundaries
Sentence	Clean narrative prose	~100 tok avg	Fast	Matches semantic quality for docs <5k tok [14]
Semantic	Multi-topic unstructured prose	~43 tok avg	14× slower	Chunks too short for LLM generation [4]
Hierarchical	Long analytical docs	Leaf + parent	Medium	Embeds small, retrieves big; best factoid+narrative
Late chunking	Long coherent docs	Full doc first	Medium	+1.9% nDCG@10 over naive; needs long-ctx model [5]
Contextual	High-stakes, lower volume	512 + LLM prefix	Slow (LLM)	−35% retrieval failures; ~$1.02/M tokens [15]
Agentic	Irregular / complex structure	Variable	Slowest	Highest quality ceiling; high latency + cost

Late chunking [5] [6]: embed all document tokens first, then split the token-embedding sequence before mean pooling. The transformer sees full context before chunking, so coreferences (“it”, “they”, “the city”) resolve correctly. Gains of 1.8–1.9% nDCG@10 over naive chunking on BEIR; gains grow with document length [5]. Requires a long-context embedding model (e.g., jina-embeddings-v3).

Contextual Retrieval (Anthropic): an LLM prepends a 50–100 token context summary to each chunk before embedding. Alone cuts retrieval failures by 35%; combined with cross-encoder reranking, −67% [15]. Cost is ~$1.02/M document tokens with prompt caching [4].

ANN Index Algorithms

Algorithm	Recall	Query latency	Memory	Build speed	Filtered search
HNSW	~98%+	Sub-ms; log complexity	High	Slow	Degrades >90% filter ratio [7]
IVF	~95%	Fast; tunable via nprobe	Low	Fast	Stable; two-level centroid+fine filter [7]
IVF-PQ	~90–95%	Fast	4–8× lower	Medium	Good; 1B vecs ~500 GB vs 4 TB HNSW [18]
Flat	100%	O(n)	Medium	Instant	Perfect; only viable for <1M vectors

Use HNSW by default for pure similarity search without heavy filtering [7]. Switch to IVF when filtered searches are common or memory is constrained. Use IVF-PQ at billion-scale: 4–8× memory reduction at ~5% recall cost [18]. For IVF, set nlist ≈ √n and tune nprobe (1–16) at query time without rebuilding the index [7].

Vector Databases

Database	Stars	Scale	p50 latency	Deployment	Standout
FAISS ⭐ 40k	library	1B+ (in-mem)	<1ms	Self-hosted	No persistence; fastest raw search
Milvus ⭐ 44.7k	OSS	1B+	~6ms	Both	Distributed; enterprise-grade
Qdrant ⭐ 32k	OSS	100M+	~4ms	Both	Rust; best filtered-search latency
Weaviate ⭐ 16.3k	OSS	100M+	~5ms	Both	Native BM25+dense hybrid
Chroma ⭐ 28.3k	OSS	<1M	~10ms	Both	Dev-friendly; degrades above 1M vecs
pgvector ⭐ 21.7k	extension	1–10M	5–50ms	Self-hosted	Postgres extension; zero new infra

Sources: [8], [17]. Stars fetched June 2026.

Scale selection [8]: under 10M vectors, any option works; 10M–1B narrows to Qdrant, Weaviate, Milvus, or managed Pinecone; above 1B → Milvus distributed or Vespa. Qdrant and Weaviate maintain low latency with complex filters; others can see 2–3× slowdowns. By 2026, hybrid search (BM25+dense) is table stakes — Weaviate, Qdrant, and Vespa ship it natively; Pinecone added it; pgvector requires manual composition [8].

Advanced Retrieval Pipeline

The production pattern for maximum recall [9] [22]:

BM25 (sparse) + dense ANN → top-1000 candidates
  → Reciprocal Rank Fusion (RRF)
  → cross-encoder rerank → top-100
  → LLM generate with citations

Why hybrid. BM25 handles exact-match and rare-term queries that break dense retrieval; dense vectors handle semantic paraphrase and synonyms that break BM25 [19]. Reciprocal Rank Fusion (RRF) sidesteps score-incompatibility between the two signals by working purely on rank position, not raw scores [9]. One team reported +48% RAG accuracy using cascading BM25+FAISS+cross-encoder reranking in production [19].

Cross-encoder reranking. Unlike bi-encoders (which embed query and document independently), a cross-encoder processes the query+document pair jointly, letting attention directly model their interaction. The “retrieve top-1000 → rerank top-100” pattern stays within p99 latency budgets for interactive search [22].

Production Optimization

Batching. GPUs thrive on large batches (256–512 sequences); CPUs are optimal at 16–64 [10]. A key finding: in a naive embedding function, GPU inference accounts for only ~10% of total compute — 90% is CPU-side tokenization [10]. Disaggregating tokenization from inference (pipeline parallelism) and running multiple model replicas per GPU produced a 16× throughput improvement, sustaining 230k tokens/sec on A10g GPUs [10]. Self-hosting via Hugging Face Text Embeddings Inference (TEI) on a GPU instance is straightforward for teams crossing the ~500M-token/month threshold [21].

Vector quantization. Three tiers [11] [12]:

Method	Compression	Recall impact	When to use
INT8 scalar	~4×	<0.5% loss	Default first step; nearly free [13]
Binary (BBQ)	~32×	Moderate loss	High-throughput; add a rescoring pass [11]
MRL truncation	4–12×	1–7% loss	Models with Matryoshka training [3]

Binary quantization achieves 32× compression and 80% faster queries but requires a full-vector rescoring pass to recover precision [11]. Combined INT8 + dimensionality reduction can cut storage 4–8× with under 1% MTEB score drop [12].