TL;DR Hybrid retrieval (BM25 + dense, fused via RRF) outperforms either method alone by 3–7% NDCG [1]; adding a cross-encoder reranker adds another 5–15 NDCG@10 points [2]. Production path: retrieve top 100 candidates per method → fuse with RRF (k=60) → rerank to top 10. For self-hosted reranking use gte-reranker-modernbert (149M, best accuracy/size ratio); for API use Voyage rerank-2.5 (32K context, +7.94% over Cohere) [14].
Pipeline Architecture
Query
├─ BM25 (top-N) ──────────┐
└─ Dense ANN (top-N) ──────┤→ RRF fusion (top 100) → Cross-encoder (top 10) → LLM
Two stages solving two different failure modes:
- RRF recovers recall lost by either retriever individually — score-agnostic, no normalization required.
- Reranker fixes precision — bi-encoder similarity ≠ relevance at positions 1–5.
Stage 1 — Hybrid Retrieval
Why each retriever fails alone
BM25 produces unbounded integers; cosine similarity lives in [−1, 1]. Naïve weighted averaging fails because the scales are incompatible [1].
| Signal | BM25 | Dense |
|---|---|---|
| Exact terms (IDs, named entities) | ✓ | ✗ |
| Semantic paraphrase | ✗ | ✓ |
| Rare / out-of-vocabulary tokens | ✓ | ✗ |
| Concept synonyms | ✗ | ✓ |
Benchmark: WANDS (e-commerce)
| Method | NDCG@10 |
|---|---|
| BM25 only | 0.6983 |
| Dense only | 0.6953 |
| RRF hybrid | 0.7068 |
| Tuned hybrid | 0.7497 |
+7.4% lift over either alone with field boosting [1]. A production case study reported 91% retrieval accuracy vs 62% dense-only (+48%) [3].
Fusion methods
Reciprocal Rank Fusion (RRF) — the default
Score(d) = Σ 1 / (k + rank(d, retriever_r))
k=60 favors consensus across lists; no normalization required [4]. Use k=30–40 if top-1 precision matters more than top-10 recall. Supported natively in Qdrant (v1.10+), Elasticsearch, OpenSearch [5], and Azure AI Search [20].
Alpha-weighted fusion (Pinecone): score = α × dense + (1-α) × sparse. Start α=0.75 for natural-language queries; shift to α=0.3–0.4 for entity or product lookup [1].
⚠ Weaviate v1.24 silently switched from RRF to Relative Score Fusion as default — pin fusionType explicitly when upgrading [1].
Dynamic Alpha Tuning (DAT) — an LLM scores top-1 results from each retriever per query, calibrating α dynamically. Consistently outperforms fixed-α hybrid but adds an LLM round-trip [6].
Advanced sparse: beyond BM25
- SPLADE ⭐ 995 (Jun 2026) — learns sparse query/document expansion via BERT MLM head; 38.8 MRR@10 on MS MARCO dev; better out-of-domain BEIR generalization than BM25 at higher indexing cost [7].
- ColBERT v2 ⭐ 3.9k (Jun 2026) — late interaction: each token gets its own embedding; relevance = MaxSim over all token pairs. Bridges bi-encoders (fast indexing) and cross-encoders (precise scoring) in a single ANN-indexable structure [8].
Stage 2 — Reranking
Cross-encoders concatenate query + document and score relevance through every transformer layer — no compression, no approximation. This is inherently a small-batch operation: only practical on shortlists of 50–200.
Only 8.8% of retrieved chunks keep their original rank after reranking; top-ranked evidence chunks averaged original retrieval rank 6.0, with several originally outside the top-10 [9]. The reranker fundamentally reorders the list.
Typical gain vs bi-encoder alone: +5 to +15 NDCG@10; +22 Precision@10 (0.62 → 0.84) in one production system [2] [10].
Reranker model comparison
| Model | Params | Context | Accuracy† | Latency‡ | Hosting |
|---|---|---|---|---|---|
| BGE-Reranker-v2-m3 [11] | 0.6B | 512T | 60.4 MTEB | 90ms/100pr | Self-host |
| gte-reranker-modernbert-base | 149M | — | 83.0% Hit@1 | ~150ms | Self-host |
| Nemotron-rerank-1b | 1.2B | — | 83.0% Hit@1 | 243ms | Self-host |
| Jina Reranker v3 [12] | 560M | 1024T | 81.3% Hit@1 | 188ms | API / self-host |
| Qwen3-Reranker-0.6B [13] | 0.6B | 32K | >BGE MTEB | — | Self-host / $0.01/M |
| Qwen3-Reranker-4B [13] | 4B | 32K | 77.7% Hit@1 | >1000ms | $0.02/M |
| Cohere Rerank 3.5 [16] | — | 4K | baseline | ~595ms | API ~$100/100k |
| Voyage rerank-2.5 [14] | — | 32K | +7.94% vs Cohere | ~603ms | API, 200M free |
†Hit@1 from Agentset benchmark [15] unless noted; MTEB from official model cards.
‡Self-hosted GPU (H100); API includes network round-trip.
Decision matrix:
- Self-hosted English, any budget: gte-reranker-modernbert-base (149M) matches 1.2B nemotron at 83% Hit@1 [15].
- Self-hosted multilingual: Qwen3-Reranker-0.6B (32K context, 100+ languages, surpasses BGE) [13].
- API, best accuracy + long context: Voyage rerank-2.5 — 32K context (8× Cohere), instruction-following, +7.94% accuracy on 93-dataset suite [14].
- API, multilingual is critical: Cohere Rerank 3.5 — validated on Spanish, French, German, Japanese, Mandarin, Arabic, Hindi, Portuguese [16].
- Domain-steerable: Voyage rerank-2.5 or Qwen3-Reranker accept natural-language instructions (“prioritize recent technical docs”) to steer relevance without schema changes [14] [21].
Setup with FlagEmbedding ⭐ 11.8k (Jun 2026)
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True)
pairs = [(query, doc) for doc in candidates]
scores = reranker.compute_score(pairs, normalize=True)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
Or with Sentence Transformers: CrossEncoder("BAAI/bge-reranker-v2-m3") [22].
LLM-based rerankers
RankLLM ⭐ 603 (Jun 2026) — listwise reranking via RankGPT (GPT-4o), RankZephyr, RankVicuna, MonoT5 [17]. Listwise methods generalize better to unseen queries: 8% average degradation vs 12–15% for pointwise approaches [18]. Best quality, highest latency — suited for offline batch re-ranking or premium pipelines.
AnswerDotAI/rerankers ⭐ 1.6k (Jun 2026) — unified Python API wrapping cross-encoders, ColBERT, RankGPT, Cohere, Voyage, Jina, and FlashRank under Reranker('model-name').rank(query, docs) [19]. Good for A/B testing multiple backends.
Platform Support
| Platform | Fusion method | RRF native | Notes |
|---|---|---|---|
| Qdrant | Dense + sparse | ✓ (v1.10+) | Server-side Query API, no enterprise gate |
| Weaviate | Dense + BM25 | ✓ (manual) | v1.24+ default = RSF ⚠ — set fusionType |
| Pinecone | Dense + sparse | ✗ | Alpha-weighted; built-in sparse model available |
| Elasticsearch | Dense + BM25 | ✓ | Enterprise plan; ranx library for free tier |
| OpenSearch | BM25 + kNN | ✓ | Configurable k via search pipeline [5] |
| Azure AI Search | BM25 + vector | ✓ | Semantic ranker add-on for reranking [20] |
| Milvus / Zilliz | Dense + sparse | ✓ | Native Qwen3 embedding + reranker support [21] |
Tuning Checklist
- RRF k=60 — works without per-deployment tuning; adjust to k=30–40 only if top-1 matters more than top-10 [4].
- Candidate pool — 50–100 candidates per retriever for normal corpora; scale to 500 if recall is the bottleneck before reranking.
- Reranker cut — rerank top 50, send top 5–10 to LLM. More context → lost-in-the-middle degradation.
- Domain fine-tuning — +10–20 NDCG above zero-shot on narrow domains [2]. Worth it for enterprise search; skip for general assistants.
- Query-type routing — classify queries as entity-lookup vs semantic; apply α=0.3 vs α=0.75 accordingly. Gets most of DAT’s benefit without an extra LLM call [6].
- Batch reranking — batch cross-encoder calls (batch_size=32) and use Text Embeddings Inference (TEI) for GPU acceleration [2].