TL;DR
Naive RAG fails at retrieval, not generation. The bottleneck is three-fold: bad chunking boundaries (512-token splits break context), embeddings being lossy (similarity ≠ relevance), and ranking the right answer too low. A 40% retrieval failure rate is empirical across 2024–2025 production systems. [1] [2] Fixes: switch to late/contextual chunking, add a cross-encoder reranker, and use hybrid search (lexical + vector).
The Seven Failure Modes
1. Chunking destroys context. Default 512-token splits break paragraphs mid-thought and separate questions from answers. Teams leave 8–15 points of recall on the table with naive chunking alone. [1] Contextual and late chunking using thematic boundary detection cuts top-20 retrieval failures by up to 67%.[3]
2. Embedding similarity ≠ relevance. Bi-encoders compress paragraphs into a lossy 1536-dimensional point. A semantically relevant document may score low because dense vectors optimize for aggregate corpus similarity, not human relevance judgment. [4]
3. Ranking finds the answer but buries it. The retriever pulls 20 docs, the right one lands at position 15. The LLM never sees it. Hybrid retrieval (combining BM25 exact-match + dense vector search) plus cross-encoder reranking boost precision by 18–42%, adding 50–200ms but often cutting downstream LLM token spend enough to net out cheaper. [3]
4. Knowledge base is incomplete or stale. Missing docs, outdated content, or embeddings indexing a 2024-era model falling two generations behind by 2026. [1] Fix: changefeeds, content ownership, and regular audit coverage.
5. Context overflow. Too many snippets exceed token limits or dilute signal. Semantic chunking and adaptive retrieval depth (stop at K docs, not K tokens) address this. [5]
6. Query complexity. Single user questions are often compound or vague. Systems that generate multiple paraphrases or hypothetical answers before retrieval expand the query space and improve hit rate. [2]
7. Access-control leak. Vector indices contain chunks the user shouldn’t see. Similarity matching exfiltrates restricted content unless chunk-level ACLs are enforced at retrieval time. [1]
Why It Matters
Naive RAG reaches a ~40% failure rate on real corpora. [1] Most failures are retrieval (right answer not in top-k), not generation (LLM hallucinating). The fix is not a better model—it’s better retrieval engineering: chunking strategy, reranking, and hybrid search.