Chunking Strategies for RAG Pipelines

TL;DR — Start with recursive character splitting at 512 tokens: it outperforms semantic chunking on most benchmarks at 14× the speed [6]. Layer Contextual Retrieval to cut top-20 retrieval failures by up to 67% [5]. Use hierarchical (small-to-big) when you need retrieval precision and generation context [8].

Strategy Comparison

Strategy	Throughput	When to use	Tooling
Fixed-size	Fastest	Prototyping only; breaks semantic boundaries	`CharacterTextSplitter` (LangChain)
Recursive character	4.82 MB/s ⭐	General-purpose default; respects paragraph/sentence/word hierarchy	`RecursiveCharacterTextSplitter` (LangChain)
Token-based	~4.82 MB/s	Hard token-budget limits, multilingual text	`TokenTextSplitter` (LangChain)
Sentence-based	Medium	Legal, news; no mid-sentence breaks	`SentenceSplitter` (LlamaIndex)
Structure-aware (MD/HTML)	Fast	Docs, wikis — biggest retrieval gain for structured text	Unstructured.io, `MarkdownTextSplitter`
Semantic	0.33 MB/s (~14× slower)	Topic-dense knowledge bases where accuracy > speed	`SemanticChunker` (LangChain), `SemanticSplitterNodeParser` (LlamaIndex)
Late chunking	~token-based speed	Ambiguous cross-references, pronoun resolution	Jina AI embedding API
Hierarchical (small-to-big)	Medium	Complex Q&A needing retrieval precision + generation context	`ParentDocumentRetriever` (LangChain)
Contextual Retrieval	Preprocessing cost	High-value production corpora	Claude 3 Haiku + prompt caching
Agentic / LLM-decided	10–50× slower	Messy structured docs (contracts, research papers)	Experimental only

⭐ = recommended default [1] [9]

Chunk Size by Query Type

Query type	Size	Notes
Factoid (dates, names)	64–256 tokens	Smaller = higher precision for point lookups
General-purpose	256–512 tokens	Sweet spot for most chat-style RAG
Analytical / multi-hop	512–1024 tokens	Multi-concept reasoning needs more context
Long-document QA	1500–2048 tokens	Pair with a strong cross-encoder reranker

[2] [9] — There is a context cliff: quality degrades beyond ~2,500 tokens per chunk [11].

Key Benchmarks

Recursive at 512 tokens: 69% end-to-end accuracy across 50 academic papers (Vecta, Feb 2026) [2]
Semantic chunking: 54% in the same benchmark — 14.6× slower (4.82 → 0.33 MB/s) [6]
LLMSemanticChunker: 0.919 recall; ClusterSemanticChunker: 0.913 recall — at much higher cost [1]
Fixed-size baseline: context recall 0.72 (~1 in 4 queries misses relevant content) [10]
Sentence window optimization: context recall rises to 0.88, precision to 0.83 [10]
Best recall gap between strategies: up to 9% [7]

Advanced Techniques

Late Chunking

Encodes the full document with a long-context embedding model, then derives chunk vectors by pooling the already-contextualized token embeddings — no information loss at boundaries. Yields ~3.5% relative improvement on BeIR retrieval benchmarks [3]. Gains are largest at small chunk sizes. Requires a long-context embedding model (Jina AI’s API exposes this directly).

Contextual Retrieval

Anthropic’s technique: a small model (Claude 3 Haiku) generates a 50–100 token contextual description for each chunk and prepends it before embedding [5]. The layered gains on top-20 retrieval failure rate:

Layer added	Failure rate	Reduction
Baseline	5.7%	—
+ Contextual Embeddings	3.7%	35%
+ Contextual BM25	2.9%	49%
+ Reranking	1.9%	67%

Cost: ~$1.02 per million document tokens with prompt caching. Worth it for corpora that don’t change often [5].

Hierarchical / Small-to-Big

Index small child chunks (200–400 tokens) for retrieval precision; return their parent chunks (1500–3000 tokens) to the LLM for generation context [8]. Adds pipeline complexity (two document stores) but solves the classic trade-off between retrieval recall and generation quality. LangChain’s ParentDocumentRetriever implements this pattern directly.

Agentic / Propositional Chunking

An LLM identifies logical propositions and groups them into coherent retrieval units — highest-fidelity for complex implicit structure (legal contracts, research papers) but 10–50× indexing cost [4]. In one 2026 study, agentic chunking reached 94.5% accuracy, ~4% above fixed-token approaches. Treat as a last resort after measuring baselines.

The Overlap Question

Conventional wisdom: 10–20% overlap prevents context loss at boundaries. Reality: a Jan 2026 arXiv study (n=Natural Questions, SPLADE retrieval) found overlap provides no measurable retrieval benefit and only increases indexing cost [11]. Separate production analysis found up to 14.5% recall improvement with overlap in dense retriever setups [7]. The verdict: treat overlap as a variable to test on your data, not a free default.

Decision Framework

Parse first. Layout-aware parsing (Unstructured.io, Docling) preserves tables, headers, and reading order before any splitting. Weak parsing → weak chunks regardless of strategy.
Baseline. Recursive character splitting, 512 tokens, 0–10% overlap. Measure RAGAS: context recall, context precision, faithfulness [10].
Size to query type. If context recall is low on factoid queries → shrink to 128–256 tokens. If low on analytical queries → expand to 768–1024 tokens.
Structure wins first. If documents have headers or HTML structure, switch to structure-aware splitting before going semantic — often the biggest single gain for structured docs [4].
Upgrade cost-effectively. Add Contextual Retrieval before semantic chunking: better failure reduction, one-time preprocessing cost, no inference-time latency [5].
Hierarchical when precision + context both matter. Small-to-big adds pipeline complexity but pays off for complex Q&A [8].
Semantic / agentic last. Only after measuring a gap that justifies 14–50× ingestion cost.

80% of RAG failures trace to the ingestion and chunking layer — not the LLM, not the embedding model [2]. The canonical production stack: 400–600 token recursive chunks → retrieve top-30–50 → rerank to top-5 with a cross-encoder.

Tooling

Tool	Role
Chonkie ⭐ 4.1k	Fastest Python chunking library; 32+ integrations, 56 languages [12]
LangChain text splitters	`RecursiveCharacterTextSplitter` is the de-facto default
LlamaIndex `SentenceSplitter`	Sentence-aware splits; `SemanticSplitterNodeParser` for semantic
Unstructured.io	Layout-aware PDF/HTML parsing before chunking
Jina AI embeddings API	Late chunking via long-context encoder
RAGAS	Evaluation: context recall, precision, faithfulness