Session blueprint: RAG & Embeddings — from demos in 5 minutes to a production-grade pipeline (2026)

Every fix in this session is a retrieval fix. The naive RAG failure modes research identifies seven production failure modes; six of them — semantic drift, boundary bleed, lost-in-the-middle degradation, sparse-term mismatch, missing context for ambiguous queries, and over-retrieval noise — are retrieval failures, not generation failures [1]. This is the organizing principle that gives the session its arc: each subsequent layer (chunking, hybrid search, reranking, the guardrail) is not a feature addition but a targeted retrieval patch.

The compounding cost problem must be made visible early. Corpus embedding & indexing documents a 3–4× latency spread across embedding model tiers alone [2]; chunking strategies add preprocessing overhead that scales with corpus size; hybrid search and reranking introduces a cross-encoder pass that typically adds 50–200 ms per query [3]; and the “I don’t know” guardrail adds an output-scoring step on top [4]. Each layer is justified in isolation, but participants leave with a system that is simply too slow to ship unless cumulative P95 latency is measured at every checkpoint. The session arc scaffold should wire a time.perf_counter block around the full pipeline from the first segment and print it after each live-coding step so the cost graph builds in front of the room [5].

The contextual retrieval moment is the sharpest debate catalyst. Anthropic’s contextual retrieval technique — prepending a generated context summary to each chunk before embedding — is documented in the chunking strategies research as producing step-change recall improvements [6]. The catch: it roughly doubles embedding cost per document and requires a prompt call per chunk at index time. For an expert audience, this is not a settled best practice; it is a trade-off that maps directly to whether the corpus is updated daily (cost matters) or indexed once (cost less so). The facilitator should surface this as an open vote: raise your hand if your corpus ingests more than 10 000 documents per day.

Evaluation must precede optimization, not follow it. The RAG evaluation research and observability research converge on the same prescription: instrument the dual-stage metrics — retrieval precision@k and recall, generation faithfulness and hallucination rate — before any tuning begins [7] [8]. This contradicts the session arc as scaffolded, which introduces evaluation as a late segment. The facilitator fix: deploy a minimal RAGAS harness in Segment 1 alongside the naive pipeline, then re-run it at each checkpoint. The quality graph across five stages is far more persuasive than a one-shot score at the end — and it makes the hallucination-rate drop after adding the guardrail visually undeniable.

The “I don’t know” guardrail is the only layer that touches generation. All other layers improve what gets retrieved; the guardrail research is the only segment that intervenes after retrieval by scoring whether the retrieved context actually supports the answer before returning it [9]. This makes it the natural final layer — but it also means it is sensitive to retrieval quality in the preceding layers. A guardrail tuned on naive retrieval will over-trigger (too many “I don’t know” responses); the same threshold after hybrid search + reranking will be appropriately tight. Facilitators should run the guardrail demo twice: once on the naive pipeline to show false abstentions, once after reranking to show it calibrating correctly.

The open question expert participants will ask first: at what corpus size does an in-process vector store (FAISS, LanceDB) break down and force a managed service migration? The corpus embedding & indexing research covers ANN algorithm trade-offs and managed-vs-self-hosted DB options but stops short of a concrete inflection-point benchmark [10]. The honest facilitator answer is: it depends on query concurrency, not corpus size alone — a 10 M-vector corpus served by a single-process FAISS index collapses under 50 concurrent users; a managed Qdrant or Weaviate cluster handles the same load at 100 M vectors. That framing, not a number, is the correct answer to give the room.

Session blueprint: RAG & Embeddings — from demos in 5 minutes to a production-grade pipeline (2026)

Sub-topics

Corpus embedding & indexing

Naive RAG failure modes (live demo)

Chunking strategies

Hybrid search & reranking

"I don't know" guardrail

Session arc & live-coding scaffold

RAG evaluation

Observability & tracing