Atlas expedition 5 angles ↓

Evals: How Do You Know Your AI Works? — A Session Blueprint

Everything to run an evals session for a client-shipping consultancy: the methodology that actually matters, the 2026 tooling shake-up, CI gates, agent/RAG specifics, and a runnable 2h lab.

5 succeeded

Decision — Ship this as a 2-hour, demo-driven hands-on session standardized on DeepEval, and teach error analysis, not tool-clicking. In 2026 the eval harness is commoditized; the methodology above it is the hard part and the thing a consultancy can actually sell.

Across all five angles one theme dominates: the eval harness is now a solved, commoditized layer, and the field has converged on a two-tier stack — a lightweight CI-gating framework (DeepEval, Ragas, or Promptfoo) paired with an annotation-and-dashboard platform (Braintrust, LangSmith, or Arize) [1]. The real differentiator is the discipline above the tools: build a golden set, write a binary LLM-as-judge grader, and calibrate that judge against human labels via TPR/TNR rather than raw agreement [2]. That is the “who grades the grader” loop — and it’s exactly where most client projects fly blind.

Tooling choice is unusually time-sensitive. OpenAI’s acquisition of Promptfoo (9 March 2026, ~$86M; the project stays MIT) [3][4] turns the most-starred leader into a vendor-aligned bet — a real concern when you ship provider-neutral evaluations for clients. That tips the lab’s default to DeepEval: pytest-native, CI-ready, provider-neutral (see tooling landscape and workshop design).

The angles reinforce each other on why evals matter at all. The reliability math is the punchline: a 90%-reliable agent step compounds to ~57% end-to-end success over 8 steps (pass^k) [5] — the quantitative case for the planted-regression demo and for blocking the build on a statistical gate, not a vibe (see evals in CI/CD). Agents and RAG also break the single-turn assumption entirely: you need trajectory and tool-use correctness, plus RAG’s faithfulness / context-precision-recall decomposition against explicit thresholds [6], so the eval layer has to be component-aware, not just end-to-end. This is also the cleanest hook back into the wider AI track, whose RAG and MCP/agent sessions produce precisely the systems these evals are built to test.

Sharpest open question before scheduling: run it as a true hands-on lab — highest learning, highest setup-failure risk — or as a demo-driven hybrid that degrades gracefully to a talk?

Sub-topics