Evals: How Do You Know Your AI Works? — A Session Blueprint

Decision — Ship this as a 2-hour, demo-driven hands-on session standardized on DeepEval, and teach error analysis, not tool-clicking. In 2026 the eval harness is commoditized; the methodology above it is the hard part and the thing a consultancy can actually sell.

Across all five angles one theme dominates: the eval harness is now a solved, commoditized layer, and the field has converged on a two-tier stack — a lightweight CI-gating framework (DeepEval, Ragas, or Promptfoo) paired with an annotation-and-dashboard platform (Braintrust, LangSmith, or Arize) [1]. The real differentiator is the discipline above the tools: build a golden set, write a binary LLM-as-judge grader, and calibrate that judge against human labels via TPR/TNR rather than raw agreement [2]. That is the “who grades the grader” loop — and it’s exactly where most client projects fly blind.

Tooling choice is unusually time-sensitive. OpenAI’s acquisition of Promptfoo (9 March 2026, ~$86M; the project stays MIT) [3][4] turns the most-starred leader into a vendor-aligned bet — a real concern when you ship provider-neutral evaluations for clients. That tips the lab’s default to DeepEval: pytest-native, CI-ready, provider-neutral (see tooling landscape and workshop design).

The angles reinforce each other on why evals matter at all. The reliability math is the punchline: a 90%-reliable agent step compounds to ~57% end-to-end success over 8 steps (pass^k) [5] — the quantitative case for the planted-regression demo and for blocking the build on a statistical gate, not a vibe (see evals in CI/CD). Agents and RAG also break the single-turn assumption entirely: you need trajectory and tool-use correctness, plus RAG’s faithfulness / context-precision-recall decomposition against explicit thresholds [6], so the eval layer has to be component-aware, not just end-to-end. This is also the cleanest hook back into the wider AI track, whose RAG and MCP/agent sessions produce precisely the systems these evals are built to test.

Sharpest open question before scheduling: run it as a true hands-on lab — highest learning, highest setup-failure risk — or as a demo-driven hybrid that degrades gracefully to a talk?

Evals: How Do You Know Your AI Works? — A Session Blueprint

Sub-topics

Eval Methodologies & Metrics for LLM Systems: The Taxonomy, the Judge Problem, and the Statistics

LLM Eval Tooling Landscape 2026: A Consultancy's Decision Guide

Wiring LLM Evals into CI/CD: gates, flaky judges, and cost budgets (2026)

Agent & RAG Evals in 2026: Trajectories, Tool-Use, and Ragas Metrics

Teaching Evals: A 2h Hands-On Workshop (and Talk Variant) for Expert Devs