← Default view
expedition · Session Blueprint · 2026-06-09

Evals — Vibes Don't Scale

A Complete Technical Session Blueprint — 7 modules for expert developers

7modules
192citations
42min total
~246golden examples needed
Core Thesis

The bottleneck in LLM evaluation is not tooling —
it's validation discipline.

Any of the eight viable open-source libraries can run your suite. None of them tell you whether the suite is measuring what you think it is. Golden dataset → judge validation → CI gates. That's the strict sequence. [1]

Hamel's LLM eval dependency diagram
Dependency Chain — critical path
01
Golden Dataset
Construction
7 min · 37 src
02
LLM-as-Judge
Grader Design
9 min · 44 src
04
Wiring Evals
into CI
7 min · 18 src
05
Live Regression
Demo Design
5 min · 15 src
Seven Modules
MODULE 01 expedition
Golden Dataset Construction

Mine real failures via error analysis; seed ~20 cases by hand, grow from production traces and synthetic generation; size for ~250 examples using statistical power math.

Build a 250+ example golden set from production failures, not generic benchmarks
promptfoo LLM-as-judge
MODULE 02 expedition
LLM-as-Judge Grader Design

Write a binary decomposed rubric; neutralize position bias (20–40% flip rate) and verbosity attacks (91.3% fool rate); validate TPR/TNR against human labels — not accuracy.

Design a grader with TPR/TNR > 0.85 validated against real human labels
MODULE 03 expedition
Eval Framework Landscape 2026

8 viable OSS libs mapped: DeepEval ⭐16k, Promptfoo ⭐22k (OpenAI-acquired), Inspect AI (UK AISI), Langfuse, Braintrust. Scoring methods, biases, agent/RAG support, CI gating.

Select a framework with governance clarity; understand each lib's scoring bias trade-offs
CI/CD LLM eval pipeline
MODULE 04 survey
Wiring Evals into CI

Path-scoped triggers, a layered eval pyramid, and statistical delta gates (mean drop + Welch's t + effect size). Gate on delta, not absolute floors — floors let regressions pass undetected.

Implement path-scoped CI triggers with statistical delta gates that catch real regressions
MODULE 05 survey
Live Regression Demo Design

3-act arc: vibe check passes → innocent wording edit → eval catches the regression. Promptfoo with --cache for determinism on conference Wi-Fi. 60-second pre-recorded fallback essential.

Run a 3-minute demo that converts skeptics via the "I would have done that" test
MODULE 06 recon
Eval Security & Data Governance

Eval pipelines are an attack surface: LLM judges susceptible to adversarial inputs, prompt injection, token-level exploits. EU AI Act high-risk provisions take effect August 2026.

Enumerate your eval attack surface; document data lineage before EU AI Act deadline (Aug 2026)
MODULE 07 recon
Metrics Beyond Accuracy

Accuracy hides failures on imbalanced data. Precision/Recall for trade-offs; F1 for imbalance; PR-AUC for rare positives. Match the metric to the use case before picking a grader rubric.

Match metric to use case; understand when accuracy is actively misleading
Learning Paths
Critical Numbers
~246
examples needed for 95% confidence at ±5% margin.
"20 examples" can't move a launch decision.
91.3%
of the time, a verbosity padding attack fooled Claude-v1 and GPT-3.5 — at zero cost to the attacker.
20–40%
of close-pair pairwise verdicts flip when the position of candidates is swapped — position bias.
TPR/TNR
are the metrics that matter for judge validation — not accuracy. A judge predicting "pass" always scores 90% accuracy while catching zero real failures.
Toolchain governance note: Promptfoo (⭐22k), the natural CI-gating choice and demo tool, was acquired by OpenAI in March 2026. Teams evaluating Anthropic, Llama, or Gemini models should evaluate vendor alignment. Alternatives: DeepEval ⭐16k (Python/pytest) and Inspect AI (UK AISI, reproducible safety evals) are mature with independent governance. EU AI Act high-risk provisions take effect August 2026 — eval data lineage documentation is mandatory. [11]