expedition · Session Blueprint · 2026-06-09

Evals — Vibes Don't Scale

A Complete Technical Session Blueprint — 7 modules for expert developers

7modules

192citations

42min total

~246golden examples needed

Core Thesis

The bottleneck in LLM evaluation is not tooling —
it's validation discipline.

Any of the eight viable open-source libraries can run your suite. None of them tell you whether the suite is measuring what you think it is. Golden dataset → judge validation → CI gates. That's the strict sequence. ^[1]

Dependency Chain — critical path

Golden Dataset
Construction

7 min · 37 src

→

LLM-as-Judge
Grader Design

9 min · 44 src

→

Wiring Evals
into CI

7 min · 18 src

→

Live Regression
Demo Design

5 min · 15 src

also supports 04: 03 · Eval Framework Landscape 2026 context & prerequisites: 06 · Security & Governance 07 · Metrics Beyond Accuracy

Seven Modules

⊞

MODULE 01 expedition

Golden Dataset Construction

Mine real failures via error analysis; seed ~20 cases by hand, grow from production traces and synthetic generation; size for ~250 examples using statistical power math.

Build a 250+ example golden set from production failures, not generic benchmarks

37 citations·7 min

MODULE 02 expedition

LLM-as-Judge Grader Design

Write a binary decomposed rubric; neutralize position bias (20–40% flip rate) and verbosity attacks (91.3% fool rate); validate TPR/TNR against human labels — not accuracy.

Design a grader with TPR/TNR > 0.85 validated against real human labels

44 citations·9 min

⬡

MODULE 03 expedition

Eval Framework Landscape 2026

8 viable OSS libs mapped: DeepEval ⭐16k, Promptfoo ⭐22k (OpenAI-acquired), Inspect AI (UK AISI), Langfuse, Braintrust. Scoring methods, biases, agent/RAG support, CI gating.

Select a framework with governance clarity; understand each lib's scoring bias trade-offs

63 citations·10 min

MODULE 04 survey

Wiring Evals into CI

Path-scoped triggers, a layered eval pyramid, and statistical delta gates (mean drop + Welch's t + effect size). Gate on delta, not absolute floors — floors let regressions pass undetected.

Implement path-scoped CI triggers with statistical delta gates that catch real regressions

18 citations·7 min

▶

MODULE 05 survey

Live Regression Demo Design

3-act arc: vibe check passes → innocent wording edit → eval catches the regression. Promptfoo with --cache for determinism on conference Wi-Fi. 60-second pre-recorded fallback essential.

Run a 3-minute demo that converts skeptics via the "I would have done that" test

15 citations·5 min

⚑

MODULE 06 recon

Eval Security & Data Governance

Eval pipelines are an attack surface: LLM judges susceptible to adversarial inputs, prompt injection, token-level exploits. EU AI Act high-risk provisions take effect August 2026.

Enumerate your eval attack surface; document data lineage before EU AI Act deadline (Aug 2026)

8 citations·2 min

◈

MODULE 07 recon

Metrics Beyond Accuracy

Accuracy hides failures on imbalanced data. Precision/Recall for trade-offs; F1 for imbalance; PR-AUC for rare positives. Match the metric to the use case before picking a grader rubric.

Match metric to use case; understand when accuracy is actively misleading

7 citations·2 min

Learning Paths

⚡ Quick Intel

4 min · recon modules only

07 · Metrics Beyond Accuracy

06 · Security & Governance

⚙ Core Engineering Track

23 min · the strict dependency chain

01 · Golden Dataset Construction

02 · LLM-as-Judge Grader Design

04 · Wiring Evals into CI

↗ Skeptic's Hook

14 min · demo first, then the why

05 · Live Regression Demo Design

02 · LLM-as-Judge Grader Design

01 · Golden Dataset Construction

◎ Full Expedition

42 min · all 7 modules in order

07 · Metrics Beyond Accuracy

01 → 02 → 03 (expedition core)

04 → 05 (integration + demo)

06 · Security & Governance

Critical Numbers

~246

examples needed for 95% confidence at ±5% margin.
"20 examples" can't move a launch decision.

[2] Statistical power math

91.3%

of the time, a verbosity padding attack fooled Claude-v1 and GPT-3.5 — at zero cost to the attacker.

[7] arxiv: LLMs are not fair evaluators

20–40%

of close-pair pairwise verdicts flip when the position of candidates is swapped — position bias.

[7] arxiv: LLMs are not fair evaluators

TPR/TNR

are the metrics that matter for judge validation — not accuracy. A judge predicting "pass" always scores 90% accuracy while catching zero real failures.

[3] Hamel's Evals FAQ

⚠

Toolchain governance note: Promptfoo (⭐22k), the natural CI-gating choice and demo tool, was acquired by OpenAI in March 2026. Teams evaluating Anthropic, Llama, or Gemini models should evaluate vendor alignment. Alternatives: DeepEval ⭐16k (Python/pytest) and Inspect AI (UK AISI, reproducible safety evals) are mature with independent governance. EU AI Act high-risk provisions take effect August 2026 — eval data lineage documentation is mandatory. ^[11]