Atlas expedition 7 angles ↓

Evals — Vibes Don't Scale: A Complete Technical Session Blueprint

A complete session blueprint for expert developers covering golden datasets, LLM-as-judge validation, CI delta gates, and a live regression demo that catches what vibe checks miss.

7 succeeded 192 sources ~42 min read #206

The seven threads converge on one claim: the bottleneck in LLM evaluation is not tooling, it’s validation discipline [1]. Any of the eight viable open-source libraries can run your suite; none of them will tell you whether the suite is measuring what you think it is.

The dependency chain is strict. Golden dataset first — you need ~246 examples for 95% confidence at ±5% margin [2], and “20 examples” can’t move a launch decision. Then validate the judge against those human labels using TPR/TNR, not accuracy [3] — a judge that always predicts “pass” can score 90% accuracy while catching zero real failures. Only then can CI gates be meaningful: use statistical delta gates (mean drop + Welch’s t + effect size) rather than absolute score floors [4].

Two pieces of practitioner consensus cut across every angle and should anchor the session’s prescriptions: (1) binary pass/fail beats Likert scales, both for ground-truth labeling [3] and for judge rubrics [5] — the gap between a 3 and a 4 is subjective, and annotators default to the middle to avoid hard calls; (2) decompose vague rubrics into specific binary checks — “is this response good?” is unjudgeable; “does the response name the correct refund policy?” is not [6].

The bias catalogue is systematically underestimated. Position bias alone causes 20–40% of close-pair pairwise verdicts to flip on swap [7]. A verbosity padding attack fooled Claude-v1 and GPT-3.5 91.3% of the time [7]. The mitigation is cheap: run pairwise in both orders, count only consistent wins, and judge with a different model family than the one under test [8]. Few teams do this. It costs nothing to add as a concrete session recommendation.

A governance note for tool selection: Promptfoo — the most-starred CI-gating library at ⭐22k and the natural choice for the demo — was acquired by OpenAI in March 2026 [9]. Teams evaluating Anthropic, Llama, or Gemini models should factor vendor alignment into their toolchain; DeepEval (⭐16k, Python/pytest) and Inspect AI (UK AISI, reproducible safety evals) are mature alternatives with independent governance.

Security is a pre-condition, not a footnote. Eval pipelines are themselves an attack surface: LLM judges are susceptible to adversarial inputs, prompt injection, and token-level exploits [10]. The EU AI Act’s high-risk AI provisions take effect August 2026, mandating eval data lineage documentation [11]. If eval data contains production traces, consent provenance is required before it can legally leave your boundary for a cloud eval service.

The live demo closes the talk’s thesis. The 3-act arc — vibe check, innocent edit, eval catches it — passes the “I would have done that” test: every developer in the room has shipped a well-intentioned wording change that silently altered behavior elsewhere [12]. Promptfoo with --cache keeps the demo deterministic on conference Wi-Fi; a 60-second pre-recorded fallback is the cheapest insurance policy in live demo design [13].

The question none of the seven angles fully closes: how do you know when your rubric has converged? EvalGen names the paradox — you need criteria to grade outputs, but grading outputs is what reveals the criteria [14] — and operationalizes it via human-grading cycles, but the stopping condition remains heuristic. The honest answer for an expert audience: validate TPR/TNR across two annotation rounds on fresh data; if the numbers are stable, you’re as converged as you can be without more human labels. Nobody has shipped a better answer yet.

Sub-topics