Decision. Pick the cheapest metric that still measures the thing: deterministic checks for anything with a ground truth, LLM-as-judge only for open-ended quality — and never trust a judge you haven’t calibrated against humans (target 80–90% agreement [7]). Treat position, verbosity, and self-preference bias as defaults that are on until you turn them off [5], and never ship a winner from a small eval set without a confidence interval — below ~a few hundred examples the CLT lies to you [9].
This is the methods backbone for the session. Three things to land with the room: (1) a metric taxonomy that tells you which tool for which output, (2) the LLM-as-judge bias + “who grades the grader” trap with concrete mitigations, (3) the statistics that separate a real win from noise on the small eval sets every consultant actually has.
1. The metric taxonomy — which tool for which output
Three families have crystallized by 2026: deterministic, rubric/model-based, and composite [2]. The decision rule: use the leftmost column that can actually measure your output, because each step right costs more money, more latency, and more calibration debt.
| Family | Metric | What it measures | Reference? | When to reach for it |
|---|---|---|---|---|
| Deterministic | Exact match / regex / JSON-schema valid | Structural / closed-form correctness | reference-based | Classification, extraction, tool-call args, format gates [2] |
| Deterministic | F1 / precision / recall | Classification quality | reference-based | Labels, retrieval hit/miss [1] |
| Statistical NLP | BLEU / ROUGE / METEOR | n-gram precision / recall overlap | reference-based | Translation, summarization vs a gold answer [1] |
| Statistical NLP | Levenshtein / edit distance | Char-level distance | reference-based | Code diffs, OCR, near-exact strings [1] |
| Embedding | BERTScore / cosine sim | Semantic similarity | reference-based | Paraphrase-tolerant matching [1] |
| Embedding | Perplexity / self-BLEU | Fluency / diversity | reference-free | No gold answer available [21] |
| Model-based | G-Eval (CoT judge) | Open-ended quality on a rubric | either | Coherence, helpfulness, tone [1] |
| Model-based | QAG (yes/no decomposition) | Factuality via closed sub-questions | either | Hallucination / faithfulness [1] |
| Model-based | RAG metrics (faithfulness, answer relevancy, contextual precision/recall) | Retrieval + grounding quality | reference-free | RAG pipelines [1] |
| Composite | DAG (decision-tree of judge calls) | Multi-criterion pass/fail | either | Gated, auditable verdicts [1] |
Reference-based vs reference-free is the axis consultants most often get wrong. Reference-based metrics compare output to a predefined gold answer — great for constrained outputs, useless when there’s no single right answer [21]. Reference-free metrics (perplexity, self-BLEU, embedding-based, and most RAG metrics like RAGAS faithfulness) score the output against the input/context with no gold label [21][17]. Open-ended generation almost always forces you reference-free → into judge territory.
Tooling anchors (workshop install targets): DeepEval ⭐ 16k (Jun 2026) for G-Eval/RAG/DAG [16], RAGAS ⭐ 14k (Jun 2026) for reference-free RAG metrics [17], and lm-evaluation-harness ⭐ 13k (Jun 2026) for static golden-dataset benchmark runs [18].
2. Golden datasets — the ground truth you can’t skip
A golden dataset is a curated, human-labeled set used as ground truth [3]. In a mature 2026 pipeline a PR/merge triggers an automated judge run against the full golden set, with the judge calibrated to 85–90% agreement with the human reference before its scores are trusted [7]. The framework fires at three lifecycle points: offline (curated set), online (live traffic), pre-merge (CI gate) [2]. The golden set is also the substrate the judge is calibrated against — so it does double duty as both benchmark and judge-validation oracle.
3. LLM-as-judge: single vs pairwise, and the biases that come free
Single (pointwise) = score one output against a rubric (absolute). Pairwise = pick the better of two (relative). Pairwise is easier for a model and humans to do reliably, but it’s where position bias bites hardest; pointwise scales better to CI gates and per-example thresholds. Note: position bias is not pairwise-only — it persists in rubric-based pointwise grading too [20].
The three biases to name on stage, with numbers:
| Bias | What it is | Measured magnitude | Mitigation |
|---|---|---|---|
| Position | Favors answer by slot, not quality | Judge picks first answer in 68% of comparisons even when humans prefer the second [14]; GPT-4 ~40% inconsistency under (A,B)/(B,A) swap [14]; 10–15 pt winrate swing [6] | Swap-and-average both orderings; treat order-dependent verdicts as ties [12] |
| Verbosity | Longer = better, even if thin | 15–30 pt inflated preference for verbose answers (Wang et al.) across GPT-4/Claude/PaLM-2 [6] | Length-controlled metric: regress out length [12] |
| Self-preference | Rates its own family higher | GPT-4 bias score 0.520; driven by perplexity, not self-recognition [4]; adds 10–25% uniform bias [6] | Cross-family judge: never judge a model with its own family [12] |
The key reframe for the audience: these are default behaviors, not edge cases — frontier judges fail 50%+ of bias tests, and the bias silently distorts every uncalibrated pipeline [5]. The self-preference mechanism is the surprising bit: GPT-4 over-scores low-perplexity text regardless of whether it generated it — so it’s familiarity, not vanity [4].
Diagnostic numbers worth knowing: position-bias studies decompose it into Repetition Stability (capable judges >0.95), Position Consistency (GPT-4/Claude-3.5 ~0.82, but Claude-3-Haiku collapses to 0.23 on harder benchmarks), and Preference Fairness [11]. Critically, position bias scales with the quality gap between the two answers and is weakly correlated with prompt length — so it’s worst exactly when the two candidates are close, i.e. when you most need the judge [11].
4. Rubric grading — design rules that actually move agreement
- Binary vs ordinal: binary pass/fail for hard gates (policy, regulated constraints, observable thresholds); ordinal when the thing has legitimate gradations [12].
- Scale choice matters empirically: human–LLM alignment is highest on a 0–5 scale [13]; vague 1–10 scoring causes “catastrophic score inflation” [12].
- 4–7 independent criteria, each tied to a real business failure mode, with anchor examples to lock the grading boundaries; merge overlapping criteria (grammar + spelling → “fluency”) so a single mistake isn’t double-penalized [12].
- Force chain-of-thought (G-Eval style): generating reasoning tokens before the score raises agreement with human experts [1].
5. Who grades the grader? — judge calibration
The judge is itself a model you have not evaluated. Calibration is the loop that makes its scores trustworthy:
- Collect human corrections on a sample of judge verdicts [8].
- Build few-shot examples from corrected cases into the judge prompt [8].
- Track agreement over time (κ / weighted κ for ordinal, % agreement for binary); alert on drops [7][8].
Targets: strong judges (GPT-5 class) hit 80–90% human agreement — comparable to inter-annotator agreement and often higher than two humans agree with each other (~85%) [7][15]. For ordinal rubrics, aim Krippendorff’s α ≈ 0.8 (0.67–0.8 tentative) [12].
War story to tell: a team’s hired domain expert re-graded 50 production outputs and the expert-vs-judge κ was 0.31 — the judge had been over-rewarding (family bias) and under-penalizing fluent hallucinations (length-confidence). Without calibration the scores looked reasonable while being systematically wrong [7].
Operational minimum: judge-prompt registry + sampled runner + scoring DB + calibration job against the gold-set + drift monitor on κ. The hard part isn’t the code — it’s the discipline of maintaining the gold-set and recalibrating when agreement drops [7]. Cheap recurring check: sample 5–10% of verdicts for human re-grade and watch the trend [7].
6. Statistical significance on small eval sets — the part everyone skips
The single most defensible slide: report the standard error of the mean beneath every eval score [10]. For binary correctness, SE = √(μ̂(1−μ̂)/n) [10].
Then the traps, all of which hit the ~50–500-example sets consultants actually have:
- The CLT lies below a few hundred examples. CLT-based intervals “dramatically underestimate uncertainty” — error bars too small, false precision [9]. Below ~100 datapoints, switch to Bayesian or bootstrap intervals instead [9][10].
- Bootstrap for the verdict: resample both systems; if the 95% CIs don’t overlap, you have a real difference; if they overlap, it could be noise — get more data [22]. Cost reality: 1,000 bootstrap iterations × LLM-judge on a 200-example set = 200,000 API calls — budget for it [22].
- Pair your comparisons. Comparing models A and B on the same questions, analyze per-question differences (s_A − s_B), not separate means — a “free” variance reduction because models agree on which questions are hard [10].
- Cluster when questions are correlated (same source doc, translation pairs): cluster-adjusted SE can be 3× larger than naïve CLT — ignore it and you’ll call noise a win [10].
- Right significance test for the metric: McNemar’s for paired binary, Wilcoxon signed-rank for ordinal, paired t-test for continuous [22].
- Power, before you collect: detecting a gap half the size needs 4× the samples (quadratic) — set your minimum detectable effect first, then size the set [10].
- Variance reduction: K outputs per question cuts within-question variance by 1/K; using token log-probabilities as a continuous score (vs {0,1}) eliminates within-question noise and beats binary correctness on signal-to-noise — and don’t change temperature to do it [10].
Workshop hooks (what to do with this)
- Live demo: run the same pairwise judge with (A,B) then (B,A); show the verdict flip → motivates swap-and-average in one slide [14].
- Exercise: hand teams a 40-example set with a 4% accuracy “win” between two prompts; have them compute the CI and discover it’s not significant [9][10].
- Takeaway artifact: a one-page decision tree — deterministic → embedding → judge, then “is it calibrated?” and “is the CI separated?” gates.
- The line that lands: “An uncalibrated LLM judge is a confident intern who never gets reviewed.” Calibration + CIs are the review.