Decision (workshop spine). A single-turn “is the final answer good?” eval is the wrong instrument for agents and RAG. Build a three-layer diagnostic stack: end-to-end outcome → trajectory/path → failing component, each with its own metric and dataset [1][6]. For agents, grade the path (tool correctness, argument correctness, step efficiency) and measure reliability with
pass^k, not justpass@1— a 90%-success agent is only 57% consistent over 8 runs [7]. For RAG, decompose into Ragas-style faithfulness / answer relevancy / context precision / context recall so a regression points at retriever vs generator [2][3]. Calibrate every LLM-judge against human labels — frontier judges exceed 50% error on bias stress-tests [8].
Why single-turn evals are the wrong tool here
A single-turn eval scores input → output. Agents and RAG are not single-turn systems: they have trajectories, internal state, tool calls, and failure modes that unfold over time [4]. Output-only scoring is blind to a specific, recurring set of failures [6]:
| Failure mode | What single-turn eval sees | What actually happened |
|---|---|---|
| Ghost action | Plausible “Done, I booked it.” | Agent never called the booking tool; nothing changed downstream [6] |
| Confident fabrication | Polished, on-topic answer | Built on stale/hallucinated intermediate data [6] |
| Interrogation loop | Final answer eventually correct | Re-asked for info the user gave 3 turns ago [6] |
| Budget burning | Correct answer | 14 model completions instead of 2 → 10× cost/latency [6] |
| Inconsistency | One green run | Fails 1-in-2 reruns of the same task [7] |
The benchmark evidence is blunt: single-turn evaluations fail to capture agent failures that emerge across multi-step trajectories, and failure rates climb with trajectory length, tool-selection timing, and argument construction — three modes that only surface across an action sequence [5].
The three-layer diagnostic stack
These layers form a stack: outcome first, then path, then failing component — when the end-to-end score drops, component scores reveal where it originated [1]. Put component checks at tool-call spans, end-to-end checks at the root trace [6].
| Layer | Question | Agent metric | RAG metric |
|---|---|---|---|
| End-to-end | Did the task succeed? | Task Completion (LLM-judged, infers goal from input) [1] | Answer relevancy, end-to-end faithfulness [3] |
| Trajectory / path | Was the path right & efficient? | Tool Correctness, Step Efficiency, Plan Adherence [1] | (n/a — RAG path is the retrieve→generate spans below) |
| Component | Which part broke? | Argument Correctness, Handoff Correctness (per span) [6] | Context precision/recall (retriever), faithfulness (generator) [2] |
Key distinction for scoring fairness: separate agent failure from infrastructure failure. A tool timing out on a connection is not the agent’s fault; the agent failing to call the tool, or calling it with wrong parameters, is [4].
Agent metrics: grade the path, not just the answer
Definitions that matter in a workshop, with whether each is deterministic or LLM-judged [1]:
| Metric | What it checks | Type |
|---|---|---|
| Tool Correctness | Were the right tools called? (selection + I/O) | Deterministic [1] |
| Argument Correctness | Right parameters into each tool, scored per call | Deterministic [1] |
| Step Efficiency | No redundant retries, loops, wasted calls | Deterministic or LLM-judged [1] |
| Task Completion | Goal inferred from input, then reasoning+tools+output graded | LLM-judged [1] |
| Plan Adherence | Did execution follow the intended plan/constraints? | LLM-judged [1] |
The data layer under all of this is traces: every span (retrieval, tool call, handoff, generation) with inputs, outputs, latency, and cost. Traces enable failure attribution to a specific span and surface unanticipated failure modes before you’ve written a formal metric for them [1].
Reliability: pass^k, not pass@1
The single most important 2026 framing for an agent talk. pass@1 (or pass@k = “at least one of k succeeded”) rewards lucky single runs. τ-bench introduced pass^k = “all k attempts succeeded”, which decays as p^k [7]:
- 90% per-run success → 57% consistency at k=8 [7]
- On τ-bench retail, even strong function-calling agents score
pass^8< 25% [7]
Workshop takeaway: run each eval case N times and report the all-pass rate. A green CI run from a single execution is reliability theatre. sierra-research/tau-bench ⭐ 1.2k (Jun 2026) is the canonical tool-agent-user benchmark; its 2026 tau2-bench update adds voice + knowledge-retrieval domains and a dual-control framework [7].
RAG metrics: decompose so the regression names its own cause
The four Ragas-lineage metrics, with 2026 production thresholds and the subsystem each indicts [2][3]:
| Metric | Measures | Subsystem | Target |
|---|---|---|---|
| Context Recall | Did retrieval surface all relevant chunks? | Retriever [2] | ≥0.90 @ k=10 [2] |
| Context Precision | What fraction of retrieved chunks were useful? | Retriever (noise) [2] | ≥0.45 @ k=10 [2] |
| Faithfulness | Do answer claims trace to retrieved context? | Generator [2] | ≥0.90; unsafe <0.70 [2] |
| Answer Relevancy | Is the answer on-point for the query? | Generator/end-to-end [3] | ≥0.80 [2] |
How faithfulness is computed (worth demoing): an LLM judge decomposes the answer into atomic claims, then checks whether retrieved context supports / contradicts / is silent on each; score = supported-claim fraction [2].
Diagnostic logic — the payoff of decomposition [2]:
- Recall drops → retrieval problem (chunking, embedding drift, index degradation)
- Faithfulness drops, citation accuracy holds → generation/prompt issue
- Citation accuracy drops, faithfulness holds → citation-parsing/prompt-enforcement bug
This is what an opaque “the answer was bad” score can never tell you [2].
Tooling
| Tool | Stars | Niche |
|---|---|---|
| Ragas | ⭐ 14k | RAG metric reference impl; faithfulness/relevancy/precision/recall; LangChain/LlamaIndex native [3] |
| DeepEval | ⭐ 16k | 50+ first-party agent metrics (Tool/Argument Correctness, Plan Adherence) + RAG; pytest-style [1] |
| Opik | ⭐ 19k | Traces + evals + CI gating, self-hostable |
| Phoenix | ⭐ 10k | OpenInference tracing + eval, OTel-based |
LLM-as-judge: the grader is itself untrustworthy until calibrated
Most of the metrics above are LLM-judged, so judge reliability is load-bearing. Judges agree with humans ~85% of the time (higher than two humans agree), yet on bias stress-tests frontier models exceed 50% error rates [8]. Five named biases to flag in a talk [8]:
| Bias | Effect |
|---|---|
| Position | Slot A wins 10–15 pts more in pairwise [8] |
| Verbosity | Longer answers score higher at equal quality [8] |
| Self-preference | Judges score own model family 10–25% higher [8] |
| Format | Formatting/markdown sways the score [8] |
| Calibration drift | Same rubric, different distribution after a judge-model update [8] |
Mitigation: calibrate the judge on a held-out set of human labels and track disagreement — “otherwise you never audit the grader.” Disagreements signal rubric gaps, not just model variance [6].
Building the eval dataset (the part teams skip)
You can’t run any of the above without a dataset of cases. The 2026 default pattern: synthetic + sampled production + human calibration, all against one metric set [9].
- RAG: a synthetic set is
(question, retrieved-context, ground-truth-answer)triples, machine-generated — the default way to build RAG test sets at scale because it removes expert-annotation bottlenecks in niche/regulated domains [9]. Ragas ships test-set generation for exactly these triples [3]. - Agents: domain experts author “must-pass” scenarios with explicit acceptance criteria as anchor tests; target each known failure mode directly (e.g. for a booking agent, assert the booking API was actually invoked, not that the reply sounds plausible) [6][10].
- Close the loop: convert every production incident into a new golden so the regression is caught in CI next time [6].
CI / regression workflow (stacked gates)
- CI on every change — automated metrics + operating envelopes: max steps, token budgets, wall-clock timeouts. Fail the run when quality is fine but economics/latency are not [6].
- Pre-release — human review of a fixed sample after any prompt/model change, focused on high-risk intents [6].
- Production — monitor traces; audit flagged sessions (errors, complaints, metric drops) [6].
Workshop / talk hooks
- Live demo: show the same RAG answer scoring high on answer-relevancy but low on faithfulness — then trace it to a stale retrieved chunk. Makes “decompose or stay blind” visceral [2].
- The
pass^kreveal: run one agent task 8× live; watch a “90% agent” fail twice. Anchors why reliability ≠ accuracy [7]. - Ghost-action trap: an agent that says “refund processed” with no tool call in the trace — the one-slide case for trace-based, component-level evals [6].
- Judge audit: swap A/B order and watch the score flip — live proof of position bias, and why you calibrate graders [8].