Atlas survey

Agent & RAG Evals in 2026: Trajectories, Tool-Use, and Ragas Metrics

Why single-turn evals miss agent/RAG failures, and the 2026 metric stack — trajectory + tool-use correctness, pass^k reliability, and Ragas-style RAG decomposition — that catches them.

15 sources ~8 min read evals · agents · rag · ragas · tool-use · llm-as-judge · trajectory-evaluation

Decision (workshop spine). A single-turn “is the final answer good?” eval is the wrong instrument for agents and RAG. Build a three-layer diagnostic stack: end-to-end outcome → trajectory/path → failing component, each with its own metric and dataset [1][6]. For agents, grade the path (tool correctness, argument correctness, step efficiency) and measure reliability with pass^k, not just pass@1 — a 90%-success agent is only 57% consistent over 8 runs [7]. For RAG, decompose into Ragas-style faithfulness / answer relevancy / context precision / context recall so a regression points at retriever vs generator [2][3]. Calibrate every LLM-judge against human labels — frontier judges exceed 50% error on bias stress-tests [8].

Why single-turn evals are the wrong tool here

A single-turn eval scores input → output. Agents and RAG are not single-turn systems: they have trajectories, internal state, tool calls, and failure modes that unfold over time [4]. Output-only scoring is blind to a specific, recurring set of failures [6]:

Failure mode What single-turn eval sees What actually happened
Ghost action Plausible “Done, I booked it.” Agent never called the booking tool; nothing changed downstream [6]
Confident fabrication Polished, on-topic answer Built on stale/hallucinated intermediate data [6]
Interrogation loop Final answer eventually correct Re-asked for info the user gave 3 turns ago [6]
Budget burning Correct answer 14 model completions instead of 2 → 10× cost/latency [6]
Inconsistency One green run Fails 1-in-2 reruns of the same task [7]

The benchmark evidence is blunt: single-turn evaluations fail to capture agent failures that emerge across multi-step trajectories, and failure rates climb with trajectory length, tool-selection timing, and argument construction — three modes that only surface across an action sequence [5].

The three-layer diagnostic stack

These layers form a stack: outcome first, then path, then failing component — when the end-to-end score drops, component scores reveal where it originated [1]. Put component checks at tool-call spans, end-to-end checks at the root trace [6].

Layer Question Agent metric RAG metric
End-to-end Did the task succeed? Task Completion (LLM-judged, infers goal from input) [1] Answer relevancy, end-to-end faithfulness [3]
Trajectory / path Was the path right & efficient? Tool Correctness, Step Efficiency, Plan Adherence [1] (n/a — RAG path is the retrieve→generate spans below)
Component Which part broke? Argument Correctness, Handoff Correctness (per span) [6] Context precision/recall (retriever), faithfulness (generator) [2]

Key distinction for scoring fairness: separate agent failure from infrastructure failure. A tool timing out on a connection is not the agent’s fault; the agent failing to call the tool, or calling it with wrong parameters, is [4].

Agent metrics: grade the path, not just the answer

Definitions that matter in a workshop, with whether each is deterministic or LLM-judged [1]:

Metric What it checks Type
Tool Correctness Were the right tools called? (selection + I/O) Deterministic [1]
Argument Correctness Right parameters into each tool, scored per call Deterministic [1]
Step Efficiency No redundant retries, loops, wasted calls Deterministic or LLM-judged [1]
Task Completion Goal inferred from input, then reasoning+tools+output graded LLM-judged [1]
Plan Adherence Did execution follow the intended plan/constraints? LLM-judged [1]

The data layer under all of this is traces: every span (retrieval, tool call, handoff, generation) with inputs, outputs, latency, and cost. Traces enable failure attribution to a specific span and surface unanticipated failure modes before you’ve written a formal metric for them [1].

Reliability: pass^k, not pass@1

The single most important 2026 framing for an agent talk. pass@1 (or pass@k = “at least one of k succeeded”) rewards lucky single runs. τ-bench introduced pass^k = “all k attempts succeeded”, which decays as p^k [7]:

  • 90% per-run success → 57% consistency at k=8 [7]
  • On τ-bench retail, even strong function-calling agents score pass^8 < 25% [7]

Workshop takeaway: run each eval case N times and report the all-pass rate. A green CI run from a single execution is reliability theatre. sierra-research/tau-bench ⭐ 1.2k (Jun 2026) is the canonical tool-agent-user benchmark; its 2026 tau2-bench update adds voice + knowledge-retrieval domains and a dual-control framework [7].

RAG metrics: decompose so the regression names its own cause

The four Ragas-lineage metrics, with 2026 production thresholds and the subsystem each indicts [2][3]:

Metric Measures Subsystem Target
Context Recall Did retrieval surface all relevant chunks? Retriever [2] ≥0.90 @ k=10 [2]
Context Precision What fraction of retrieved chunks were useful? Retriever (noise) [2] ≥0.45 @ k=10 [2]
Faithfulness Do answer claims trace to retrieved context? Generator [2] ≥0.90; unsafe <0.70 [2]
Answer Relevancy Is the answer on-point for the query? Generator/end-to-end [3] ≥0.80 [2]

How faithfulness is computed (worth demoing): an LLM judge decomposes the answer into atomic claims, then checks whether retrieved context supports / contradicts / is silent on each; score = supported-claim fraction [2].

Diagnostic logic — the payoff of decomposition [2]:

  • Recall drops → retrieval problem (chunking, embedding drift, index degradation)
  • Faithfulness drops, citation accuracy holds → generation/prompt issue
  • Citation accuracy drops, faithfulness holds → citation-parsing/prompt-enforcement bug

This is what an opaque “the answer was bad” score can never tell you [2].

Tooling

Tool Stars Niche
Ragas ⭐ 14k RAG metric reference impl; faithfulness/relevancy/precision/recall; LangChain/LlamaIndex native [3]
DeepEval ⭐ 16k 50+ first-party agent metrics (Tool/Argument Correctness, Plan Adherence) + RAG; pytest-style [1]
Opik ⭐ 19k Traces + evals + CI gating, self-hostable
Phoenix ⭐ 10k OpenInference tracing + eval, OTel-based

LLM-as-judge: the grader is itself untrustworthy until calibrated

Most of the metrics above are LLM-judged, so judge reliability is load-bearing. Judges agree with humans ~85% of the time (higher than two humans agree), yet on bias stress-tests frontier models exceed 50% error rates [8]. Five named biases to flag in a talk [8]:

Bias Effect
Position Slot A wins 10–15 pts more in pairwise [8]
Verbosity Longer answers score higher at equal quality [8]
Self-preference Judges score own model family 10–25% higher [8]
Format Formatting/markdown sways the score [8]
Calibration drift Same rubric, different distribution after a judge-model update [8]

Mitigation: calibrate the judge on a held-out set of human labels and track disagreement — “otherwise you never audit the grader.” Disagreements signal rubric gaps, not just model variance [6].

Building the eval dataset (the part teams skip)

You can’t run any of the above without a dataset of cases. The 2026 default pattern: synthetic + sampled production + human calibration, all against one metric set [9].

  • RAG: a synthetic set is (question, retrieved-context, ground-truth-answer) triples, machine-generated — the default way to build RAG test sets at scale because it removes expert-annotation bottlenecks in niche/regulated domains [9]. Ragas ships test-set generation for exactly these triples [3].
  • Agents: domain experts author “must-pass” scenarios with explicit acceptance criteria as anchor tests; target each known failure mode directly (e.g. for a booking agent, assert the booking API was actually invoked, not that the reply sounds plausible) [6][10].
  • Close the loop: convert every production incident into a new golden so the regression is caught in CI next time [6].

CI / regression workflow (stacked gates)

  1. CI on every change — automated metrics + operating envelopes: max steps, token budgets, wall-clock timeouts. Fail the run when quality is fine but economics/latency are not [6].
  2. Pre-release — human review of a fixed sample after any prompt/model change, focused on high-risk intents [6].
  3. Production — monitor traces; audit flagged sessions (errors, complaints, metric drops) [6].

Workshop / talk hooks

  • Live demo: show the same RAG answer scoring high on answer-relevancy but low on faithfulness — then trace it to a stale retrieved chunk. Makes “decompose or stay blind” visceral [2].
  • The pass^k reveal: run one agent task 8× live; watch a “90% agent” fail twice. Anchors why reliability ≠ accuracy [7].
  • Ghost-action trap: an agent that says “refund processed” with no tool call in the trace — the one-slide case for trace-based, component-level evals [6].
  • Judge audit: swap A/B order and watch the score flip — live proof of position bias, and why you calibrate graders [8].

Citations · 15 sources

Click the Citations tab to load…