Decision. Default stack for shipping an LLM product in 2026: pick one CI-gating library + one tracing/annotation platform [16]. For the library, choose Promptfoo ⭐ 22k (Jun 2026) if you want CLI/YAML config and red-teaming [1], or DeepEval ⭐ 16k if your team thinks in pytest [2]. For the platform, take Langfuse ⭐ 29k if you want open-source + self-host [9], Braintrust if you want eval-first SaaS with no per-seat tax [22], or LangSmith if you already live in LangGraph [17]. The hard part isn’t the tool — it’s building the golden dataset and validating that your LLM judge agrees with humans [38].
Why this is now table stakes
AI apps fail differently from deterministic software: a PR can pass every unit test, type check, and uptime probe while the model silently gets worse at its actual task [61]. Manual “vibe checking” plateaus into a whack-a-mole loop as scope grows — you fix one prompt regression and three others appear unseen [38]. That gap is what the eval-tooling market sells against, and it is growing fast: $1.97B in 2025 → $2.69B in 2026, a 36.3% CAGR for LLM observability/eval platforms [62].
The landscape splits into two layers that most teams now run together [16]:
- Eval libraries — code you run in CI to score outputs against a dataset.
- Eval/observability platforms — hosted (or self-hosted) tracing, dashboards, annotation queues, and online production scoring.
Layer 1 — Open-source eval libraries
| Tool | Stars | Lang / License | Best at | Watch-out | Cite |
|---|---|---|---|---|---|
| Promptfoo | ⭐ 22k | TypeScript / MIT | CLI/YAML testing + red teaming | Acquired by OpenAI Mar 2026 | [1] [11] [30] |
| OpenAI Evals | ⭐ 19k | Python / custom | Benchmark registry | Last push Apr 2026, slow cadence | [4] |
| DeepEval | ⭐ 16k | Python / Apache-2.0 | Pytest-style unit tests, 14+ metrics | LLM-judge token cost | [2] [12] |
| Ragas | ⭐ 14k | Python / Apache-2.0 | RAG metrics (faithfulness etc.) | Repo moved; last push Feb 2026 | [5] [45] |
| lm-evaluation-harness | ⭐ 13k | Python / MIT | Academic few-shot benchmarking | Not for app-level product evals | [6] |
| TruLens | ⭐ 3.4k | Python / MIT | RAG Triad, experiment tracking | Smaller community | [7] [55] |
| UpTrain | ⭐ 2.4k | Python / Apache-2.0 | 20+ preconfigured checks | ⚠ Stalled — last push Aug 2024 | [8] |
| Inspect (UK AISI) | ⭐ 2.2k | Python / MIT | Reproducible safety + agent evals | Steeper learning curve | [3] [14] |
How they differ in practice. Promptfoo is CLI/YAML-first — 50+ assertions and 40+ red-team categories, used by OpenAI and Anthropic [1] [11]. DeepEval is “pytest for LLMs”: 14+ metrics including G-Eval, faithfulness, bias, toxicity, summarization, and tool-correctness, scored with LLM-as-judge plus local NLP models [12] [13]. Ragas owns the deepest RAG metric set [13]. Inspect (UK AI Safety Institute) is architected around tasks/datasets/solvers/scorers with first-class agent + sandboxing support — practitioners reach for it on reproducible, security-oriented evals rather than quick prompt iteration [14] [15]. The recurring operating cost isn’t the library — it’s LLM-judge API tokens, roughly $150–500/month at 10,000 evals/day [16].
Layer 2 — Eval & observability platforms
| Platform | OSS core? | Pricing model | Position | Cite |
|---|---|---|---|---|
| Langfuse | ✓ MIT, self-host free | Free Hobby → $29 → $199 → $2,499, no per-seat | OSS observability leader; ClickHouse acquired Jan 2026 | [18] [20] |
| Braintrust | ✗ proprietary | Free 1M spans → $249/mo flat, no per-seat | Eval-first SaaS; a16z + ICONIQ backed | [22] [27] |
| LangSmith | ✗ proprietary | Free 5k traces → $39/seat + $0.50/1k overage | Default for LangChain/LangGraph teams | [21] [17] |
| Arize Phoenix | ✓ Elastic-2.0, self-host free | Free self-host; paid Arize AX tier | OpenTelemetry-native, ⭐ 10k | [10] [23] [53] |
| Comet Opik | ✓ true OSS | Free self-host + enterprise hosted | OSS, ~⭐ 20k | [26] |
| W&B Weave | ✗ proprietary | Folded into W&B plans | Strong eval harness; CoreWeave bought W&B ~$1.7B | [24] [28] |
| Humanloop | — (dead) | — | ⚠ Shut down Sep 2025; team → Anthropic | [25] |
Consolidation is the 2026 story. Langfuse open-sourced its remaining cloud-only features (managed LLM-as-judge, annotation queues, prompt experiments) in June 2025 [19], then was acquired by ClickHouse alongside a $400M Series D in January 2026 — with a public commitment to stay MIT-licensed and self-hostable [20]. Braintrust killed per-seat pricing entirely (a 100-person team still starts at $249/mo) and raised an $80M Series B led by ICONIQ on top of its a16z Series A [22] [27]. W&B Weave rode CoreWeave’s ~$1.7B acquisition of Weights & Biases (closed May 2025) [28]. And the cautionary tale: Humanloop shut down in September 2025 after Anthropic acqui-hired its founders and ~a dozen staff (no IP transfer) — a reminder that betting your eval stack on a single closed vendor carries platform risk [25].
How evals run in CI — eval-driven development
By 2026 the dominant workflow is eval-driven development (EDD): the eval suite is the working spec, and scores are the shipping oracle [33]. The operator loop is tight — maintain a versioned golden-set dataset as the regression baseline, change a prompt or model, rerun the suite, read deltas in minutes (e.g. catching a tone score slip from 0.85 → 0.72), then open a PR; production traces that surface new edge cases get appended back into the golden set [33].
Tooling converges on PR-time gating:
- Promptfoo ships a GitHub Action that runs a before-vs-after eval on every PR touching a prompt, comments the results with an interactive viewer, caches LLM calls to cut cost, and fails the build when scores drop below a gate [29] [30].
- DeepEval is the pytest path:
assert_test()raises when a metric falls below threshold, anddeepeval test runadds async/repeats/identifiers on top of pytest, runnable as a shell step in any CI [31] [32]. - Braintrust runs eval suites on every PR via a native Action and blocks sub-threshold changes; LangSmith integrates pytest/Vitest/GitHub but informs rather than auto-blocks [34].
Mature teams split offline CI evals (LLM judge + human review) from online production evals running deterministic scorers under 50ms with ~5% async sampling for deeper judging [35]. Common gates: minPassRate 0.95, maxRegressions 0 [35].
Scoring methods — and the skeptic’s read
Methods run a spectrum from cheap-and-deterministic to expensive-and-trusted: programmatic asserts → LLM-as-judge / rubric (G-Eval) → pairwise comparison → human review [38]. G-Eval is the canonical rubric method — chain-of-thought-generated steps plus log-prob-weighted form-filling, reaching the top Spearman correlation with humans (0.514) on SummEval [39].
But “vibes don’t scale” has a twin truth: the judges are themselves unreliable. The numbers are uncomfortable:
| Bias | Measured effect | Source |
|---|---|---|
| Self-preference | GPT-4 ~0.520 self-preference on Chatbot Arena | [40] |
| Self-enhancement | GPT-4 +10%, Claude-v1 +25% own win-rate | [42] |
| Position bias | Claude-v1 favored first position 70% of time | [42] |
| Verbosity bias | >90% preference for the longer answer | [42] |
| Formatting fragility | Consistency collapses on format/paraphrase changes; >50% error on bias benchmarks | [41] |
The synthesis: no judge is uniformly reliable, yet ~80% human agreement and >50% bias-benchmark error coexist rather than contradict [41]. Practical mitigations that hold up in 2026:
- Never self-judge — same model as judge and candidate adds 10–25% uniform bias [44].
- Prefer pairwise ranking over direct scoring — direct-scoring misalignment can’t be fixed by calibration [43].
- Validate the judge against human labels continuously, tracking precision/recall on imbalanced data, not raw accuracy [38].
- Design for variance, not false determinism: control judge temperature, use pass bands not single thresholds, don’t gate inside the 0.41–0.60 “moderate” confidence band, recalibrate when divergence exceeds ~20–25%, and report
pass@k(a 70%-per-trial agent reads ~97% at pass@3 but ~34% at pass^3) [36] [37].
Agent, RAG & trajectory evals
RAG evaluation is anchored by Ragas’s four metrics, which split diagnosis across retriever and generator: faithfulness, answer relevancy, context precision, context recall [45]. They carry rough targets (~0.75 / 0.8 / 0.7 / 0.8) and an aggregate Ragas score that is their mean [46]. TruLens frames the same idea as the RAG Triad (groundedness / answer relevance / context relevance) [55]. A common lifecycle pairs Ragas for exploration, DeepEval for CI/CD, and a production monitor for ongoing observability [55].
Agent evals stratify into three levels — end-to-end task success, trajectory-level path analysis, and component tests — with core metrics tool correctness, task completion, step efficiency, and argument correctness [47]. Trajectory testing became production-default in 2026: scoring the agent’s actual path vs a golden trajectory, plus tool-call precision/recall, path efficiency, and cost — covering loops, error recovery, and state transitions [54]. Named tooling:
- LangSmith’s open-source agentevals ⭐ 613 package matches trajectories via four modes — strict, unordered, superset, subset — plus an LLM-judge evaluator [49] [50].
- Inspect supports ReAct, Deep Agent, and custom multi-agent architectures with first-class tool use [51].
- Galileo ships named metrics (Tool Selection Quality, Action Completion, Reasoning Coherence, Agent Efficiency) on Luna-2 small models at sub-200ms [52].
- Arize Phoenix does OpenTelemetry trajectory-span analysis with Trajectory Mapping for loop detection [53].
DeepEval also exposes six agentic metrics including plan adherence and plan quality, and its G-Eval remains the versatile escape hatch for arbitrary custom criteria [47] [48].
How we got here, and where it’s going
Three overlapping eras [60]:
- Academic-benchmark era (2020–22) — MMLU (57 subjects), BIG-bench, Stanford HELM, EleutherAI’s lm-eval-harness; successor chains like GLUE→SuperGLUE and MMLU→MMLU-Pro emerged as ceilings were hit [56].
- Framework era (2023) — OpenAI open-sourced Evals alongside GPT-4 in March 2023, dangling GPT-4 access for high-quality benchmark contributions [57].
- Platform era (2024–26) — commercial tooling (DeepEval/Confident AI, Braintrust, Galileo, Weave, Ragas) layered on top, shifting away from static leaderboards toward domain-specific custom metrics for RAG, agents, multi-turn, and safety [60].
The dominant 2026 trend is benchmark decay: static benchmarks now have a median discriminative lifespan under two years; HumanEval and GPQA Diamond are saturated (frontier models >94% vs 65% for PhD experts), and SWE-bench Verified gold patches can be reproduced verbatim from task IDs [58]. The three structural failure modes — saturation, contamination, gameability — push every serious team off public leaderboards and onto their own gold sets [63]. Agent benchmarks (GAIA, SWE-bench Verified, OSWorld, Tau²-bench, WebArena) matured in parallel but suffer a standardization gap: even “success” is encoded incompatibly across them [59].
The takeaway for a 2026 talk: the tools have commoditized — the moat is your golden dataset and a judge you’ve actually validated against humans. Vibes don’t scale; neither does an unvalidated judge.