← Default view
2026 LLM EVAL LANDSCAPE $2.69B MARKET 36.3% CAGR 63 SOURCES 8 OSS LIBRARIES 6 ACTIVE PLATFORMS 1 DEAD PLATFORM 5 DOCUMENTED JUDGE BIASES
Eval Standings Board · 2026
LLM Eval
Framework Landscape
Open-source libraries · hosted platforms · scoring methods · judge biases
expedition depth · 63 citations · 10 min read
$2.69B market 2026 [62]
36.3% CAGR
~2yr benchmark lifespan
▶ CHAMPIONSHIP VERDICT — DEFAULT STACK 2026
Pick one CI-gating library + one tracing platform. [16] The hard part isn't the tool — it's building the golden dataset and validating that your LLM judge agrees with humans.
CI Library — pick one
Promptfoo
Promptfoo ⭐ 22k
CLI/YAML + 40+ red-team categories [1]
DeepEval
DeepEval ⭐ 16k
if your team thinks in pytest [2]
Tracing Platform — pick one
Langfuse
Langfuse ⭐ 29k
open-source + self-host [9]
Braintrust
Braintrust
eval-first SaaS, no per-seat [22]
LangSmith
LangSmith
if you already live in LangGraph [17]
01 Open-Source Library Standings — sorted by stars
# Library Stars Lang · License Status Best for
1
⚠ Acquired by OpenAI Mar 2026
⭐ 22k TypeScript MIT PICK CLI/YAML; 50+ assertions, 40+ red-team categories; used by OpenAI & Anthropic [11]
2 ⭐ 19k Python custom USE Benchmark registry; last push Apr 2026, slow cadence [4]
3 ⭐ 16k Python Apache-2.0 PICK pytest-style; 14+ metrics incl. G-Eval, faithfulness, bias, toxicity, tool-correctness [12]
4
Repo moved; last push Feb 2026
⭐ 14k Python Apache-2.0 USE RAG eval leader: faithfulness, answer relevancy, context precision & recall [45]
5 ⭐ 13k Python MIT USE Academic few-shot benchmarking; not for product-level evals [6]
6 ⭐ 3.4k Python MIT USE RAG Triad (groundedness / answer rel. / context rel.) + experiment tracking [55]
⭐ 2.4k Python Apache-2.0 STALLED 20+ preconfigured checks; ⚠ last push Aug 2024 — effectively abandoned [8]
7 ⭐ 2.2k Python MIT USE Reproducible safety + agent evals; Tasks/Datasets/Solvers/Scorers architecture; first-class sandboxing [14]
02 Eval & Observability Platform Standings
# Platform OSS Core Price Floor Status Standout
1
Acquired by ClickHouse Jan 2026 — stays MIT
MIT ✓ Free → $29/mo PICK ⭐ 29k; OSS leader; all cloud features opened Jun 2025; self-host free, no event limit [20]
2
a16z + ICONIQ backed · $80M Series B
Proprietary Free → $249/mo flat PICK Eval-first SaaS; no per-seat tax; blocks sub-threshold PRs via native Action [22]
3 Proprietary Free → $39/seat USE IF LANGCHAIN Default for LangChain/LangGraph; informs rather than auto-blocks [17]
4 Elastic-2.0 ✓ Free self-host USE ⭐ 10k; OpenTelemetry-native; trajectory-span analysis for agent loops [23]
5 True OSS ✓ Free self-host USE ~⭐ 20k; fully open-source; free cloud tier + enterprise hosted [26]
6
CoreWeave acquired W&B ~$1.7B, closed May 2025
Proprietary W&B plan bundled USE Strong eval harness; @weave.op() decorator is one of the simplest integrations [24]
Humanloop
Shut down Sep 2025 · founders → Anthropic
DEAD Cautionary tale: closed vendor → acqui-hire → platform gone, no IP transfer [25]
03 Scoring Method Spectrum — cheap & fast → expensive & trusted
Cheapest
Programmatic Asserts
~$0 · <1ms · deterministic
Regex, JSON schema, substring, exact-match checks. No model cost. Only works for structured or templated outputs.
Default
LLM-as-Judge / G-Eval
~$150–500/mo @ 10k evals/day [16]
CoT-generated rubric steps + log-prob-weighted scoring. G-Eval reaches Spearman 0.514 on SummEval. Carries documented biases — see §04. [39]
Preferred
Pairwise Ranking
2× judge calls per comparison
Present both outputs; run both orders; count only consistent wins. Preferred over direct scoring — calibration alone can't fix direct-score misalignment. [43]
Ground Truth
Human Review
Days–weeks · required for judge validation
Use binary pass/fail (not Likert scales). Track TPR/TNR, not accuracy. ~246 examples for 95% CI at ±5%. Without this, no judge is validated. [38]
04 Judge Reliability Report — Five Documented Biases
Self-Preference Bias
~0.520 self-preference
GPT-4 on Chatbot Arena. Never use the model under test as its own judge — same-model judging adds 10–25% uniform win-rate inflation. [40]
Self-Enhancement Win Rate
+25% own win rate
GPT-4 inflates its own win rate +10%; Claude-v1 +25%. Mitigation: judge with a different model family than the one under test. [42]
Position Bias
70% first-position
Claude-v1 favored the first answer 70% of the time in pairwise comparisons. Run both orders and count only consistent wins. [42]
Verbosity Bias
>90% prefers longer
Strong preference for longer responses regardless of quality. A padding attack fooled Claude-v1 and GPT-3.5 91.3% of the time. [42]
Formatting Fragility
>50% error on bias benchmarks
Consistency collapses on format/paraphrase changes. ~80% human agreement and >50% bias-benchmark error coexist without contradicting each other. [41]
Confidence Band Risk
0.41–0.60 = unreliable
Don't gate inside the "moderate" confidence band. Use pass bands not single thresholds. Recalibrate when judge/human divergence exceeds ~20–25%. [36]
05 CI Gate Modes — how each tool handles a failing eval
● BLOCK — score gate
GitHub Action; before-vs-after on every prompt-touching PR; cached LLM calls; interactive viewer comment. Common gate: minPassRate 0.95, maxRegressions 0. [29]
● BLOCK — pytest raises
assert_test() raises when metric falls below threshold; deepeval test run adds async/repeats on top of pytest as a shell step in any CI system. [31]
● BLOCK — native Action
Native GitHub Action blocks sub-threshold merges. Eval-first philosophy: evals are the working spec and scores are the shipping oracle. [34]
◎ INFORM — does not auto-block
Integrates with pytest/Vitest/GitHub but informs rather than auto-blocks sub-threshold changes. Teams must wire the block themselves. [34]
06 How We Got Here — Three Eras
2020 – 2022
Academic Benchmark Era
MMLU (57 subjects), BIG-bench, Stanford HELM, EleutherAI harness. Static public leaderboards. Successor chains emerged as ceilings were hit: GLUE → SuperGLUE → MMLU → MMLU-Pro. [56]
2023
Framework Era
OpenAI open-sourced Evals alongside GPT-4 in March 2023, dangling GPT-4 access for high-quality benchmark contributions. First wave of community-built eval frameworks. [57]
2024 – 2026
Platform Era + Benchmark Decay
Commercial tooling layered on top. Static benchmarks now have a median lifespan under 2 years; HumanEval and GPQA Diamond already saturated. Three structural failure modes — saturation, contamination, gameability — push every serious team off leaderboards onto their own gold sets. [60]
63 citations · expedition depth · $4.93 · 530s · Opus 4.8 · 2026-06-09 · full canonical · parent expedition