← Default view
2026 LLM EVAL LANDSCAPE
$2.69B MARKET
36.3% CAGR
63 SOURCES
8 OSS LIBRARIES
6 ACTIVE PLATFORMS
1 DEAD PLATFORM
5 DOCUMENTED JUDGE BIASES
Eval Standings Board · 2026
LLM Eval
Framework Landscape
Open-source libraries · hosted platforms · scoring methods · judge biases
expedition depth · 63 citations · 10 min read
36.3%
CAGR
~2yr
benchmark lifespan
▶ CHAMPIONSHIP VERDICT — DEFAULT STACK 2026
Pick
one CI-gating library +
one tracing platform.
[16]
The hard part isn't the tool — it's building the golden dataset and validating that your LLM judge agrees with humans.
Tracing Platform — pick one
01
Open-Source Library Standings — sorted by stars
| # |
Library |
Stars |
Lang · License |
Status |
Best for |
| 1 |
⚠ Acquired by OpenAI Mar 2026
|
⭐ 22k |
TypeScript MIT |
PICK |
CLI/YAML; 50+ assertions, 40+ red-team categories; used by OpenAI & Anthropic [11] |
| 2 |
|
⭐ 19k |
Python custom |
USE |
Benchmark registry; last push Apr 2026, slow cadence [4] |
| 3 |
|
⭐ 16k |
Python Apache-2.0 |
PICK |
pytest-style; 14+ metrics incl. G-Eval, faithfulness, bias, toxicity, tool-correctness [12] |
| 4 |
Repo moved; last push Feb 2026
|
⭐ 14k |
Python Apache-2.0 |
USE |
RAG eval leader: faithfulness, answer relevancy, context precision & recall [45] |
| 5 |
|
⭐ 13k |
Python MIT |
USE |
Academic few-shot benchmarking; not for product-level evals [6] |
| 6 |
|
⭐ 3.4k |
Python MIT |
USE |
RAG Triad (groundedness / answer rel. / context rel.) + experiment tracking [55] |
| ✕ |
|
⭐ 2.4k |
Python Apache-2.0 |
STALLED |
20+ preconfigured checks; ⚠ last push Aug 2024 — effectively abandoned [8] |
| 7 |
|
⭐ 2.2k |
Python MIT |
USE |
Reproducible safety + agent evals; Tasks/Datasets/Solvers/Scorers architecture; first-class sandboxing [14] |
02
Eval & Observability Platform Standings
| # |
Platform |
OSS Core |
Price Floor |
Status |
Standout |
| 1 |
Acquired by ClickHouse Jan 2026 — stays MIT
|
MIT ✓ |
Free → $29/mo |
PICK |
⭐ 29k; OSS leader; all cloud features opened Jun 2025; self-host free, no event limit [20] |
| 2 |
a16z + ICONIQ backed · $80M Series B
|
Proprietary |
Free → $249/mo flat |
PICK |
Eval-first SaaS; no per-seat tax; blocks sub-threshold PRs via native Action [22] |
| 3 |
|
Proprietary |
Free → $39/seat |
USE IF LANGCHAIN |
Default for LangChain/LangGraph; informs rather than auto-blocks [17] |
| 4 |
|
Elastic-2.0 ✓ |
Free self-host |
USE |
⭐ 10k; OpenTelemetry-native; trajectory-span analysis for agent loops [23] |
| 5 |
|
True OSS ✓ |
Free self-host |
USE |
~⭐ 20k; fully open-source; free cloud tier + enterprise hosted [26] |
| 6 |
CoreWeave acquired W&B ~$1.7B, closed May 2025
|
Proprietary |
W&B plan bundled |
USE |
Strong eval harness; @weave.op() decorator is one of the simplest integrations [24] |
| ✕ |
Shut down Sep 2025 · founders → Anthropic
|
— |
— |
DEAD |
Cautionary tale: closed vendor → acqui-hire → platform gone, no IP transfer [25] |
03
Scoring Method Spectrum — cheap & fast → expensive & trusted
Cheapest
Programmatic Asserts
~$0 · <1ms · deterministic
Regex, JSON schema, substring, exact-match checks. No model cost. Only works for structured or templated outputs.
Default
LLM-as-Judge / G-Eval
~$150–500/mo @ 10k evals/day
[16]
CoT-generated rubric steps + log-prob-weighted scoring. G-Eval reaches Spearman 0.514 on SummEval. Carries documented biases — see §04.
[39]
Preferred
Pairwise Ranking
2× judge calls per comparison
Present both outputs; run both orders; count only consistent wins. Preferred over direct scoring — calibration alone can't fix direct-score misalignment.
[43]
Ground Truth
Human Review
Days–weeks · required for judge validation
Use binary pass/fail (not Likert scales). Track TPR/TNR, not accuracy. ~246 examples for 95% CI at ±5%. Without this, no judge is validated.
[38]
04
Judge Reliability Report — Five Documented Biases
Self-Preference Bias
~0.520 self-preference
GPT-4 on Chatbot Arena. Never use the model under test as its own judge — same-model judging adds 10–25% uniform win-rate inflation.
[40]
Self-Enhancement Win Rate
+25% own win rate
GPT-4 inflates its own win rate +10%; Claude-v1 +25%. Mitigation: judge with a
different model family than the one under test.
[42]
Position Bias
70% first-position
Claude-v1 favored the first answer 70% of the time in pairwise comparisons. Run both orders and count only consistent wins.
[42]
Verbosity Bias
>90% prefers longer
Strong preference for longer responses regardless of quality. A padding attack fooled Claude-v1 and GPT-3.5
91.3% of the time.
[42]
Formatting Fragility
>50% error on bias benchmarks
Consistency collapses on format/paraphrase changes. ~80% human agreement and >50% bias-benchmark error coexist without contradicting each other.
[41]
Confidence Band Risk
0.41–0.60 = unreliable
Don't gate inside the "moderate" confidence band. Use pass
bands not single thresholds. Recalibrate when judge/human divergence exceeds ~20–25%.
[36]
05
CI Gate Modes — how each tool handles a failing eval
● BLOCK — score gate
GitHub Action; before-vs-after on every prompt-touching PR; cached LLM calls; interactive viewer comment. Common gate:
minPassRate 0.95,
maxRegressions 0.
[29]
● BLOCK — pytest raises
assert_test() raises when metric falls below threshold;
deepeval test run adds async/repeats on top of pytest as a shell step in any CI system.
[31]
● BLOCK — native Action
Native GitHub Action blocks sub-threshold merges. Eval-first philosophy: evals are the working spec and scores are the shipping oracle.
[34]
◎ INFORM — does not auto-block
Integrates with pytest/Vitest/GitHub but
informs rather than auto-blocks sub-threshold changes. Teams must wire the block themselves.
[34]
06
How We Got Here — Three Eras
2020 – 2022
Academic Benchmark Era
MMLU (57 subjects), BIG-bench, Stanford HELM, EleutherAI harness. Static public leaderboards. Successor chains emerged as ceilings were hit: GLUE → SuperGLUE → MMLU → MMLU-Pro.
[56]
2023
Framework Era
OpenAI open-sourced Evals alongside GPT-4 in March 2023, dangling GPT-4 access for high-quality benchmark contributions. First wave of community-built eval frameworks.
[57]
2024 – 2026
Platform Era + Benchmark Decay
Commercial tooling layered on top. Static benchmarks now have a median lifespan under 2 years; HumanEval and GPQA Diamond already saturated. Three structural failure modes — saturation, contamination, gameability — push every serious team off leaderboards onto their own gold sets.
[60]