Eval Framework Landscape 2026: Vibes Don't Scale — Scoreboard

2026 LLM EVAL LANDSCAPE $2.69B MARKET 36.3% CAGR 63 SOURCES 8 OSS LIBRARIES 6 ACTIVE PLATFORMS 1 DEAD PLATFORM 5 DOCUMENTED JUDGE BIASES

Eval Standings Board · 2026

LLM Eval
Framework Landscape

Open-source libraries · hosted platforms · scoring methods · judge biases
expedition depth · 63 citations · 10 min read

$2.69B market 2026 ^[62]

36.3% CAGR

~2yr benchmark lifespan

▶ CHAMPIONSHIP VERDICT — DEFAULT STACK 2026

Pick one CI-gating library + one tracing platform. ^[16] The hard part isn't the tool — it's building the golden dataset and validating that your LLM judge agrees with humans.

CI Library — pick one

Promptfoo ⭐ 22k
CLI/YAML + 40+ red-team categories ^[1]

DeepEval ⭐ 16k
if your team thinks in pytest ^[2]

Tracing Platform — pick one

Langfuse ⭐ 29k
open-source + self-host ^[9]

Braintrust
eval-first SaaS, no per-seat ^[22]

LangSmith
if you already live in LangGraph ^[17]

01 Open-Source Library Standings — sorted by stars

#	Library	Stars	Lang · License	Status	Best for
1	Promptfoo ⚠ Acquired by OpenAI Mar 2026	⭐ 22k	TypeScript MIT	PICK	CLI/YAML; 50+ assertions, 40+ red-team categories; used by OpenAI & Anthropic ^[11]
2	OpenAI Evals	⭐ 19k	Python custom	USE	Benchmark registry; last push Apr 2026, slow cadence ^[4]
3	DeepEval	⭐ 16k	Python Apache-2.0	PICK	pytest-style; 14+ metrics incl. G-Eval, faithfulness, bias, toxicity, tool-correctness ^[12]
4	Ragas Repo moved; last push Feb 2026	⭐ 14k	Python Apache-2.0	USE	RAG eval leader: faithfulness, answer relevancy, context precision & recall ^[45]
5	lm-eval-harness	⭐ 13k	Python MIT	USE	Academic few-shot benchmarking; not for product-level evals ^[6]
6	TruLens	⭐ 3.4k	Python MIT	USE	RAG Triad (groundedness / answer rel. / context rel.) + experiment tracking ^[55]
✕	UpTrain	⭐ 2.4k	Python Apache-2.0	STALLED	20+ preconfigured checks; ⚠ last push Aug 2024 — effectively abandoned ^[8]
7	Inspect (UK AISI)	⭐ 2.2k	Python MIT	USE	Reproducible safety + agent evals; Tasks/Datasets/Solvers/Scorers architecture; first-class sandboxing ^[14]

02 Eval & Observability Platform Standings

#	Platform	OSS Core	Price Floor	Status	Standout
1	Langfuse Acquired by ClickHouse Jan 2026 — stays MIT	MIT ✓	Free → $29/mo	PICK	⭐ 29k; OSS leader; all cloud features opened Jun 2025; self-host free, no event limit ^[20]
2	Braintrust a16z + ICONIQ backed · $80M Series B	Proprietary	Free → $249/mo flat	PICK	Eval-first SaaS; no per-seat tax; blocks sub-threshold PRs via native Action ^[22]
3	LangSmith	Proprietary	Free → $39/seat	USE IF LANGCHAIN	Default for LangChain/LangGraph; informs rather than auto-blocks ^[17]
4	Arize Phoenix	Elastic-2.0 ✓	Free self-host	USE	⭐ 10k; OpenTelemetry-native; trajectory-span analysis for agent loops ^[23]
5	Comet Opik	True OSS ✓	Free self-host	USE	~⭐ 20k; fully open-source; free cloud tier + enterprise hosted ^[26]
6	W&B Weave CoreWeave acquired W&B ~$1.7B, closed May 2025	Proprietary	W&B plan bundled	USE	Strong eval harness; `@weave.op()` decorator is one of the simplest integrations ^[24]
✕	Humanloop Shut down Sep 2025 · founders → Anthropic	—	—	DEAD	Cautionary tale: closed vendor → acqui-hire → platform gone, no IP transfer ^[25]

03 Scoring Method Spectrum — cheap & fast → expensive & trusted

Cheapest

Programmatic Asserts

~$0 · <1ms · deterministic

Regex, JSON schema, substring, exact-match checks. No model cost. Only works for structured or templated outputs.

Default

LLM-as-Judge / G-Eval

~$150–500/mo @ 10k evals/day ^[16]

CoT-generated rubric steps + log-prob-weighted scoring. G-Eval reaches Spearman 0.514 on SummEval. Carries documented biases — see §04. ^[39]

Preferred

Pairwise Ranking

2× judge calls per comparison

Present both outputs; run both orders; count only consistent wins. Preferred over direct scoring — calibration alone can't fix direct-score misalignment. ^[43]

Ground Truth

Human Review

Days–weeks · required for judge validation

Use binary pass/fail (not Likert scales). Track TPR/TNR, not accuracy. ~246 examples for 95% CI at ±5%. Without this, no judge is validated. ^[38]

04 Judge Reliability Report — Five Documented Biases

Self-Preference Bias

~0.520 self-preference

GPT-4 on Chatbot Arena. Never use the model under test as its own judge — same-model judging adds 10–25% uniform win-rate inflation. ^[40]

Self-Enhancement Win Rate

+25% own win rate

GPT-4 inflates its own win rate +10%; Claude-v1 +25%. Mitigation: judge with a different model family than the one under test. ^[42]

Position Bias

70% first-position

Claude-v1 favored the first answer 70% of the time in pairwise comparisons. Run both orders and count only consistent wins. ^[42]

Verbosity Bias

>90% prefers longer

Strong preference for longer responses regardless of quality. A padding attack fooled Claude-v1 and GPT-3.5 91.3% of the time. ^[42]

Formatting Fragility

>50% error on bias benchmarks

Consistency collapses on format/paraphrase changes. ~80% human agreement and >50% bias-benchmark error coexist without contradicting each other. ^[41]

Confidence Band Risk

0.41–0.60 = unreliable

Don't gate inside the "moderate" confidence band. Use pass bands not single thresholds. Recalibrate when judge/human divergence exceeds ~20–25%. ^[36]

05 CI Gate Modes — how each tool handles a failing eval

Promptfoo

● BLOCK — score gate

GitHub Action; before-vs-after on every prompt-touching PR; cached LLM calls; interactive viewer comment. Common gate: minPassRate 0.95, maxRegressions 0. ^[29]

DeepEval

● BLOCK — pytest raises

assert_test() raises when metric falls below threshold; deepeval test run adds async/repeats on top of pytest as a shell step in any CI system. ^[31]

Braintrust

● BLOCK — native Action

Native GitHub Action blocks sub-threshold merges. Eval-first philosophy: evals are the working spec and scores are the shipping oracle. ^[34]

LangSmith

◎ INFORM — does not auto-block

Integrates with pytest/Vitest/GitHub but informs rather than auto-blocks sub-threshold changes. Teams must wire the block themselves. ^[34]

06 How We Got Here — Three Eras

2020 – 2022

Academic Benchmark Era

MMLU (57 subjects), BIG-bench, Stanford HELM, EleutherAI harness. Static public leaderboards. Successor chains emerged as ceilings were hit: GLUE → SuperGLUE → MMLU → MMLU-Pro. ^[56]

2023

Framework Era

OpenAI open-sourced Evals alongside GPT-4 in March 2023, dangling GPT-4 access for high-quality benchmark contributions. First wave of community-built eval frameworks. ^[57]

2024 – 2026

Platform Era + Benchmark Decay

Commercial tooling layered on top. Static benchmarks now have a median lifespan under 2 years; HumanEval and GPQA Diamond already saturated. Three structural failure modes — saturation, contamination, gameability — push every serious team off leaderboards onto their own gold sets. ^[60]

63 citations · expedition depth · $4.93 · 530s · Opus 4.8 · 2026-06-09 · full canonical · parent expedition