LLM Eval Tooling Landscape 2026: A Consultancy's Decision Guide

Decision (consultancy shipping AI for clients). No single tool wins — build a two-layer stack: a code-first OSS eval framework that gates CI, plus an OSS platform you can self-host or run as SaaS per client.[1][37]

Default pick → DeepEval ⭐ 16k + Langfuse ⭐ 28k. Apache-2.0 pytest gating with the deepest metric set, paired with MIT self-hostable tracing/datasets you can stand up per-client with zero license cost.[10][22][39]

Pick Braintrust if clients pay for a polished managed UI, stakeholder dashboards and you don’t want to run infra — the strongest commercial experiment/annotation platform, $249/mo Pro, self-host only on Enterprise.[7][8][28]

Pick LangSmith if the client’s app is already LangChain/LangGraph — native, lowest-cost commercial entry ($39/seat) — but accept ecosystem lock-in.[12][36]

Pick Promptfoo ⭐ 22k for red-teaming / security (now OpenAI-owned, still MIT) and Inspect AI ⭐ 2.2k for safety/regulated, multi-provider capability evals.[2][4][17]

Add Ragas ⭐ 14k when the deliverable is RAG and you want reference-free retrieval metrics.[15][16]

For expert consultants the load-bearing insight isn’t the tool — it’s that harness design swings published benchmark scores 10–20 points on SWE-Bench Verified, so the real asset is your client’s golden dataset + CI gates, and tools are interchangeable plumbing around it.[5][38]

The 2026 shape of the market

Two consolidation events reshaped the landscape this cycle:

OpenAI acquired Promptfoo (announced 9 Mar 2026, ~$86M) — used by 125k devs and 25%+ of the Fortune 500. Red-teaming folds into OpenAI Frontier; the repo stays MIT and open.[2][3][33] For a consultancy shipping non-OpenAI models, the open caveat is vendor objectivity — pair it with an independent scorer (Inspect AI) if that matters to a client.[34]
Humanloop is gone — acqui-hired by Anthropic (founders + ~12 staff, Aug 2025, no IP/assets), standalone platform sunset 8 Sep 2025, tech folded into the Anthropic Console. Don’t design a 2026 stack around it.[26][27]
Two OpenAI “Evals” exist and are easy to conflate: the OSS openai/evals framework (MIT, ⭐ 19k, benchmark registry, still developed) versus the hosted Evals platform/API, which OpenAI has announced for deprecation alongside Agent Builder. Treat the hosted product as a dead end; the OSS repo is a benchmark-running tool, not a product eval platform.[13][14]
Braintrust raised ~$80M Series B → ~$800M valuation (Feb 2026) — the commercial eval-platform tier is consolidating around it and LangSmith.[40]

Master comparison

Stars are current as of June 2026 (GitHub API). ✓ = first-class/native, ~ = possible but not the tool’s focus, ✗ = not supported.

Tool	⭐ Stars	OSS / License	Self-host vs SaaS	CI/CD fit	Dataset / experiment mgmt	LLM-as-judge	Agent eval	RAG eval	Pricing	Lock-in
DeepEval	⭐ 16k	OSS · Apache-2.0	Self-host (free); SaaS = Confident AI	✓ native `deepeval test run` (pytest)	~ via Confident AI cloud	✓ G-Eval, DAG	✓ task completion, tool correctness	✓ faithfulness, ctx recall/precision	Framework free; Confident AI $19.99–49.99/user; self-host @ Team/Ent	Low
Promptfoo	⭐ 22k	OSS · MIT (OpenAI-owned)	Self-host / CLI; Enterprise SaaS	✓ native GitHub Action (PR review)	~ YAML cases, no rich UI	✓ custom assertions	~ red-team focus	~ not primary	Free OSS; Enterprise on-prem	Low ⚠ vendor objectivity
Braintrust	— (closed)	Commercial · proprietary	SaaS; self-host = Enterprise only	✓ GitHub Action → SDK	✓ best-in-class UI, diffs, annotation	✓ sandboxed-Python scorers	✓ lifecycle metrics	~ via custom scorer	Free → Pro $249/mo → Ent	Medium
LangSmith	— (closed)	Commercial · proprietary	SaaS; self-host = Enterprise add-on	✓ native evaluator runs	✓ trace-curated datasets	✓ online + offline	✓ native LangGraph	~ general framework	Free 5k → Plus $39/seat → Ent	High (LangChain)
Ragas	⭐ 14k	OSS · Apache-2.0	Self-host (library)	~ wrap in pytest/CI	✗ (metrics lib)	~ judge-backed metrics	~ limited	✓ purpose-built, reference-free	Free	Low
Inspect AI	⭐ 2.2k	OSS · MIT (UK AISI)	Self-host (library + UI)	~ bring-your-own GitHub Actions	✓ Dataset→Task→Solver→Scorer	✓ model-graded + custom	✓ Docker-sandboxed agentic	~ capability-level	Free	Low
OpenAI Evals	⭐ 19k	OSS · MIT	Self-host only	~ no runner; wrap in Actions	✓ benchmark registry	✓ model-graded YAML	~ Completion-Fn protocol	✗	Free (hosted platform deprecating)	Low
Arize Phoenix	⭐ 10k	OSS · Elastic 2.0	Self-host (free); SaaS = Arize AX	✓ code evaluators, LLM jury	✓ score traces in-UI	✓ Code Evaluators, LLM jury	✓ OTel agent spans	✓ embedding-based	OSS free; AX Free → AX Pro $50/mo	Low (OTel-native)
Langfuse	⭐ 28k	OSS · MIT	Self-host free, unlimited; SaaS	✓ SDK in CI; datasets/runs	✓ datasets, runs, scores	✓ LLM-as-judge templates	✓ trace/span eval	✓ trace eval	Self-host free; cloud $29 / $199 / $2,499	Low

Sources backing the cells above: DeepEval[9][10][11][30]; Promptfoo[2][4][6][28]; Braintrust[5][7][8]; LangSmith[12][35][36]; Ragas[15][16]; Inspect AI[17][18]; OpenAI Evals[13][14]; Phoenix[19][20][21][31]; Langfuse[22][23][29].

Two axes that actually decide it

Axis 1 — framework (CI gate) vs platform (UI + storage). DeepEval, Promptfoo, Ragas, Inspect AI and OpenAI Evals are frameworks: they run in code, exit non-zero on a failed assertion, and gate a PR. Braintrust, LangSmith, Langfuse and Phoenix are platforms: persistent storage, dataset curation, annotation UI, production monitoring. The 2026 consensus is you want one of each, not one tool doing both badly.[1][37]

Axis 2 — CI gating strength. DeepEval (deepeval test run) and Promptfoo (PR-review GitHub Action) are the only two with native, first-class CI gating; everyone else gates by wrapping their SDK/API in your own GitHub Actions glue. For a consultancy whose value proposition is “we ship tested AI,” that native gate is worth optimising for.[41][18]

Self-host & lock-in (the consultancy-critical column)

A consultancy hands deliverables to clients with varying data-residency rules. Self-host capability and license terms dominate.

Capability	Free self-host?	License gotcha	Verdict for client work
Langfuse	✓ unlimited, MIT	none	Best — drop on client infra, no fees[22][23]
DeepEval	✓ framework; SaaS UI @ Team/Ent	Apache-2.0	Best — pure-code, runs anywhere[9][10]
Inspect AI / Promptfoo / Ragas	✓	MIT / MIT / Apache-2.0	Best — libraries, zero lock-in[15][17][4]
Phoenix	✓ no feature gates, air-gappable	Elastic 2.0 — can’t resell it as a hosted service	Fine for internal/client deploys; ⚠ can’t white-label as your own SaaS[20][21]
Braintrust	✗ Enterprise-only	proprietary	Locked to their cloud unless client buys Enterprise[7][8]
LangSmith	✗ Enterprise add-on	proprietary	Enterprise custom contract; + LangChain ecosystem pull[12][35][36]

Note the ELv2 trap for Phoenix: free to self-host for any client internally, but you may not offer it back as a managed/hosted service — relevant if your consultancy’s product is a hosted eval dashboard.[20] Langfuse’s MIT has no such restriction, and self-hosts at ~$1/GB-month vs Braintrust’s ~$3/GB — cheaper at scale.[29]

Worth-a-mention tier

Langfuse ⭐ 28k — the most-starred OSS LLMOps platform; arguably the default open observability+eval+prompt-management layer for 2026, and the strongest Braintrust alternative when self-hosting matters.[22][32] Promoted into the main table above on merit.
Evidently AI ⭐ 7.6k (Apache-2.0) — open-source eval/monitoring building blocks, not a turnkey review platform; reaching a dedicated tool’s maturity takes real engineering investment. Pick only if you want to assemble your own.[24]
Patronus AI — evaluation-first, differentiated by proprietary judge models: Lynx (hallucination), GLIDER (rubric scoring), Percival (agent monitoring). Niche but real when you want managed, research-grade judges instead of rolling your own.[25]
Humanloop — dead. Sunset 8 Sep 2025 after Anthropic acqui-hire; listed only so you don’t propose it.[26][27]

Pick-X-if (recommendation grid for a client-shipping consultancy)

If the client situation is…	Pick	Why
Default / greenfield, data-residency varies	DeepEval + Langfuse	Apache/MIT, self-host anywhere, native pytest gate + free unlimited platform[10][22][39]
Client wants a managed, polished UI; budget OK	Braintrust Pro	Best experiment/annotation UX, diffs, sandboxed scorers, $249/mo[7][28]
App is LangChain/LangGraph	LangSmith	Native tracing/eval, $39/seat — lowest commercial entry, accept lock-in[12][36]
Security / red-team / regulated	Promptfoo (+Inspect AI)	MIT red-teaming; add Inspect AI for vendor-independent scoring[2][17][34]
Public-sector / safety / multi-provider capability evals	Inspect AI	UK AISI, MIT, 200+ evals across 10+ providers, Docker sandboxing[17][18]
Deliverable is RAG	Ragas (+DeepEval)	Reference-free retrieval metrics; DeepEval for agent/safety overlap[15][16]
Already OTel-instrumented	Arize Phoenix	OTel-native, score traces in-UI, no SDK lock-in (mind ELv2)[19][31]
Want managed judge models, not DIY	Patronus AI	Proprietary Lynx/GLIDER/Percival judges[25]

Workshop framing (2h hands-on): demo the DeepEval pytest gate failing a PR on a regressed golden-dataset case, then push those same traces to a self-hosted Langfuse for the dashboard view — that one flow shows both axes and is reproducible on any client’s infra without a license.[1][38][41]